# Initial packages required
library(tidyverse)
Iteration
You can download this .qmd file from here. Just hit the Download Raw File button.
This leans on parts of R4DS Chapter 26: Iteration, in addition to parts of the first edition of R4DS.
Iteration
Reducing duplication of code will reduce errors and make debugging much easier. We’ve already seen how functions (Ch 25) can help reduce duplication by extracting repeated patterns of code. Another tool is iteration, when you find you’re doing the same thing to multiple inputs – repeating the same operation on different columns or datasets.
Here we’ll see two important iteration paradigms: imperative programming and functional programming.
Imperation programming for iteration
Examples: for loops and while loops
Pros: relatively easy to learn, make iteration very explicit so it’s obvious what’s happening, not as inefficient as some people believe
Cons: require lots of bookkeeping code that’s duplicated for every loop
Every for loop has three components:
- output - plan ahead and allocate enough space for output
- sequence - determines what to loop over; cycles through different values of \(i\)
- body - code that does the work; run repeatedly with different values of \(i\)
<- tibble(
df a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
) df
# A tibble: 10 × 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 -0.502 -0.668 0.396 0.653
2 1.87 0.363 -0.644 -0.816
3 -0.943 -0.737 1.01 -2.81
4 0.617 0.174 -0.214 -0.0285
5 -0.850 -1.22 1.36 0.564
6 -0.848 0.598 -0.247 -0.0209
7 0.162 1.24 0.196 1.57
8 0.248 -0.366 0.983 -1.25
9 -0.286 -1.09 0.143 -1.46
10 0.459 1.61 0.566 0.863
# want median of each column (w/o cutting and pasting)
# Be careful using square brackets vs double square brackets when
# selecting elements
median(df[[1]])
[1] -0.06208288
median(df[1])
Error in median.default(df[1]): need numeric data
1] df[
# A tibble: 10 × 1
a
<dbl>
1 -0.502
2 1.87
3 -0.943
4 0.617
5 -0.850
6 -0.848
7 0.162
8 0.248
9 -0.286
10 0.459
1]] df[[
[1] -0.5023795 1.8742121 -0.9428545 0.6167671 -0.8504647 -0.8482801
[7] 0.1619869 0.2481099 -0.2861527 0.4591192
class(df[1])
[1] "tbl_df" "tbl" "data.frame"
class(df[[1]])
[1] "numeric"
# basic for loop to take median of each column
<- vector("double", ncol(df)) # 1. output
output for (i in 1:4) { # 2. sequence
<- median(df[[i]]) # 3. body
output[[i]]
} output
[1] -0.06208288 -0.09602160 0.29629139 -0.02472320
# ?seq_along - a safer option if had zero length vectors
<- vector("double", ncol(df)) # 1. output
output for (i in seq_along(df)) { # 2. sequence
<- median(df[[i]]) # 3. body
output[[i]]
} output
[1] -0.06208288 -0.09602160 0.29629139 -0.02472320
# use [[.]] even if don't have to to signal working with single elements
# alternative solution - don't hardcode in "4"
<- vector("double", ncol(df)) # 1. output
output for(i in 1:ncol(df)) { # 2. sequence
<- median(df[[i]]) # 3. body
output[i]
} output
[1] -0.06208288 -0.09602160 0.29629139 -0.02472320
# another approach - no double square brackets since df not a tibble
<- as.data.frame(df)
df <- vector("double", ncol(df)) # 1. output
output for(i in 1:ncol(df)) { # 2. sequence
<- median(df[,i]) # 3. body
output[i]
} output
[1] -0.06208288 -0.09602160 0.29629139 -0.02472320
One advantage of seq_along()
: works with unknown output length. However, the second approach below is much more efficient, since each iteration doesn’t copy all data from previous iterations.
[Pause to Ponder:] What does the code below do? Be prepared to explain both chunks line-by-line!
# for loop: unknown output length
<- c(0, 1, 2)
means <- double()
output for (i in seq_along(means)) {
<- sample(100, 1)
n <- c(output, rnorm(n, means[[i]]))
output
}str(output) ## inefficient
num [1:214] -0.306 -1.187 0.183 1.271 0.917 ...
<- vector("list", length(means))
out for (i in seq_along(means)) {
<- sample(100, 1)
n <- rnorm(n, means[[i]])
out[[i]]
}str(out) ## more efficient
List of 3
$ : num [1:62] 0.938 0.133 -2.019 1.091 -1.101 ...
$ : num [1:43] 2.0109 1.0148 2.5274 -0.0122 0.3437 ...
$ : num [1:74] 1.93 2.76 2.95 1.32 1.78 ...
str(unlist(out)) ## flatten list of vectors into single vector
num [1:179] 0.938 0.133 -2.019 1.091 -1.101 ...
Finally, the while()
loop can be used with unknown sequence length. This is used more in simulation than in data analysis.
[Pause to Ponder:] What does the following code do?
<- function() sample(c("T", "H"), 1)
flip <- 0
flips <- 0
nheads while (nheads < 3) {
if (flip() == "H") {
<- nheads + 1
nheads else {
} <- 0
nheads
}<- flips + 1
flips
} flips
[1] 4
Using iteration for simulation
This applet contains data from a 2005 study on the use of dolphin-facilitated therapy on the treatment of depression. In that study, 10 of the 15 subjects (67%) assigned to dolphin therapy showed improvement, compared to only 3 of the 15 subjects (20%) assigned to the control group. But with such small sample sizes, is this significant evidence that the dolphin group had greater improvement of their depressive symptoms? To answer that question, we can use simulation to conduct a randomization test.
We will simulate behavior in the “null world” where there is no real effect of treatment. In that case, the 13 total improvers would have improved no matter the treatment assigned, and the 17 total non-improvers would have not improved no matter the treatment assigned. So in the “null world”, treatment is a meaningless label that can be just as easily shuffled among subjects without any effect. In that world, the fact we observed a 47 percentage point difference in success rates (67 - 20) was just random luck. But we should ask: how often would we expect a difference as large as 47% by chance, assuming we’re living in the null world where there is no effect of treatment?
You could think about simulating this situation with the following steps:
write code to calculate the difference in success rates in the observed data
write a loop to calculate the differences in success rates from 1000 simulated data sets from the null world. Store those 1000 simulated differences
calculate how often we found a difference in the null world as large as that found in the observed data. In statistics, when this probability is below .05, we typically reject the null world, and conclude that there is likely a real difference between the two groups (i.e. a “statistically significant” difference)
[Pause to Ponder:] Fill in Step 2 in the second R chunk below to carry out the three steps above. (The first R chunk provides some preliminary code.) Then describe what you can conclude from this study based on your plot and “p_value” from Step 3.
### Preliminary code ###
# generate a tibble with our observed data
<- tibble(treatment = rep(c("Dolphin", "Control"), each = 15),
dolphin_data improve = c(rep("Yes", 10), rep("No", 5),
rep("Yes", 3), rep("No", 12)))
print(dolphin_data, n = Inf)
# A tibble: 30 × 2
treatment improve
<chr> <chr>
1 Dolphin Yes
2 Dolphin Yes
3 Dolphin Yes
4 Dolphin Yes
5 Dolphin Yes
6 Dolphin Yes
7 Dolphin Yes
8 Dolphin Yes
9 Dolphin Yes
10 Dolphin Yes
11 Dolphin No
12 Dolphin No
13 Dolphin No
14 Dolphin No
15 Dolphin No
16 Control Yes
17 Control Yes
18 Control Yes
19 Control No
20 Control No
21 Control No
22 Control No
23 Control No
24 Control No
25 Control No
26 Control No
27 Control No
28 Control No
29 Control No
30 Control No
# `sample()` can be used to shuffle the treatments among the 30 subjects
sample(dolphin_data$treatment)
[1] "Control" "Dolphin" "Dolphin" "Dolphin" "Dolphin" "Control" "Control"
[8] "Control" "Dolphin" "Dolphin" "Control" "Control" "Dolphin" "Dolphin"
[15] "Control" "Control" "Dolphin" "Dolphin" "Control" "Dolphin" "Control"
[22] "Dolphin" "Dolphin" "Dolphin" "Control" "Control" "Dolphin" "Control"
[29] "Control" "Control"
### Fill in Step 2 and remove "eval: FALSE" ###
# Step 1
<- dolphin_data |>
dolphin_summary group_by(treatment) |>
summarize(prop_yes = mean(improve == "Yes"))
dolphin_summary<- dolphin_summary[[2]][2] - dolphin_summary[[2]][1]
observed_diff
# Step 2
### Write a loop to create 1000 simulated differences from the null world
# Step 3
<- tibble(simulated_diffs = simulated_diffs)
null_world ggplot(null_world, aes(x = simulated_diffs)) +
geom_histogram() +
geom_vline(xintercept = observed_diff, color = "red")
<- sum(abs(simulated_diffs) >= abs(observed_diff)) / 1000
p_value p_value
You have written code to conduct a randomization test for the difference in two proportions, a powerful test of statistical significance that is demonstrated in the original applet!
Functional programming for iteration
Examples: map functions and across()
Pros: less code, fewer errors, code that’s easier to read; takes advantage of fact that R is a functional programming language
Cons: little more complicated to master vocabulary and use – a step up in abstraction
R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly. Passing one function to another is a very powerful coding approach!!
# Below you can avoid writing separate functions for mean, median,
# SD, etc. by column
<- tibble(
df a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
<- function(df, fun) {
col_summary <- vector("double", length(df))
out for (i in seq_along(df)) {
<- fun(df[[i]])
out[i]
}
out
}col_summary(df, median)
[1] -0.1952978 -0.2067583 0.2349411 0.6016533
col_summary(df, mean)
[1] -0.07689049 -0.15886096 0.08754637 0.51398147
col_summary(df, IQR)
[1] 1.094586 1.656958 1.283876 1.020525
The purrr
package provides map
functions to eliminate need for for loops, plus it makes code easier to read!
# using map functions for summary stats by column as above
map_dbl(df, mean)
a b c d
-0.07689049 -0.15886096 0.08754637 0.51398147
map_dbl(df, median)
a b c d
-0.1952978 -0.2067583 0.2349411 0.6016533
map_dbl(df, sd)
a b c d
0.9438331 0.9889397 0.9287586 0.7635153
map_dbl(df, mean, trim = 0.5)
a b c d
-0.1952978 -0.2067583 0.2349411 0.6016533
# map_dbl means make a double vector
# can also do map() for list, map_lgl(), map_int(), and map_chr()
# even more clear
|> map_dbl(mean) df
a b c d
-0.07689049 -0.15886096 0.08754637 0.51398147
|> map_dbl(median) df
a b c d
-0.1952978 -0.2067583 0.2349411 0.6016533
|> map_dbl(sd) df
a b c d
0.9438331 0.9889397 0.9287586 0.7635153
The across() function from dplyr also works well:
|> summarize(
df n = n(),
across(.cols = a:d, .fns = median, .names = "median_{.col}")
)
# A tibble: 1 × 5
n median_a median_b median_c median_d
<int> <dbl> <dbl> <dbl> <dbl>
1 10 -0.195 -0.207 0.235 0.602
# other ways to repeat across the numeric columns of df:
|> summarize(
df n = n(),
across(everything(), median, .names = "median_{.col}")
)
# A tibble: 1 × 6
n median_a median_b median_c median_d median_n
<int> <dbl> <dbl> <dbl> <dbl> <int>
1 10 -0.195 -0.207 0.235 0.602 10
|> summarize(
df n = n(),
across(where(is.numeric), median, .names = "median_{.col}")
)
# A tibble: 1 × 6
n median_a median_b median_c median_d median_n
<int> <dbl> <dbl> <dbl> <dbl> <int>
1 10 -0.195 -0.207 0.235 0.602 10
# Here "across" effectively expands to the following code. Note that
# across() will write over old columns unless you change the name!
|>
df summarize(
median_a = median(a),
median_b = median(b),
median_c = median(c),
median_d = median(d),
n = n()
)
# A tibble: 1 × 5
median_a median_b median_c median_d n
<dbl> <dbl> <dbl> <dbl> <int>
1 -0.195 -0.207 0.235 0.602 10
# And if we're worried about NAs, we can't call median directly, we
# must create a new function that we can pass options into
<- df
df_miss 2, 1] <- NA
df_miss[4:5, 2] <- NA
df_miss[ df_miss
# A tibble: 10 × 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 -0.456 0.594 0.901 0.547
2 NA -0.685 -0.815 1.01
3 -0.966 0.932 0.634 -0.718
4 0.697 NA -0.470 1.01
5 0.579 NA 0.228 -0.184
6 -0.195 -0.935 0.317 1.69
7 -1.95 0.990 -1.14 0.521
8 -0.338 -0.533 0.242 0.656
9 1.07 -1.80 -0.882 1.10
10 -0.195 0.119 1.86 -0.490
|>
df_miss summarize(
across(
:d,
alist(
median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x))
),.names = "{.fn}_{.col}"
),n = n(),
)
# A tibble: 1 × 9
median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d n
<dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <int>
1 -0.195 1 -0.207 2 0.235 0 0.602 0 10
# where \ is shorthand for an anonymous function - i.e. you could
# replace "\" with "function" if you like typing more letters :)
# across-like functions can also be used with filter():
# same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
|> filter(if_any(a:d, is.na)) df_miss
# A tibble: 3 × 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 NA -0.685 -0.815 1.01
2 0.697 NA -0.470 1.01
3 0.579 NA 0.228 -0.184
# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
|> filter(if_all(a:d, is.na)) df_miss
# A tibble: 0 × 4
# ℹ 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>
When you input a list of functions (like the lubridate functions below), across() assigns default names as columnname_functionname:
library(lubridate)
<- function(df) {
expand_dates |>
df mutate(
across(where(is.Date), list(year = year, month = month, day = mday))
)
}
<- tibble(
df_date name = c("Amy", "Bob"),
date = ymd(c("2009-08-03", "2010-01-16"))
)
|>
df_date expand_dates()
# A tibble: 2 × 5
name date date_year date_month date_day
<chr> <date> <dbl> <dbl> <int>
1 Amy 2009-08-03 2009 8 3
2 Bob 2010-01-16 2010 1 16
Here is default is to summarize all numeric columns, but as with all functions, we can override the default if we choose:
<- function(df, summary_vars = where(is.numeric)) {
summarize_means |>
df summarize(
across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
n = n(),
.groups = "drop"
)
}|>
diamonds group_by(cut) |>
summarize_means()
# A tibble: 5 × 9
cut carat depth table price x y z n
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 1.05 64.0 59.1 4359. 6.25 6.18 3.98 1610
2 Good 0.849 62.4 58.7 3929. 5.84 5.85 3.64 4906
3 Very Good 0.806 61.8 58.0 3982. 5.74 5.77 3.56 12082
4 Premium 0.892 61.3 58.7 4584. 5.97 5.94 3.65 13791
5 Ideal 0.703 61.7 56.0 3458. 5.51 5.52 3.40 21551
|>
diamonds group_by(cut) |>
summarize_means(c(carat, x:z))
# A tibble: 5 × 6
cut carat x y z n
<ord> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 1.05 6.25 6.18 3.98 1610
2 Good 0.849 5.84 5.85 3.64 4906
3 Very Good 0.806 5.74 5.77 3.56 12082
4 Premium 0.892 5.97 5.94 3.65 13791
5 Ideal 0.703 5.51 5.52 3.40 21551
pivot_longer() with group_by() and summarize() also provides a nice solution:
<- df |>
long pivot_longer(a:d) |>
group_by(name) |>
summarize(
median = median(value),
mean = mean(value)
) long
# A tibble: 4 × 3
name median mean
<chr> <dbl> <dbl>
1 a -0.195 -0.0769
2 b -0.207 -0.159
3 c 0.235 0.0875
4 d 0.602 0.514
Here are a couple of other nice features of map
functions: - perform analyses (like fitting a line) by subgroup - extracting components from a model or elements by position
# fit linear model to each group based on cylinder
# - split designed to split into new dfs (unlike group_by)
# - map returns a vector or list, which can be limiting
= purrr::map
map <- split(mtcars, mtcars$cyl) |>
models map(function(df) lm(mpg ~ wt, data = df))
models
$`4`
Call:
lm(formula = mpg ~ wt, data = df)
Coefficients:
(Intercept) wt
39.571 -5.647
$`6`
Call:
lm(formula = mpg ~ wt, data = df)
Coefficients:
(Intercept) wt
28.41 -2.78
$`8`
Call:
lm(formula = mpg ~ wt, data = df)
Coefficients:
(Intercept) wt
23.868 -2.192
1]] models[[
Call:
lm(formula = mpg ~ wt, data = df)
Coefficients:
(Intercept) wt
39.571 -5.647
# shortcut using purrr - 1-sided formulas
<- split(mtcars, mtcars$cyl) |>
models map(~lm(mpg ~ wt, data = .))
models
$`4`
Call:
lm(formula = mpg ~ wt, data = .)
Coefficients:
(Intercept) wt
39.571 -5.647
$`6`
Call:
lm(formula = mpg ~ wt, data = .)
Coefficients:
(Intercept) wt
28.41 -2.78
$`8`
Call:
lm(formula = mpg ~ wt, data = .)
Coefficients:
(Intercept) wt
23.868 -2.192
# extract named components from each model
str(models)
List of 3
$ 4:List of 12
..$ coefficients : Named num [1:2] 39.57 -5.65
.. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
..$ residuals : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
.. ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
..$ effects : Named num [1:11] -88.433 10.171 0.695 6.231 1.728 ...
.. ..- attr(*, "names")= chr [1:11] "(Intercept)" "wt" "" "" ...
..$ rank : int 2
..$ fitted.values: Named num [1:11] 26.5 21.6 21.8 27.1 30.5 ...
.. ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
..$ assign : int [1:2] 0 1
..$ qr :List of 5
.. ..$ qr : num [1:11, 1:2] -3.317 0.302 0.302 0.302 0.302 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
.. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
.. .. ..- attr(*, "assign")= int [1:2] 0 1
.. ..$ qraux: num [1:2] 1.3 1.5
.. ..$ pivot: int [1:2] 1 2
.. ..$ tol : num 1e-07
.. ..$ rank : int 2
.. ..- attr(*, "class")= chr "qr"
..$ df.residual : int 9
..$ xlevels : Named list()
..$ call : language lm(formula = mpg ~ wt, data = .)
..$ terms :Classes 'terms', 'formula' language mpg ~ wt
.. .. ..- attr(*, "variables")= language list(mpg, wt)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "mpg" "wt"
.. .. .. .. ..$ : chr "wt"
.. .. ..- attr(*, "term.labels")= chr "wt"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: 0x0000015c30619290>
.. .. ..- attr(*, "predvars")= language list(mpg, wt)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
..$ model :'data.frame': 11 obs. of 2 variables:
.. ..$ mpg: num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
.. ..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
.. ..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ wt
.. .. .. ..- attr(*, "variables")= language list(mpg, wt)
.. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
.. .. .. .. .. ..$ : chr "wt"
.. .. .. ..- attr(*, "term.labels")= chr "wt"
.. .. .. ..- attr(*, "order")= int 1
.. .. .. ..- attr(*, "intercept")= int 1
.. .. .. ..- attr(*, "response")= int 1
.. .. .. ..- attr(*, ".Environment")=<environment: 0x0000015c30619290>
.. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
.. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
..- attr(*, "class")= chr "lm"
$ 6:List of 12
..$ coefficients : Named num [1:2] 28.41 -2.78
.. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
..$ residuals : Named num [1:7] -0.125 0.584 1.929 -0.69 0.355 ...
.. ..- attr(*, "names")= chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
..$ effects : Named num [1:7] -52.235 -2.427 2.111 -0.353 0.679 ...
.. ..- attr(*, "names")= chr [1:7] "(Intercept)" "wt" "" "" ...
..$ rank : int 2
..$ fitted.values: Named num [1:7] 21.1 20.4 19.5 18.8 18.8 ...
.. ..- attr(*, "names")= chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
..$ assign : int [1:2] 0 1
..$ qr :List of 5
.. ..$ qr : num [1:7, 1:2] -2.646 0.378 0.378 0.378 0.378 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
.. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
.. .. ..- attr(*, "assign")= int [1:2] 0 1
.. ..$ qraux: num [1:2] 1.38 1.12
.. ..$ pivot: int [1:2] 1 2
.. ..$ tol : num 1e-07
.. ..$ rank : int 2
.. ..- attr(*, "class")= chr "qr"
..$ df.residual : int 5
..$ xlevels : Named list()
..$ call : language lm(formula = mpg ~ wt, data = .)
..$ terms :Classes 'terms', 'formula' language mpg ~ wt
.. .. ..- attr(*, "variables")= language list(mpg, wt)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "mpg" "wt"
.. .. .. .. ..$ : chr "wt"
.. .. ..- attr(*, "term.labels")= chr "wt"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: 0x0000015c3066b308>
.. .. ..- attr(*, "predvars")= language list(mpg, wt)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
..$ model :'data.frame': 7 obs. of 2 variables:
.. ..$ mpg: num [1:7] 21 21 21.4 18.1 19.2 17.8 19.7
.. ..$ wt : num [1:7] 2.62 2.88 3.21 3.46 3.44 ...
.. ..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ wt
.. .. .. ..- attr(*, "variables")= language list(mpg, wt)
.. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
.. .. .. .. .. ..$ : chr "wt"
.. .. .. ..- attr(*, "term.labels")= chr "wt"
.. .. .. ..- attr(*, "order")= int 1
.. .. .. ..- attr(*, "intercept")= int 1
.. .. .. ..- attr(*, "response")= int 1
.. .. .. ..- attr(*, ".Environment")=<environment: 0x0000015c3066b308>
.. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
.. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
..- attr(*, "class")= chr "lm"
$ 8:List of 12
..$ coefficients : Named num [1:2] 23.87 -2.19
.. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
..$ residuals : Named num [1:14] 2.374 -1.741 1.455 1.61 -0.381 ...
.. ..- attr(*, "names")= chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
..$ effects : Named num [1:14] -56.499 -6.003 0.816 1.22 -0.807 ...
.. ..- attr(*, "names")= chr [1:14] "(Intercept)" "wt" "" "" ...
..$ rank : int 2
..$ fitted.values: Named num [1:14] 16.3 16 14.9 15.7 15.6 ...
.. ..- attr(*, "names")= chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
..$ assign : int [1:2] 0 1
..$ qr :List of 5
.. ..$ qr : num [1:14, 1:2] -3.742 0.267 0.267 0.267 0.267 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
.. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
.. .. ..- attr(*, "assign")= int [1:2] 0 1
.. ..$ qraux: num [1:2] 1.27 1.11
.. ..$ pivot: int [1:2] 1 2
.. ..$ tol : num 1e-07
.. ..$ rank : int 2
.. ..- attr(*, "class")= chr "qr"
..$ df.residual : int 12
..$ xlevels : Named list()
..$ call : language lm(formula = mpg ~ wt, data = .)
..$ terms :Classes 'terms', 'formula' language mpg ~ wt
.. .. ..- attr(*, "variables")= language list(mpg, wt)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "mpg" "wt"
.. .. .. .. ..$ : chr "wt"
.. .. ..- attr(*, "term.labels")= chr "wt"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: 0x0000015c3084fe80>
.. .. ..- attr(*, "predvars")= language list(mpg, wt)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
..$ model :'data.frame': 14 obs. of 2 variables:
.. ..$ mpg: num [1:14] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 ...
.. ..$ wt : num [1:14] 3.44 3.57 4.07 3.73 3.78 ...
.. ..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ wt
.. .. .. ..- attr(*, "variables")= language list(mpg, wt)
.. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
.. .. .. .. .. ..$ : chr "wt"
.. .. .. ..- attr(*, "term.labels")= chr "wt"
.. .. .. ..- attr(*, "order")= int 1
.. .. .. ..- attr(*, "intercept")= int 1
.. .. .. ..- attr(*, "response")= int 1
.. .. .. ..- attr(*, ".Environment")=<environment: 0x0000015c3084fe80>
.. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
.. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
..- attr(*, "class")= chr "lm"
str(models[[1]])
List of 12
$ coefficients : Named num [1:2] 39.57 -5.65
..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
$ residuals : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
$ effects : Named num [1:11] -88.433 10.171 0.695 6.231 1.728 ...
..- attr(*, "names")= chr [1:11] "(Intercept)" "wt" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:11] 26.5 21.6 21.8 27.1 30.5 ...
..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:11, 1:2] -3.317 0.302 0.302 0.302 0.302 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
.. .. ..$ : chr [1:2] "(Intercept)" "wt"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.3 1.5
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 9
$ xlevels : Named list()
$ call : language lm(formula = mpg ~ wt, data = .)
$ terms :Classes 'terms', 'formula' language mpg ~ wt
.. ..- attr(*, "variables")= language list(mpg, wt)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "mpg" "wt"
.. .. .. ..$ : chr "wt"
.. ..- attr(*, "term.labels")= chr "wt"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: 0x0000015c30619290>
.. ..- attr(*, "predvars")= language list(mpg, wt)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
$ model :'data.frame': 11 obs. of 2 variables:
..$ mpg: num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ wt
.. .. ..- attr(*, "variables")= language list(mpg, wt)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "mpg" "wt"
.. .. .. .. ..$ : chr "wt"
.. .. ..- attr(*, "term.labels")= chr "wt"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: 0x0000015c30619290>
.. .. ..- attr(*, "predvars")= language list(mpg, wt)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
- attr(*, "class")= chr "lm"
str(summary(models[[1]]))
List of 11
$ call : language lm(formula = mpg ~ wt, data = .)
$ terms :Classes 'terms', 'formula' language mpg ~ wt
.. ..- attr(*, "variables")= language list(mpg, wt)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "mpg" "wt"
.. .. .. ..$ : chr "wt"
.. ..- attr(*, "term.labels")= chr "wt"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: 0x0000015c30619290>
.. ..- attr(*, "predvars")= language list(mpg, wt)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
$ residuals : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
$ coefficients : num [1:2, 1:4] 39.57 -5.65 4.35 1.85 9.1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "wt"
.. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
$ aliased : Named logi [1:2] FALSE FALSE
..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
$ sigma : num 3.33
$ df : int [1:3] 2 9 2
$ r.squared : num 0.509
$ adj.r.squared: num 0.454
$ fstatistic : Named num [1:3] 9.32 1 9
..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
$ cov.unscaled : num [1:2, 1:2] 1.701 -0.705 -0.705 0.308
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "wt"
.. ..$ : chr [1:2] "(Intercept)" "wt"
- attr(*, "class")= chr "summary.lm"
|>
models map(summary) |>
map_dbl("r.squared")
4 6 8
0.5086326 0.4645102 0.4229655
# can use integer to select elements by position
<- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x |> map_dbl(2) x
[1] 2 5 8
Iterative techniques for reading multiple files
library(readxl)
# our usual path doesn't work with excel files
# read_excel("https://proback.github.io/264_fall_2024/Data/gapminder/1952.xlsx")
# this will work if .xlsx files live in a Data folder that's at the same
# level as your .qmd file
<- read_excel("Data/gapminder/1952.xlsx")
gap1952 <- read_excel("Data/gapminder/1957.xlsx") gap1957
Since the 1952 and 1957 data have the same 5 columns, if we want to combine this data into a single data set showing time trends, we could simply bind_rows()
(after adding a 6th column for year
)
<- gap1952 |>
gap1952 mutate(year = 1952)
<- gap1957 |>
gap1957 mutate(year = 1957)
<- bind_rows(gap1952, gap1957) gap_data
Of course, with 10 more years worth of data still left to read in and merge, this process could get pretty onerous. Section 26.3 shows how to automate this process in 3 steps:
- use list.files() to list all the files in a directory
<- list.files("Data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)
paths paths
[1] "Data/gapminder/1952.xlsx" "Data/gapminder/1957.xlsx"
[3] "Data/gapminder/1962.xlsx" "Data/gapminder/1967.xlsx"
[5] "Data/gapminder/1972.xlsx" "Data/gapminder/1977.xlsx"
[7] "Data/gapminder/1982.xlsx" "Data/gapminder/1987.xlsx"
[9] "Data/gapminder/1992.xlsx" "Data/gapminder/1997.xlsx"
[11] "Data/gapminder/2002.xlsx" "Data/gapminder/2007.xlsx"
- use purrr::map() to read each of them into a list (we will discuss lists more in 06_data_types.qmd)
<- map(paths, readxl::read_excel)
gap_files length(gap_files)
[1] 12
str(gap_files)
List of 12
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 28.8 55.2 43.1 30 62.5 ...
..$ pop : num [1:142] 8425333 1282697 9279525 4232095 17876956 ...
..$ gdpPercap: num [1:142] 779 1601 2449 3521 5911 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 30.3 59.3 45.7 32 64.4 ...
..$ pop : num [1:142] 9240934 1476505 10270856 4561361 19610538 ...
..$ gdpPercap: num [1:142] 821 1942 3014 3828 6857 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 32 64.8 48.3 34 65.1 ...
..$ pop : num [1:142] 10267083 1728137 11000948 4826015 21283783 ...
..$ gdpPercap: num [1:142] 853 2313 2551 4269 7133 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 34 66.2 51.4 36 65.6 ...
..$ pop : num [1:142] 11537966 1984060 12760499 5247469 22934225 ...
..$ gdpPercap: num [1:142] 836 2760 3247 5523 8053 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 36.1 67.7 54.5 37.9 67.1 ...
..$ pop : num [1:142] 13079460 2263554 14760787 5894858 24779799 ...
..$ gdpPercap: num [1:142] 740 3313 4183 5473 9443 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 38.4 68.9 58 39.5 68.5 ...
..$ pop : num [1:142] 14880372 2509048 17152804 6162675 26983828 ...
..$ gdpPercap: num [1:142] 786 3533 4910 3009 10079 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 39.9 70.4 61.4 39.9 69.9 ...
..$ pop : num [1:142] 12881816 2780097 20033753 7016384 29341374 ...
..$ gdpPercap: num [1:142] 978 3631 5745 2757 8998 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 40.8 72 65.8 39.9 70.8 ...
..$ pop : num [1:142] 13867957 3075321 23254956 7874230 31620918 ...
..$ gdpPercap: num [1:142] 852 3739 5681 2430 9140 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 41.7 71.6 67.7 40.6 71.9 ...
..$ pop : num [1:142] 16317921 3326498 26298373 8735988 33958947 ...
..$ gdpPercap: num [1:142] 649 2497 5023 2628 9308 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 41.8 73 69.2 41 73.3 ...
..$ pop : num [1:142] 22227415 3428038 29072015 9875024 36203463 ...
..$ gdpPercap: num [1:142] 635 3193 4797 2277 10967 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 42.1 75.7 71 41 74.3 ...
..$ pop : num [1:142] 25268405 3508512 31287142 10866106 38331121 ...
..$ gdpPercap: num [1:142] 727 4604 5288 2773 8798 ...
$ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
..$ country : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
..$ lifeExp : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
..$ pop : num [1:142] 31889923 3600523 33333216 12420476 40301927 ...
..$ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...
1]] # pull off the first object in the list (i.e. 1952 data) gap_files[[
# A tibble: 142 × 5
country continent lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Asia 28.8 8425333 779.
2 Albania Europe 55.2 1282697 1601.
3 Algeria Africa 43.1 9279525 2449.
4 Angola Africa 30.0 4232095 3521.
5 Argentina Americas 62.5 17876956 5911.
6 Australia Oceania 69.1 8691212 10040.
7 Austria Europe 66.8 6927772 6137.
8 Bahrain Asia 50.9 120447 9867.
9 Bangladesh Asia 37.5 46886859 684.
10 Belgium Europe 68 8730405 8343.
# ℹ 132 more rows
- use purrr::list_rbind() to combine them into a single data frame
<- list_rbind(gap_files)
gap_tidy class(gap_tidy)
[1] "tbl_df" "tbl" "data.frame"
gap_tidy
# A tibble: 1,704 × 5
country continent lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Asia 28.8 8425333 779.
2 Albania Europe 55.2 1282697 1601.
3 Algeria Africa 43.1 9279525 2449.
4 Angola Africa 30.0 4232095 3521.
5 Argentina Americas 62.5 17876956 5911.
6 Australia Oceania 69.1 8691212 10040.
7 Austria Europe 66.8 6927772 6137.
8 Bahrain Asia 50.9 120447 9867.
9 Bangladesh Asia 37.5 46886859 684.
10 Belgium Europe 68 8730405 8343.
# ℹ 1,694 more rows
We could even do all steps in a single pipeline:
list.files("Data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) |>
map(readxl::read_excel) |>
list_rbind()
# A tibble: 1,704 × 5
country continent lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Asia 28.8 8425333 779.
2 Albania Europe 55.2 1282697 1601.
3 Algeria Africa 43.1 9279525 2449.
4 Angola Africa 30.0 4232095 3521.
5 Argentina Americas 62.5 17876956 5911.
6 Australia Oceania 69.1 8691212 10040.
7 Austria Europe 66.8 6927772 6137.
8 Bahrain Asia 50.9 120447 9867.
9 Bangladesh Asia 37.5 46886859 684.
10 Belgium Europe 68 8730405 8343.
# ℹ 1,694 more rows
Note that we are lacking a 6th column with the year represented by each row of data. Here is one way to solve that issue:
# This extracts file names, which are carried along as data frame names by
# map functions
|> set_names(basename) paths
1952.xlsx 1957.xlsx
"Data/gapminder/1952.xlsx" "Data/gapminder/1957.xlsx"
1962.xlsx 1967.xlsx
"Data/gapminder/1962.xlsx" "Data/gapminder/1967.xlsx"
1972.xlsx 1977.xlsx
"Data/gapminder/1972.xlsx" "Data/gapminder/1977.xlsx"
1982.xlsx 1987.xlsx
"Data/gapminder/1982.xlsx" "Data/gapminder/1987.xlsx"
1992.xlsx 1997.xlsx
"Data/gapminder/1992.xlsx" "Data/gapminder/1997.xlsx"
2002.xlsx 2007.xlsx
"Data/gapminder/2002.xlsx" "Data/gapminder/2007.xlsx"
# The middle line ensures that each of the 12 data frames in the list for
# gap_files has a name determined by its filepath, unlike the gap_files
# we created in step 2 above, which had no names (we could only identify
# data frames by their position)
<- paths |>
gap_files set_names(basename) |>
map(readxl::read_excel)
# Now we can extract a particular year by its name:
"1962.xlsx"]] gap_files[[
# A tibble: 142 × 5
country continent lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Asia 32.0 10267083 853.
2 Albania Europe 64.8 1728137 2313.
3 Algeria Africa 48.3 11000948 2551.
4 Angola Africa 34 4826015 4269.
5 Argentina Americas 65.1 21283783 7133.
6 Australia Oceania 70.9 10794968 12217.
7 Austria Europe 69.5 7129864 10751.
8 Bahrain Asia 56.9 171863 12753.
9 Bangladesh Asia 41.2 56839289 686.
10 Belgium Europe 70.2 9218400 10991.
# ℹ 132 more rows
# Finally, take advantage of the `names_to` argument in list_rbind to
# create that 6th column with `year`
<- paths |>
gap_tidy set_names(basename) |>
map(readxl::read_excel) |>
list_rbind(names_to = "year") |>
mutate(year = parse_number(year))
You could then save your result using write_csv
so you don’t have to run the reading and wrangling code every time!
On Your Own
Compute the mean of every column of the
mtcars
data set using (a) a for loop, (b) amap
function, (c) the across() function, and (d) pivot_longer().Write a function that prints the mean of each numeric column in a data frame. Try it on the
iris
data set. (Hint:keep(is.numeric)
)Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:
<- ""
out for (x in letters) {
<- stringr::str_c(out, x)
out
} out
[1] "abcdefghijklmnopqrstuvwxyz"
<- runif(100)
x <- vector("numeric", length(x))
out 1] <- x[1]
out[for (i in 2:length(x)) {
<- out[i - 1] + x[i]
out[i]
} out
[1] 0.2506384 0.9759137 1.5069357 2.4550164 3.0671610 3.4692085
[7] 3.5170362 3.5680981 4.5310981 5.4915534 6.3652519 6.6017948
[13] 7.3308325 7.4322275 7.9820764 8.4232124 8.9626225 9.4099158
[19] 9.8411847 9.9983242 10.9370628 11.2184210 11.7131817 11.9979545
[25] 12.7336841 13.4807135 13.5142562 13.9353312 14.3954739 14.5238460
[31] 15.2531725 15.3331256 16.0800002 16.9047467 17.7124190 18.5836777
[37] 19.2685514 20.2217972 20.7590128 21.7258832 22.3571406 23.0150318
[43] 23.0555192 23.2183946 23.8984369 24.7742049 24.7830053 25.6539862
[49] 26.0658080 26.1555196 26.4515061 26.5169040 27.4832727 28.0914213
[55] 29.0909732 29.9250165 30.1126262 30.4355826 30.4838046 31.3994133
[61] 31.9182786 32.6152841 33.5040339 34.0445952 34.9451669 35.5634332
[67] 36.3727143 36.5811334 36.7382437 37.5619114 38.3995447 38.4714948
[73] 38.8461349 39.8411913 40.3614437 40.4626199 41.1813205 41.6758823
[79] 42.4902015 43.0768056 43.1641257 43.6831914 43.7996635 44.2088854
[85] 45.0964847 45.9281900 46.3899264 46.8555729 47.3223114 48.0272623
[91] 48.8482467 49.7457907 50.4621209 50.6371379 51.4485006 51.4670713
[97] 52.4013657 52.8815296 53.1895940 53.9850197
Compute the number of unique values in each column of the
iris
data set using at least 2 of your favorite iteration methods. Bonus points if you can use pivot_longer()!Carefully explain each step in the pipeline below:
<- function(df, group_vars, summary_vars = everything()) {
show_missing |>
df group_by(pick({{ group_vars }})) |>
summarize(
across({{ summary_vars }}, \(x) sum(is.na(x))),
.groups = "drop"
|>
) select(where(\(x) any(x > 0)))
}::flights |> show_missing(c(year, month, day)) nycflights13
# A tibble: 365 × 9
year month day dep_time dep_delay arr_time arr_delay tailnum air_time
<int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2013 1 1 4 4 5 11 0 11
2 2013 1 2 8 8 10 15 2 15
3 2013 1 3 10 10 10 14 2 14
4 2013 1 4 6 6 6 7 2 7
5 2013 1 5 3 3 3 3 1 3
6 2013 1 6 1 1 1 3 0 3
7 2013 1 7 3 3 3 3 1 3
8 2013 1 8 4 4 4 7 1 7
9 2013 1 9 5 5 7 9 2 9
10 2013 1 10 3 3 3 3 2 3
# ℹ 355 more rows
- Write a function called
summary_stats()
that allows a user to input a tibble, numeric variables in that tibble, and summary statistics that they would like to see for each variable. Usingacross()
, the function’s output should look like the example below.
summary_stats(mtcars,
vars = c(mpg, hp, wt),
stat_fcts = list(mean = mean,
median = median,
sd = sd,
IQR = IQR))
# mpg_mean mpg_median mpg_sd mpg_IQR hp_mean hp_median hp_sd hp_IQR
#1 20.09062 19.2 6.026948 7.375 146.6875 123 68.56287 83.5
# wt_mean wt_median wt_sd wt_IQR n
#1 3.21725 3.325 0.9784574 1.02875 32
- The power of a statistical test is the probability that it rejects the null hypothesis when the null hypothesis is false. In other words, it’s the probability that a statistical test can detect when a true difference exists. The power depends on a number of factors, including:
- sample size
- type I error level (probability of declaring there is a statistically significant difference when there really isn’t)
- variability in the data
- size of the true difference
The following steps can be followed to simulate a power calculation using iteration techniques:
generate simulated data where is a true difference or effect
run your desired test on the simulated data and record if the null hypothesis was rejected or not (i.e. if the p-value was below .05)
repeat (a)-(b) a large number of times and record the total proportion of times that the null hypothesis was rejected; that proportion is the power of your test under those conditions
Create a power curve for a two-sample t-test by filling in Step C below and then removing eval: FALSE
:
# Step A
# set parameters for two-sample t-test
<- 100 # mean response in Group 1
mean1 <- 5 # true mean difference between Groups 1 and 2
truediff <- mean1 + truediff # mean response in Group 2
mean2 <- 10 # standard deviation in Group 1
sd1 <- 10 # standard deviation in Group 2
sd2 <- 20 # sample size in Group 1
n1 <- 20 # sample size in Group 2
n2 <- 1000 # number of simulations (iterations) to run
numsims
# generate sample data for Groups 1 and 2 based on normal distributions
# with the parameters above (note that there is truly a difference in means!)
<- rnorm(n1, mean1, sd1)
samp1 <- rnorm(n2, mean2, sd2)
samp2
# organize the simulated data into a tibble
<- tibble(response = c(samp1, samp2),
sim_data group = c(rep("Group 1", n1), rep("Group 2", n2)))
sim_data
# Step B
# exploratory analysis of the simulated data
::favstats(response ~ group, data = sim_data)
mosaicggplot(sim_data, aes(x = response, y = group)) +
geom_boxplot()
# run a two-sample t-test to see if there is a significant difference
# in means between Groups 1 and 2 (i.e. is the p-value < .05?)
<- t.test(x = samp1, y = samp2)$p.value
p_value
p_value< .05 # if TRUE, then we reject the null hypothesis and conclude
p_value # there is a statistically significant difference
# Step C
# find the power = proportion of time null is rejected when
# true difference is not 0 (i.e. number of simulated data sets that
# result in p-values below .05)