Iteration

You can download this .qmd file from here. Just hit the Download Raw File button.

This leans on parts of R4DS Chapter 26: Iteration, in addition to parts of the first edition of R4DS.

# Initial packages required
library(tidyverse)

Iteration

Reducing duplication of code will reduce errors and make debugging much easier. We’ve already seen how functions (Ch 25) can help reduce duplication by extracting repeated patterns of code. Another tool is iteration, when you find you’re doing the same thing to multiple inputs – repeating the same operation on different columns or datasets.

Here we’ll see two important iteration paradigms: imperative programming and functional programming.

Imperation programming for iteration

Examples: for loops and while loops

Pros: relatively easy to learn, make iteration very explicit so it’s obvious what’s happening, not as inefficient as some people believe

Cons: require lots of bookkeeping code that’s duplicated for every loop

Every for loop has three components:

output - plan ahead and allocate enough space for output
sequence - determines what to loop over; cycles through different values of \(i\)
body - code that does the work; run repeatedly with different values of \(i\)

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df

# A tibble: 10 × 4
         a      b       c      d
     <dbl>  <dbl>   <dbl>  <dbl>
 1  0.301   0.249  1.29   -0.412
 2 -0.962   0.789  0.243   1.03 
 3 -0.0396  0.932  0.198   0.671
 4 -1.28    3.08   0.821  -1.22 
 5 -0.716  -0.643 -2.12    0.916
 6  2.10    0.990  0.0462 -1.30 
 7  0.115   1.73  -2.60   -1.01 
 8  0.133  -1.20   0.620   1.58 
 9  0.850   0.767  1.64   -0.299
10  3.84   -0.166  1.31   -1.56

# want median of each column (w/o cutting and pasting)
#   Be careful using square brackets vs double square brackets when
#   selecting elements
median(df[[1]])

[1] 0.1240396

median(df[1])

Error in median.default(df[1]): need numeric data

df[1]

# A tibble: 10 × 1
         a
     <dbl>
 1  0.301 
 2 -0.962 
 3 -0.0396
 4 -1.28  
 5 -0.716 
 6  2.10  
 7  0.115 
 8  0.133 
 9  0.850 
10  3.84

df[[1]]

 [1]  0.30078994 -0.96163880 -0.03959508 -1.27703392 -0.71551756  2.10163450
 [7]  0.11513889  0.13294032  0.84962136  3.84028921

class(df[1])

[1] "tbl_df"     "tbl"        "data.frame"

class(df[[1]])

[1] "numeric"

# basic for loop to take median of each column
output <- vector("double", ncol(df))  # 1. output
for (i in 1:4) {                      # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output

[1]  0.1240396  0.7779508  0.4312663 -0.3556261

# ?seq_along - a safer option if had zero length vectors
output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output

[1]  0.1240396  0.7779508  0.4312663 -0.3556261

# use [[.]] even if don't have to to signal working with single elements

# alternative solution - don't hardcode in "4"
output <- vector("double", ncol(df))  # 1. output
for(i in 1:ncol(df)) {                # 2. sequence
  output[i] <- median(df[[i]])        # 3. body
}
output

[1]  0.1240396  0.7779508  0.4312663 -0.3556261

# another approach - no double square brackets since df not a tibble
df <- as.data.frame(df)
output <- vector("double", ncol(df))  # 1. output
for(i in 1:ncol(df)) {                # 2. sequence
  output[i] <- median(df[,i])         # 3. body
}
output

[1]  0.1240396  0.7779508  0.4312663 -0.3556261

One advantage of seq_along(): works with unknown output length. However, the second approach below is much more efficient, since each iteration doesn’t copy all data from previous iterations.

[Pause to Ponder:] What does the code below do? Be prepared to explain both chunks line-by-line!

# for loop: unknown output length

means <- c(0, 1, 2)
output <- double()
for (i in seq_along(means)) {
  n <- sample(100, 1)
  output <- c(output, rnorm(n, means[[i]]))
}
str(output)        ## inefficient

 num [1:68] 1.257 -0.605 -0.195 0.49 -0.278 ...

out <- vector("list", length(means))
for (i in seq_along(means)) {
  n <- sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(out)           ## more efficient

List of 3
 $ : num [1:42] 0.516 -0.23 0.921 -0.79 -0.138 ...
 $ : num [1:35] 0.0588 3.1537 0.9891 1.605 0.028 ...
 $ : num [1:24] 1.656 0.906 1.455 2.342 3.311 ...

str(unlist(out))   ## flatten list of vectors into single vector

 num [1:101] 0.516 -0.23 0.921 -0.79 -0.138 ...

Finally, the while() loop can be used with unknown sequence length. This is used more in simulation than in data analysis.

[Pause to Ponder:] What does the following code do?

flip <- function() sample(c("T", "H"), 1)
flips <- 0
nheads <- 0
while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips

[1] 11

Using iteration for simulation

This applet contains data from a 2005 study on the use of dolphin-facilitated therapy on the treatment of depression. In that study, 10 of the 15 subjects (67%) assigned to dolphin therapy showed improvement, compared to only 3 of the 15 subjects (20%) assigned to the control group. But with such small sample sizes, is this significant evidence that the dolphin group had greater improvement of their depressive symptoms? To answer that question, we can use simulation to conduct a randomization test.

We will simulate behavior in the “null world” where there is no real effect of treatment. In that case, the 13 total improvers would have improved no matter the treatment assigned, and the 17 total non-improvers would have not improved no matter the treatment assigned. So in the “null world”, treatment is a meaningless label that can be just as easily shuffled among subjects without any effect. In that world, the fact we observed a 47 percentage point difference in success rates (67 - 20) was just random luck. But we should ask: how often would we expect a difference as large as 47% by chance, assuming we’re living in the null world where there is no effect of treatment?

You could think about simulating this situation with the following steps:

write code to calculate the difference in success rates in the observed data
write a loop to calculate the differences in success rates from 1000 simulated data sets from the null world. Store those 1000 simulated differences
calculate how often we found a difference in the null world as large as that found in the observed data. In statistics, when this probability is below .05, we typically reject the null world, and conclude that there is likely a real difference between the two groups (i.e. a “statistically significant” difference)

[Pause to Ponder:] Fill in Step 2 in the second R chunk below to carry out the three steps above. (The first R chunk provides some preliminary code.) Then describe what you can conclude from this study based on your plot and “p_value” from Step 3.

### Preliminary code ###

# generate a tibble with our observed data
dolphin_data <- tibble(treatment = rep(c("Dolphin", "Control"), each = 15),
                       improve = c(rep("Yes", 10), rep("No", 5), 
                                   rep("Yes", 3), rep("No", 12)))
print(dolphin_data, n = Inf)

# A tibble: 30 × 2
   treatment improve
   <chr>     <chr>  
 1 Dolphin   Yes    
 2 Dolphin   Yes    
 3 Dolphin   Yes    
 4 Dolphin   Yes    
 5 Dolphin   Yes    
 6 Dolphin   Yes    
 7 Dolphin   Yes    
 8 Dolphin   Yes    
 9 Dolphin   Yes    
10 Dolphin   Yes    
11 Dolphin   No     
12 Dolphin   No     
13 Dolphin   No     
14 Dolphin   No     
15 Dolphin   No     
16 Control   Yes    
17 Control   Yes    
18 Control   Yes    
19 Control   No     
20 Control   No     
21 Control   No     
22 Control   No     
23 Control   No     
24 Control   No     
25 Control   No     
26 Control   No     
27 Control   No     
28 Control   No     
29 Control   No     
30 Control   No

# `sample()` can be used to shuffle the treatments among the 30 subjects
sample(dolphin_data$treatment)

 [1] "Control" "Control" "Dolphin" "Control" "Dolphin" "Dolphin" "Control"
 [8] "Control" "Control" "Dolphin" "Control" "Dolphin" "Dolphin" "Control"
[15] "Dolphin" "Control" "Dolphin" "Control" "Control" "Control" "Control"
[22] "Dolphin" "Dolphin" "Dolphin" "Control" "Control" "Dolphin" "Dolphin"
[29] "Dolphin" "Dolphin"

### Fill in Step 2 and remove "eval: FALSE" ###

# Step 1
dolphin_summary <- dolphin_data |>
  group_by(treatment) |>
  summarize(prop_yes = mean(improve == "Yes"))
dolphin_summary
observed_diff <- dolphin_summary[[2]][2] - dolphin_summary[[2]][1]

# Step 2

### Write a loop to create 1000 simulated differences from the null world

# Step 3
null_world <- tibble(simulated_diffs = simulated_diffs)
ggplot(null_world, aes(x = simulated_diffs)) +
  geom_histogram() +
  geom_vline(xintercept = observed_diff, color = "red")

p_value <- sum(abs(simulated_diffs) >= abs(observed_diff)) / 1000
p_value

You have written code to conduct a randomization test for the difference in two proportions, a powerful test of statistical significance that is demonstrated in the original applet!

Functional programming for iteration

Examples: map functions and across()

Pros: less code, fewer errors, code that’s easier to read; takes advantage of fact that R is a functional programming language

Cons: little more complicated to master vocabulary and use – a step up in abstraction

R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly. Passing one function to another is a very powerful coding approach!!

# Below you can avoid writing separate functions for mean, median, 
#   SD, etc. by column
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)

[1] -0.003660362 -0.171796570  0.647822037 -0.029577492

col_summary(df, mean)

[1]  0.1603757 -0.1719900  0.4256136  0.0802427

col_summary(df, IQR)

[1] 1.831473 1.240390 1.108879 1.683681

The purrr package provides map functions to eliminate need for for loops, plus it makes code easier to read!

# using map functions for summary stats by column as above
map_dbl(df, mean)

         a          b          c          d 
 0.1603757 -0.1719900  0.4256136  0.0802427

map_dbl(df, median)

           a            b            c            d 
-0.003660362 -0.171796570  0.647822037 -0.029577492

map_dbl(df, sd)

        a         b         c         d 
1.5262516 1.1643800 0.9235822 1.2619116

map_dbl(df, mean, trim = 0.5)

           a            b            c            d 
-0.003660362 -0.171796570  0.647822037 -0.029577492

# map_dbl means make a double vector
# can also do map() for list, map_lgl(), map_int(), and map_chr()

# even more clear
df |> map_dbl(mean)

         a          b          c          d 
 0.1603757 -0.1719900  0.4256136  0.0802427

df |> map_dbl(median)

           a            b            c            d 
-0.003660362 -0.171796570  0.647822037 -0.029577492

df |> map_dbl(sd)

        a         b         c         d 
1.5262516 1.1643800 0.9235822 1.2619116

The across() function from dplyr also works well:

df |> summarize(
  n = n(),
  across(.cols = a:d, .fns = median, .names = "median_{.col}")
)

# A tibble: 1 × 5
      n median_a median_b median_c median_d
  <int>    <dbl>    <dbl>    <dbl>    <dbl>
1    10 -0.00366   -0.172    0.648  -0.0296

# other ways to repeat across the numeric columns of df:
df |> summarize(
  n = n(),
  across(everything(), median, .names = "median_{.col}")
)

# A tibble: 1 × 6
      n median_a median_b median_c median_d median_n
  <int>    <dbl>    <dbl>    <dbl>    <dbl>    <int>
1    10 -0.00366   -0.172    0.648  -0.0296       10

df |> summarize(
  n = n(),
  across(where(is.numeric), median, .names = "median_{.col}")
)

# A tibble: 1 × 6
      n median_a median_b median_c median_d median_n
  <int>    <dbl>    <dbl>    <dbl>    <dbl>    <int>
1    10 -0.00366   -0.172    0.648  -0.0296       10

# Here "across" effectively expands to the following code.  Note that 
#   across() will write over old columns unless you change the name!
df |> 
  summarize(
    median_a = median(a),
    median_b = median(b),
    median_c = median(c),
    median_d = median(d),
    n = n()
  )

# A tibble: 1 × 5
  median_a median_b median_c median_d     n
     <dbl>    <dbl>    <dbl>    <dbl> <int>
1 -0.00366   -0.172    0.648  -0.0296    10

# And if we're worried about NAs, we can't call median directly, we
#   must create a new function that we can pass options into
df_miss <- df
df_miss[2, 1] <- NA
df_miss[4:5, 2] <- NA
df_miss

# A tibble: 10 × 4
        a      b      c       d
    <dbl>  <dbl>  <dbl>   <dbl>
 1 -0.611 -0.924 -0.812  1.52  
 2 NA      0.603 -0.245  2.25  
 3  2.60  -0.196  0.745  1.22  
 4  0.281 NA      1.97  -1.47  
 5  0.968 NA     -1.06   0.0798
 6  1.90   0.239  1.05  -1.52  
 7 -2.18  -0.975  0.550  0.257 
 8 -1.62   1.97   0.200 -0.139 
 9 -0.288 -0.147  0.764 -0.715 
10  1.23  -2.43   1.10  -0.683

df_miss |> 
  summarize(
    across(
      a:d,
      list(
        median = \(x) median(x, na.rm = TRUE),
        n_miss = \(x) sum(is.na(x))
      ),
      .names = "{.fn}_{.col}"
    ),
    n = n(),
  )

# A tibble: 1 × 9
  median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d     n
     <dbl>    <int>    <dbl>    <int>    <dbl>    <int>    <dbl>    <int> <int>
1    0.281        1   -0.172        2    0.648        0  -0.0296        0    10

# where \ is shorthand for an anonymous function - i.e. you could
#   replace "\" with "function" if you like typing more letters :)

# across-like functions can also be used with filter():

# same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
df_miss |> filter(if_any(a:d, is.na))

# A tibble: 3 × 4
       a      b      c       d
   <dbl>  <dbl>  <dbl>   <dbl>
1 NA      0.603 -0.245  2.25  
2  0.281 NA      1.97  -1.47  
3  0.968 NA     -1.06   0.0798

# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
df_miss |> filter(if_all(a:d, is.na))

# A tibble: 0 × 4
# ℹ 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>

When you input a list of functions (like the lubridate functions below), across() assigns default names as columnname_functionname:

library(lubridate)
expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    )
}

df_date <- tibble(
  name = c("Amy", "Bob"),
  date = ymd(c("2009-08-03", "2010-01-16"))
)

df_date |> 
  expand_dates()

# A tibble: 2 × 5
  name  date       date_year date_month date_day
  <chr> <date>         <dbl>      <dbl>    <int>
1 Amy   2009-08-03      2009          8        3
2 Bob   2010-01-16      2010          1       16

Here is default is to summarize all numeric columns, but as with all functions, we can override the default if we choose:

summarize_means <- function(df, summary_vars = where(is.numeric)) {
  df |> 
    summarize(
      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
      n = n(),
      .groups = "drop"
    )
}
diamonds |> 
  group_by(cut) |> 
  summarize_means()

# A tibble: 5 × 9
  cut       carat depth table price     x     y     z     n
  <ord>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair      1.05   64.0  59.1 4359.  6.25  6.18  3.98  1610
2 Good      0.849  62.4  58.7 3929.  5.84  5.85  3.64  4906
3 Very Good 0.806  61.8  58.0 3982.  5.74  5.77  3.56 12082
4 Premium   0.892  61.3  58.7 4584.  5.97  5.94  3.65 13791
5 Ideal     0.703  61.7  56.0 3458.  5.51  5.52  3.40 21551

diamonds |> 
  group_by(cut) |> 
  summarize_means(c(carat, x:z))

# A tibble: 5 × 6
  cut       carat     x     y     z     n
  <ord>     <dbl> <dbl> <dbl> <dbl> <int>
1 Fair      1.05   6.25  6.18  3.98  1610
2 Good      0.849  5.84  5.85  3.64  4906
3 Very Good 0.806  5.74  5.77  3.56 12082
4 Premium   0.892  5.97  5.94  3.65 13791
5 Ideal     0.703  5.51  5.52  3.40 21551

pivot_longer() with group_by() and summarize() also provides a nice solution:

long <- df |> 
  pivot_longer(a:d) |> 
  group_by(name) |> 
  summarize(
    median = median(value),
    mean = mean(value)
  )
long

# A tibble: 4 × 3
  name    median    mean
  <chr>    <dbl>   <dbl>
1 a     -0.00366  0.160 
2 b     -0.172   -0.172 
3 c      0.648    0.426 
4 d     -0.0296   0.0802

Here are a couple of other nice features of map functions: - perform analyses (like fitting a line) by subgroup - extracting components from a model or elements by position

# fit linear model to each group based on cylinder
#   - split designed to split into new dfs (unlike group_by)
#   - map returns a vector or list, which can be limiting
map = purrr::map
models <- split(mtcars, mtcars$cyl) |>
  map(function(df) lm(mpg ~ wt, data = df))
models

$`4`

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  


$`6`

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
      28.41        -2.78  


$`8`

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
     23.868       -2.192

models[[1]]


Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
     39.571       -5.647

# shortcut using purrr - 1-sided formulas
models <- split(mtcars, mtcars$cyl) |> 
  map(~lm(mpg ~ wt, data = .))
models

$`4`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  


$`6`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
      28.41        -2.78  


$`8`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
     23.868       -2.192

# extract named components from each model
str(models)

List of 3
 $ 4:List of 12
  ..$ coefficients : Named num [1:2] 39.57 -5.65
  .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
  ..$ residuals    : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
  .. ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  ..$ effects      : Named num [1:11] -88.433 10.171 0.695 6.231 1.728 ...
  .. ..- attr(*, "names")= chr [1:11] "(Intercept)" "wt" "" "" ...
  ..$ rank         : int 2
  ..$ fitted.values: Named num [1:11] 26.5 21.6 21.8 27.1 30.5 ...
  .. ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  ..$ assign       : int [1:2] 0 1
  ..$ qr           :List of 5
  .. ..$ qr   : num [1:11, 1:2] -3.317 0.302 0.302 0.302 0.302 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  .. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. ..$ qraux: num [1:2] 1.3 1.5
  .. ..$ pivot: int [1:2] 1 2
  .. ..$ tol  : num 1e-07
  .. ..$ rank : int 2
  .. ..- attr(*, "class")= chr "qr"
  ..$ df.residual  : int 9
  ..$ xlevels      : Named list()
  ..$ call         : language lm(formula = mpg ~ wt, data = .)
  ..$ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x0000025784e90880> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..$ model        :'data.frame':   11 obs. of  2 variables:
  .. ..$ mpg: num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
  .. ..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
  .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. .. ..$ : chr "wt"
  .. .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x0000025784e90880> 
  .. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..- attr(*, "class")= chr "lm"
 $ 6:List of 12
  ..$ coefficients : Named num [1:2] 28.41 -2.78
  .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
  ..$ residuals    : Named num [1:7] -0.125 0.584 1.929 -0.69 0.355 ...
  .. ..- attr(*, "names")= chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
  ..$ effects      : Named num [1:7] -52.235 -2.427 2.111 -0.353 0.679 ...
  .. ..- attr(*, "names")= chr [1:7] "(Intercept)" "wt" "" "" ...
  ..$ rank         : int 2
  ..$ fitted.values: Named num [1:7] 21.1 20.4 19.5 18.8 18.8 ...
  .. ..- attr(*, "names")= chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
  ..$ assign       : int [1:2] 0 1
  ..$ qr           :List of 5
  .. ..$ qr   : num [1:7, 1:2] -2.646 0.378 0.378 0.378 0.378 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
  .. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. ..$ qraux: num [1:2] 1.38 1.12
  .. ..$ pivot: int [1:2] 1 2
  .. ..$ tol  : num 1e-07
  .. ..$ rank : int 2
  .. ..- attr(*, "class")= chr "qr"
  ..$ df.residual  : int 5
  ..$ xlevels      : Named list()
  ..$ call         : language lm(formula = mpg ~ wt, data = .)
  ..$ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x0000025784f4b0a0> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..$ model        :'data.frame':   7 obs. of  2 variables:
  .. ..$ mpg: num [1:7] 21 21 21.4 18.1 19.2 17.8 19.7
  .. ..$ wt : num [1:7] 2.62 2.88 3.21 3.46 3.44 ...
  .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. .. ..$ : chr "wt"
  .. .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x0000025784f4b0a0> 
  .. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..- attr(*, "class")= chr "lm"
 $ 8:List of 12
  ..$ coefficients : Named num [1:2] 23.87 -2.19
  .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
  ..$ residuals    : Named num [1:14] 2.374 -1.741 1.455 1.61 -0.381 ...
  .. ..- attr(*, "names")= chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
  ..$ effects      : Named num [1:14] -56.499 -6.003 0.816 1.22 -0.807 ...
  .. ..- attr(*, "names")= chr [1:14] "(Intercept)" "wt" "" "" ...
  ..$ rank         : int 2
  ..$ fitted.values: Named num [1:14] 16.3 16 14.9 15.7 15.6 ...
  .. ..- attr(*, "names")= chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
  ..$ assign       : int [1:2] 0 1
  ..$ qr           :List of 5
  .. ..$ qr   : num [1:14, 1:2] -3.742 0.267 0.267 0.267 0.267 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
  .. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. ..$ qraux: num [1:2] 1.27 1.11
  .. ..$ pivot: int [1:2] 1 2
  .. ..$ tol  : num 1e-07
  .. ..$ rank : int 2
  .. ..- attr(*, "class")= chr "qr"
  ..$ df.residual  : int 12
  ..$ xlevels      : Named list()
  ..$ call         : language lm(formula = mpg ~ wt, data = .)
  ..$ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x0000025785123760> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..$ model        :'data.frame':   14 obs. of  2 variables:
  .. ..$ mpg: num [1:14] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 ...
  .. ..$ wt : num [1:14] 3.44 3.57 4.07 3.73 3.78 ...
  .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. .. ..$ : chr "wt"
  .. .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x0000025785123760> 
  .. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..- attr(*, "class")= chr "lm"

str(models[[1]])

List of 12
 $ coefficients : Named num [1:2] 39.57 -5.65
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
 $ residuals    : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
  ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
 $ effects      : Named num [1:11] -88.433 10.171 0.695 6.231 1.728 ...
  ..- attr(*, "names")= chr [1:11] "(Intercept)" "wt" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:11] 26.5 21.6 21.8 27.1 30.5 ...
  ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:11, 1:2] -3.317 0.302 0.302 0.302 0.302 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.3 1.5
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 9
 $ xlevels      : Named list()
 $ call         : language lm(formula = mpg ~ wt, data = .)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. ..- attr(*, "variables")= language list(mpg, wt)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. ..$ : chr "wt"
  .. ..- attr(*, "term.labels")= chr "wt"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x0000025784e90880> 
  .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 $ model        :'data.frame':  11 obs. of  2 variables:
  ..$ mpg: num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
  ..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x0000025784e90880> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 - attr(*, "class")= chr "lm"

str(summary(models[[1]]))

List of 11
 $ call         : language lm(formula = mpg ~ wt, data = .)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. ..- attr(*, "variables")= language list(mpg, wt)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. ..$ : chr "wt"
  .. ..- attr(*, "term.labels")= chr "wt"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x0000025784e90880> 
  .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 $ residuals    : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
  ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
 $ coefficients : num [1:2, 1:4] 39.57 -5.65 4.35 1.85 9.1 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:2] FALSE FALSE
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
 $ sigma        : num 3.33
 $ df           : int [1:3] 2 9 2
 $ r.squared    : num 0.509
 $ adj.r.squared: num 0.454
 $ fstatistic   : Named num [1:3] 9.32 1 9
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:2, 1:2] 1.701 -0.705 -0.705 0.308
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. ..$ : chr [1:2] "(Intercept)" "wt"
 - attr(*, "class")= chr "summary.lm"

models |>
  map(summary) |> 
  map_dbl("r.squared")

        4         6         8 
0.5086326 0.4645102 0.4229655

# can use integer to select elements by position
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x |> map_dbl(2)

[1] 2 5 8

Iterative techniques for reading multiple files

library(readxl)

# our usual path doesn't work with excel files
# read_excel("https://proback.github.io/264_fall_2024/Data/gapminder/1952.xlsx")

# this will work if .xlsx files live in a Data folder that's at the same
#   level as your .qmd file
gap1952 <- read_excel("Data/gapminder/1952.xlsx")
gap1957 <- read_excel("Data/gapminder/1957.xlsx")

Since the 1952 and 1957 data have the same 5 columns, if we want to combine this data into a single data set showing time trends, we could simply bind_rows() (after adding a 6th column for year)

gap1952 <- gap1952 |>
  mutate(year = 1952)
gap1957 <- gap1957 |>
  mutate(year = 1957)
gap_data <- bind_rows(gap1952, gap1957)

Of course, with 10 more years worth of data still left to read in and merge, this process could get pretty onerous. Section 26.3 shows how to automate this process in 3 steps:

use list.files() to list all the files in a directory

paths <- list.files("Data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)
paths

 [1] "Data/gapminder/1952.xlsx" "Data/gapminder/1957.xlsx"
 [3] "Data/gapminder/1962.xlsx" "Data/gapminder/1967.xlsx"
 [5] "Data/gapminder/1972.xlsx" "Data/gapminder/1977.xlsx"
 [7] "Data/gapminder/1982.xlsx" "Data/gapminder/1987.xlsx"
 [9] "Data/gapminder/1992.xlsx" "Data/gapminder/1997.xlsx"
[11] "Data/gapminder/2002.xlsx" "Data/gapminder/2007.xlsx"

use purrr::map() to read each of them into a list (we will discuss lists more in 06_data_types.qmd)

gap_files <- map(paths, readxl::read_excel)
length(gap_files)

[1] 12

str(gap_files)

List of 12
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 28.8 55.2 43.1 30 62.5 ...
  ..$ pop      : num [1:142] 8425333 1282697 9279525 4232095 17876956 ...
  ..$ gdpPercap: num [1:142] 779 1601 2449 3521 5911 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 30.3 59.3 45.7 32 64.4 ...
  ..$ pop      : num [1:142] 9240934 1476505 10270856 4561361 19610538 ...
  ..$ gdpPercap: num [1:142] 821 1942 3014 3828 6857 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 32 64.8 48.3 34 65.1 ...
  ..$ pop      : num [1:142] 10267083 1728137 11000948 4826015 21283783 ...
  ..$ gdpPercap: num [1:142] 853 2313 2551 4269 7133 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 34 66.2 51.4 36 65.6 ...
  ..$ pop      : num [1:142] 11537966 1984060 12760499 5247469 22934225 ...
  ..$ gdpPercap: num [1:142] 836 2760 3247 5523 8053 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 36.1 67.7 54.5 37.9 67.1 ...
  ..$ pop      : num [1:142] 13079460 2263554 14760787 5894858 24779799 ...
  ..$ gdpPercap: num [1:142] 740 3313 4183 5473 9443 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 38.4 68.9 58 39.5 68.5 ...
  ..$ pop      : num [1:142] 14880372 2509048 17152804 6162675 26983828 ...
  ..$ gdpPercap: num [1:142] 786 3533 4910 3009 10079 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 39.9 70.4 61.4 39.9 69.9 ...
  ..$ pop      : num [1:142] 12881816 2780097 20033753 7016384 29341374 ...
  ..$ gdpPercap: num [1:142] 978 3631 5745 2757 8998 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 40.8 72 65.8 39.9 70.8 ...
  ..$ pop      : num [1:142] 13867957 3075321 23254956 7874230 31620918 ...
  ..$ gdpPercap: num [1:142] 852 3739 5681 2430 9140 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 41.7 71.6 67.7 40.6 71.9 ...
  ..$ pop      : num [1:142] 16317921 3326498 26298373 8735988 33958947 ...
  ..$ gdpPercap: num [1:142] 649 2497 5023 2628 9308 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 41.8 73 69.2 41 73.3 ...
  ..$ pop      : num [1:142] 22227415 3428038 29072015 9875024 36203463 ...
  ..$ gdpPercap: num [1:142] 635 3193 4797 2277 10967 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 42.1 75.7 71 41 74.3 ...
  ..$ pop      : num [1:142] 25268405 3508512 31287142 10866106 38331121 ...
  ..$ gdpPercap: num [1:142] 727 4604 5288 2773 8798 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
  ..$ pop      : num [1:142] 31889923 3600523 33333216 12420476 40301927 ...
  ..$ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...

gap_files[[1]]   # pull off the first object in the list (i.e. 1952 data)

# A tibble: 142 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 132 more rows

use purrr::list_rbind() to combine them into a single data frame

gap_tidy <- list_rbind(gap_files)
class(gap_tidy)

[1] "tbl_df"     "tbl"        "data.frame"

gap_tidy

# A tibble: 1,704 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 1,694 more rows

We could even do all steps in a single pipeline:

list.files("Data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) |>
  map(readxl::read_excel) |>
  list_rbind()

# A tibble: 1,704 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 1,694 more rows

Note that we are lacking a 6th column with the year represented by each row of data. Here is one way to solve that issue:

# This extracts file names, which are carried along as data frame names by
#   map functions
paths |> set_names(basename)

                 1952.xlsx                  1957.xlsx 
"Data/gapminder/1952.xlsx" "Data/gapminder/1957.xlsx" 
                 1962.xlsx                  1967.xlsx 
"Data/gapminder/1962.xlsx" "Data/gapminder/1967.xlsx" 
                 1972.xlsx                  1977.xlsx 
"Data/gapminder/1972.xlsx" "Data/gapminder/1977.xlsx" 
                 1982.xlsx                  1987.xlsx 
"Data/gapminder/1982.xlsx" "Data/gapminder/1987.xlsx" 
                 1992.xlsx                  1997.xlsx 
"Data/gapminder/1992.xlsx" "Data/gapminder/1997.xlsx" 
                 2002.xlsx                  2007.xlsx 
"Data/gapminder/2002.xlsx" "Data/gapminder/2007.xlsx"

# The middle line ensures that each of the 12 data frames in the list for
#   gap_files has a name determined by its filepath, unlike the gap_files
#   we created in step 2 above, which had no names (we could only identify
#   data frames by their position)
gap_files <- paths |> 
  set_names(basename) |>  
  map(readxl::read_excel)

# Now we can extract a particular year by its name:
gap_files[["1962.xlsx"]]

# A tibble: 142 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         32.0 10267083      853.
 2 Albania     Europe       64.8  1728137     2313.
 3 Algeria     Africa       48.3 11000948     2551.
 4 Angola      Africa       34    4826015     4269.
 5 Argentina   Americas     65.1 21283783     7133.
 6 Australia   Oceania      70.9 10794968    12217.
 7 Austria     Europe       69.5  7129864    10751.
 8 Bahrain     Asia         56.9   171863    12753.
 9 Bangladesh  Asia         41.2 56839289      686.
10 Belgium     Europe       70.2  9218400    10991.
# ℹ 132 more rows

# Finally, take advantage of the `names_to` argument in list_rbind to 
#   create that 6th column with `year`
gap_tidy <- paths |> 
  set_names(basename) |> 
  map(readxl::read_excel) |> 
  list_rbind(names_to = "year") |> 
  mutate(year = parse_number(year))

You could then save your result using write_csv so you don’t have to run the reading and wrangling code every time!

On Your Own

Compute the mean of every column of the mtcars data set using (a) a for loop, (b) a map function, (c) the across() function, and (d) pivot_longer().
Write a function that prints the mean of each numeric column in a data frame. Try it on the iris data set. (Hint: keep(is.numeric))
Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

out <- ""
for (x in letters) {
  out <- stringr::str_c(out, x)
}
out

[1] "abcdefghijklmnopqrstuvwxyz"

x <- runif(100)
out <- vector("numeric", length(x))
out[1] <- x[1]
for (i in 2:length(x)) {
  out[i] <- out[i - 1] + x[i]
}
out

  [1]  0.7659542  1.2170385  1.4631939  2.0642360  2.9087141  3.2263928
  [7]  3.6097007  3.8137376  4.5678801  5.4855052  5.8603469  6.5126671
 [13]  7.0896233  7.9390025  8.8806870  8.9509445  9.4712987 10.0536424
 [19] 10.7498980 11.0469931 11.4688280 11.5515210 11.8689802 12.5501227
 [25] 12.7265628 13.4051848 14.2168808 14.3068506 15.0218993 15.5253211
 [31] 16.4996536 16.9528035 16.9537413 17.9148991 18.0475934 18.8577373
 [37] 18.9987444 19.2107286 19.3617838 20.0185779 20.2254060 20.8054480
 [43] 21.3095485 22.0575582 23.0281798 23.3801044 23.7102771 24.1054422
 [49] 24.5924026 24.7837967 24.7838145 25.4549247 26.0966358 26.8904540
 [55] 27.6074262 28.2528229 28.5076951 29.3567196 30.1808465 30.3444989
 [61] 30.7537603 31.1689662 31.9023482 32.1775384 32.5465328 32.9278041
 [67] 33.1765324 34.1210389 35.1054234 35.5651167 35.9227279 36.7737112
 [73] 37.6588950 38.3155885 39.1752387 39.5529214 40.4191225 40.4736776
 [79] 41.0140061 41.8181455 42.1813463 43.1254625 43.8479465 43.9401933
 [85] 44.5191770 45.1500025 46.0941437 46.1888272 46.5240323 47.1761070
 [91] 47.4068767 48.3708435 48.9851770 49.2902708 49.3857626 50.2547584
 [97] 50.4076073 50.8591056 51.8306170 52.0590785

Compute the number of unique values in each column of the iris data set using at least 2 of your favorite iteration methods. Bonus points if you can use pivot_longer()!
Carefully explain each step in the pipeline below:

show_missing <- function(df, group_vars, summary_vars = everything()) {
  df |> 
    group_by(pick({{ group_vars }})) |> 
    summarize(
      across({{ summary_vars }}, \(x) sum(is.na(x))),
      .groups = "drop"
    ) |>
    select(where(\(x) any(x > 0)))
}
nycflights13::flights |> show_missing(c(year, month, day))

# A tibble: 365 × 9
    year month   day dep_time dep_delay arr_time arr_delay tailnum air_time
   <int> <int> <int>    <int>     <int>    <int>     <int>   <int>    <int>
 1  2013     1     1        4         4        5        11       0       11
 2  2013     1     2        8         8       10        15       2       15
 3  2013     1     3       10        10       10        14       2       14
 4  2013     1     4        6         6        6         7       2        7
 5  2013     1     5        3         3        3         3       1        3
 6  2013     1     6        1         1        1         3       0        3
 7  2013     1     7        3         3        3         3       1        3
 8  2013     1     8        4         4        4         7       1        7
 9  2013     1     9        5         5        7         9       2        9
10  2013     1    10        3         3        3         3       2        3
# ℹ 355 more rows

Write a function called summary_stats() that allows a user to input a tibble, numeric variables in that tibble, and summary statistics that they would like to see for each variable. Using across(), the function’s output should look like the example below.

summary_stats(mtcars, 
              vars = c(mpg, hp, wt), 
              stat_fcts = list(mean = mean, 
                               median = median, 
                               sd = sd, 
                               IQR = IQR))

#  mpg_mean mpg_median   mpg_sd mpg_IQR  hp_mean hp_median    hp_sd hp_IQR
#1 20.09062       19.2 6.026948   7.375 146.6875       123 68.56287   83.5
#  wt_mean wt_median     wt_sd  wt_IQR  n
#1 3.21725     3.325 0.9784574 1.02875 32

The power of a statistical test is the probability that it rejects the null hypothesis when the null hypothesis is false. In other words, it’s the probability that a statistical test can detect when a true difference exists. The power depends on a number of factors, including:

sample size
type I error level (probability of declaring there is a statistically significant difference when there really isn’t)
variability in the data
size of the true difference

The following steps can be followed to simulate a power calculation using iteration techniques:

generate simulated data where is a true difference or effect
run your desired test on the simulated data and record if the null hypothesis was rejected or not (i.e. if the p-value was below .05)
repeat (a)-(b) a large number of times and record the total proportion of times that the null hypothesis was rejected; that proportion is the power of your test under those conditions

Create a power curve for a two-sample t-test by filling in Step C below and then removing eval: FALSE:

# Step A

# set parameters for two-sample t-test
mean1 <- 100   # mean response in Group 1
truediff <- 5    # true mean difference between Groups 1 and 2
mean2 <- mean1 + truediff   # mean response in Group 2
sd1 <- 10   # standard deviation in Group 1
sd2 <- 10   # standard deviation in Group 2
n1 <- 20    # sample size in Group 1
n2 <- 20    # sample size in Group 2
numsims <- 1000   # number of simulations (iterations) to run

# generate sample data for Groups 1 and 2 based on normal distributions
#   with the parameters above (note that there is truly a difference in means!)
samp1 <- rnorm(n1, mean1, sd1)
samp2 <- rnorm(n2, mean2, sd2)

# organize the simulated data into a tibble
sim_data <- tibble(response = c(samp1, samp2), 
       group = c(rep("Group 1", n1), rep("Group 2", n2)))
sim_data

# Step B

# exploratory analysis of the simulated data
mosaic::favstats(response ~ group, data = sim_data)
ggplot(sim_data, aes(x = response, y = group)) +
  geom_boxplot()

# run a two-sample t-test to see if there is a significant difference
#   in means between Groups 1 and 2 (i.e. is the p-value < .05?)
p_value <- t.test(x = samp1, y = samp2)$p.value
p_value
p_value < .05   # if TRUE, then we reject the null hypothesis and conclude
                #   there is a statistically significant difference

# Step C

# find the power = proportion of time null is rejected when
#   true difference is not 0 (i.e. number of simulated data sets that
#   result in p-values below .05)