Iteration

You can download this .qmd file from here. Just hit the Download Raw File button.

This leans on parts of R4DS Chapter 26: Iteration, in addition to parts of the first edition of R4DS.

# Initial packages required
library(tidyverse)

Iteration

Reducing duplication of code will reduce errors and make debugging much easier. We’ve already seen how functions (Ch 25) can help reduce duplication by extracting repeated patterns of code. Another tool is iteration, when you find you’re doing the same thing to multiple inputs – repeating the same operation on different columns or datasets.

Here we’ll see two important iteration paradigms: imperative programming and functional programming.

Imperation programming for iteration

Examples: for loops and while loops

Pros: relatively easy to learn, make iteration very explicit so it’s obvious what’s happening, not as inefficient as some people believe

Cons: require lots of bookkeeping code that’s duplicated for every loop

Every for loop has three components:

  1. output - plan ahead and allocate enough space for output
  2. sequence - determines what to loop over; cycles through different values of \(i\)
  3. body - code that does the work; run repeatedly with different values of \(i\)
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df
# A tibble: 10 × 4
        a      b      c       d
    <dbl>  <dbl>  <dbl>   <dbl>
 1 -0.502 -0.668  0.396  0.653 
 2  1.87   0.363 -0.644 -0.816 
 3 -0.943 -0.737  1.01  -2.81  
 4  0.617  0.174 -0.214 -0.0285
 5 -0.850 -1.22   1.36   0.564 
 6 -0.848  0.598 -0.247 -0.0209
 7  0.162  1.24   0.196  1.57  
 8  0.248 -0.366  0.983 -1.25  
 9 -0.286 -1.09   0.143 -1.46  
10  0.459  1.61   0.566  0.863 
# want median of each column (w/o cutting and pasting)
#   Be careful using square brackets vs double square brackets when
#   selecting elements
median(df[[1]])
[1] -0.06208288
median(df[1])
Error in median.default(df[1]): need numeric data
df[1]
# A tibble: 10 × 1
        a
    <dbl>
 1 -0.502
 2  1.87 
 3 -0.943
 4  0.617
 5 -0.850
 6 -0.848
 7  0.162
 8  0.248
 9 -0.286
10  0.459
df[[1]]
 [1] -0.5023795  1.8742121 -0.9428545  0.6167671 -0.8504647 -0.8482801
 [7]  0.1619869  0.2481099 -0.2861527  0.4591192
class(df[1])
[1] "tbl_df"     "tbl"        "data.frame"
class(df[[1]])
[1] "numeric"
# basic for loop to take median of each column
output <- vector("double", ncol(df))  # 1. output
for (i in 1:4) {                      # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output
[1] -0.06208288 -0.09602160  0.29629139 -0.02472320
# ?seq_along - a safer option if had zero length vectors
output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output
[1] -0.06208288 -0.09602160  0.29629139 -0.02472320
# use [[.]] even if don't have to to signal working with single elements

# alternative solution - don't hardcode in "4"
output <- vector("double", ncol(df))  # 1. output
for(i in 1:ncol(df)) {                # 2. sequence
  output[i] <- median(df[[i]])        # 3. body
}
output
[1] -0.06208288 -0.09602160  0.29629139 -0.02472320
# another approach - no double square brackets since df not a tibble
df <- as.data.frame(df)
output <- vector("double", ncol(df))  # 1. output
for(i in 1:ncol(df)) {                # 2. sequence
  output[i] <- median(df[,i])         # 3. body
}
output
[1] -0.06208288 -0.09602160  0.29629139 -0.02472320

One advantage of seq_along(): works with unknown output length. However, the second approach below is much more efficient, since each iteration doesn’t copy all data from previous iterations.

[Pause to Ponder:] What does the code below do? Be prepared to explain both chunks line-by-line!

# for loop: unknown output length

means <- c(0, 1, 2)
output <- double()
for (i in seq_along(means)) {
  n <- sample(100, 1)
  output <- c(output, rnorm(n, means[[i]]))
}
str(output)        ## inefficient
 num [1:214] -0.306 -1.187 0.183 1.271 0.917 ...
out <- vector("list", length(means))
for (i in seq_along(means)) {
  n <- sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(out)           ## more efficient
List of 3
 $ : num [1:62] 0.938 0.133 -2.019 1.091 -1.101 ...
 $ : num [1:43] 2.0109 1.0148 2.5274 -0.0122 0.3437 ...
 $ : num [1:74] 1.93 2.76 2.95 1.32 1.78 ...
str(unlist(out))   ## flatten list of vectors into single vector
 num [1:179] 0.938 0.133 -2.019 1.091 -1.101 ...

Finally, the while() loop can be used with unknown sequence length. This is used more in simulation than in data analysis.

[Pause to Ponder:] What does the following code do?

flip <- function() sample(c("T", "H"), 1)
flips <- 0
nheads <- 0
while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips
[1] 4

Using iteration for simulation

This applet contains data from a 2005 study on the use of dolphin-facilitated therapy on the treatment of depression. In that study, 10 of the 15 subjects (67%) assigned to dolphin therapy showed improvement, compared to only 3 of the 15 subjects (20%) assigned to the control group. But with such small sample sizes, is this significant evidence that the dolphin group had greater improvement of their depressive symptoms? To answer that question, we can use simulation to conduct a randomization test.

We will simulate behavior in the “null world” where there is no real effect of treatment. In that case, the 13 total improvers would have improved no matter the treatment assigned, and the 17 total non-improvers would have not improved no matter the treatment assigned. So in the “null world”, treatment is a meaningless label that can be just as easily shuffled among subjects without any effect. In that world, the fact we observed a 47 percentage point difference in success rates (67 - 20) was just random luck. But we should ask: how often would we expect a difference as large as 47% by chance, assuming we’re living in the null world where there is no effect of treatment?

You could think about simulating this situation with the following steps:

  1. write code to calculate the difference in success rates in the observed data

  2. write a loop to calculate the differences in success rates from 1000 simulated data sets from the null world. Store those 1000 simulated differences

  3. calculate how often we found a difference in the null world as large as that found in the observed data. In statistics, when this probability is below .05, we typically reject the null world, and conclude that there is likely a real difference between the two groups (i.e. a “statistically significant” difference)

[Pause to Ponder:] Fill in Step 2 in the second R chunk below to carry out the three steps above. (The first R chunk provides some preliminary code.) Then describe what you can conclude from this study based on your plot and “p_value” from Step 3.

### Preliminary code ###

# generate a tibble with our observed data
dolphin_data <- tibble(treatment = rep(c("Dolphin", "Control"), each = 15),
                       improve = c(rep("Yes", 10), rep("No", 5), 
                                   rep("Yes", 3), rep("No", 12)))
print(dolphin_data, n = Inf)
# A tibble: 30 × 2
   treatment improve
   <chr>     <chr>  
 1 Dolphin   Yes    
 2 Dolphin   Yes    
 3 Dolphin   Yes    
 4 Dolphin   Yes    
 5 Dolphin   Yes    
 6 Dolphin   Yes    
 7 Dolphin   Yes    
 8 Dolphin   Yes    
 9 Dolphin   Yes    
10 Dolphin   Yes    
11 Dolphin   No     
12 Dolphin   No     
13 Dolphin   No     
14 Dolphin   No     
15 Dolphin   No     
16 Control   Yes    
17 Control   Yes    
18 Control   Yes    
19 Control   No     
20 Control   No     
21 Control   No     
22 Control   No     
23 Control   No     
24 Control   No     
25 Control   No     
26 Control   No     
27 Control   No     
28 Control   No     
29 Control   No     
30 Control   No     
# `sample()` can be used to shuffle the treatments among the 30 subjects
sample(dolphin_data$treatment)
 [1] "Control" "Dolphin" "Dolphin" "Dolphin" "Dolphin" "Control" "Control"
 [8] "Control" "Dolphin" "Dolphin" "Control" "Control" "Dolphin" "Dolphin"
[15] "Control" "Control" "Dolphin" "Dolphin" "Control" "Dolphin" "Control"
[22] "Dolphin" "Dolphin" "Dolphin" "Control" "Control" "Dolphin" "Control"
[29] "Control" "Control"
### Fill in Step 2 and remove "eval: FALSE" ###

# Step 1
dolphin_summary <- dolphin_data |>
  group_by(treatment) |>
  summarize(prop_yes = mean(improve == "Yes"))
dolphin_summary
observed_diff <- dolphin_summary[[2]][2] - dolphin_summary[[2]][1]

# Step 2

### Write a loop to create 1000 simulated differences from the null world

# Step 3
null_world <- tibble(simulated_diffs = simulated_diffs)
ggplot(null_world, aes(x = simulated_diffs)) +
  geom_histogram() +
  geom_vline(xintercept = observed_diff, color = "red")

p_value <- sum(abs(simulated_diffs) >= abs(observed_diff)) / 1000
p_value

You have written code to conduct a randomization test for the difference in two proportions, a powerful test of statistical significance that is demonstrated in the original applet!

Functional programming for iteration

Examples: map functions and across()

Pros: less code, fewer errors, code that’s easier to read; takes advantage of fact that R is a functional programming language

Cons: little more complicated to master vocabulary and use – a step up in abstraction

R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly. Passing one function to another is a very powerful coding approach!!

# Below you can avoid writing separate functions for mean, median, 
#   SD, etc. by column
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)
[1] -0.1952978 -0.2067583  0.2349411  0.6016533
col_summary(df, mean)
[1] -0.07689049 -0.15886096  0.08754637  0.51398147
col_summary(df, IQR)
[1] 1.094586 1.656958 1.283876 1.020525

The purrr package provides map functions to eliminate need for for loops, plus it makes code easier to read!

# using map functions for summary stats by column as above
map_dbl(df, mean)
          a           b           c           d 
-0.07689049 -0.15886096  0.08754637  0.51398147 
map_dbl(df, median)
         a          b          c          d 
-0.1952978 -0.2067583  0.2349411  0.6016533 
map_dbl(df, sd)
        a         b         c         d 
0.9438331 0.9889397 0.9287586 0.7635153 
map_dbl(df, mean, trim = 0.5)
         a          b          c          d 
-0.1952978 -0.2067583  0.2349411  0.6016533 
# map_dbl means make a double vector
# can also do map() for list, map_lgl(), map_int(), and map_chr()

# even more clear
df |> map_dbl(mean)
          a           b           c           d 
-0.07689049 -0.15886096  0.08754637  0.51398147 
df |> map_dbl(median)
         a          b          c          d 
-0.1952978 -0.2067583  0.2349411  0.6016533 
df |> map_dbl(sd)
        a         b         c         d 
0.9438331 0.9889397 0.9287586 0.7635153 

The across() function from dplyr also works well:

df |> summarize(
  n = n(),
  across(.cols = a:d, .fns = median, .names = "median_{.col}")
)
# A tibble: 1 × 5
      n median_a median_b median_c median_d
  <int>    <dbl>    <dbl>    <dbl>    <dbl>
1    10   -0.195   -0.207    0.235    0.602
# other ways to repeat across the numeric columns of df:
df |> summarize(
  n = n(),
  across(everything(), median, .names = "median_{.col}")
)
# A tibble: 1 × 6
      n median_a median_b median_c median_d median_n
  <int>    <dbl>    <dbl>    <dbl>    <dbl>    <int>
1    10   -0.195   -0.207    0.235    0.602       10
df |> summarize(
  n = n(),
  across(where(is.numeric), median, .names = "median_{.col}")
)
# A tibble: 1 × 6
      n median_a median_b median_c median_d median_n
  <int>    <dbl>    <dbl>    <dbl>    <dbl>    <int>
1    10   -0.195   -0.207    0.235    0.602       10
# Here "across" effectively expands to the following code.  Note that 
#   across() will write over old columns unless you change the name!
df |> 
  summarize(
    median_a = median(a),
    median_b = median(b),
    median_c = median(c),
    median_d = median(d),
    n = n()
  )
# A tibble: 1 × 5
  median_a median_b median_c median_d     n
     <dbl>    <dbl>    <dbl>    <dbl> <int>
1   -0.195   -0.207    0.235    0.602    10
# And if we're worried about NAs, we can't call median directly, we
#   must create a new function that we can pass options into
df_miss <- df
df_miss[2, 1] <- NA
df_miss[4:5, 2] <- NA
df_miss
# A tibble: 10 × 4
        a      b      c      d
    <dbl>  <dbl>  <dbl>  <dbl>
 1 -0.456  0.594  0.901  0.547
 2 NA     -0.685 -0.815  1.01 
 3 -0.966  0.932  0.634 -0.718
 4  0.697 NA     -0.470  1.01 
 5  0.579 NA      0.228 -0.184
 6 -0.195 -0.935  0.317  1.69 
 7 -1.95   0.990 -1.14   0.521
 8 -0.338 -0.533  0.242  0.656
 9  1.07  -1.80  -0.882  1.10 
10 -0.195  0.119  1.86  -0.490
df_miss |> 
  summarize(
    across(
      a:d,
      list(
        median = \(x) median(x, na.rm = TRUE),
        n_miss = \(x) sum(is.na(x))
      ),
      .names = "{.fn}_{.col}"
    ),
    n = n(),
  )
# A tibble: 1 × 9
  median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d     n
     <dbl>    <int>    <dbl>    <int>    <dbl>    <int>    <dbl>    <int> <int>
1   -0.195        1   -0.207        2    0.235        0    0.602        0    10
# where \ is shorthand for an anonymous function - i.e. you could
#   replace "\" with "function" if you like typing more letters :)

# across-like functions can also be used with filter():

# same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
df_miss |> filter(if_any(a:d, is.na))
# A tibble: 3 × 4
       a      b      c      d
   <dbl>  <dbl>  <dbl>  <dbl>
1 NA     -0.685 -0.815  1.01 
2  0.697 NA     -0.470  1.01 
3  0.579 NA      0.228 -0.184
# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
df_miss |> filter(if_all(a:d, is.na))
# A tibble: 0 × 4
# ℹ 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>

When you input a list of functions (like the lubridate functions below), across() assigns default names as columnname_functionname:

library(lubridate)
expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    )
}

df_date <- tibble(
  name = c("Amy", "Bob"),
  date = ymd(c("2009-08-03", "2010-01-16"))
)

df_date |> 
  expand_dates()
# A tibble: 2 × 5
  name  date       date_year date_month date_day
  <chr> <date>         <dbl>      <dbl>    <int>
1 Amy   2009-08-03      2009          8        3
2 Bob   2010-01-16      2010          1       16

Here is default is to summarize all numeric columns, but as with all functions, we can override the default if we choose:

summarize_means <- function(df, summary_vars = where(is.numeric)) {
  df |> 
    summarize(
      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
      n = n(),
      .groups = "drop"
    )
}
diamonds |> 
  group_by(cut) |> 
  summarize_means()
# A tibble: 5 × 9
  cut       carat depth table price     x     y     z     n
  <ord>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair      1.05   64.0  59.1 4359.  6.25  6.18  3.98  1610
2 Good      0.849  62.4  58.7 3929.  5.84  5.85  3.64  4906
3 Very Good 0.806  61.8  58.0 3982.  5.74  5.77  3.56 12082
4 Premium   0.892  61.3  58.7 4584.  5.97  5.94  3.65 13791
5 Ideal     0.703  61.7  56.0 3458.  5.51  5.52  3.40 21551
diamonds |> 
  group_by(cut) |> 
  summarize_means(c(carat, x:z))
# A tibble: 5 × 6
  cut       carat     x     y     z     n
  <ord>     <dbl> <dbl> <dbl> <dbl> <int>
1 Fair      1.05   6.25  6.18  3.98  1610
2 Good      0.849  5.84  5.85  3.64  4906
3 Very Good 0.806  5.74  5.77  3.56 12082
4 Premium   0.892  5.97  5.94  3.65 13791
5 Ideal     0.703  5.51  5.52  3.40 21551

pivot_longer() with group_by() and summarize() also provides a nice solution:

long <- df |> 
  pivot_longer(a:d) |> 
  group_by(name) |> 
  summarize(
    median = median(value),
    mean = mean(value)
  )
long
# A tibble: 4 × 3
  name  median    mean
  <chr>  <dbl>   <dbl>
1 a     -0.195 -0.0769
2 b     -0.207 -0.159 
3 c      0.235  0.0875
4 d      0.602  0.514 

Here are a couple of other nice features of map functions: - perform analyses (like fitting a line) by subgroup - extracting components from a model or elements by position

# fit linear model to each group based on cylinder
#   - split designed to split into new dfs (unlike group_by)
#   - map returns a vector or list, which can be limiting
map = purrr::map
models <- split(mtcars, mtcars$cyl) |>
  map(function(df) lm(mpg ~ wt, data = df))
models
$`4`

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  


$`6`

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
      28.41        -2.78  


$`8`

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
     23.868       -2.192  
models[[1]]

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  
# shortcut using purrr - 1-sided formulas
models <- split(mtcars, mtcars$cyl) |> 
  map(~lm(mpg ~ wt, data = .))
models
$`4`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  


$`6`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
      28.41        -2.78  


$`8`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
     23.868       -2.192  
# extract named components from each model
str(models)
List of 3
 $ 4:List of 12
  ..$ coefficients : Named num [1:2] 39.57 -5.65
  .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
  ..$ residuals    : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
  .. ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  ..$ effects      : Named num [1:11] -88.433 10.171 0.695 6.231 1.728 ...
  .. ..- attr(*, "names")= chr [1:11] "(Intercept)" "wt" "" "" ...
  ..$ rank         : int 2
  ..$ fitted.values: Named num [1:11] 26.5 21.6 21.8 27.1 30.5 ...
  .. ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  ..$ assign       : int [1:2] 0 1
  ..$ qr           :List of 5
  .. ..$ qr   : num [1:11, 1:2] -3.317 0.302 0.302 0.302 0.302 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  .. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. ..$ qraux: num [1:2] 1.3 1.5
  .. ..$ pivot: int [1:2] 1 2
  .. ..$ tol  : num 1e-07
  .. ..$ rank : int 2
  .. ..- attr(*, "class")= chr "qr"
  ..$ df.residual  : int 9
  ..$ xlevels      : Named list()
  ..$ call         : language lm(formula = mpg ~ wt, data = .)
  ..$ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x0000015c30619290> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..$ model        :'data.frame':   11 obs. of  2 variables:
  .. ..$ mpg: num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
  .. ..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
  .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. .. ..$ : chr "wt"
  .. .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x0000015c30619290> 
  .. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..- attr(*, "class")= chr "lm"
 $ 6:List of 12
  ..$ coefficients : Named num [1:2] 28.41 -2.78
  .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
  ..$ residuals    : Named num [1:7] -0.125 0.584 1.929 -0.69 0.355 ...
  .. ..- attr(*, "names")= chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
  ..$ effects      : Named num [1:7] -52.235 -2.427 2.111 -0.353 0.679 ...
  .. ..- attr(*, "names")= chr [1:7] "(Intercept)" "wt" "" "" ...
  ..$ rank         : int 2
  ..$ fitted.values: Named num [1:7] 21.1 20.4 19.5 18.8 18.8 ...
  .. ..- attr(*, "names")= chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
  ..$ assign       : int [1:2] 0 1
  ..$ qr           :List of 5
  .. ..$ qr   : num [1:7, 1:2] -2.646 0.378 0.378 0.378 0.378 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
  .. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. ..$ qraux: num [1:2] 1.38 1.12
  .. ..$ pivot: int [1:2] 1 2
  .. ..$ tol  : num 1e-07
  .. ..$ rank : int 2
  .. ..- attr(*, "class")= chr "qr"
  ..$ df.residual  : int 5
  ..$ xlevels      : Named list()
  ..$ call         : language lm(formula = mpg ~ wt, data = .)
  ..$ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x0000015c3066b308> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..$ model        :'data.frame':   7 obs. of  2 variables:
  .. ..$ mpg: num [1:7] 21 21 21.4 18.1 19.2 17.8 19.7
  .. ..$ wt : num [1:7] 2.62 2.88 3.21 3.46 3.44 ...
  .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. .. ..$ : chr "wt"
  .. .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x0000015c3066b308> 
  .. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..- attr(*, "class")= chr "lm"
 $ 8:List of 12
  ..$ coefficients : Named num [1:2] 23.87 -2.19
  .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
  ..$ residuals    : Named num [1:14] 2.374 -1.741 1.455 1.61 -0.381 ...
  .. ..- attr(*, "names")= chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
  ..$ effects      : Named num [1:14] -56.499 -6.003 0.816 1.22 -0.807 ...
  .. ..- attr(*, "names")= chr [1:14] "(Intercept)" "wt" "" "" ...
  ..$ rank         : int 2
  ..$ fitted.values: Named num [1:14] 16.3 16 14.9 15.7 15.6 ...
  .. ..- attr(*, "names")= chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
  ..$ assign       : int [1:2] 0 1
  ..$ qr           :List of 5
  .. ..$ qr   : num [1:14, 1:2] -3.742 0.267 0.267 0.267 0.267 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
  .. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. ..$ qraux: num [1:2] 1.27 1.11
  .. ..$ pivot: int [1:2] 1 2
  .. ..$ tol  : num 1e-07
  .. ..$ rank : int 2
  .. ..- attr(*, "class")= chr "qr"
  ..$ df.residual  : int 12
  ..$ xlevels      : Named list()
  ..$ call         : language lm(formula = mpg ~ wt, data = .)
  ..$ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x0000015c3084fe80> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..$ model        :'data.frame':   14 obs. of  2 variables:
  .. ..$ mpg: num [1:14] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 ...
  .. ..$ wt : num [1:14] 3.44 3.57 4.07 3.73 3.78 ...
  .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. .. ..$ : chr "wt"
  .. .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x0000015c3084fe80> 
  .. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..- attr(*, "class")= chr "lm"
str(models[[1]])
List of 12
 $ coefficients : Named num [1:2] 39.57 -5.65
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
 $ residuals    : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
  ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
 $ effects      : Named num [1:11] -88.433 10.171 0.695 6.231 1.728 ...
  ..- attr(*, "names")= chr [1:11] "(Intercept)" "wt" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:11] 26.5 21.6 21.8 27.1 30.5 ...
  ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:11, 1:2] -3.317 0.302 0.302 0.302 0.302 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.3 1.5
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 9
 $ xlevels      : Named list()
 $ call         : language lm(formula = mpg ~ wt, data = .)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. ..- attr(*, "variables")= language list(mpg, wt)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. ..$ : chr "wt"
  .. ..- attr(*, "term.labels")= chr "wt"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x0000015c30619290> 
  .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 $ model        :'data.frame':  11 obs. of  2 variables:
  ..$ mpg: num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
  ..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x0000015c30619290> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 - attr(*, "class")= chr "lm"
str(summary(models[[1]]))
List of 11
 $ call         : language lm(formula = mpg ~ wt, data = .)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. ..- attr(*, "variables")= language list(mpg, wt)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. ..$ : chr "wt"
  .. ..- attr(*, "term.labels")= chr "wt"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x0000015c30619290> 
  .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 $ residuals    : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
  ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
 $ coefficients : num [1:2, 1:4] 39.57 -5.65 4.35 1.85 9.1 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:2] FALSE FALSE
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
 $ sigma        : num 3.33
 $ df           : int [1:3] 2 9 2
 $ r.squared    : num 0.509
 $ adj.r.squared: num 0.454
 $ fstatistic   : Named num [1:3] 9.32 1 9
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:2, 1:2] 1.701 -0.705 -0.705 0.308
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. ..$ : chr [1:2] "(Intercept)" "wt"
 - attr(*, "class")= chr "summary.lm"
models |>
  map(summary) |> 
  map_dbl("r.squared")
        4         6         8 
0.5086326 0.4645102 0.4229655 
# can use integer to select elements by position
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x |> map_dbl(2)
[1] 2 5 8

Iterative techniques for reading multiple files

library(readxl)

# our usual path doesn't work with excel files
# read_excel("https://proback.github.io/264_fall_2024/Data/gapminder/1952.xlsx")

# this will work if .xlsx files live in a Data folder that's at the same
#   level as your .qmd file
gap1952 <- read_excel("Data/gapminder/1952.xlsx")
gap1957 <- read_excel("Data/gapminder/1957.xlsx")

Since the 1952 and 1957 data have the same 5 columns, if we want to combine this data into a single data set showing time trends, we could simply bind_rows() (after adding a 6th column for year)

gap1952 <- gap1952 |>
  mutate(year = 1952)
gap1957 <- gap1957 |>
  mutate(year = 1957)
gap_data <- bind_rows(gap1952, gap1957)

Of course, with 10 more years worth of data still left to read in and merge, this process could get pretty onerous. Section 26.3 shows how to automate this process in 3 steps:

  1. use list.files() to list all the files in a directory
paths <- list.files("Data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)
paths
 [1] "Data/gapminder/1952.xlsx" "Data/gapminder/1957.xlsx"
 [3] "Data/gapminder/1962.xlsx" "Data/gapminder/1967.xlsx"
 [5] "Data/gapminder/1972.xlsx" "Data/gapminder/1977.xlsx"
 [7] "Data/gapminder/1982.xlsx" "Data/gapminder/1987.xlsx"
 [9] "Data/gapminder/1992.xlsx" "Data/gapminder/1997.xlsx"
[11] "Data/gapminder/2002.xlsx" "Data/gapminder/2007.xlsx"
  1. use purrr::map() to read each of them into a list (we will discuss lists more in 06_data_types.qmd)
gap_files <- map(paths, readxl::read_excel)
length(gap_files)
[1] 12
str(gap_files)
List of 12
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 28.8 55.2 43.1 30 62.5 ...
  ..$ pop      : num [1:142] 8425333 1282697 9279525 4232095 17876956 ...
  ..$ gdpPercap: num [1:142] 779 1601 2449 3521 5911 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 30.3 59.3 45.7 32 64.4 ...
  ..$ pop      : num [1:142] 9240934 1476505 10270856 4561361 19610538 ...
  ..$ gdpPercap: num [1:142] 821 1942 3014 3828 6857 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 32 64.8 48.3 34 65.1 ...
  ..$ pop      : num [1:142] 10267083 1728137 11000948 4826015 21283783 ...
  ..$ gdpPercap: num [1:142] 853 2313 2551 4269 7133 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 34 66.2 51.4 36 65.6 ...
  ..$ pop      : num [1:142] 11537966 1984060 12760499 5247469 22934225 ...
  ..$ gdpPercap: num [1:142] 836 2760 3247 5523 8053 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 36.1 67.7 54.5 37.9 67.1 ...
  ..$ pop      : num [1:142] 13079460 2263554 14760787 5894858 24779799 ...
  ..$ gdpPercap: num [1:142] 740 3313 4183 5473 9443 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 38.4 68.9 58 39.5 68.5 ...
  ..$ pop      : num [1:142] 14880372 2509048 17152804 6162675 26983828 ...
  ..$ gdpPercap: num [1:142] 786 3533 4910 3009 10079 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 39.9 70.4 61.4 39.9 69.9 ...
  ..$ pop      : num [1:142] 12881816 2780097 20033753 7016384 29341374 ...
  ..$ gdpPercap: num [1:142] 978 3631 5745 2757 8998 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 40.8 72 65.8 39.9 70.8 ...
  ..$ pop      : num [1:142] 13867957 3075321 23254956 7874230 31620918 ...
  ..$ gdpPercap: num [1:142] 852 3739 5681 2430 9140 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 41.7 71.6 67.7 40.6 71.9 ...
  ..$ pop      : num [1:142] 16317921 3326498 26298373 8735988 33958947 ...
  ..$ gdpPercap: num [1:142] 649 2497 5023 2628 9308 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 41.8 73 69.2 41 73.3 ...
  ..$ pop      : num [1:142] 22227415 3428038 29072015 9875024 36203463 ...
  ..$ gdpPercap: num [1:142] 635 3193 4797 2277 10967 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 42.1 75.7 71 41 74.3 ...
  ..$ pop      : num [1:142] 25268405 3508512 31287142 10866106 38331121 ...
  ..$ gdpPercap: num [1:142] 727 4604 5288 2773 8798 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
  ..$ pop      : num [1:142] 31889923 3600523 33333216 12420476 40301927 ...
  ..$ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...
gap_files[[1]]   # pull off the first object in the list (i.e. 1952 data)
# A tibble: 142 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 132 more rows
  1. use purrr::list_rbind() to combine them into a single data frame
gap_tidy <- list_rbind(gap_files)
class(gap_tidy)
[1] "tbl_df"     "tbl"        "data.frame"
gap_tidy
# A tibble: 1,704 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 1,694 more rows

We could even do all steps in a single pipeline:

list.files("Data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) |>
  map(readxl::read_excel) |>
  list_rbind()
# A tibble: 1,704 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 1,694 more rows

Note that we are lacking a 6th column with the year represented by each row of data. Here is one way to solve that issue:

# This extracts file names, which are carried along as data frame names by
#   map functions
paths |> set_names(basename)
                 1952.xlsx                  1957.xlsx 
"Data/gapminder/1952.xlsx" "Data/gapminder/1957.xlsx" 
                 1962.xlsx                  1967.xlsx 
"Data/gapminder/1962.xlsx" "Data/gapminder/1967.xlsx" 
                 1972.xlsx                  1977.xlsx 
"Data/gapminder/1972.xlsx" "Data/gapminder/1977.xlsx" 
                 1982.xlsx                  1987.xlsx 
"Data/gapminder/1982.xlsx" "Data/gapminder/1987.xlsx" 
                 1992.xlsx                  1997.xlsx 
"Data/gapminder/1992.xlsx" "Data/gapminder/1997.xlsx" 
                 2002.xlsx                  2007.xlsx 
"Data/gapminder/2002.xlsx" "Data/gapminder/2007.xlsx" 
# The middle line ensures that each of the 12 data frames in the list for
#   gap_files has a name determined by its filepath, unlike the gap_files
#   we created in step 2 above, which had no names (we could only identify
#   data frames by their position)
gap_files <- paths |> 
  set_names(basename) |>  
  map(readxl::read_excel)

# Now we can extract a particular year by its name:
gap_files[["1962.xlsx"]]
# A tibble: 142 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         32.0 10267083      853.
 2 Albania     Europe       64.8  1728137     2313.
 3 Algeria     Africa       48.3 11000948     2551.
 4 Angola      Africa       34    4826015     4269.
 5 Argentina   Americas     65.1 21283783     7133.
 6 Australia   Oceania      70.9 10794968    12217.
 7 Austria     Europe       69.5  7129864    10751.
 8 Bahrain     Asia         56.9   171863    12753.
 9 Bangladesh  Asia         41.2 56839289      686.
10 Belgium     Europe       70.2  9218400    10991.
# ℹ 132 more rows
# Finally, take advantage of the `names_to` argument in list_rbind to 
#   create that 6th column with `year`
gap_tidy <- paths |> 
  set_names(basename) |> 
  map(readxl::read_excel) |> 
  list_rbind(names_to = "year") |> 
  mutate(year = parse_number(year))

You could then save your result using write_csv so you don’t have to run the reading and wrangling code every time!

On Your Own

  1. Compute the mean of every column of the mtcars data set using (a) a for loop, (b) a map function, (c) the across() function, and (d) pivot_longer().

  2. Write a function that prints the mean of each numeric column in a data frame. Try it on the iris data set. (Hint: keep(is.numeric))

  3. Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

out <- ""
for (x in letters) {
  out <- stringr::str_c(out, x)
}
out
[1] "abcdefghijklmnopqrstuvwxyz"
x <- runif(100)
out <- vector("numeric", length(x))
out[1] <- x[1]
for (i in 2:length(x)) {
  out[i] <- out[i - 1] + x[i]
}
out
  [1]  0.2506384  0.9759137  1.5069357  2.4550164  3.0671610  3.4692085
  [7]  3.5170362  3.5680981  4.5310981  5.4915534  6.3652519  6.6017948
 [13]  7.3308325  7.4322275  7.9820764  8.4232124  8.9626225  9.4099158
 [19]  9.8411847  9.9983242 10.9370628 11.2184210 11.7131817 11.9979545
 [25] 12.7336841 13.4807135 13.5142562 13.9353312 14.3954739 14.5238460
 [31] 15.2531725 15.3331256 16.0800002 16.9047467 17.7124190 18.5836777
 [37] 19.2685514 20.2217972 20.7590128 21.7258832 22.3571406 23.0150318
 [43] 23.0555192 23.2183946 23.8984369 24.7742049 24.7830053 25.6539862
 [49] 26.0658080 26.1555196 26.4515061 26.5169040 27.4832727 28.0914213
 [55] 29.0909732 29.9250165 30.1126262 30.4355826 30.4838046 31.3994133
 [61] 31.9182786 32.6152841 33.5040339 34.0445952 34.9451669 35.5634332
 [67] 36.3727143 36.5811334 36.7382437 37.5619114 38.3995447 38.4714948
 [73] 38.8461349 39.8411913 40.3614437 40.4626199 41.1813205 41.6758823
 [79] 42.4902015 43.0768056 43.1641257 43.6831914 43.7996635 44.2088854
 [85] 45.0964847 45.9281900 46.3899264 46.8555729 47.3223114 48.0272623
 [91] 48.8482467 49.7457907 50.4621209 50.6371379 51.4485006 51.4670713
 [97] 52.4013657 52.8815296 53.1895940 53.9850197
  1. Compute the number of unique values in each column of the iris data set using at least 2 of your favorite iteration methods. Bonus points if you can use pivot_longer()!

  2. Carefully explain each step in the pipeline below:

show_missing <- function(df, group_vars, summary_vars = everything()) {
  df |> 
    group_by(pick({{ group_vars }})) |> 
    summarize(
      across({{ summary_vars }}, \(x) sum(is.na(x))),
      .groups = "drop"
    ) |>
    select(where(\(x) any(x > 0)))
}
nycflights13::flights |> show_missing(c(year, month, day))
# A tibble: 365 × 9
    year month   day dep_time dep_delay arr_time arr_delay tailnum air_time
   <int> <int> <int>    <int>     <int>    <int>     <int>   <int>    <int>
 1  2013     1     1        4         4        5        11       0       11
 2  2013     1     2        8         8       10        15       2       15
 3  2013     1     3       10        10       10        14       2       14
 4  2013     1     4        6         6        6         7       2        7
 5  2013     1     5        3         3        3         3       1        3
 6  2013     1     6        1         1        1         3       0        3
 7  2013     1     7        3         3        3         3       1        3
 8  2013     1     8        4         4        4         7       1        7
 9  2013     1     9        5         5        7         9       2        9
10  2013     1    10        3         3        3         3       2        3
# ℹ 355 more rows
  1. Write a function called summary_stats() that allows a user to input a tibble, numeric variables in that tibble, and summary statistics that they would like to see for each variable. Using across(), the function’s output should look like the example below.
summary_stats(mtcars, 
              vars = c(mpg, hp, wt), 
              stat_fcts = list(mean = mean, 
                               median = median, 
                               sd = sd, 
                               IQR = IQR))

#  mpg_mean mpg_median   mpg_sd mpg_IQR  hp_mean hp_median    hp_sd hp_IQR
#1 20.09062       19.2 6.026948   7.375 146.6875       123 68.56287   83.5
#  wt_mean wt_median     wt_sd  wt_IQR  n
#1 3.21725     3.325 0.9784574 1.02875 32
  1. The power of a statistical test is the probability that it rejects the null hypothesis when the null hypothesis is false. In other words, it’s the probability that a statistical test can detect when a true difference exists. The power depends on a number of factors, including:
  • sample size
  • type I error level (probability of declaring there is a statistically significant difference when there really isn’t)
  • variability in the data
  • size of the true difference

The following steps can be followed to simulate a power calculation using iteration techniques:

  1. generate simulated data where is a true difference or effect

  2. run your desired test on the simulated data and record if the null hypothesis was rejected or not (i.e. if the p-value was below .05)

  3. repeat (a)-(b) a large number of times and record the total proportion of times that the null hypothesis was rejected; that proportion is the power of your test under those conditions

Create a power curve for a two-sample t-test by filling in Step C below and then removing eval: FALSE:

# Step A

# set parameters for two-sample t-test
mean1 <- 100   # mean response in Group 1
truediff <- 5    # true mean difference between Groups 1 and 2
mean2 <- mean1 + truediff   # mean response in Group 2
sd1 <- 10   # standard deviation in Group 1
sd2 <- 10   # standard deviation in Group 2
n1 <- 20    # sample size in Group 1
n2 <- 20    # sample size in Group 2
numsims <- 1000   # number of simulations (iterations) to run

# generate sample data for Groups 1 and 2 based on normal distributions
#   with the parameters above (note that there is truly a difference in means!)
samp1 <- rnorm(n1, mean1, sd1)
samp2 <- rnorm(n2, mean2, sd2)

# organize the simulated data into a tibble
sim_data <- tibble(response = c(samp1, samp2), 
       group = c(rep("Group 1", n1), rep("Group 2", n2)))
sim_data

# Step B

# exploratory analysis of the simulated data
mosaic::favstats(response ~ group, data = sim_data)
ggplot(sim_data, aes(x = response, y = group)) +
  geom_boxplot()

# run a two-sample t-test to see if there is a significant difference
#   in means between Groups 1 and 2 (i.e. is the p-value < .05?)
p_value <- t.test(x = samp1, y = samp2)$p.value
p_value
p_value < .05   # if TRUE, then we reject the null hypothesis and conclude
                #   there is a statistically significant difference

# Step C

# find the power = proportion of time null is rejected when
#   true difference is not 0 (i.e. number of simulated data sets that
#   result in p-values below .05)