Data types

You can download this .qmd file from here. Just hit the Download Raw File button.

This leans on parts of R4DS Chapter 27: A field guide to base R, in addition to parts of the first edition of R4DS.

# Initial packages required
library(tidyverse)

What is a vector?

We’ve seen them:

1:5

[1] 1 2 3 4 5

c(3, 6, 1, 7)

[1] 3 6 1 7

c("a", "b", "c")

[1] "a" "b" "c"

x <- c(0:3, NA)
is.na(x)

[1] FALSE FALSE FALSE FALSE  TRUE

sqrt(x)

[1] 0.000000 1.000000 1.414214 1.732051       NA

This doesn’t really fit the mathematical definition of a vector (direction and magnitude)… its really just some numbers (or letters, or TRUE’s…) strung together.

Types of vectors

Atomic vectors are homogeneous… they can contain only one “type”. Types include logical, integer, double, and character (Also complex and raw, but we will ignore those).

Lists can be heterogeneous…. they can be made up of vectors of different types, or even of other lists!

NULL denotes the absence of a vector (whereas NA denotes absence of a value in a vector).

Let’s check out some vector types:

x <- c(0:3, NA)
typeof(x)

[1] "integer"

sqrt(x)

[1] 0.000000 1.000000 1.414214 1.732051       NA

typeof(sqrt(x))

[1] "double"

[Pause to Ponder:] State the types of the following vectors, then use typeof() to check:

is.na(x)

[1] FALSE FALSE FALSE FALSE  TRUE

x > 2

[1] FALSE FALSE FALSE  TRUE    NA

c("apple", "banana", "pear")

[1] "apple"  "banana" "pear"

A logical vector can be implicitly coerced to numeric - T to 1 and F to 0

x <- sample(1:20, 100, replace = TRUE)
y <- x > 10
is_logical(y)

[1] TRUE

as.numeric(y)

  [1] 1 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1
 [38] 1 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 1 0
 [75] 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1

sum(y)  # how many are greater than 10?

[1] 49

mean(y) # what proportion are greater than 10?

[1] 0.49

If there are multiple data types in a vector, then the most complex type wins, because you cannot mix types in a vector (although you can in a list)

typeof(c(TRUE, 1L))

[1] "integer"

typeof(c(1L, 1.5))

[1] "double"

typeof(c(1.5, "a"))

[1] "character"

Integers are whole numbers. “double” refers to “Double-precision” representation of fractional values… don’t worry about the details here (Google it if you care), but just recognize that computers have to round at some point. “Double-precision” tries to store numbers precisely and efficiently.

But weird stuff can happen:

y <- sqrt(2) ^2
y

[1] 2

y == 2

[1] FALSE

the function near is better here:

near(y, 2)

[1] TRUE

And doubles have a couple extra possible values: Inf, -Inf, and NaN, in addition to NA:

1/0

[1] Inf

-1/0

[1] -Inf

0/0

[1] NaN

Inf*0

[1] NaN

Inf/Inf

[1] NaN

Inf/NA

[1] NA

Inf*NA

[1] NA

It’s not a good idea to check for special values (NA, NaN, Inf, -Inf) with ==. Use these instead:

is.finite(Inf)

[1] FALSE

is.infinite(Inf)

[1] TRUE

is.finite(NA)

[1] FALSE

is.finite(NaN)

[1] FALSE

is.infinite(NA)

[1] FALSE

is.infinite(NaN)

[1] FALSE

is.na(NA)

[1] TRUE

is.na(NaN)

[1] TRUE

is.nan(NA)

[1] FALSE

is.nan(NaN)

[1] TRUE

is.na(Inf)

[1] FALSE

is.nan(Inf)

[1] FALSE

Why not use == ?

# Sometimes it works how you think it would:
1/0

[1] Inf

1/0 == Inf

[1] TRUE

# Sometimes it doesn't (Because NA is contagious!)
0/0

[1] NaN

0/0 == NaN

[1] NA

NA == NA

[1] NA

x <- c(0, 1, 1/0, 0/0)
# Doesn't work well
x == NA

[1] NA NA NA NA

x == Inf

[1] FALSE FALSE  TRUE    NA

# Works better
is.na(x)

[1] FALSE FALSE FALSE  TRUE

is.infinite(x)

[1] FALSE FALSE  TRUE FALSE

Another note: technically, each type of vector has its own type of NA… this usually doesn’t matter, but is good to know in case one day you get very very strange errors.

Augmented vectors

Vectors may carry additional metadata in the form of attributes which create augmented vectors.

Factors are built on top of integer vectors
Dates and date-times are built on top of numeric (either integer or double) vectors
Data frames and tibbles are built on top of lists

Naming items in vectors

Each element of a vector can be named, either when it is created or with setnames from package purrr.

x <- c(a = 1, b = 2, c = 3)
x

a b c 
1 2 3

This is more commonly used when you’re dealing with lists or tibbles (which are just a special kind of list!)

tibble(x = 1:4, y = 5:8)

# A tibble: 4 × 2
      x     y
  <int> <int>
1     1     5
2     2     6
3     3     7
4     4     8

Subsetting vectors

So many ways to do this.

I. Subset with numbers.

Use positive integers to keep elements at those positions:

x <- c("one", "two", "three", "four", "five")
x[1]

[1] "one"

x[4]

[1] "four"

x[1:2]

[1] "one" "two"

[Pause to Ponder:] How would you extract values 1 and 3?

You can also repeat values:

x[c(1, 1, 3, 3, 5, 5, 2, 2, 4, 4, 4)]

 [1] "one"   "one"   "three" "three" "five"  "five"  "two"   "two"   "four" 
[10] "four"  "four"

Use negative integers to drop elements:

x[-3]

[1] "one"  "two"  "four" "five"

[Pause to Ponder:] How would you drop values 2 and 4?

What happens if you mix positive and negative values?

x[c(1, -1)]

Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts

You can just subset with 0… this isn’t usually helpful, except perhaps for testing weird cases when you write functions:

x[0]

character(0)

Subset with a logical vector (“Logical subsetting”).

x == "one"

[1]  TRUE FALSE FALSE FALSE FALSE

x[x == "one"]

[1] "one"

y <- c(10, 3, NA, 5, 8, 1, NA)
is.na(y)

[1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

y[!is.na(y)]

[1] 10  3  5  8  1

[Pause to Ponder:] Extract values of y that are less than or equal to 5 (what happens to NAs?). Then extract all non-missing values of y that are less than or equal to 5

If named, subset with a character vector.

z <- c(abc = 1, def = 2, xyz = 3)
z["abc"]

abc 
  1

# A slightly more useful example:
summary(y)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0     3.0     5.0     5.4     8.0    10.0       2

summary(y)["Min."]

Min. 
   1

[Pause to Ponder:] Extract abc and xyz from the vector z, and then extract the mean from summary(y)

Note: Using $ is just for lists (and tibbles, since tibbles are lists)! Not atomic vectors!

z$abc

Error in z$abc: $ operator is invalid for atomic vectors

Blank space. (Don’t subset).

[1] "one"   "two"   "three" "four"  "five"

x[]

[1] "one"   "two"   "three" "four"  "five"

This seems kind of silly. But blank is useful for higher-dimensional objects… like a matrix, or data frame. But our book doesn’t use matrices, so this may be the last one you see this semester:

z <- matrix(1:8, nrow= 2)
z

     [,1] [,2] [,3] [,4]
[1,]    1    3    5    7
[2,]    2    4    6    8

z[1, ]

[1] 1 3 5 7

z[, 1]

[1] 1 2

z[, -3]

     [,1] [,2] [,3]
[1,]    1    3    7
[2,]    2    4    8

We could use this with tibbles too, but it is generally better to use the column names (more readable, and less likely to get the wrong columns by accident), and you should probably use select, filter, or slice:

mpg[, 1:2]

# A tibble: 234 × 2
   manufacturer model     
   <chr>        <chr>     
 1 audi         a4        
 2 audi         a4        
 3 audi         a4        
 4 audi         a4        
 5 audi         a4        
 6 audi         a4        
 7 audi         a4        
 8 audi         a4 quattro
 9 audi         a4 quattro
10 audi         a4 quattro
# ℹ 224 more rows

mpg[1:3, ]

# A tibble: 3 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…

Recycling

What does R do with vectors:

1:5 + 1:5

[1]  2  4  6  8 10

1:5 * 1:5

[1]  1  4  9 16 25

1:5 + 2

[1] 3 4 5 6 7

1:5 * 2

[1]  2  4  6  8 10

This last two lines makes sense… but R is doing something important here, called recycling. In other words, it is really doing this:

1:5 * c(2, 2, 2, 2, 2)

[1]  2  4  6  8 10

You never need to do this explicit iteration! (This is different from some other more general purpose computing languages…. R was built for analyzing data, so this type of behavior is really desirable!)

R can recycle longer vectors too, and only warns you if lengths are not multiples of each other:

1:10 + 1:2

 [1]  2  4  4  6  6  8  8 10 10 12

1:10 + 1:3

Warning in 1:10 + 1:3: longer object length is not a multiple of shorter object
length

 [1]  2  4  6  5  7  9  8 10 12 11

However, functions within the tidyverse will not allow you to recycle anything other than scalars (math word for single number… in R, a vector of length 1).

#OK:
tibble(x = 1:4, y = 1)

# A tibble: 4 × 2
      x     y
  <int> <dbl>
1     1     1
2     2     1
3     3     1
4     4     1

#not OK:
tibble(x = 1:4, y = 1:2)

Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 4: Existing data.
• Size 2: Column `y`.
ℹ Only values of size one are recycled.

To intentionally recycle, use rep:

rep(1:3, times = 2)

[1] 1 2 3 1 2 3

rep(1:3, each = 2)

[1] 1 1 2 2 3 3

Lists

Lists can contain a mix of objects, even other lists.

As noted previously, tibbles are an augmented list. Augmented lists have additional attributes. For example, the names of the columns in a tibble.

Another list you may have encountered in a stats class is output from lm, linear regression:

mpg_model <- lm(hwy ~ cty, data = mpg)

mpg_model


Call:
lm(formula = hwy ~ cty, data = mpg)

Coefficients:
(Intercept)          cty  
      0.892        1.337

typeof(mpg_model)

[1] "list"

str(mpg_model)

List of 12
 $ coefficients : Named num [1:2] 0.892 1.337
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "cty"
 $ residuals    : Named num [1:234] 4.0338 0.0214 3.3588 1.0214 3.7087 ...
  ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
 $ effects      : Named num [1:234] -358.566 -86.887 3.121 0.787 3.458 ...
  ..- attr(*, "names")= chr [1:234] "(Intercept)" "cty" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:234] 25 29 27.6 29 22.3 ...
  ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:234, 1:2] -15.2971 0.0654 0.0654 0.0654 0.0654 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:234] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "cty"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.07 1.06
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 232
 $ xlevels      : Named list()
 $ call         : language lm(formula = hwy ~ cty, data = mpg)
 $ terms        :Classes 'terms', 'formula'  language hwy ~ cty
  .. ..- attr(*, "variables")= language list(hwy, cty)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "hwy" "cty"
  .. .. .. ..$ : chr "cty"
  .. ..- attr(*, "term.labels")= chr "cty"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(hwy, cty)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "hwy" "cty"
 $ model        :'data.frame':  234 obs. of  2 variables:
  ..$ hwy: int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
  ..$ cty: int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language hwy ~ cty
  .. .. ..- attr(*, "variables")= language list(hwy, cty)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "hwy" "cty"
  .. .. .. .. ..$ : chr "cty"
  .. .. ..- attr(*, "term.labels")= chr "cty"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(hwy, cty)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "hwy" "cty"
 - attr(*, "class")= chr "lm"

There are three ways to extract from a list. Check out the pepper shaker analogy in Section 27.3.3 (note: shaker = list)

[] returns new, smaller list (fewer pepper packs in shaker)
[[]] drills down one level (individual pepper packs not in shaker)

I. [ to extract a sub-list. The result is a list.

mpg_model[1]

$coefficients
(Intercept)         cty 
  0.8920411   1.3374556

typeof(mpg_model[1])

[1] "list"

you can also do this by name, rather than number:

mpg_model["coefficients"]

$coefficients
(Intercept)         cty 
  0.8920411   1.3374556

[[ extracts a single component from the list… It removes a level of hierarchy

mpg_model[[1]]

(Intercept)         cty 
  0.8920411   1.3374556

typeof(mpg_model[[1]])

[1] "double"

Again, it can be done by name instead:

mpg_model[["coefficients"]]

(Intercept)         cty 
  0.8920411   1.3374556

$ is a shorthand way of extracting elements by name… it is similar to [[ in that it removes a level of hierarchy. You don’t need quotes. (We’ve seen this with tibbles before too!)

mpg_model$coefficients

(Intercept)         cty 
  0.8920411   1.3374556

str

The str function allows us to see the structure of a list, as well as any attributes.

mpg

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

str(mpg)

tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
 $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
 $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
 $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
 $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
 $ drv         : chr [1:234] "f" "f" "f" "f" ...
 $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
 $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
 $ fl          : chr [1:234] "p" "p" "p" "p" ...
 $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

mpg_model


Call:
lm(formula = hwy ~ cty, data = mpg)

Coefficients:
(Intercept)          cty  
      0.892        1.337

str(mpg_model)

List of 12
 $ coefficients : Named num [1:2] 0.892 1.337
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "cty"
 $ residuals    : Named num [1:234] 4.0338 0.0214 3.3588 1.0214 3.7087 ...
  ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
 $ effects      : Named num [1:234] -358.566 -86.887 3.121 0.787 3.458 ...
  ..- attr(*, "names")= chr [1:234] "(Intercept)" "cty" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:234] 25 29 27.6 29 22.3 ...
  ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:234, 1:2] -15.2971 0.0654 0.0654 0.0654 0.0654 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:234] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "cty"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.07 1.06
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 232
 $ xlevels      : Named list()
 $ call         : language lm(formula = hwy ~ cty, data = mpg)
 $ terms        :Classes 'terms', 'formula'  language hwy ~ cty
  .. ..- attr(*, "variables")= language list(hwy, cty)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "hwy" "cty"
  .. .. .. ..$ : chr "cty"
  .. ..- attr(*, "term.labels")= chr "cty"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(hwy, cty)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "hwy" "cty"
 $ model        :'data.frame':  234 obs. of  2 variables:
  ..$ hwy: int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
  ..$ cty: int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language hwy ~ cty
  .. .. ..- attr(*, "variables")= language list(hwy, cty)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "hwy" "cty"
  .. .. .. .. ..$ : chr "cty"
  .. .. ..- attr(*, "term.labels")= chr "cty"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(hwy, cty)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "hwy" "cty"
 - attr(*, "class")= chr "lm"

As you can see, the mpg_model is a very complicated list with lots of attributes. The elements of the list can be all different types.

The last attribute is the object class, which it lists as lm.

class(mpg_model)

[1] "lm"

Now let’s see how extracting from a list works with a tibble (since a tibble is built on top of a list).

ugly_data <- tibble(
  truefalse = c("TRUE", "FALSE", "NA"),
  numbers = c("1", "2", "3"),
  dates = c("2010-01-01", "1979-10-14", "2013-08-17"),
  more_numbers = c("1", "231", ".")
)
ugly_data

# A tibble: 3 × 4
  truefalse numbers dates      more_numbers
  <chr>     <chr>   <chr>      <chr>       
1 TRUE      1       2010-01-01 1           
2 FALSE     2       1979-10-14 231         
3 NA        3       2013-08-17 .

str(ugly_data)   # we've seen str before... stands for "structure"

tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
 $ truefalse   : chr [1:3] "TRUE" "FALSE" "NA"
 $ numbers     : chr [1:3] "1" "2" "3"
 $ dates       : chr [1:3] "2010-01-01" "1979-10-14" "2013-08-17"
 $ more_numbers: chr [1:3] "1" "231" "."

pretty_data <- ugly_data %>% 
  mutate(truefalse = parse_logical(truefalse),
         numbers = parse_number(numbers),
         dates = parse_date(dates),
         more_numbers = parse_number(more_numbers))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `more_numbers = parse_number(more_numbers)`.
Caused by warning:
! 1 parsing failure.
row col expected actual
  3  -- a number      .

pretty_data

# A tibble: 3 × 4
  truefalse numbers dates      more_numbers
  <lgl>       <dbl> <date>            <dbl>
1 TRUE            1 2010-01-01            1
2 FALSE           2 1979-10-14          231
3 NA              3 2013-08-17           NA

str(pretty_data)

tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
 $ truefalse   : logi [1:3] TRUE FALSE NA
 $ numbers     : num [1:3] 1 2 3
 $ dates       : Date[1:3], format: "2010-01-01" "1979-10-14" ...
 $ more_numbers: num [1:3] 1 231 NA
  ..- attr(*, "problems")= tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
  .. ..$ row     : int 3
  .. ..$ col     : int NA
  .. ..$ expected: chr "a number"
  .. ..$ actual  : chr "."

# Get a smaller tibble
pretty_data[1]

# A tibble: 3 × 1
  truefalse
  <lgl>    
1 TRUE     
2 FALSE    
3 NA

class(pretty_data[1])

[1] "tbl_df"     "tbl"        "data.frame"

typeof(pretty_data[1])

[1] "list"

pretty_data[2:3]

# A tibble: 3 × 2
  numbers dates     
    <dbl> <date>    
1       1 2010-01-01
2       2 1979-10-14
3       3 2013-08-17

pretty_data[1, 3:4]

# A tibble: 1 × 2
  dates      more_numbers
  <date>            <dbl>
1 2010-01-01            1

pretty_data["dates"]

# A tibble: 3 × 1
  dates     
  <date>    
1 2010-01-01
2 1979-10-14
3 2013-08-17

pretty_data[c("dates", "more_numbers")]

# A tibble: 3 × 2
  dates      more_numbers
  <date>            <dbl>
1 2010-01-01            1
2 1979-10-14          231
3 2013-08-17           NA

pretty_data %>% select(dates, more_numbers)

# A tibble: 3 × 2
  dates      more_numbers
  <date>            <dbl>
1 2010-01-01            1
2 1979-10-14          231
3 2013-08-17           NA

pretty_data %>% select(dates, more_numbers) %>% slice(1:2)

# A tibble: 2 × 2
  dates      more_numbers
  <date>            <dbl>
1 2010-01-01            1
2 1979-10-14          231

# Remove a level of hierarchy - drill down one level to get a new object
pretty_data$dates

[1] "2010-01-01" "1979-10-14" "2013-08-17"

class(pretty_data$dates)

[1] "Date"

typeof(pretty_data$dates)

[1] "double"

pretty_data[[1]]

[1]  TRUE FALSE    NA

class(pretty_data[[1]])

[1] "logical"

typeof(pretty_data[[1]])

[1] "logical"

[Pause to Ponder:] Predict what these lines will produce BEFORE running them:

pretty_data[[c("dates", "more_numbers")]]

Error in `pretty_data[[c("dates", "more_numbers")]]`:
! Can't extract column with `c("dates", "more_numbers")`.
✖ Subscript `c("dates", "more_numbers")` must be size 1, not 2.

pretty_data[[2]][[3]]

[1] 3

pretty_data[[2]][3]

[1] 3

pretty_data[[2]][c(TRUE, FALSE, TRUE)]

[1] 1 3

pretty_data[[1]][c(1, 2, 1, 2)]

[1]  TRUE FALSE  TRUE FALSE

Generic functions

Another important feature of R is generic functions. Some functions, like plot and summary for example, behave very differently depending on the class of their input.

class(mpg)

[1] "tbl_df"     "tbl"        "data.frame"

summary(mpg)

 manufacturer          model               displ            year     
 Length:234         Length:234         Min.   :1.600   Min.   :1999  
 Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
 Mode  :character   Mode  :character   Median :3.300   Median :2004  
                                       Mean   :3.472   Mean   :2004  
                                       3rd Qu.:4.600   3rd Qu.:2008  
                                       Max.   :7.000   Max.   :2008  
      cyl           trans               drv                 cty       
 Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
 1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
 Median :6.000   Mode  :character   Mode  :character   Median :17.00  
 Mean   :5.889                                         Mean   :16.86  
 3rd Qu.:8.000                                         3rd Qu.:19.00  
 Max.   :8.000                                         Max.   :35.00  
      hwy             fl               class          
 Min.   :12.00   Length:234         Length:234        
 1st Qu.:18.00   Class :character   Class :character  
 Median :24.00   Mode  :character   Mode  :character  
 Mean   :23.44                                        
 3rd Qu.:27.00                                        
 Max.   :44.00

class(mpg_model)

[1] "lm"

summary(mpg_model)


Call:
lm(formula = hwy ~ cty, data = mpg)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3408 -1.2790  0.0214  1.0338  4.0461 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.89204    0.46895   1.902   0.0584 .  
cty          1.33746    0.02697  49.585   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.752 on 232 degrees of freedom
Multiple R-squared:  0.9138,    Adjusted R-squared:  0.9134 
F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

As a simpler case, consider the mean function.

mean

function (x, ...) 
UseMethod("mean")
<bytecode: 0x0000020a3c7703a8>
<environment: namespace:base>

As a generic function, we can see what methods are available:

methods(mean)

[1] mean.Date        mean.default     mean.difftime    mean.POSIXct    
[5] mean.POSIXlt     mean.quosure*    mean.vctrs_vctr*
see '?methods' for accessing help and source code

mean(c(20, 21, 23))

[1] 21.33333

library(lubridate)
date_test <- ymd(c("2020-03-20", "2020-03-21", "2020-03-23"))
mean(date_test)

[1] "2020-03-21"

What makes Tibbles special?

Tibbles are lists that: - have names attributes (column/variable names) as well as row.names attributes. - have elements that are all vectors of the same length

attributes(mpg)

$class
[1] "tbl_df"     "tbl"        "data.frame"

$row.names
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
[163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
[181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
[199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
[217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234

$names
 [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
 [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
[11] "class"

On Your Own

The dataset roster includes 24 names (the first 24 alphabetically on this list of names). Let’s suppose this is our class, and you want to divide students into 6 groups. Modify the code below using the rep function to create groups in two different ways.

babynames <- read_csv("https://proback.github.io/264_fall_2024/Data/babynames_2000.csv")

roster <- babynames %>%
  sample_n(size = 24) %>%
  select(name) 

roster %>%
  mutate(group_method1 = , 
         group_method2 = )

Here’s is a really crazy list that tells you some stuff about data science.

data_sci <- list(first = c("first it must work", "then it can be" , "pretty"),
                 DRY = c("Do not", "Repeat", "Yourself"),
                 dont_forget = c("garbage", "in", "out"),
                 our_first_tibble = mpg,
                 integers = 1:25,
                 doubles = sqrt(1:25),
                 tidyverse = c(pack1 = "ggplot2", pack2 = "dplyr", 
                               pack3 = "lubridate", etc = "and more!"),
                 opinion = list("MSCS 264 is",  "awesome!", "amazing!", "rainbows!")
                  )

Use str to learn about data_sci.

Now, figure out how to get exactly the following outputs. Bonus points if you can do it more than one way!

[1] “first it must work” “then it can be” “pretty”

$DRY [1] “Do not” “Repeat” “Yourself”

[1] 3 1 4 1 5 9 3

  pack1         etc

“ggplot2” “and more!”

[1] “rainbows!”

[1] “garbage” “in” “garbage” “out”

A tibble: 234 x 2

 hwy   cty

1 29 18 2 29 21 3 31 20 4 30 21 5 26 16 6 26 18 7 27 18 8 26 18 9 25 16 10 28 20 # … with 224 more rows

[[1]] [1] “MSCS 264 is”

[[2]] [1] “amazing!”