# Initial packages required
library(tidyverse)
Data types
You can download this .qmd file from here. Just hit the Download Raw File button.
This leans on parts of R4DS Chapter 27: A field guide to base R, in addition to parts of the first edition of R4DS.
What is a vector?
We’ve seen them:
1:5
[1] 1 2 3 4 5
c(3, 6, 1, 7)
[1] 3 6 1 7
c("a", "b", "c")
[1] "a" "b" "c"
<- c(0:3, NA)
x is.na(x)
[1] FALSE FALSE FALSE FALSE TRUE
sqrt(x)
[1] 0.000000 1.000000 1.414214 1.732051 NA
This doesn’t really fit the mathematical definition of a vector (direction and magnitude)… its really just some numbers (or letters, or TRUE’s…) strung together.
Types of vectors
Atomic vectors are homogeneous… they can contain only one “type”. Types include logical, integer, double, and character (Also complex and raw, but we will ignore those).
Lists can be heterogeneous…. they can be made up of vectors of different types, or even of other lists!
NULL denotes the absence of a vector (whereas NA denotes absence of a value in a vector).
Let’s check out some vector types:
<- c(0:3, NA)
x typeof(x)
[1] "integer"
sqrt(x)
[1] 0.000000 1.000000 1.414214 1.732051 NA
typeof(sqrt(x))
[1] "double"
[Pause to Ponder:] State the types of the following vectors, then use typeof()
to check:
is.na(x)
[1] FALSE FALSE FALSE FALSE TRUE
> 2 x
[1] FALSE FALSE FALSE TRUE NA
c("apple", "banana", "pear")
[1] "apple" "banana" "pear"
A logical vector can be implicitly coerced to numeric - T to 1 and F to 0
<- sample(1:20, 100, replace = TRUE)
x <- x > 10
y is_logical(y)
[1] TRUE
as.numeric(y)
[1] 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 0 1 1 1 1 0
[38] 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 0 0 1 1 1 0 1 1 0
[75] 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0
sum(y) # how many are greater than 10?
[1] 50
mean(y) # what proportion are greater than 10?
[1] 0.5
If there are multiple data types in a vector, then the most complex type wins, because you cannot mix types in a vector (although you can in a list)
typeof(c(TRUE, 1L))
[1] "integer"
typeof(c(1L, 1.5))
[1] "double"
typeof(c(1.5, "a"))
[1] "character"
Integers are whole numbers. “double” refers to “Double-precision” representation of fractional values… don’t worry about the details here (Google it if you care), but just recognize that computers have to round at some point. “Double-precision” tries to store numbers precisely and efficiently.
But weird stuff can happen:
<- sqrt(2) ^2
y y
[1] 2
== 2 y
[1] FALSE
the function near
is better here:
near(y, 2)
[1] TRUE
And doubles have a couple extra possible values: Inf, -Inf, and NaN, in addition to NA:
1/0
[1] Inf
-1/0
[1] -Inf
0/0
[1] NaN
Inf*0
[1] NaN
Inf/Inf
[1] NaN
Inf/NA
[1] NA
Inf*NA
[1] NA
It’s not a good idea to check for special values (NA, NaN, Inf, -Inf) with ==. Use these instead:
is.finite(Inf)
[1] FALSE
is.infinite(Inf)
[1] TRUE
is.finite(NA)
[1] FALSE
is.finite(NaN)
[1] FALSE
is.infinite(NA)
[1] FALSE
is.infinite(NaN)
[1] FALSE
is.na(NA)
[1] TRUE
is.na(NaN)
[1] TRUE
is.nan(NA)
[1] FALSE
is.nan(NaN)
[1] TRUE
is.na(Inf)
[1] FALSE
is.nan(Inf)
[1] FALSE
Why not use == ?
# Sometimes it works how you think it would:
1/0
[1] Inf
1/0 == Inf
[1] TRUE
# Sometimes it doesn't (Because NA is contagious!)
0/0
[1] NaN
0/0 == NaN
[1] NA
NA == NA
[1] NA
<- c(0, 1, 1/0, 0/0)
x # Doesn't work well
== NA x
[1] NA NA NA NA
== Inf x
[1] FALSE FALSE TRUE NA
# Works better
is.na(x)
[1] FALSE FALSE FALSE TRUE
is.infinite(x)
[1] FALSE FALSE TRUE FALSE
Another note: technically, each type of vector has its own type of NA… this usually doesn’t matter, but is good to know in case one day you get very very strange errors.
Augmented vectors
Vectors may carry additional metadata in the form of attributes which create augmented vectors.
Factors are built on top of integer vectors
Dates and date-times are built on top of numeric (either integer or double) vectors
Data frames and tibbles are built on top of lists
Naming items in vectors
Each element of a vector can be named, either when it is created or with setnames
from package purrr.
<- c(a = 1, b = 2, c = 3)
x x
a b c
1 2 3
This is more commonly used when you’re dealing with lists or tibbles (which are just a special kind of list!)
tibble(x = 1:4, y = 5:8)
# A tibble: 4 × 2
x y
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
Subsetting vectors
So many ways to do this.
I. Subset with numbers.
Use positive integers to keep elements at those positions:
<- c("one", "two", "three", "four", "five")
x 1] x[
[1] "one"
4] x[
[1] "four"
1:2] x[
[1] "one" "two"
[Pause to Ponder:] How would you extract values 1 and 3?
You can also repeat values:
c(1, 1, 3, 3, 5, 5, 2, 2, 4, 4, 4)] x[
[1] "one" "one" "three" "three" "five" "five" "two" "two" "four"
[10] "four" "four"
Use negative integers to drop elements:
-3] x[
[1] "one" "two" "four" "five"
[Pause to Ponder:] How would you drop values 2 and 4?
What happens if you mix positive and negative values?
c(1, -1)] x[
Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts
You can just subset with 0… this isn’t usually helpful, except perhaps for testing weird cases when you write functions:
0] x[
character(0)
- Subset with a logical vector (“Logical subsetting”).
== "one" x
[1] TRUE FALSE FALSE FALSE FALSE
== "one"] x[x
[1] "one"
<- c(10, 3, NA, 5, 8, 1, NA)
y is.na(y)
[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE
!is.na(y)] y[
[1] 10 3 5 8 1
[Pause to Ponder:] Extract values of y that are less than or equal to 5 (what happens to NAs?). Then extract all non-missing values of y that are less than or equal to 5
- If named, subset with a character vector.
<- c(abc = 1, def = 2, xyz = 3)
z "abc"] z[
abc
1
# A slightly more useful example:
summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.0 3.0 5.0 5.4 8.0 10.0 2
summary(y)["Min."]
Min.
1
[Pause to Ponder:] Extract abc and xyz from the vector z, and then extract the mean from summary(y)
Note: Using $ is just for lists (and tibbles, since tibbles are lists)! Not atomic vectors!
$abc z
Error in z$abc: $ operator is invalid for atomic vectors
- Blank space. (Don’t subset).
x
[1] "one" "two" "three" "four" "five"
x[]
[1] "one" "two" "three" "four" "five"
This seems kind of silly. But blank is useful for higher-dimensional objects… like a matrix, or data frame. But our book doesn’t use matrices, so this may be the last one you see this semester:
<- matrix(1:8, nrow= 2)
z z
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
1, ] z[
[1] 1 3 5 7
1] z[,
[1] 1 2
-3] z[,
[,1] [,2] [,3]
[1,] 1 3 7
[2,] 2 4 8
We could use this with tibbles too, but it is generally better to use the column names (more readable, and less likely to get the wrong columns by accident), and you should probably use select
, filter
, or slice
:
1:2] mpg[,
# A tibble: 234 × 2
manufacturer model
<chr> <chr>
1 audi a4
2 audi a4
3 audi a4
4 audi a4
5 audi a4
6 audi a4
7 audi a4
8 audi a4 quattro
9 audi a4 quattro
10 audi a4 quattro
# ℹ 224 more rows
1:3, ] mpg[
# A tibble: 3 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
Recycling
What does R do with vectors:
1:5 + 1:5
[1] 2 4 6 8 10
1:5 * 1:5
[1] 1 4 9 16 25
1:5 + 2
[1] 3 4 5 6 7
1:5 * 2
[1] 2 4 6 8 10
This last two lines makes sense… but R is doing something important here, called recycling. In other words, it is really doing this:
1:5 * c(2, 2, 2, 2, 2)
[1] 2 4 6 8 10
You never need to do this explicit iteration! (This is different from some other more general purpose computing languages…. R was built for analyzing data, so this type of behavior is really desirable!)
R can recycle longer vectors too, and only warns you if lengths are not multiples of each other:
1:10 + 1:2
[1] 2 4 4 6 6 8 8 10 10 12
1:10 + 1:3
Warning in 1:10 + 1:3: longer object length is not a multiple of shorter object
length
[1] 2 4 6 5 7 9 8 10 12 11
However, functions within the tidyverse will not allow you to recycle anything other than scalars (math word for single number… in R, a vector of length 1).
#OK:
tibble(x = 1:4, y = 1)
# A tibble: 4 × 2
x y
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
#not OK:
tibble(x = 1:4, y = 1:2)
Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 4: Existing data.
• Size 2: Column `y`.
ℹ Only values of size one are recycled.
To intentionally recycle, use rep
:
rep(1:3, times = 2)
[1] 1 2 3 1 2 3
rep(1:3, each = 2)
[1] 1 1 2 2 3 3
Lists
Lists can contain a mix of objects, even other lists.
As noted previously, tibbles are an augmented list. Augmented lists have additional attributes. For example, the names of the columns in a tibble.
Another list you may have encountered in a stats class is output from lm
, linear regression:
<- lm(hwy ~ cty, data = mpg)
mpg_model
mpg_model
Call:
lm(formula = hwy ~ cty, data = mpg)
Coefficients:
(Intercept) cty
0.892 1.337
typeof(mpg_model)
[1] "list"
str(mpg_model)
List of 12
$ coefficients : Named num [1:2] 0.892 1.337
..- attr(*, "names")= chr [1:2] "(Intercept)" "cty"
$ residuals : Named num [1:234] 4.0338 0.0214 3.3588 1.0214 3.7087 ...
..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
$ effects : Named num [1:234] -358.566 -86.887 3.121 0.787 3.458 ...
..- attr(*, "names")= chr [1:234] "(Intercept)" "cty" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:234] 25 29 27.6 29 22.3 ...
..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:234, 1:2] -15.2971 0.0654 0.0654 0.0654 0.0654 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:234] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "cty"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.07 1.06
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 232
$ xlevels : Named list()
$ call : language lm(formula = hwy ~ cty, data = mpg)
$ terms :Classes 'terms', 'formula' language hwy ~ cty
.. ..- attr(*, "variables")= language list(hwy, cty)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "hwy" "cty"
.. .. .. ..$ : chr "cty"
.. ..- attr(*, "term.labels")= chr "cty"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(hwy, cty)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "hwy" "cty"
$ model :'data.frame': 234 obs. of 2 variables:
..$ hwy: int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
..$ cty: int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language hwy ~ cty
.. .. ..- attr(*, "variables")= language list(hwy, cty)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "hwy" "cty"
.. .. .. .. ..$ : chr "cty"
.. .. ..- attr(*, "term.labels")= chr "cty"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(hwy, cty)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "hwy" "cty"
- attr(*, "class")= chr "lm"
There are three ways to extract from a list. Check out the pepper shaker analogy in Section 27.3.3 (note: shaker = list)
- [] returns new, smaller list (fewer pepper packs in shaker)
- [[]] drills down one level (individual pepper packs not in shaker)
I. [ to extract a sub-list. The result is a list.
1] mpg_model[
$coefficients
(Intercept) cty
0.8920411 1.3374556
typeof(mpg_model[1])
[1] "list"
you can also do this by name, rather than number:
"coefficients"] mpg_model[
$coefficients
(Intercept) cty
0.8920411 1.3374556
- [[ extracts a single component from the list… It removes a level of hierarchy
1]] mpg_model[[
(Intercept) cty
0.8920411 1.3374556
typeof(mpg_model[[1]])
[1] "double"
Again, it can be done by name instead:
"coefficients"]] mpg_model[[
(Intercept) cty
0.8920411 1.3374556
- $ is a shorthand way of extracting elements by name… it is similar to [[ in that it removes a level of hierarchy. You don’t need quotes. (We’ve seen this with tibbles before too!)
$coefficients mpg_model
(Intercept) cty
0.8920411 1.3374556
str
The str
function allows us to see the structure of a list, as well as any attributes.
mpg
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
str(mpg)
tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
$ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
$ model : chr [1:234] "a4" "a4" "a4" "a4" ...
$ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
$ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
$ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
$ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
$ drv : chr [1:234] "f" "f" "f" "f" ...
$ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
$ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
$ fl : chr [1:234] "p" "p" "p" "p" ...
$ class : chr [1:234] "compact" "compact" "compact" "compact" ...
mpg_model
Call:
lm(formula = hwy ~ cty, data = mpg)
Coefficients:
(Intercept) cty
0.892 1.337
str(mpg_model)
List of 12
$ coefficients : Named num [1:2] 0.892 1.337
..- attr(*, "names")= chr [1:2] "(Intercept)" "cty"
$ residuals : Named num [1:234] 4.0338 0.0214 3.3588 1.0214 3.7087 ...
..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
$ effects : Named num [1:234] -358.566 -86.887 3.121 0.787 3.458 ...
..- attr(*, "names")= chr [1:234] "(Intercept)" "cty" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:234] 25 29 27.6 29 22.3 ...
..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:234, 1:2] -15.2971 0.0654 0.0654 0.0654 0.0654 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:234] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "cty"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.07 1.06
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 232
$ xlevels : Named list()
$ call : language lm(formula = hwy ~ cty, data = mpg)
$ terms :Classes 'terms', 'formula' language hwy ~ cty
.. ..- attr(*, "variables")= language list(hwy, cty)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "hwy" "cty"
.. .. .. ..$ : chr "cty"
.. ..- attr(*, "term.labels")= chr "cty"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(hwy, cty)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "hwy" "cty"
$ model :'data.frame': 234 obs. of 2 variables:
..$ hwy: int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
..$ cty: int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language hwy ~ cty
.. .. ..- attr(*, "variables")= language list(hwy, cty)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "hwy" "cty"
.. .. .. .. ..$ : chr "cty"
.. .. ..- attr(*, "term.labels")= chr "cty"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(hwy, cty)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "hwy" "cty"
- attr(*, "class")= chr "lm"
As you can see, the mpg_model is a very complicated list with lots of attributes. The elements of the list can be all different types.
The last attribute is the object class, which it lists as lm.
class(mpg_model)
[1] "lm"
Now let’s see how extracting from a list works with a tibble (since a tibble is built on top of a list).
<- tibble(
ugly_data truefalse = c("TRUE", "FALSE", "NA"),
numbers = c("1", "2", "3"),
dates = c("2010-01-01", "1979-10-14", "2013-08-17"),
more_numbers = c("1", "231", ".")
) ugly_data
# A tibble: 3 × 4
truefalse numbers dates more_numbers
<chr> <chr> <chr> <chr>
1 TRUE 1 2010-01-01 1
2 FALSE 2 1979-10-14 231
3 NA 3 2013-08-17 .
str(ugly_data) # we've seen str before... stands for "structure"
tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
$ truefalse : chr [1:3] "TRUE" "FALSE" "NA"
$ numbers : chr [1:3] "1" "2" "3"
$ dates : chr [1:3] "2010-01-01" "1979-10-14" "2013-08-17"
$ more_numbers: chr [1:3] "1" "231" "."
<- ugly_data %>%
pretty_data mutate(truefalse = parse_logical(truefalse),
numbers = parse_number(numbers),
dates = parse_date(dates),
more_numbers = parse_number(more_numbers))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `more_numbers = parse_number(more_numbers)`.
Caused by warning:
! 1 parsing failure.
row col expected actual
3 -- a number .
pretty_data
# A tibble: 3 × 4
truefalse numbers dates more_numbers
<lgl> <dbl> <date> <dbl>
1 TRUE 1 2010-01-01 1
2 FALSE 2 1979-10-14 231
3 NA 3 2013-08-17 NA
str(pretty_data)
tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
$ truefalse : logi [1:3] TRUE FALSE NA
$ numbers : num [1:3] 1 2 3
$ dates : Date[1:3], format: "2010-01-01" "1979-10-14" ...
$ more_numbers: num [1:3] 1 231 NA
..- attr(*, "problems")= tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
.. ..$ row : int 3
.. ..$ col : int NA
.. ..$ expected: chr "a number"
.. ..$ actual : chr "."
# Get a smaller tibble
1] pretty_data[
# A tibble: 3 × 1
truefalse
<lgl>
1 TRUE
2 FALSE
3 NA
class(pretty_data[1])
[1] "tbl_df" "tbl" "data.frame"
typeof(pretty_data[1])
[1] "list"
2:3] pretty_data[
# A tibble: 3 × 2
numbers dates
<dbl> <date>
1 1 2010-01-01
2 2 1979-10-14
3 3 2013-08-17
1, 3:4] pretty_data[
# A tibble: 1 × 2
dates more_numbers
<date> <dbl>
1 2010-01-01 1
"dates"] pretty_data[
# A tibble: 3 × 1
dates
<date>
1 2010-01-01
2 1979-10-14
3 2013-08-17
c("dates", "more_numbers")] pretty_data[
# A tibble: 3 × 2
dates more_numbers
<date> <dbl>
1 2010-01-01 1
2 1979-10-14 231
3 2013-08-17 NA
%>% select(dates, more_numbers) pretty_data
# A tibble: 3 × 2
dates more_numbers
<date> <dbl>
1 2010-01-01 1
2 1979-10-14 231
3 2013-08-17 NA
%>% select(dates, more_numbers) %>% slice(1:2) pretty_data
# A tibble: 2 × 2
dates more_numbers
<date> <dbl>
1 2010-01-01 1
2 1979-10-14 231
# Remove a level of hierarchy - drill down one level to get a new object
$dates pretty_data
[1] "2010-01-01" "1979-10-14" "2013-08-17"
class(pretty_data$dates)
[1] "Date"
typeof(pretty_data$dates)
[1] "double"
1]] pretty_data[[
[1] TRUE FALSE NA
class(pretty_data[[1]])
[1] "logical"
typeof(pretty_data[[1]])
[1] "logical"
[Pause to Ponder:] Predict what these lines will produce BEFORE running them:
c("dates", "more_numbers")]] pretty_data[[
Error in `pretty_data[[c("dates", "more_numbers")]]`:
! Can't extract column with `c("dates", "more_numbers")`.
✖ Subscript `c("dates", "more_numbers")` must be size 1, not 2.
2]][[3]] pretty_data[[
[1] 3
2]][3] pretty_data[[
[1] 3
2]][c(TRUE, FALSE, TRUE)] pretty_data[[
[1] 1 3
1]][c(1, 2, 1, 2)] pretty_data[[
[1] TRUE FALSE TRUE FALSE
Generic functions
Another important feature of R is generic functions. Some functions, like plot
and summary
for example, behave very differently depending on the class of their input.
class(mpg)
[1] "tbl_df" "tbl" "data.frame"
summary(mpg)
manufacturer model displ year
Length:234 Length:234 Min. :1.600 Min. :1999
Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
Mode :character Mode :character Median :3.300 Median :2004
Mean :3.472 Mean :2004
3rd Qu.:4.600 3rd Qu.:2008
Max. :7.000 Max. :2008
cyl trans drv cty
Min. :4.000 Length:234 Length:234 Min. : 9.00
1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
Median :6.000 Mode :character Mode :character Median :17.00
Mean :5.889 Mean :16.86
3rd Qu.:8.000 3rd Qu.:19.00
Max. :8.000 Max. :35.00
hwy fl class
Min. :12.00 Length:234 Length:234
1st Qu.:18.00 Class :character Class :character
Median :24.00 Mode :character Mode :character
Mean :23.44
3rd Qu.:27.00
Max. :44.00
class(mpg_model)
[1] "lm"
summary(mpg_model)
Call:
lm(formula = hwy ~ cty, data = mpg)
Residuals:
Min 1Q Median 3Q Max
-5.3408 -1.2790 0.0214 1.0338 4.0461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.89204 0.46895 1.902 0.0584 .
cty 1.33746 0.02697 49.585 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.752 on 232 degrees of freedom
Multiple R-squared: 0.9138, Adjusted R-squared: 0.9134
F-statistic: 2459 on 1 and 232 DF, p-value: < 2.2e-16
As a simpler case, consider the mean
function.
mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x000001896df06880>
<environment: namespace:base>
As a generic function, we can see what methods are available:
methods(mean)
[1] mean.Date mean.default mean.difftime mean.POSIXct
[5] mean.POSIXlt mean.quosure* mean.vctrs_vctr*
see '?methods' for accessing help and source code
mean(c(20, 21, 23))
[1] 21.33333
library(lubridate)
<- ymd(c("2020-03-20", "2020-03-21", "2020-03-23"))
date_test mean(date_test)
[1] "2020-03-21"
What makes Tibbles special?
Tibbles are lists that: - have names
attributes (column/variable names) as well as row.names
attributes. - have elements that are all vectors of the same length
attributes(mpg)
$class
[1] "tbl_df" "tbl" "data.frame"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
[163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
[181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
[199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
[217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
$names
[1] "manufacturer" "model" "displ" "year" "cyl"
[6] "trans" "drv" "cty" "hwy" "fl"
[11] "class"
On Your Own
- The dataset
roster
includes 24 names (the first 24 alphabetically on this list of names). Let’s suppose this is our class, and you want to divide students into 6 groups. Modify the code below using therep
function to create groups in two different ways.
<- read_csv("https://proback.github.io/264_fall_2024/Data/babynames_2000.csv")
babynames
<- babynames %>%
roster sample_n(size = 24) %>%
select(name)
%>%
roster mutate(group_method1 = ,
group_method2 = )
- Here’s is a really crazy list that tells you some stuff about data science.
<- list(first = c("first it must work", "then it can be" , "pretty"),
data_sci DRY = c("Do not", "Repeat", "Yourself"),
dont_forget = c("garbage", "in", "out"),
our_first_tibble = mpg,
integers = 1:25,
doubles = sqrt(1:25),
tidyverse = c(pack1 = "ggplot2", pack2 = "dplyr",
pack3 = "lubridate", etc = "and more!"),
opinion = list("MSCS 264 is", "awesome!", "amazing!", "rainbows!")
)
Use str to learn about data_sci.
Now, figure out how to get exactly the following outputs. Bonus points if you can do it more than one way!
[1] “first it must work” “then it can be” “pretty”
$DRY [1] “Do not” “Repeat” “Yourself”
[1] 3 1 4 1 5 9 3
pack1 etc
“ggplot2” “and more!”
[1] “rainbows!”
[1] “garbage” “in” “garbage” “out”
A tibble: 234 x 2
hwy cty
[[1]] [1] “MSCS 264 is”
[[2]] [1] “amazing!”