# This is a tibble with many columns of "string variables", or "character variables"spot_smaller
# A tibble: 10 × 6
title artist album_release_date album_name subgenre playlist_name
<chr> <chr> <chr> <chr> <chr> <chr>
1 Hear Me Now Alok 2016-01-01 Hear Me N… indie p… "Chillout & …
2 Run the World (G… Beyon… 2011-06-24 4 post-te… "post-teen a…
3 Formation Beyon… 2016-04-23 Lemonade hip pop "Feeling Acc…
4 7/11 Beyon… 2014-11-24 BEYONCÉ [… hip pop "Feeling Acc…
5 My Oh My (feat. … Camil… 2019-12-06 Romance latin p… "2020 Hits &…
6 It's Automatic Frees… 2013-11-28 It's Auto… latin h… "80's Freest…
7 Poetic Justice Kendr… 2012 good kid,… hip hop "Hip Hop Con…
8 A.D.H.D Kendr… 2011-07-02 Section.80 souther… "Hip-Hop 'n …
9 Ya Estuvo Kid F… 1990-01-01 Hispanic … latin h… "HIP-HOP: La…
10 Runnin (with A$A… Mike … 2018-11-16 Creed II:… gangste… "RAP Gangsta"
# Each column of the tibble is a vector of strings.spot_smaller$title
[1] "Hear Me Now"
[2] "Run the World (Girls)"
[3] "Formation"
[4] "7/11"
[5] "My Oh My (feat. DaBaby)"
[6] "It's Automatic"
[7] "Poetic Justice"
[8] "A.D.H.D"
[9] "Ya Estuvo"
[10] "Runnin (with A$AP Rocky, A$AP Ferg & Nicki Minaj)"
# Each item in the tibble is a string.spot_smaller$title[1]
[1] "Hear Me Now"
Functions that start str_ do stuff to strings!
str_length()
# when the input to str_length is a single string, the output is a single value:str_length("hi")
[1] 2
str_length(single_string)
[1] 17
# when the input to str_length is a vector, the output is a vector:str_length(string_vector)
[1] 4 2 1 6 10
str_length takes a vector input and creates a vector output (or a single value input and returns a single value output)…. this makes it easy to use within a mutate!
# A tibble: 10 × 1
song_by
<chr>
1 Hear Me Now by Alok
2 Run the World (Girls) by Beyoncé
3 Formation by Beyoncé
4 7/11 by Beyoncé
5 My Oh My (feat. DaBaby) by Camila Cabello
6 It's Automatic by Freestyle
7 Poetic Justice by Kendrick Lamar
8 A.D.H.D by Kendrick Lamar
9 Ya Estuvo by Kid Frost
10 Runnin (with A$AP Rocky, A$AP Ferg & Nicki Minaj) by Mike WiLL Made-It
# A tibble: 10 × 3
title title_to_lower title_to_upper
<chr> <chr> <chr>
1 Hear Me Now hear me now HEAR ME NOW
2 Run the World (Girls) run the world… RUN THE WORLD…
3 Formation formation FORMATION
4 7/11 7/11 7/11
5 My Oh My (feat. DaBaby) my oh my (fea… MY OH MY (FEA…
6 It's Automatic it's automatic IT'S AUTOMATIC
7 Poetic Justice poetic justice POETIC JUSTICE
8 A.D.H.D a.d.h.d A.D.H.D
9 Ya Estuvo ya estuvo YA ESTUVO
10 Runnin (with A$AP Rocky, A$AP Ferg & Nicki Min… runnin (with … RUNNIN (WITH …
# title is already in title case, so: str_to_title("makes this into title case")
[1] "Makes This Into Title Case"
Matching Patterns
In addition to manipulating strings, we might what to search through them to find matches. For example, can I find all the songs that start with M? The songs from 2016? The album titles that include a number?
str_view()
This function is helpful for viewing. It returns rows that contain the pattern you’re searching for, highlighting the pattern between <.> symbols and in a different color.
The first input is the vector, and the second input is the string/substring/pattern you are looking for.
[1] │ indie <pop>timism
[2] │ post-teen <pop>
[3] │ hip <pop>
[4] │ hip <pop>
[5] │ latin <pop>
str_view(spot_smaller$subgenre, "hip hop")
[6] │ latin <hip hop>
[7] │ <hip hop>
[8] │ southern <hip hop>
[9] │ latin <hip hop>
str_subset()
str_subset() takes a vector input and returns a (usually shorter) vector output. Compare the output from str_view() and str_subset() here. Both of these functions can be hard to work with in a tibble.
[1] "Hear Me Now"
[2] "My Oh My (feat. DaBaby)"
[3] "Runnin (with A$AP Rocky, A$AP Ferg & Nicki Minaj)"
str_detect()
str_detect takes a vector of strings (or single string) input and returns a vector of TRUE/FALSE (or single value). This makes it easy to work with in tibbles, using mutate or filter.
# A tibble: 3 × 3
title album_name artist
<chr> <chr> <chr>
1 Hear Me Now Hear Me Now Alok
2 My Oh My (feat. DaBaby) Romance Camila …
3 Runnin (with A$AP Rocky, A$AP Ferg & Nicki Minaj) Creed II: The Album Mike Wi…
# A tibble: 5 × 4
title album_name artist subgenre
<chr> <chr> <chr> <chr>
1 Hear Me Now Hear Me Now Alok indie popti…
2 Run the World (Girls) 4 Beyoncé post-teen p…
3 Formation Lemonade Beyoncé hip pop
4 7/11 BEYONCÉ [Platinum Edition] Beyoncé hip pop
5 My Oh My (feat. DaBaby) Romance Camila Cabello latin pop
str_extract()
str_extract() takes a vector (or single) of strings input and returns a vector (or single) string output
single_string
[1] "this is a string!"
str_extract(single_string, "this")
[1] "this"
str_extract() is more interesting when we want to identify a particular pattern to extract from the string.
# A tibble: 10 × 4
title artist album_name numbers
<chr> <chr> <chr> <chr>
1 Hear Me Now Alok Hear Me N… <NA>
2 Run the World (Girls) Beyoncé 4 4
3 Formation Beyoncé Lemonade <NA>
4 7/11 Beyoncé BEYONCÉ [… <NA>
5 My Oh My (feat. DaBaby) Camila … Romance <NA>
6 It's Automatic Freesty… It's Auto… <NA>
7 Poetic Justice Kendric… good kid,… <NA>
8 A.D.H.D Kendric… Section.80 8
9 Ya Estuvo Kid Fro… Hispanic … <NA>
10 Runnin (with A$AP Rocky, A$AP Ferg & Nicki Minaj) Mike Wi… Creed II:… <NA>
The patterns we show here, “\d” and “[aeiou]” are called regular expressions.
Regular Expressions
Regular expressions are a way to write general patterns… for instance the string “\d” will find any digit (number). We can also specify whether we want the string to start or end with a certain letter.
Notice the difference between the regular expression “M” and “^M”, “o” and “o$”
step 1: use str_view() to figure out an appropriate regular expression to use for searching.
str_view(spot_smaller$album_name, "\\d")
[2] │ <4>
[8] │ Section.<8><0>
step 2: what kind of output do I want?
# A list of the album names?str_subset(spot_smaller$album_name, "\\d")
[1] "4" "Section.80"
# A tibble? spot_smaller |>filter(str_detect(album_name, "\\d"))
# A tibble: 2 × 6
title artist album_release_date album_name subgenre playlist_name
<chr> <chr> <chr> <chr> <chr> <chr>
1 Run the World (Gi… Beyon… 2011-06-24 4 post-te… post-teen al…
2 A.D.H.D Kendr… 2011-07-02 Section.80 souther… Hip-Hop 'n R…
More regular expressions
[abc] - a, b, or c
str_view(spot_smaller$subgenre, "[hp]op")
[1] │ indie <pop>timism
[2] │ post-teen <pop>
[3] │ hip <pop>
[4] │ hip <pop>
[5] │ latin <pop>
[6] │ latin hip <hop>
[7] │ hip <hop>
[8] │ southern hip <hop>
[9] │ latin hip <hop>
[2] │ <4>
[4] │ BEYONC<É> <[>Platinum Edition<]>
[6] │ It<'>s Automatic
[7] │ good kid<,> m<.>A<.>A<.>d city <(>Deluxe<)>
[8] │ Section<.><8><0>
[10] │ Creed II<:> The Album
Bonus content not in the pre-class video
str_glue()
This is a nice alternative to str_c(), where you only need a single set of quotes, and anything inside curly brackets {} is evaluated like it’s outside the quotes.
# Thus, this code from earlier...spot_smaller
# A tibble: 10 × 6
title artist album_release_date album_name subgenre playlist_name
<chr> <chr> <chr> <chr> <chr> <chr>
1 Hear Me Now Alok 2016-01-01 Hear Me N… indie p… "Chillout & …
2 Run the World (G… Beyon… 2011-06-24 4 post-te… "post-teen a…
3 Formation Beyon… 2016-04-23 Lemonade hip pop "Feeling Acc…
4 7/11 Beyon… 2014-11-24 BEYONCÉ [… hip pop "Feeling Acc…
5 My Oh My (feat. … Camil… 2019-12-06 Romance latin p… "2020 Hits &…
6 It's Automatic Frees… 2013-11-28 It's Auto… latin h… "80's Freest…
7 Poetic Justice Kendr… 2012 good kid,… hip hop "Hip Hop Con…
8 A.D.H.D Kendr… 2011-07-02 Section.80 souther… "Hip-Hop 'n …
9 Ya Estuvo Kid F… 1990-01-01 Hispanic … latin h… "HIP-HOP: La…
10 Runnin (with A$A… Mike … 2018-11-16 Creed II:… gangste… "RAP Gangsta"
song_count <- spot_smaller |>count(artist) |>slice_max(n, n =1)song_count
# A tibble: 1 × 2
artist n
<chr> <int>
1 Beyoncé 3
str_c("The artist with the most songs in spot_smaller is", song_count$artist, "with", song_count$n, "songs.", sep =" ")
[1] "The artist with the most songs in spot_smaller is Beyoncé with 3 songs."
# ... becomes this:song_count |>mutate(statement =str_glue("The artist with the most songs in spot_smaller is {artist} with {n} songs."))
# A tibble: 1 × 3
artist n statement
<chr> <int> <glue>
1 Beyoncé 3 The artist with the most songs in spot_smaller is Beyoncé with …
# or str_glue("The artist with the most songs in spot_smaller is {song_count$artist} with {song_count$n} songs.")
The artist with the most songs in spot_smaller is Beyoncé with 3 songs.
str_glue() can also be applied to an entire column vector:
spot_smaller |>mutate(statement =str_glue("{artist} released {album_name} on {album_release_date}.")) |>select(statement)
# A tibble: 10 × 1
statement
<glue>
1 Alok released Hear Me Now on 2016-01-01.
2 Beyoncé released 4 on 2011-06-24.
3 Beyoncé released Lemonade on 2016-04-23.
4 Beyoncé released BEYONCÉ [Platinum Edition] on 2014-11-24.
5 Camila Cabello released Romance on 2019-12-06.
6 Freestyle released It's Automatic on 2013-11-28.
7 Kendrick Lamar released good kid, m.A.A.d city (Deluxe) on 2012.
8 Kendrick Lamar released Section.80 on 2011-07-02.
9 Kid Frost released Hispanic Causing Panic on 1990-01-01.
10 Mike WiLL Made-It released Creed II: The Album on 2018-11-16.
And if you wanted to include {} in your statement, you can double up {} to serve as an escape character:
spot_smaller |>mutate(statement =str_glue("{artist} released {album_name} on {album_release_date} {{according to Spotify}}.")) |>select(statement)
# A tibble: 10 × 1
statement
<glue>
1 Alok released Hear Me Now on 2016-01-01 {according to Spotify}.
2 Beyoncé released 4 on 2011-06-24 {according to Spotify}.
3 Beyoncé released Lemonade on 2016-04-23 {according to Spotify}.
4 Beyoncé released BEYONCÉ [Platinum Edition] on 2014-11-24 {according to Spot…
5 Camila Cabello released Romance on 2019-12-06 {according to Spotify}.
6 Freestyle released It's Automatic on 2013-11-28 {according to Spotify}.
7 Kendrick Lamar released good kid, m.A.A.d city (Deluxe) on 2012 {according t…
8 Kendrick Lamar released Section.80 on 2011-07-02 {according to Spotify}.
9 Kid Frost released Hispanic Causing Panic on 1990-01-01 {according to Spotif…
10 Mike WiLL Made-It released Creed II: The Album on 2018-11-16 {according to S…
separate_wider_delim() and its cousins
When multiple variables are crammed together into a single string, the separate_ functions can be used to extract the pieces are produce additional rows (longer) or columns (wider). We show one such example below, using the optional “too_few” setting to diagnose issues after getting a warning message the first time.
# A tibble: 10 × 8
title artist year month day album_name subgenre playlist_name
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Hear Me Now Alok 2016 01 01 Hear Me N… indie p… "Chillout & …
2 Run the World (Gi… Beyon… 2011 06 24 4 post-te… "post-teen a…
3 Formation Beyon… 2016 04 23 Lemonade hip pop "Feeling Acc…
4 7/11 Beyon… 2014 11 24 BEYONCÉ [… hip pop "Feeling Acc…
5 My Oh My (feat. D… Camil… 2019 12 06 Romance latin p… "2020 Hits &…
6 It's Automatic Frees… 2013 11 28 It's Auto… latin h… "80's Freest…
7 Poetic Justice Kendr… 2012 <NA> <NA> good kid,… hip hop "Hip Hop Con…
8 A.D.H.D Kendr… 2011 07 02 Section.80 souther… "Hip-Hop 'n …
9 Ya Estuvo Kid F… 1990 01 01 Hispanic … latin h… "HIP-HOP: La…
10 Runnin (with A$AP… Mike … 2018 11 16 Creed II:… gangste… "RAP Gangsta"
If there is a definable pattern, but the pattern is a bit weird, we can often use separate_wider_regex() to extract the correct values and build a tidy data set: