Strings: In-class Exercises (Part 2)

You can download this .qmd file from here. Just hit the Download Raw File button.

This uses parts of R4DS Ch 14: Strings and Ch 15: Regular Expressions (both the first and second editions).

Manipulating strings

str functions to know for manipulating strings:

str_length()
str_sub()
str_c()
str_to_lower()
str_to_upper()
str_to_title()
str_replace() not in video examples

library(tidyverse)

#spotify <- read_csv("Data/spotify.csv") 
spotify <- read_csv("https://proback.github.io/264_fall_2024/Data/spotify.csv")

spot_smaller <- spotify |>
  select(
    title, 
    artist, 
    album_release_date, 
    album_name, 
    subgenre, 
    playlist_name
  )

spot_smaller <- spot_smaller[c(5, 32, 49, 52, 83, 175, 219, 231, 246, 265), ]
spot_smaller

# A tibble: 10 × 6
   title             artist album_release_date album_name subgenre playlist_name
   <chr>             <chr>  <chr>              <chr>      <chr>    <chr>        
 1 Hear Me Now       Alok   2016-01-01         Hear Me N… indie p… "Chillout & …
 2 Run the World (G… Beyon… 2011-06-24         4          post-te… "post-teen a…
 3 Formation         Beyon… 2016-04-23         Lemonade   hip pop  "Feeling Acc…
 4 7/11              Beyon… 2014-11-24         BEYONCÉ [… hip pop  "Feeling Acc…
 5 My Oh My (feat. … Camil… 2019-12-06         Romance    latin p… "2020 Hits &…
 6 It's Automatic    Frees… 2013-11-28         It's Auto… latin h… "80's Freest…
 7 Poetic Justice    Kendr… 2012               good kid,… hip hop  "Hip Hop Con…
 8 A.D.H.D           Kendr… 2011-07-02         Section.80 souther… "Hip-Hop 'n …
 9 Ya Estuvo         Kid F… 1990-01-01         Hispanic … latin h… "HIP-HOP: La…
10 Runnin (with A$A… Mike … 2018-11-16         Creed II:… gangste… "RAP Gangsta"

Warm-up

Describe what EACH of the str_ functions below does. Then, create a new variable “month” which is the two digit month from album_release_date

spot_new <- spot_smaller |>
  select(title, album_release_date) |>
  mutate(title_length = str_length(title),
         year = str_sub(album_release_date, 1, 4),
         title_lower = str_to_lower(title),
         album_release_date2 = str_replace_all(album_release_date, "-", "/"))
spot_new

# A tibble: 10 × 6
   title   album_release_date title_length year  title_lower album_release_date2
   <chr>   <chr>                     <int> <chr> <chr>       <chr>              
 1 Hear M… 2016-01-01                   11 2016  hear me now 2016/01/01         
 2 Run th… 2011-06-24                   21 2011  run the wo… 2011/06/24         
 3 Format… 2016-04-23                    9 2016  formation   2016/04/23         
 4 7/11    2014-11-24                    4 2014  7/11        2014/11/24         
 5 My Oh … 2019-12-06                   23 2019  my oh my (… 2019/12/06         
 6 It's A… 2013-11-28                   14 2013  it's autom… 2013/11/28         
 7 Poetic… 2012                         14 2012  poetic jus… 2012               
 8 A.D.H.D 2011-07-02                    7 2011  a.d.h.d     2011/07/02         
 9 Ya Est… 1990-01-01                    9 1990  ya estuvo   1990/01/01         
10 Runnin… 2018-11-16                   49 2018  runnin (wi… 2018/11/16

max_length <- max(spot_new$title_length)

str_c("The longest title is", max_length, "characters long.", sep = " ")

[1] "The longest title is 49 characters long."

Important functions for identifying strings which match

str_view() : most useful for testing str_subset() : useful for printing matches to the console str_detect() : useful when working within a tibble

Identify the input type and output type for each of these examples:

str_view(spot_smaller$subgenre, "pop")

[1] │ indie <pop>timism
[2] │ post-teen <pop>
[3] │ hip <pop>
[4] │ hip <pop>
[5] │ latin <pop>

typeof(str_view(spot_smaller$subgenre, "pop"))

[1] "character"

class(str_view(spot_smaller$subgenre, "pop"))

[1] "stringr_view"

str_view(spot_smaller$subgenre, "pop", match = NA)

 [1] │ indie <pop>timism
 [2] │ post-teen <pop>
 [3] │ hip <pop>
 [4] │ hip <pop>
 [5] │ latin <pop>
 [6] │ latin hip hop
 [7] │ hip hop
 [8] │ southern hip hop
 [9] │ latin hip hop
[10] │ gangster rap

str_view(spot_smaller$subgenre, "pop", html = TRUE)

str_subset(spot_smaller$subgenre, "pop")

[1] "indie poptimism" "post-teen pop"   "hip pop"         "hip pop"        
[5] "latin pop"

str_detect(spot_smaller$subgenre, "pop")

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Use str_detect to print the rows of the spot_smaller tibble containing songs that have “pop” in the subgenre. (i.e. make a new tibble with fewer rows)
Find the mean song title length for songs with “pop” in the subgenre and songs without “pop” in the subgenre.

Producing a table like this would be great:

A tibble: 2 × 2

sub_pop mean_title_length 1 FALSE 18.6 2 TRUE 13.6

Producing a table like this would be SUPER great (hint: ifelse()):

A tibble: 2 × 2

sub_pop mean_title_length 1 Genre with pop 13.6 2 Genre without pop 18.6

In the bigspotify dataset, find the proportion of songs which contain “love” in the title (track_name) by playlist_genre.

bigspotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

Rows: 32833 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bigspotify

# A tibble: 32,833 × 23
   track_id              track_name track_artist track_popularity track_album_id
   <chr>                 <chr>      <chr>                   <dbl> <chr>         
 1 6f807x0ima9a1j3VPbc7… I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
 2 0r7CVbZTWZgbTCYdfa2P… Memories … Maroon 5                   67 63rPSO264uRjW…
 3 1z1Hg7Vb0AhHDiEmnDE7… All the T… Zara Larsson               70 1HoSmj2eLcsrR…
 4 75FpbthrwQmzHlBJLuGd… Call You … The Chainsm…               60 1nqYsOef1yKKu…
 5 1e8PAfcKUYoKkxPhrHqw… Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
 6 7fvUMiyapMsRRxr07cU8… Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
 7 2OAylPUDDfwRGfe0lYql… Never Rea… Katy Perry                 62 7INHYSeusaFly…
 8 6b1RNvAcJjQH73eZO4BL… Post Malo… Sam Feldt                  69 6703SRPsLkS4b…
 9 7bF6tCO3gFb8INrEDcjN… Tough Lov… Avicii                     68 7CvAfGvq4RlIw…
10 1IXGILkPm0tOCNeq00kC… If I Can'… Shawn Mendes               67 4QxzbfSsVryEQ…
# ℹ 32,823 more rows
# ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
#   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
#   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
#   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   duration_ms <dbl>

Matching patterns with regular expressions

^abc string starts with abc abc$ string ends with abc . any character [abc] a or b or c [^abc] anything EXCEPT a or b or c

# Guess the output!

str_view(spot_smaller$artist, "^K")

[7] │ <K>endrick Lamar
[8] │ <K>endrick Lamar
[9] │ <K>id Frost

str_view(spot_smaller$album_release_date, "01$")

[1] │ 2016-01-<01>
[9] │ 1990-01-<01>

str_view(spot_smaller$title, "^.. ")

[5] │ <My >Oh My (feat. DaBaby)
[9] │ <Ya >Estuvo

str_view(spot_smaller$artist, "[^A-Za-z ]")

 [2] │ Beyonc<é>
 [3] │ Beyonc<é>
 [4] │ Beyonc<é>
[10] │ Mike WiLL Made<->It

Given the corpus of common words in stringr::words, create regular expressions that find all words that:

Start with “y”.
End with “x”
Are exactly three letters long.
Have seven letters or more.
Start with a vowel.
End with ed, but not with eed.
Words where q is not followed by u. (are there any in words?)

# Try using str_view() or str_subset()

# For example, to find words with "tion" at any point, I could use:
str_view(words, "tion")

[181] │ condi<tion>
[347] │ func<tion>
[516] │ men<tion>
[536] │ mo<tion>
[543] │ na<tion>
[631] │ posi<tion>
[667] │ ques<tion>
[695] │ rela<tion>
[732] │ sec<tion>
[804] │ sta<tion>

str_subset(words, "tion")

 [1] "condition" "function"  "mention"   "motion"    "nation"    "position" 
 [7] "question"  "relation"  "section"   "station"

More useful regular expressions:

\d - any number \s - any space, tab, etc \b - any boundary: space, ., etc.

str_view(spot_smaller$album_name, "\\d")

[2] │ <4>
[8] │ Section.<8><0>

str_view(spot_smaller$album_name, "\\s")

 [1] │ Hear< >Me< >Now
 [4] │ BEYONCÉ< >[Platinum< >Edition]
 [6] │ It's< >Automatic
 [7] │ good< >kid,< >m.A.A.d< >city< >(Deluxe)
 [9] │ Hispanic< >Causing< >Panic
[10] │ Creed< >II:< >The< >Album

str_view_all(spot_smaller$album_name, "\\b")

Warning: `str_view_all()` was deprecated in stringr 1.5.0.
ℹ Please use `str_view()` instead.

 [1] │ <>Hear<> <>Me<> <>Now<>
 [2] │ <>4<>
 [3] │ <>Lemonade<>
 [4] │ <>BEYONCÉ<> [<>Platinum<> <>Edition<>]
 [5] │ <>Romance<>
 [6] │ <>It<>'<>s<> <>Automatic<>
 [7] │ <>good<> <>kid<>, <>m<>.<>A<>.<>A<>.<>d<> <>city<> (<>Deluxe<>)
 [8] │ <>Section<>.<>80<>
 [9] │ <>Hispanic<> <>Causing<> <>Panic<>
[10] │ <>Creed<> <>II<>: <>The<> <>Album<>

Here are the regular expression special characters that require an escape character (a preceding ): ^ $ . ? * | + ( ) [ {

For any characters with special properties, use to “escape” its special meaning … but is itself a special character … so we need two \! (e.g. \$, \., etc.)

str_view(spot_smaller$title, "$")

 [1] │ Hear Me Now<>
 [2] │ Run the World (Girls)<>
 [3] │ Formation<>
 [4] │ 7/11<>
 [5] │ My Oh My (feat. DaBaby)<>
 [6] │ It's Automatic<>
 [7] │ Poetic Justice<>
 [8] │ A.D.H.D<>
 [9] │ Ya Estuvo<>
[10] │ Runnin (with A$AP Rocky, A$AP Ferg & Nicki Minaj)<>

str_view(spot_smaller$title, "\\$")

[10] │ Runnin (with A<$>AP Rocky, A<$>AP Ferg & Nicki Minaj)

In bigspotify, how many track_names include a $? Be sure you print the track_names you find and make sure the dollar sign is not just in a featured artist!
In bigspotify, how many track_names include a dollar amount (a $ followed by a number).

Repetition

? 0 or 1 times + 1 or more * 0 or more {n} exactly n times {n,} n or more times {,m} at most m times {n,m} between n and m times

str_view(spot_smaller$album_name, "[A-Z]{2,}")

 [4] │ <BEYONC>É [Platinum Edition]
[10] │ Creed <II>: The Album

str_view(spot_smaller$album_release_date, "\\d{4}-\\d{2}")

 [1] │ <2016-01>-01
 [2] │ <2011-06>-24
 [3] │ <2016-04>-23
 [4] │ <2014-11>-24
 [5] │ <2019-12>-06
 [6] │ <2013-11>-28
 [8] │ <2011-07>-02
 [9] │ <1990-01>-01
[10] │ <2018-11>-16

Use at least 1 repetition symbol when solving 8-10 below

Modify the first regular expression above to also pick up “A.A” (in addition to “BEYONC” and “II”). That is, pick up strings where there might be a period between capital letters.
Create some strings that satisfy these regular expressions and explain.

“^.*$”
“\{.+\}”

Create regular expressions to find all stringr::words that:

Start with three consonants.
Have two or more vowel-consonant pairs in a row.

Useful functions for handling patterns

str_extract() : extract a string that matches a pattern str_count() : count how many times a pattern occurs within a string

str_extract(spot_smaller$album_release_date, "\\d{4}-\\d{2}")

 [1] "2016-01" "2011-06" "2016-04" "2014-11" "2019-12" "2013-11" NA       
 [8] "2011-07" "1990-01" "2018-11"

spot_smaller |>
  select(album_release_date) |>
  mutate(year_month = str_extract(album_release_date, "\\d{4}-\\d{2}"))

# A tibble: 10 × 2
   album_release_date year_month
   <chr>              <chr>     
 1 2016-01-01         2016-01   
 2 2011-06-24         2011-06   
 3 2016-04-23         2016-04   
 4 2014-11-24         2014-11   
 5 2019-12-06         2019-12   
 6 2013-11-28         2013-11   
 7 2012               <NA>      
 8 2011-07-02         2011-07   
 9 1990-01-01         1990-01   
10 2018-11-16         2018-11

spot_smaller |>
  select(artist) |>
  mutate(n_vowels = str_count(artist, "[aeiou]"))

# A tibble: 10 × 2
   artist            n_vowels
   <chr>                <int>
 1 Alok                     1
 2 Beyoncé                  2
 3 Beyoncé                  2
 4 Beyoncé                  2
 5 Camila Cabello           6
 6 Freestyle                3
 7 Kendrick Lamar           4
 8 Kendrick Lamar           4
 9 Kid Frost                2
10 Mike WiLL Made-It        5

In the spot_smaller dataset, how many words are in each title? (hint \b)
In the spot_smaller dataset, extract the first word from every title. Show how you would print out these words as a vector and how you would create a new column on the spot_smaller tibble. That is, produce this:

# [1] "Hear"      "Run"       "Formation" "7/11"      "My"        "It's"     
# [7] "Poetic"    "A.D.H.D"   "Ya"        "Runnin"

Then this:

# A tibble: 10 × 2
#   title                                             first_word
#   <chr>                                             <chr>     
# 1 Hear Me Now                                       Hear      
# 2 Run the World (Girls)                             Run       
# 3 Formation                                         Formation 
# 4 7/11                                              7/11      
# 5 My Oh My (feat. DaBaby)                           My        
# 6 It's Automatic                                    It's      
# 7 Poetic Justice                                    Poetic    
# 8 A.D.H.D                                           A.D.H.D   
# 9 Ya Estuvo                                         Ya        
#10 Runnin (with A$AP Rocky, A$AP Ferg & Nicki Minaj) Runnin

Which decades are popular for playlist_names? Using the bigspotify dataset, try doing each of these steps one at a time!

filter the bigspotify dataset to only include playlists that include something like “80’s” or “00’s” in their title.
create a new column that extracts the decade
use count to find how many playlists include each decade
what if you include both “80’s” and “80s”?
how can you count “80’s” and “80s” together in your final tibble?

Grouping and backreferences

# find all fruits with repeated pair of letters.  
fruit = stringr::fruit
fruit

 [1] "apple"             "apricot"           "avocado"          
 [4] "banana"            "bell pepper"       "bilberry"         
 [7] "blackberry"        "blackcurrant"      "blood orange"     
[10] "blueberry"         "boysenberry"       "breadfruit"       
[13] "canary melon"      "cantaloupe"        "cherimoya"        
[16] "cherry"            "chili pepper"      "clementine"       
[19] "cloudberry"        "coconut"           "cranberry"        
[22] "cucumber"          "currant"           "damson"           
[25] "date"              "dragonfruit"       "durian"           
[28] "eggplant"          "elderberry"        "feijoa"           
[31] "fig"               "goji berry"        "gooseberry"       
[34] "grape"             "grapefruit"        "guava"            
[37] "honeydew"          "huckleberry"       "jackfruit"        
[40] "jambul"            "jujube"            "kiwi fruit"       
[43] "kumquat"           "lemon"             "lime"             
[46] "loquat"            "lychee"            "mandarine"        
[49] "mango"             "mulberry"          "nectarine"        
[52] "nut"               "olive"             "orange"           
[55] "pamelo"            "papaya"            "passionfruit"     
[58] "peach"             "pear"              "persimmon"        
[61] "physalis"          "pineapple"         "plum"             
[64] "pomegranate"       "pomelo"            "purple mangosteen"
[67] "quince"            "raisin"            "rambutan"         
[70] "raspberry"         "redcurrant"        "rock melon"       
[73] "salal berry"       "satsuma"           "star fruit"       
[76] "strawberry"        "tamarillo"         "tangerine"        
[79] "ugli fruit"        "watermelon"

str_view(fruit, "(..)\\1", match = TRUE)

 [4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry

# why does the code below add "pepper" and even "nectarine"?
str_view(fruit, "(..)(.*)\\1", match = TRUE)

 [4] │ b<anan>a
 [5] │ bell <peppe>r
[17] │ chili <peppe>r
[20] │ <coco>nut
[22] │ <cucu>mber
[29] │ eld<erber>ry
[41] │ <juju>be
[51] │ <nectarine>
[56] │ <papa>ya
[73] │ s<alal> berry

Tips with backreference: - You must use () around the the thing you want to reference. - To backreference multiple times, use \1 again. - The number refers to which spot you are referencing… e.g. \2 references the second set of ()

x1 <- c("abxyba", "abccba", "xyaayx", "abxyab", "abcabc")
str_view(x1, "(.)(.)(..)\\2\\1")

[1] │ <abxyba>
[2] │ <abccba>
[3] │ <xyaayx>

str_view(x1, "(.)(.)(..)\\1\\2")

[4] │ <abxyab>

str_view(x1, "(.)(.)(.)\\1\\2\\3")

[5] │ <abcabc>

Describe to your groupmates what these expressions will match, and provide a word or expression as an example:

(.)\1\1
“(.)(.)(.).*\3\2\1”

Which words in stringr::words match each expression?

Construct a regular expression to match words in stringr::words that contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice) but not match repeated pairs of numbers (e.g. 507-786-3861).
Reformat the album_release_date variable in spot_smaller so that it is MM-DD-YYYY instead of YYYY-MM-DD. (Hint: str_replace().)
BEFORE RUNNING IT, explain to your partner(s) what the following R chunk will do:

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)

[1] "The canoe birch slid on the smooth planks." 
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."     
[4] "These a days chicken leg is a rare dish."   
[5] "Rice often is served in round bowls."