Important functions for identifying strings which match
str_view() : most useful for testing str_subset() : useful for printing matches to the console str_detect() : useful when working within a tibble
Identify the input type and output type for each of these examples:
str_view(spot_smaller$subgenre, "pop")
[1] │ indie <pop>timism
[2] │ post-teen <pop>
[3] │ hip <pop>
[4] │ hip <pop>
[5] │ latin <pop>
typeof(str_view(spot_smaller$subgenre, "pop"))
[1] "character"
class(str_view(spot_smaller$subgenre, "pop"))
[1] "stringr_view"
str_view(spot_smaller$subgenre, "pop", match =NA)
[1] │ indie <pop>timism
[2] │ post-teen <pop>
[3] │ hip <pop>
[4] │ hip <pop>
[5] │ latin <pop>
[6] │ latin hip hop
[7] │ hip hop
[8] │ southern hip hop
[9] │ latin hip hop
[10] │ gangster rap
str_view(spot_smaller$subgenre, "pop", html =TRUE)
Rows: 32833 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bigspotify
# A tibble: 32,833 × 23
track_id track_name track_artist track_popularity track_album_id
<chr> <chr> <chr> <dbl> <chr>
1 6f807x0ima9a1j3VPbc7… I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
2 0r7CVbZTWZgbTCYdfa2P… Memories … Maroon 5 67 63rPSO264uRjW…
3 1z1Hg7Vb0AhHDiEmnDE7… All the T… Zara Larsson 70 1HoSmj2eLcsrR…
4 75FpbthrwQmzHlBJLuGd… Call You … The Chainsm… 60 1nqYsOef1yKKu…
5 1e8PAfcKUYoKkxPhrHqw… Someone Y… Lewis Capal… 69 7m7vv9wlQ4i0L…
6 7fvUMiyapMsRRxr07cU8… Beautiful… Ed Sheeran 67 2yiy9cd2QktrN…
7 2OAylPUDDfwRGfe0lYql… Never Rea… Katy Perry 62 7INHYSeusaFly…
8 6b1RNvAcJjQH73eZO4BL… Post Malo… Sam Feldt 69 6703SRPsLkS4b…
9 7bF6tCO3gFb8INrEDcjN… Tough Lov… Avicii 68 7CvAfGvq4RlIw…
10 1IXGILkPm0tOCNeq00kC… If I Can'… Shawn Mendes 67 4QxzbfSsVryEQ…
# ℹ 32,823 more rows
# ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
# playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
# playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
# loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
# instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
# duration_ms <dbl>
Matching patterns with regular expressions
^abc string starts with abc abc$ string ends with abc . any character [abc] a or b or c [^abc] anything EXCEPT a or b or c
# Guess the output!str_view(spot_smaller$artist, "^K")
Here are the regular expression special characters that require an escape character (a preceding ): ^ $ . ? * | + ( ) [ {
For any characters with special properties, use to “escape” its special meaning … but is itself a special character … so we need two \! (e.g. \$, \., etc.)
str_view(spot_smaller$title, "$")
[1] │ Hear Me Now<>
[2] │ Run the World (Girls)<>
[3] │ Formation<>
[4] │ 7/11<>
[5] │ My Oh My (feat. DaBaby)<>
[6] │ It's Automatic<>
[7] │ Poetic Justice<>
[8] │ A.D.H.D<>
[9] │ Ya Estuvo<>
[10] │ Runnin (with A$AP Rocky, A$AP Ferg & Nicki Minaj)<>
In bigspotify, how many track_names include a $? Be sure you print the track_names you find and make sure the dollar sign is not just in a featured artist!
In bigspotify, how many track_names include a dollar amount (a $ followed by a number).
Repetition
? 0 or 1 times + 1 or more * 0 or more {n} exactly n times {n,} n or more times {,m} at most m times {n,m} between n and m times
str_view(spot_smaller$album_name, "[A-Z]{2,}")
[4] │ <BEYONC>É [Platinum Edition]
[10] │ Creed <II>: The Album
Use at least 1 repetition symbol when solving 8-10 below
Modify the first regular expression above to also pick up “A.A” (in addition to “BEYONC” and “II”). That is, pick up strings where there might be a period between capital letters.
Create some strings that satisfy these regular expressions and explain.
“^.*$”
“\{.+\}”
Create regular expressions to find all stringr::words that:
Start with three consonants.
Have two or more vowel-consonant pairs in a row.
Useful functions for handling patterns
str_extract() : extract a string that matches a pattern str_count() : count how many times a pattern occurs within a string
In the spot_smaller dataset, how many words are in each title? (hint \b)
In the spot_smaller dataset, extract the first word from every title. Show how you would print out these words as a vector and how you would create a new column on the spot_smaller tibble. That is, produce this:
Tips with backreference: - You must use () around the the thing you want to reference. - To backreference multiple times, use \1 again. - The number refers to which spot you are referencing… e.g. \2 references the second set of ()
Describe to your groupmates what these expressions will match, and provide a word or expression as an example:
(.)\1\1
“(.)(.)(.).*\3\2\1”
Which words in stringr::words match each expression?
Construct a regular expression to match words in stringr::words that contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice) but not match repeated pairs of numbers (e.g. 507-786-3861).
Reformat the album_release_date variable in spot_smaller so that it is MM-DD-YYYY instead of YYYY-MM-DD. (Hint: str_replace().)
BEFORE RUNNING IT, explain to your partner(s) what the following R chunk will do:
[1] "The canoe birch slid on the smooth planks."
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."
[4] "These a days chicken leg is a rare dish."
[5] "Rice often is served in round bowls."