Strings: Extra Practice (Part 3)

You can download this .qmd file from here. Just hit the Download Raw File button.

library(tidyverse)
library(rvest)
library(httr)

On Your Own - Extra practice with strings and regular expressions

Describe the equivalents of ?, +, * in {m,n} form.
Describe, in words, what the expression “(.)(.)\2\1” will match, and provide a word or expression as an example.
Produce an R string which the regular expression represented by “\..\..\..” matches. In other words, find a string y below that produces a TRUE in str_detect.
Solve with str_subset(), using the words from stringr::words:

Find all words that start or end with x.
Find all words that start with a vowel and end with a consonant.
Find all words that start and end with the same letter

What words in stringr::words have the highest number of vowels? What words have the highest proportion of vowels? (Hint: what is the denominator?) Figure this out using the tidyverse and piping, starting with as_tibble(words) |>.
From the Harvard sentences data, use str_extract to produce a tibble with 3 columns: the sentence, the first word in the sentence, and the first word ending in “ed” (NA if there isn’t one).
Find and output all contractions (words with apostrophes) in the Harvard sentences, assuming no sentence has multiple contractions.
Carefully explain what the code below does, both line by line and in general terms.

temp <- str_replace_all(words, "^([A-Za-z])(.*)([a-z])$", "\\3\\2\\1")
as_tibble(words) |>
  semi_join(as_tibble(temp)) |>
  print(n = Inf)

Joining with `by = join_by(value)`

# A tibble: 45 × 1
   value     
   <chr>     
 1 a         
 2 america   
 3 area      
 4 dad       
 5 dead      
 6 deal      
 7 dear      
 8 depend    
 9 dog       
10 educate   
11 else      
12 encourage 
13 engine    
14 europe    
15 evidence  
16 example   
17 excuse    
18 exercise  
19 expense   
20 experience
21 eye       
22 god       
23 health    
24 high      
25 knock     
26 lead      
27 level     
28 local     
29 nation    
30 no        
31 non       
32 on        
33 rather    
34 read      
35 refer     
36 remember  
37 serious   
38 stairs    
39 test      
40 tonight   
41 transport 
42 treat     
43 trust     
44 window    
45 yesterday

Coco and Rotten Tomatoes

We will check out the Rotten Tomatoes page for the 2017 movie Coco, scrape information from that page (we’ll get into web scraping in a few weeks!), clean it up into a usable format, and answer some questions using strings and regular expressions.

# used to work
# coco <- read_html("https://www.rottentomatoes.com/m/coco_2017")

robotstxt::paths_allowed("https://www.rottentomatoes.com/m/coco_2017")


 www.rottentomatoes.com

[1] TRUE

library(polite)
coco <- "https://www.rottentomatoes.com/m/coco_2017" |>
  bow() |> 
  scrape()

top_reviews <- 
  "https://www.rottentomatoes.com/m/coco_2017/reviews?type=top_critics" |> 
  bow() |> 
  scrape()
top_reviews <- html_nodes(top_reviews, ".review-text")
top_reviews <- html_text(top_reviews)

user_reviews <- 
  "https://www.rottentomatoes.com/m/coco_2017/reviews?type=user" |> 
  bow() |> 
  scrape()
user_reviews <- html_nodes(user_reviews, ".js-review-text")
user_reviews <- html_text(user_reviews)

top_reviews is a character vector containing the 20 most recent critic reviews (along with some other junk) for Coco, while user_reviews is a character vector with the 10 most recent user reviews.

Explain how the code below helps clean up both user_reviews and top_reviews before we start using them.

user_reviews <- str_trim(user_reviews)
top_reviews <- str_trim(top_reviews)

Print out the critic reviews where the reviewer mentions “emotion” or “cry”. Think about various forms (“cried”, “emotional”, etc.) You may want to turn reviews to all lower case before searching for matches.
In critic reviews, replace all instances where “Pixar” is used with its full name: “Pixar Animation Studios”.
Find out how many times each user uses “I” in their review. Remember that it could be used as upper or lower case, at the beginning, middle, or end of a sentence, etc.
Do critics or users have more complex reviews, as measured by average number of commas used? Be sure your code weeds out commas used in numbers, such as “12,345”.