library(tidyverse)
library(rvest)
library(httr)
Strings: Extra Practice (Part 3)
You can download this .qmd file from here. Just hit the Download Raw File button.
On Your Own - Extra practice with strings and regular expressions
Describe the equivalents of ?, +, * in {m,n} form.
Describe, in words, what the expression “(.)(.)\2\1” will match, and provide a word or expression as an example.
Produce an R string which the regular expression represented by “\..\..\..” matches. In other words, find a string
y
below that produces a TRUE instr_detect
.Solve with
str_subset()
, using the words fromstringr::words
:
- Find all words that start or end with x.
- Find all words that start with a vowel and end with a consonant.
- Find all words that start and end with the same letter
What words in
stringr::words
have the highest number of vowels? What words have the highest proportion of vowels? (Hint: what is the denominator?) Figure this out using the tidyverse and piping, starting withas_tibble(words) |>
.From the Harvard sentences data, use
str_extract
to produce a tibble with 3 columns: the sentence, the first word in the sentence, and the first word ending in “ed” (NA if there isn’t one).Find and output all contractions (words with apostrophes) in the Harvard sentences, assuming no sentence has multiple contractions.
Carefully explain what the code below does, both line by line and in general terms.
<- str_replace_all(words, "^([A-Za-z])(.*)([a-z])$", "\\3\\2\\1")
temp as_tibble(words) |>
semi_join(as_tibble(temp)) |>
print(n = Inf)
Joining with `by = join_by(value)`
# A tibble: 45 × 1
value
<chr>
1 a
2 america
3 area
4 dad
5 dead
6 deal
7 dear
8 depend
9 dog
10 educate
11 else
12 encourage
13 engine
14 europe
15 evidence
16 example
17 excuse
18 exercise
19 expense
20 experience
21 eye
22 god
23 health
24 high
25 knock
26 lead
27 level
28 local
29 nation
30 no
31 non
32 on
33 rather
34 read
35 refer
36 remember
37 serious
38 stairs
39 test
40 tonight
41 transport
42 treat
43 trust
44 window
45 yesterday
Coco and Rotten Tomatoes
We will check out the Rotten Tomatoes page for the 2017 movie Coco, scrape information from that page (we’ll get into web scraping in a few weeks!), clean it up into a usable format, and answer some questions using strings and regular expressions.
# used to work
# coco <- read_html("https://www.rottentomatoes.com/m/coco_2017")
::paths_allowed("https://www.rottentomatoes.com/m/coco_2017") robotstxt
www.rottentomatoes.com
[1] TRUE
library(polite)
<- "https://www.rottentomatoes.com/m/coco_2017" |>
coco bow() |>
scrape()
<-
top_reviews "https://www.rottentomatoes.com/m/coco_2017/reviews?type=top_critics" |>
bow() |>
scrape()
<- html_nodes(top_reviews, ".review-text")
top_reviews <- html_text(top_reviews)
top_reviews
<-
user_reviews "https://www.rottentomatoes.com/m/coco_2017/reviews?type=user" |>
bow() |>
scrape()
<- html_nodes(user_reviews, ".js-review-text")
user_reviews <- html_text(user_reviews) user_reviews
top_reviews
is a character vector containing the 20 most recent critic reviews (along with some other junk) for Coco, whileuser_reviews
is a character vector with the 10 most recent user reviews.
- Explain how the code below helps clean up both
user_reviews
andtop_reviews
before we start using them.
<- str_trim(user_reviews)
user_reviews <- str_trim(top_reviews) top_reviews
Print out the critic reviews where the reviewer mentions “emotion” or “cry”. Think about various forms (“cried”, “emotional”, etc.) You may want to turn reviews to all lower case before searching for matches.
In critic reviews, replace all instances where “Pixar” is used with its full name: “Pixar Animation Studios”.
Find out how many times each user uses “I” in their review. Remember that it could be used as upper or lower case, at the beginning, middle, or end of a sentence, etc.
Do critics or users have more complex reviews, as measured by average number of commas used? Be sure your code weeds out commas used in numbers, such as “12,345”.