Web Scraping in R

You can download this .qmd file from here. Just hit the Download Raw File button.

Credit to Brianna Heggeseth and Leslie Myint from Macalester College for a few of these descriptions and examples.

Using rvest for web scraping

Please see 08_table_scraping.qmd for a preview of web scraping techniques when no API exists, along with ethical considerations when scraping data. In this file, we will turn to scenarios when the webpage contains data of interest, but it is not already in table form.

Recall the four steps to scraping data with functions in the rvest library:

  1. robotstxt::paths_allowed() Check if the website allows scraping, and then make sure we scrape “politely”
  2. read_html(). Input the URL containing the data and turn the html code into an XML file (another markup format that’s easier to work with).
  3. html_nodes(). Extract specific nodes from the XML file by using the CSS path that leads to the content of interest. (use css=“table” for tables.)
  4. html_text(). Extract content of interest from nodes. Might also use html_table() etc.

More scraping ethics

robots.txt

robots.txt is a file that some websites will publish to clarify what can and cannot be scraped and other constraints about scraping. When a website publishes this file, this we need to comply with the information in it for moral and legal reasons.

We will look through the information in this tutorial and apply this to the NIH robots.txt file.

From our investigation of the NIH robots.txt, we learn:

  • User-agent: *: Anyone is allowed to scrape
  • Crawl-delay: 2: Need to wait 2 seconds between each page scraped
  • No Visit-time entry: no restrictions on time of day that scraping is allowed
  • No Request-rate entry: no restrictions on simultaneous requests
  • No mention of ?page=, news-events, news-releases, or https://science.education.nih.gov/ in the Disallow sections. (This is what we want to scrape today.)

robotstxt package

We can also use functions from the robotstxt package, which was built to download and parse robots.txt files (more info). Specifically, the paths_allowed() function can check if a bot has permission to access certain pages.

A timeout to preview some technical ideas

HTML structure

HTML (hypertext markup language) is the formatting language used to create webpages. We can see the core parts of HTML from the rvest vignette.

Finding CSS Selectors

In order to gather information from a webpage, we must learn the language used to identify patterns of specific information. For example, on the NIH News Releases page, we can see that the data is represented in a consistent pattern of image + title + abstract.

We will identify data in a web page using a pattern matching language called CSS Selectors that can refer to specific patterns in HTML, the language used to write web pages.

For example:

  • Selecting by tag:
    • "a" selects all hyperlinks in a webpage (“a” represents “anchor” links in HTML)
    • "p" selects all paragraph elements
  • Selecting by ID and class:
    • ".description" selects all elements with class equal to “description”
      • The . at the beginning is what signifies class selection.
      • This is one of the most common CSS selectors for scraping because in HTML, the class attribute is extremely commonly used to format webpage elements. (Any number of HTML elements can have the same class, which is not true for the id attribute.)
    • "#mainTitle" selects the SINGLE element with id equal to “mainTitle”
      • The # at the beginning is what signifies id selection.
<p class="title">Title of resource 1</p>
<p class="description">Description of resource 1</p>

<p class="title">Title of resource 2</p>
<p class="description">Description of resource 2</p>

Warning: Websites change often! So if you are going to scrape a lot of data, it is probably worthwhile to save and date a copy of the website. Otherwise, you may return after some time and your scraping code will include all of the wrong CSS selectors.

SelectorGadget

Although you can learn how to use CSS Selectors by hand, we will use a shortcut by installing the Selector Gadget tool.

  • There is a version available for Chrome–add it to Chrome via the Chome Web Store.
    • Make sure to pin the extension to the menu bar. (Click the 3 dots > Extensions > Manage extensions. Click the “Details” button under SelectorGadget and toggle the “Pin to toolbar” option.)
  • There is also a version that can be saved as a bookmark in the browser–see here.

You might watch the Selector Gadget tutorial video.

Case Study: NIH News Releases

Our goal is to build a data frame with the article title, publication date, and abstract text for the 50 most recent NIH news releases.

Head over to the NIH News Releases page. Click the Selector Gadget extension icon or bookmark button. As you mouse over the webpage, different parts will be highlighted in orange. Click on the title (but not the live link portion!) of the first news release. You’ll notice that the Selector Gadget information in the lower right describes what you clicked on. (If SelectorGadget ever highlights too much in green, you can click on portions that you do not want to turn them red.)

Scroll through the page to verify that only the information you intend (the description paragraph) is selected. The selector panel shows the CSS selector (.teaser-title) and the number of matches for that CSS selector (10). (You may have to be careful with your clicking–there are two overlapping boxes, and clicking on the link of the title can lead to the CSS selector of “a”.)

[Pause to Ponder:] Repeat the process above to find the correct selectors for the following fields. Make sure that each matches 10 results:

  • The publication date

.date-display-single

  • The article abstract paragraph (which will also include the publication date)

.teaser-description

Retrieving Data Using rvest and CSS Selectors

Now that we have identified CSS selectors for the information we need, let’s fetch the data using the rvest package similarly to our approach in 08_table_scraping.qmd.

# check that scraping is allowed (Step 0)
robotstxt::paths_allowed("https://www.nih.gov/news-events/news-releases")

 www.nih.gov                      
[1] TRUE
# Step 1: Download the HTML and turn it into an XML file with read_html()
nih <- read_html("https://www.nih.gov/news-events/news-releases")

Finding the exact node (e.g. “.teaser-title”) is the tricky part. Among all the html code used to produce a webpage, where do you go to grab the content of interest? This is where SelectorGadget comes to the rescue!

# Step 2: Extract specific nodes with html_nodes()
title_temp <- html_nodes(nih, ".teaser-title")
title_temp
{xml_nodeset (10)}
 [1] <h4 class="teaser-title"><a href="/news-events/news-releases/nih-funded- ...
 [2] <h4 class="teaser-title"><a href="/news-events/news-releases/single-dose ...
 [3] <h4 class="teaser-title"><a href="/news-events/news-releases/nih-study-f ...
 [4] <h4 class="teaser-title"><a href="/news-events/news-releases/influenza-v ...
 [5] <h4 class="teaser-title"><a href="/news-events/news-releases/therapy-hel ...
 [6] <h4 class="teaser-title"><a href="/news-events/news-releases/excess-weig ...
 [7] <h4 class="teaser-title"><a href="/news-events/news-releases/nih-lead-im ...
 [8] <h4 class="teaser-title"><a href="/news-events/news-releases/contact-len ...
 [9] <h4 class="teaser-title"><a href="/news-events/news-releases/nih-funded- ...
[10] <h4 class="teaser-title"><a href="/news-events/news-releases/nih-researc ...
# Step 3: Extract content from nodes with html_text(), html_name(), 
#    html_attrs(), html_children(), html_table(), etc.
# Usually will still need to do some stringr adjustments
title_vec <- html_text(title_temp)
title_vec
 [1] "NIH-funded clinical trial will evaluate new dengue therapeutic"                              
 [2] "Single dose of broadly neutralizing antibody protects macaques from H5N1 influenza"          
 [3] "NIH study finds infection-related hospitalizations linked to increased risk of heart failure"
 [4] "Influenza A viruses adapt shape in response to environmental pressures"                      
 [5] "Therapy helps peanut-allergic kids tolerate tablespoons of peanut butter"                    
 [6] "Excess weight gain in first trimester associated with fetal fat accumulation"                
 [7] "NIH to lead implementation of National Plan to End Parkinson’s Act"                          
 [8] "Contact lenses used to slow nearsightedness in youth have a lasting effect"                  
 [9] "NIH-funded study finds cases of ME/CFS increase following SARS-CoV-2"                        
[10] "NIH researchers discover novel class of anti-malaria antibodies"                             

You can also write this altogether with a pipe:

robotstxt::paths_allowed("https://www.nih.gov/news-events/news-releases")

 www.nih.gov                      
[1] TRUE
read_html("https://www.nih.gov/news-events/news-releases") |>
  html_nodes(".teaser-title") |>
  html_text()
 [1] "NIH-funded clinical trial will evaluate new dengue therapeutic"                              
 [2] "Single dose of broadly neutralizing antibody protects macaques from H5N1 influenza"          
 [3] "NIH study finds infection-related hospitalizations linked to increased risk of heart failure"
 [4] "Influenza A viruses adapt shape in response to environmental pressures"                      
 [5] "Therapy helps peanut-allergic kids tolerate tablespoons of peanut butter"                    
 [6] "Excess weight gain in first trimester associated with fetal fat accumulation"                
 [7] "NIH to lead implementation of National Plan to End Parkinson’s Act"                          
 [8] "Contact lenses used to slow nearsightedness in youth have a lasting effect"                  
 [9] "NIH-funded study finds cases of ME/CFS increase following SARS-CoV-2"                        
[10] "NIH researchers discover novel class of anti-malaria antibodies"                             

And finally we wrap the 4 steps above into the bow and scrape functions from the polite package:

session <- bow("https://www.nih.gov/news-events/news-releases", force = TRUE)

nih_title <- scrape(session) |>
  html_nodes(".teaser-title") |>
  html_text()
nih_title
 [1] "NIH-funded clinical trial will evaluate new dengue therapeutic"                              
 [2] "Single dose of broadly neutralizing antibody protects macaques from H5N1 influenza"          
 [3] "NIH study finds infection-related hospitalizations linked to increased risk of heart failure"
 [4] "Influenza A viruses adapt shape in response to environmental pressures"                      
 [5] "Therapy helps peanut-allergic kids tolerate tablespoons of peanut butter"                    
 [6] "Excess weight gain in first trimester associated with fetal fat accumulation"                
 [7] "NIH to lead implementation of National Plan to End Parkinson’s Act"                          
 [8] "Contact lenses used to slow nearsightedness in youth have a lasting effect"                  
 [9] "NIH-funded study finds cases of ME/CFS increase following SARS-CoV-2"                        
[10] "NIH researchers discover novel class of anti-malaria antibodies"                             

Putting multiple columns of data together.

Now repeat the process above to extract the publication date and the abstract.

nih_pubdate <- scrape(session) |>
  html_nodes(".date-display-single") |>
  html_text()
nih_pubdate
 [1] "February 11, 2025" "February 11, 2025" "February 11, 2025"
 [4] "February 10, 2025" "February 10, 2025" "January 17, 2025" 
 [7] "January 17, 2025"  "January 16, 2025"  "January 13, 2025" 
[10] "January 3, 2025"  
nih_description <- scrape(session) |>
  html_nodes(".teaser-description") |>
  html_text()
nih_description
 [1] "February 11, 2025 —     \n          Dengue virus sickens as many as 400 million people each year, primarily in tropical and subtropical parts of the world. "
 [2] "February 11, 2025 —     \n          NIH science lays groundwork for future studies in people. "                                                              
 [3] "February 11, 2025 —     \n          Findings highlight the importance of infection prevention measures and personalized heart failure care. "                
 [4] "February 10, 2025 —     \n          NIH study identifies previously unknown adaptation. "                                                                    
 [5] "February 10, 2025 —     \n          NIH trial informs potential treatment strategy for kids who already tolerate half a peanut or more. "                    
 [6] "January 17, 2025 —     \n          Findings from NIH study suggest early intervention may prevent adult obesity associated with heavier birthweight. "       
 [7] "January 17, 2025 —     \n          Open call for participants to serve on Parkinson’s Advisory Council. "                                                    
 [8] "January 16, 2025 —     \n          NIH-funded study finds progression of eye growth returns to normal in older teens, with no loss of treatment benefit. "   
 [9] "January 13, 2025 —     \n          ME/CFS is a complex, serious, and chronic condition that often occurs following an infection. "                           
[10] "January 3, 2025 —     \n          New antibodies could lead to next generation of interventions against malaria. "                                           

Combine these extracted variables into a single tibble. Make sure the variables are formatted correctly - e.g. pubdate has date type, description does not contain the pubdate, etc.

# use tibble() to put multiple columns together into a tibble
nih_top10 <- tibble(title = nih_title, 
                    pubdate = nih_pubdate, 
                    description = nih_description)
nih_top10
# A tibble: 10 × 3
   title                                                     pubdate description
   <chr>                                                     <chr>   <chr>      
 1 NIH-funded clinical trial will evaluate new dengue thera… Februa… "February …
 2 Single dose of broadly neutralizing antibody protects ma… Februa… "February …
 3 NIH study finds infection-related hospitalizations linke… Februa… "February …
 4 Influenza A viruses adapt shape in response to environme… Februa… "February …
 5 Therapy helps peanut-allergic kids tolerate tablespoons … Februa… "February …
 6 Excess weight gain in first trimester associated with fe… Januar… "January 1…
 7 NIH to lead implementation of National Plan to End Parki… Januar… "January 1…
 8 Contact lenses used to slow nearsightedness in youth hav… Januar… "January 1…
 9 NIH-funded study finds cases of ME/CFS increase followin… Januar… "January 1…
10 NIH researchers discover novel class of anti-malaria ant… Januar… "January 3…
# now clean the data
nih_top10 <- nih_top10 |>
  mutate(pubdate = mdy(pubdate),
         description = str_trim(str_replace(description, ".*\\n", "")))
nih_top10
# A tibble: 10 × 3
   title                                                  pubdate    description
   <chr>                                                  <date>     <chr>      
 1 NIH-funded clinical trial will evaluate new dengue th… 2025-02-11 Dengue vir…
 2 Single dose of broadly neutralizing antibody protects… 2025-02-11 NIH scienc…
 3 NIH study finds infection-related hospitalizations li… 2025-02-11 Findings h…
 4 Influenza A viruses adapt shape in response to enviro… 2025-02-10 NIH study …
 5 Therapy helps peanut-allergic kids tolerate tablespoo… 2025-02-10 NIH trial …
 6 Excess weight gain in first trimester associated with… 2025-01-17 Findings f…
 7 NIH to lead implementation of National Plan to End Pa… 2025-01-17 Open call …
 8 Contact lenses used to slow nearsightedness in youth … 2025-01-16 NIH-funded…
 9 NIH-funded study finds cases of ME/CFS increase follo… 2025-01-13 ME/CFS is …
10 NIH researchers discover novel class of anti-malaria … 2025-01-03 New antibo…

NOW - continue this process to build a tibble with the most recent 50 NIH news releases, which will require that you iterate over 5 webpages! You should write at least one function, and you will need iteration–use both a for loop and appropriate map_() functions from purrr. Some additional hints:

  • Mouse over the page buttons at the very bottom of the news home page to see what the URLs look like.
  • Include Sys.sleep(2) in your function to respect the Crawl-delay: 2 in the NIH robots.txt file.
  • Recall that bind_rows() from dplyr takes a list of data frames and stacks them on top of each other.

[Pause to Ponder:] Create a function to scrape a single NIH press release page by filling missing pieces labeled ???:

# Helper function to reduce html_nodes() |> html_text() code duplication
get_text_from_page <- function(page, css_selector) {
  ???
}

# Main function to scrape and tidy desired attributes
scrape_page <- function(url) {
    Sys.sleep(2)
    page <- read_html(url)
    article_titles <- get_text_from_page(???)
    article_dates <- get_text_from_page(???)
    article_dates <- mdy(article_dates)
    article_description <- get_text_from_page(???)
    article_description <- str_trim(str_replace(article_description, 
                                                ".*\\n", 
                                                "")
                                    )
    
    tibble(
      ???
    )
}

[Pause to Ponder:] Use a for loop over the first 5 pages:

pages <- vector("list", length = 5)

for (i in 0:4) {
  url <- str_c(???)
  pages[[i + 1]] <- ???
}

df_articles <- bind_rows(pages)
head(df_articles)

[Pause to Ponder:] Use map functions in the purrr package:

# Create a character vector of URLs for the first 5 pages
base_url <- "???"
urls_all_pages <- c(base_url, str_c(???))

pages2 <- purrr::map(???)
df_articles2 <- bind_rows(pages2)
head(df_articles2)

On Your Own

  1. Go to https://www.bestplaces.net and search for Minneapolis, Minnesota. This is a site some people use when comparing cities they might consider working in and/or moving to. Using SelectorGadget, extract the following pieces of information from the Minneapolis page:
  • property crime (on a scale from 0 to 100)
  • minimum income required for a single person to live comfortably
  • average monthly rent for a 2-bedroom apartment
  • the “about” paragraph (the very first paragraph above “Location Details”)
  1. Write a function called scrape_bestplaces() with arguments for state and city. When you run, for example, scrape_bestplaces("minnesota", "minneapolis"), the output should be a 1 x 6 tibble with columns for state, city, crime, min_income_single, rent_2br, and about.

  2. Create a 5 x 6 tibble by running scrape_bestplaces() 5 times with 5 cities you are interested in. You might have to combine tibbles using bind_rows(). Be sure you look at the URL at bestplaces.net for the various cities to make sure it works as you expect. For bonus points, create the same 5 x 6 tibble for the same 5 cities using purrr:map2!