Table Scraping in R

You can download this .qmd file from here. Just hit the Download Raw File button.

Using rvest for web scraping

If you would like to assemble data from a website with no API, you can often acquire data using more brute force methods commonly called web scraping. Typically, this involves finding content inside HTML (Hypertext markup language) code used for creating webpages and web applications and the CSS (Cascading style sheets) language for customizing the appearance of webpages. We are used to reading data from .csv files…. but most websites have it stored in XML (like html, but for data). You can read more about it here if you’re interested: https://www.w3schools.com/xml/default.asp

XML has a sort of tree or graph-like structure… so we can identify information by which node it belongs to (html_nodes) and then convert the content into something we can use in R (html_text or html_table).

Here’s one quick example of web scraping. First check out the webpage https://www.cheese.com/by_type and then select Semi-Soft. We can drill into the html code for this webpage and find and store specific information (like cheese names)

session <- bow("https://www.cheese.com/by_type", force = TRUE)
result <- scrape(session, query=list(t="semi-soft", per_page=100)) |>
  html_node("#main-body") |> 
  html_nodes("h3") |> 
  html_text()
head(result)
[1] "American Cheese"     "Mozzarella"          "Taleggio"           
[4] "Fontina Val d'Aosta" "Blue Cheese"         "Jarlsberg"          
#[1] "American Cheese"     "Mozzarella"          "Taleggio"           
#[4] "Fontina Val d'Aosta" "Blue Cheese"         "Jarlsberg"   

Four steps to scraping data with functions in the rvest library:

  1. robotstxt::paths_allowed() Check if the website allows scraping, and then make sure we scrape “politely”
  2. read_html(). Input the URL containing the data and turn the html code into an XML file (another markup format that’s easier to work with).
  3. html_nodes(). Extract specific nodes from the XML file by using the CSS path that leads to the content of interest. (use css=“table” for tables.)
  4. html_text(). Extract content of interest from nodes. Might also use html_table() etc.

Data scraping ethics

Before scraping, we should always check first whether the website allows scraping. We should also consider if there’s any personal or confidential information, and we should be considerate to not overload the server we’re scraping from.

Chapter 24 in R4DS provides a nice overview of some of the important issues to consider. A couple of highlights:

  • be aware of terms of service, and, if available, the robots.txt file that some websites will publish to clarify what can and cannot be scraped and other constraints about scraping.
  • use the polite package to scrape public, non-personal, and factual data in a respectful manner
  • scrape with a good purpose and request only what you need; in particular, be extremely wary of personally identifiable information

See this article for more perspective on the ethics of data scraping.

When the data is already in table form:

In this example, we will scrape climate data from this website

The website already contains data in table form, so we use html_nodes(. , css = "table") and html_table()

# check that scraping is allowed (Step 0)
robotstxt::paths_allowed("https://www.usclimatedata.com/climate/minneapolis/minnesota/united-states/usmn0503")

 www.usclimatedata.com                      
[1] TRUE
# Step 1: read_html()
mpls <- read_html("https://www.usclimatedata.com/climate/minneapolis/minnesota/united-states/usmn0503")

# 2: html_nodes()
tables <- html_nodes(mpls, css = "table") 
tables  # have to guesstimate which table contains climate info
{xml_nodeset (8)}
[1] <table id="monthly_table_one" class="table table-hover tablesaw tablesaw- ...
[2] <table id="monthly_table_two" class="table table-hover tablesaw tablesaw- ...
[3] <table class="table table-hover tablesaw tablesaw-mode-swipe mt-4 daily_t ...
[4] <table class="table table-hover tablesaw tablesaw-mode-swipe mt-4 history ...
[5] <table class="table table-striped table-hover tablesaw tablesaw-mode-swip ...
[6] <table class="table table-hover tablesaw geo_table">\n<thead><tr>\n<th cl ...
[7] <table class="table table-hover tablesaw datetime_table" data-tablesaw-hi ...
[8] <table class="table table-hover tablesaw monthly_summary_table" data-tabl ...
# 3: html_table()
html_table(tables, header = TRUE, fill = TRUE)    # find the right table
[[1]]
# A tibble: 5 × 7
  ``                                         JanJa FebFe MarMa AprAp MayMa JunJu
  <chr>                                      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average high in ºF Av. high Hi             24    29    42    57    69    79   
2 Average low in ºF Av. low Lo                9    13    25    38    50    60   
3 Days with precipitation Days precip. DP     9     7     9    11    12    11   
4 Av. precipitation in inch Av. precip. Pre.  0.89  0.87  1.68  2.91  3.91  4.58
5 Av. snowfall in inch Snowfall Sn           11    10     8     4     0     0   

[[2]]
# A tibble: 5 × 7
  ``                                         JulJu AugAu SepSe OctOc NovNo DecDe
  <chr>                                      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average high in ºF Av. high Hi             83    81    73    58    42    29   
2 Average low in ºF Av. low Lo               65    63    54    41    28    15   
3 Days with precipitation Days precip. DP    10     9     9     9     8     9   
4 Av. precipitation in inch Av. precip. Pre.  4.06  4.34  3.02  2.58  1.61  1.17
5 Av. snowfall in inch Snowfall Sn            0     0     0     1     7    11   

[[3]]
# A tibble: 12 × 7
   Day    HighºF LowºF `Prec/moinch` `Prec/yrinch` `Snow/moinch` `Snow/yrinch`
   <chr>   <dbl> <dbl>         <dbl>         <dbl>         <dbl>         <dbl>
 1 1 Jan    23.8   8.3          0.04          0.04          0.39           1  
 2 2 Jan    23.7   8.2          0.08          0.08          0.71           1.8
 3 3 Jan    23.6   8.1          0.12          0.12          1.1            2.8
 4 4 Jan    23.5   7.9          0.12          0.12          1.5            3.8
 5 5 Jan    23.5   7.8          0.16          0.16          1.81           4.6
 6 6 Jan    23.4   7.7          0.2           0.2           2.2            5.6
 7 7 Jan    23.4   7.6          0.24          0.24          2.6            6.6
 8 8 Jan    23.3   7.5          0.28          0.28          3.11           7.9
 9 9 Jan    23.3   7.4          0.28          0.28          3.5            8.9
10 10 Jan   23.3   7.3          0.31          0.31          3.9            9.9
11 11 Jan   23.3   7.3          0.35          0.35          4.29          10.9
12 12 Jan   23.3   7.2          0.39          0.39          4.69          11.9

[[4]]
# A tibble: 26 × 6
   Day    HighºF LowºF Precip.inch Snowinch `Snow d.inch`
   <chr>   <dbl> <dbl> <chr>       <chr>            <dbl>
 1 01 Dec   32    19   0.07        1.61                 7
 2 02 Dec   27    12   0.00        0.00                 6
 3 03 Dec   37.9  19.9 0.00        0.00                 6
 4 04 Dec   39    24.1 0.00        0.00                 6
 5 05 Dec   37    21.9 0.00        0.00                 5
 6 06 Dec   32    17.1 0.00        0.00                 5
 7 07 Dec   42.1  21.9 0.00        0.00                 5
 8 08 Dec   41    30.9 0.00        0.00                 5
 9 09 Dec   34    -0.9 0.16        2.52                 5
10 10 Dec    8.1  -4   T           T                    7
# ℹ 16 more rows

[[5]]
# A tibble: 9 × 4
  ``                                          `Dec 19`    ``    Normal     
  <chr>                                       <chr>       <lgl> <chr>      
1 "Average high temperature Av. high temp."   "29.9 ºF"   NA    "24 ºF"    
2 "Average low temperature Av. low temp."     "14.6 ºF"   NA    "9 ºF"     
3 "Total precipitation Total precip."         "0.39 inch" NA    "0.89 inch"
4 "Total snowfall Total snowfall"             "6.33 inch" NA    "11 inch"  
5 ""                                          ""          NA    ""         
6 "Highest max temperature Highest max temp." "44.1 ºF"   NA    "-"        
7 "Lowest max temperature Lowest max temp."   "8.1 ºF"    NA    "-"        
8 "Highest min temperature Highest min temp." "32.0 ºF"   NA    "-"        
9 "Lowest min temperature Lowest min temp."   "-5.1 ºF"   NA    "-"        

[[6]]
# A tibble: 10 × 3
   ``                   ``                ``   
   <chr>                <chr>             <lgl>
 1 Country              United States     NA   
 2 State                Minnesota         NA   
 3 County               Hennepin          NA   
 4 City                 Minneapolis       NA   
 5 Zip Code             55401             NA   
 6 Longitude            -93.27 dec. degr. NA   
 7 Latitude             44.98 dec. degr.  NA   
 8 Altitude - Elevation 840ft             NA   
 9 ICAO                 -                 NA   
10 IATA                 -                 NA   

[[7]]
# A tibble: 6 × 3
  ``          ``                ``   
  <chr>       <chr>             <lgl>
1 Local Time  02:55 PM          NA   
2 Sunrise     06:56 AM          NA   
3 Sunset      04:57 PM          NA   
4 Day/Night   Day               NA   
5 Timezone    Chicago -6:00     NA   
6 Timezone DB America - Chicago NA   

[[8]]
# A tibble: 5 × 2
  ``                         ``        
  <chr>                      <chr>     
1 Annual high temperature    55ºF      
2 Annual low temperature     38ºF      
3 Days per year with precip. 113 days  
4 Average annual precip.     31.62 inch
5 Av. annual snowfall        52 inch   
mpls_data1 <- html_table(tables, header = TRUE, fill = TRUE)[[1]]  
mpls_data1
# A tibble: 5 × 7
  ``                                         JanJa FebFe MarMa AprAp MayMa JunJu
  <chr>                                      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average high in ºF Av. high Hi             24    29    42    57    69    79   
2 Average low in ºF Av. low Lo                9    13    25    38    50    60   
3 Days with precipitation Days precip. DP     9     7     9    11    12    11   
4 Av. precipitation in inch Av. precip. Pre.  0.89  0.87  1.68  2.91  3.91  4.58
5 Av. snowfall in inch Snowfall Sn           11    10     8     4     0     0   
mpls_data2 <- html_table(tables, header = TRUE, fill = TRUE)[[2]]  
mpls_data2
# A tibble: 5 × 7
  ``                                         JulJu AugAu SepSe OctOc NovNo DecDe
  <chr>                                      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average high in ºF Av. high Hi             83    81    73    58    42    29   
2 Average low in ºF Av. low Lo               65    63    54    41    28    15   
3 Days with precipitation Days precip. DP    10     9     9     9     8     9   
4 Av. precipitation in inch Av. precip. Pre.  4.06  4.34  3.02  2.58  1.61  1.17
5 Av. snowfall in inch Snowfall Sn            0     0     0     1     7    11   

Now we wrap the 4 steps above into the bow and scrape functions from the polite package:

session <- bow("https://www.usclimatedata.com/climate/minneapolis/minnesota/united-states/usmn0503", force = TRUE)

result <- scrape(session) |>
  html_nodes(css = "table") |> 
  html_table(header = TRUE, fill = TRUE)
mpls_data1 <- result[[1]]
mpls_data2 <- result[[2]]

Even after finding the correct tables, there may still be a lot of work to make it tidy!!!

[Pause to Ponder:] What is each line of code doing below?

bind_cols(mpls_data1, mpls_data2) |>
  as_tibble() |> 
  select(-`...8`) |>
  mutate(`...1` = str_extract(`...1`, "[^ ]+ [^ ]+ [^ ]+")) |> 
  pivot_longer(cols = c(`JanJa`:`DecDe`), 
               names_to = "month", values_to = "weather") |>
  pivot_wider(names_from = `...1`, values_from = weather) |>
  mutate(month = str_sub(month, 1, 3))  |>
  rename(avg_high = "Average high in",
         avg_low = "Average low in")
New names:
• `` -> `...1`
• `` -> `...8`
# A tibble: 12 × 6
   month avg_high avg_low `Days with precipitation` `Av. precipitation in`
   <chr>    <dbl>   <dbl>                     <dbl>                  <dbl>
 1 Jan         24       9                         9                   0.89
 2 Feb         29      13                         7                   0.87
 3 Mar         42      25                         9                   1.68
 4 Apr         57      38                        11                   2.91
 5 May         69      50                        12                   3.91
 6 Jun         79      60                        11                   4.58
 7 Jul         83      65                        10                   4.06
 8 Aug         81      63                         9                   4.34
 9 Sep         73      54                         9                   3.02
10 Oct         58      41                         9                   2.58
11 Nov         42      28                         8                   1.61
12 Dec         29      15                         9                   1.17
# ℹ 1 more variable: `Av. snowfall in` <dbl>
# Probably want to rename the rest of the variables too!

On Your Own

  1. In 13_maps.qmd we will see how to create an interactive U.S. map showing population densities by state. Right now, let’s see if we can use our new web scraping skills to scrape the correct population density data and create a tidy data frame!

A quick wikipedia search yields this webpage with population densities in a nice table format. Use our 4 steps to rvesting data to acquire the data, and then create a tidy tibble with one row per state.

  1. Use the rvest package and html_table to read in the table of data found at the link here and create a scatterplot of land area versus the 2024 estimated population. I give you some starter code below; fill in the “???” and be sure you can explain what EVERY line of code does and why it’s necessary.
city_pop <- read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population")

pop <- html_nodes(???, ???)
html_table(pop, header = TRUE, fill = TRUE)  # find right table
pop2 <- html_table(pop, header = TRUE, fill = TRUE)[[???]]
pop2

# perform the steps above with the polite package
session <- bow("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population", force = TRUE)

result <- scrape(session) |>
  html_nodes(???) |>
  html_table(header = TRUE, fill = TRUE)
pop2 <- result[[???]]
pop2

pop3 <- as_tibble(pop2[,c(1:6,8)]) |>
  slice(???) |>
  rename(`State` = `ST`,
         `Estimate2024` = `2024estimate`,
         `Census` = `2020census`,
         `Area` = `2020 land area`,
         `Density` = `2020 density`) |>
  mutate(Estimate2024 = parse_number(Estimate2024),
         Census = parse_number(Census),
         Change = ???   # get rid of % but preserve +/-,
         Area = parse_number(Area),
         Density = parse_number(Density)) |> 
  mutate(City = str_replace(City, "\\[.*$", ""))
pop3

# pick out unusual points
outliers <- pop3 |> 
  filter(Estimate2024 > ??? | Area > ???)

# This will work if don't turn variables from chr to dbl, but in that 
#  case notice how axes are just evenly spaced categorical variables
ggplot(pop3, aes(x = ???, y = ???)) +
  geom_point()  +
  geom_smooth() +
  ggrepel::geom_label_repel(data = ???, aes(label = ???))
  1. We would like to create a tibble with 4 years of data (2001-2004) from the Minnesota Wild hockey team. Specifically, we are interested in the “Scoring Regular Season” table from this webpage and the similar webpages from 2002, 2003, and 2004. Your final tibble should have 6 columns: player, year, age, pos (position), gp (games played), and pts (points).

You should (a) write a function called hockey_stats with inputs for team and year to scrape data from the “scoring Regular Season” table, and (b) use iteration techniques to scrape and combine 4 years worth of data. Here are some functions you might consider:

  • row_to_names(row_number = 1) from the janitor package
  • clean_names() also from the janitor package
  • bow() and scrape() from the polite package
  • str_c() from the stringr package (for creating urls with user inputs)
  • map2() and list_rbind() for iterating and combining years

Try following these steps:

  1. Be sure you can find and clean the correct table from the 2021 season.

  2. Organize your rvest code from (1) into functions from the polite package.

  3. Place the code from (2) into a function where the user can input a team and year. You would then adjust the url accordingly and produce a clean table for the user.

  4. Use map2 and list_rbind to build one data set containing Minnesota Wild data from 2001-2004.