You can download this .qmd file from here. Just hit the Download Raw File button.
Using rvest for web scraping
If you would like to assemble data from a website with no API, you can often acquire data using more brute force methods commonly called web scraping. Typically, this involves finding content inside HTML (Hypertext markup language) code used for creating webpages and web applications and the CSS (Cascading style sheets) language for customizing the appearance of webpages. We are used to reading data from .csv files…. but most websites have it stored in XML (like html, but for data). You can read more about it here if you’re interested: https://www.w3schools.com/xml/default.asp
XML has a sort of tree or graph-like structure… so we can identify information by which node it belongs to (html_nodes) and then convert the content into something we can use in R (html_text or html_table).
Here’s one quick example of web scraping. First check out the webpage https://www.cheese.com/by_type and then select Semi-Soft. We can drill into the html code for this webpage and find and store specific information (like cheese names)
session <-bow("https://www.cheese.com/by_type", force =TRUE)result <-scrape(session, query=list(t="semi-soft", per_page=100)) |>html_node("#main-body") |>html_nodes("h3") |>html_text()head(result)
Four steps to scraping data with functions in the rvest library:
robotstxt::paths_allowed() Check if the website allows scraping, and then make sure we scrape “politely”
read_html(). Input the URL containing the data and turn the html code into an XML file (another markup format that’s easier to work with).
html_nodes(). Extract specific nodes from the XML file by using the CSS path that leads to the content of interest. (use css=“table” for tables.)
html_text(). Extract content of interest from nodes. Might also use html_table() etc.
Data scraping ethics
Before scraping, we should always check first whether the website allows scraping. We should also consider if there’s any personal or confidential information, and we should be considerate to not overload the server we’re scraping from.
Chapter 24 in R4DS provides a nice overview of some of the important issues to consider. A couple of highlights:
be aware of terms of service, and, if available, the robots.txt file that some websites will publish to clarify what can and cannot be scraped and other constraints about scraping.
use the polite package to scrape public, non-personal, and factual data in a respectful manner
scrape with a good purpose and request only what you need; in particular, be extremely wary of personally identifiable information
See this article for more perspective on the ethics of data scraping.
When the data is already in table form:
In this example, we will scrape climate data from this website
The website already contains data in table form, so we use html_nodes(. , css = "table") and html_table()
# check that scraping is allowed (Step 0)robotstxt::paths_allowed("https://www.usclimatedata.com/climate/minneapolis/minnesota/united-states/usmn0503")
www.usclimatedata.com
[1] TRUE
# Step 1: read_html()mpls <-read_html("https://www.usclimatedata.com/climate/minneapolis/minnesota/united-states/usmn0503")# 2: html_nodes()tables <-html_nodes(mpls, css ="table") tables # have to guesstimate which table contains climate info
# 3: html_table()html_table(tables, header =TRUE, fill =TRUE) # find the right table
[[1]]
# A tibble: 5 × 7
`` JanJa FebFe MarMa AprAp MayMa JunJu
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average high in ºF Av. high Hi 24 29 42 57 69 79
2 Average low in ºF Av. low Lo 9 13 25 38 50 60
3 Days with precipitation Days precip. DP 9 7 9 11 12 11
4 Av. precipitation in inch Av. precip. Pre. 0.89 0.87 1.68 2.91 3.91 4.58
5 Av. snowfall in inch Snowfall Sn 11 10 8 4 0 0
[[2]]
# A tibble: 5 × 7
`` JulJu AugAu SepSe OctOc NovNo DecDe
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average high in ºF Av. high Hi 83 81 73 58 42 29
2 Average low in ºF Av. low Lo 65 63 54 41 28 15
3 Days with precipitation Days precip. DP 10 9 9 9 8 9
4 Av. precipitation in inch Av. precip. Pre. 4.06 4.34 3.02 2.58 1.61 1.17
5 Av. snowfall in inch Snowfall Sn 0 0 0 1 7 11
[[3]]
# A tibble: 12 × 7
Day HighºF LowºF `Prec/moinch` `Prec/yrinch` `Snow/moinch` `Snow/yrinch`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Jan 23.8 8.3 0.04 0.04 0.39 1
2 2 Jan 23.7 8.2 0.08 0.08 0.71 1.8
3 3 Jan 23.6 8.1 0.12 0.12 1.1 2.8
4 4 Jan 23.5 7.9 0.12 0.12 1.5 3.8
5 5 Jan 23.5 7.8 0.16 0.16 1.81 4.6
6 6 Jan 23.4 7.7 0.2 0.2 2.2 5.6
7 7 Jan 23.4 7.6 0.24 0.24 2.6 6.6
8 8 Jan 23.3 7.5 0.28 0.28 3.11 7.9
9 9 Jan 23.3 7.4 0.28 0.28 3.5 8.9
10 10 Jan 23.3 7.3 0.31 0.31 3.9 9.9
11 11 Jan 23.3 7.3 0.35 0.35 4.29 10.9
12 12 Jan 23.3 7.2 0.39 0.39 4.69 11.9
[[4]]
# A tibble: 26 × 6
Day HighºF LowºF Precip.inch Snowinch `Snow d.inch`
<chr> <dbl> <dbl> <chr> <chr> <dbl>
1 01 Dec 32 19 0.07 1.61 7
2 02 Dec 27 12 0.00 0.00 6
3 03 Dec 37.9 19.9 0.00 0.00 6
4 04 Dec 39 24.1 0.00 0.00 6
5 05 Dec 37 21.9 0.00 0.00 5
6 06 Dec 32 17.1 0.00 0.00 5
7 07 Dec 42.1 21.9 0.00 0.00 5
8 08 Dec 41 30.9 0.00 0.00 5
9 09 Dec 34 -0.9 0.16 2.52 5
10 10 Dec 8.1 -4 T T 7
# ℹ 16 more rows
[[5]]
# A tibble: 9 × 4
`` `Dec 19` `` Normal
<chr> <chr> <lgl> <chr>
1 "Average high temperature Av. high temp." "29.9 ºF" NA "24 ºF"
2 "Average low temperature Av. low temp." "14.6 ºF" NA "9 ºF"
3 "Total precipitation Total precip." "0.39 inch" NA "0.89 inch"
4 "Total snowfall Total snowfall" "6.33 inch" NA "11 inch"
5 "" "" NA ""
6 "Highest max temperature Highest max temp." "44.1 ºF" NA "-"
7 "Lowest max temperature Lowest max temp." "8.1 ºF" NA "-"
8 "Highest min temperature Highest min temp." "32.0 ºF" NA "-"
9 "Lowest min temperature Lowest min temp." "-5.1 ºF" NA "-"
[[6]]
# A tibble: 10 × 3
`` `` ``
<chr> <chr> <lgl>
1 Country United States NA
2 State Minnesota NA
3 County Hennepin NA
4 City Minneapolis NA
5 Zip Code 55401 NA
6 Longitude -93.27 dec. degr. NA
7 Latitude 44.98 dec. degr. NA
8 Altitude - Elevation 840ft NA
9 ICAO - NA
10 IATA - NA
[[7]]
# A tibble: 6 × 3
`` `` ``
<chr> <chr> <lgl>
1 Local Time 02:55 PM NA
2 Sunrise 06:56 AM NA
3 Sunset 04:57 PM NA
4 Day/Night Day NA
5 Timezone Chicago -6:00 NA
6 Timezone DB America - Chicago NA
[[8]]
# A tibble: 5 × 2
`` ``
<chr> <chr>
1 Annual high temperature 55ºF
2 Annual low temperature 38ºF
3 Days per year with precip. 113 days
4 Average annual precip. 31.62 inch
5 Av. annual snowfall 52 inch
mpls_data1 <-html_table(tables, header =TRUE, fill =TRUE)[[1]] mpls_data1
# A tibble: 5 × 7
`` JanJa FebFe MarMa AprAp MayMa JunJu
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average high in ºF Av. high Hi 24 29 42 57 69 79
2 Average low in ºF Av. low Lo 9 13 25 38 50 60
3 Days with precipitation Days precip. DP 9 7 9 11 12 11
4 Av. precipitation in inch Av. precip. Pre. 0.89 0.87 1.68 2.91 3.91 4.58
5 Av. snowfall in inch Snowfall Sn 11 10 8 4 0 0
mpls_data2 <-html_table(tables, header =TRUE, fill =TRUE)[[2]] mpls_data2
# A tibble: 5 × 7
`` JulJu AugAu SepSe OctOc NovNo DecDe
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average high in ºF Av. high Hi 83 81 73 58 42 29
2 Average low in ºF Av. low Lo 65 63 54 41 28 15
3 Days with precipitation Days precip. DP 10 9 9 9 8 9
4 Av. precipitation in inch Av. precip. Pre. 4.06 4.34 3.02 2.58 1.61 1.17
5 Av. snowfall in inch Snowfall Sn 0 0 0 1 7 11
Now we wrap the 4 steps above into the bow and scrape functions from the polite package:
session <-bow("https://www.usclimatedata.com/climate/minneapolis/minnesota/united-states/usmn0503", force =TRUE)result <-scrape(session) |>html_nodes(css ="table") |>html_table(header =TRUE, fill =TRUE)mpls_data1 <- result[[1]]mpls_data2 <- result[[2]]
Even after finding the correct tables, there may still be a lot of work to make it tidy!!!
[Pause to Ponder:] What is each line of code doing below?
# A tibble: 12 × 6
month avg_high avg_low `Days with precipitation` `Av. precipitation in`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Jan 24 9 9 0.89
2 Feb 29 13 7 0.87
3 Mar 42 25 9 1.68
4 Apr 57 38 11 2.91
5 May 69 50 12 3.91
6 Jun 79 60 11 4.58
7 Jul 83 65 10 4.06
8 Aug 81 63 9 4.34
9 Sep 73 54 9 3.02
10 Oct 58 41 9 2.58
11 Nov 42 28 8 1.61
12 Dec 29 15 9 1.17
# ℹ 1 more variable: `Av. snowfall in` <dbl>
# Probably want to rename the rest of the variables too!
On Your Own
In 13_maps.qmd we will see how to create an interactive U.S. map showing population densities by state. Right now, let’s see if we can use our new web scraping skills to scrape the correct population density data and create a tidy data frame!
A quick wikipedia search yields this webpage with population densities in a nice table format. Use our 4 steps to rvesting data to acquire the data, and then create a tidy tibble with one row per state.
Use the rvest package and html_table to read in the table of data found at the link here and create a scatterplot of land area versus the 2024 estimated population. I give you some starter code below; fill in the “???” and be sure you can explain what EVERY line of code does and why it’s necessary.
city_pop <-read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population")pop <-html_nodes(???, ???)html_table(pop, header =TRUE, fill =TRUE) # find right tablepop2 <-html_table(pop, header =TRUE, fill =TRUE)[[???]]pop2# perform the steps above with the polite packagesession <-bow("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population", force =TRUE)result <-scrape(session) |>html_nodes(???) |>html_table(header =TRUE, fill =TRUE)pop2 <- result[[???]]pop2pop3 <-as_tibble(pop2[,c(1:6,8)]) |>slice(???) |>rename(`State`=`ST`,`Estimate2024`=`2024estimate`,`Census`=`2020census`,`Area`=`2020 land area`,`Density`=`2020 density`) |>mutate(Estimate2024 =parse_number(Estimate2024),Census =parse_number(Census),Change = ??? # get rid of % but preserve +/-,Area =parse_number(Area),Density =parse_number(Density)) |>mutate(City =str_replace(City, "\\[.*$", ""))pop3# pick out unusual pointsoutliers <- pop3 |>filter(Estimate2024 > ??? | Area > ???)# This will work if don't turn variables from chr to dbl, but in that # case notice how axes are just evenly spaced categorical variablesggplot(pop3, aes(x = ???, y = ???)) +geom_point() +geom_smooth() + ggrepel::geom_label_repel(data = ???, aes(label = ???))
We would like to create a tibble with 4 years of data (2001-2004) from the Minnesota Wild hockey team. Specifically, we are interested in the “Scoring Regular Season” table from this webpage and the similar webpages from 2002, 2003, and 2004. Your final tibble should have 6 columns: player, year, age, pos (position), gp (games played), and pts (points).
You should (a) write a function called hockey_stats with inputs for team and year to scrape data from the “scoring Regular Season” table, and (b) use iteration techniques to scrape and combine 4 years worth of data. Here are some functions you might consider:
row_to_names(row_number = 1) from the janitor package
clean_names() also from the janitor package
bow() and scrape() from the polite package
str_c() from the stringr package (for creating urls with user inputs)
map2() and list_rbind() for iterating and combining years
Try following these steps:
Be sure you can find and clean the correct table from the 2021 season.
Organize your rvest code from (1) into functions from the polite package.
Place the code from (2) into a function where the user can input a team and year. You would then adjust the url accordingly and produce a clean table for the user.
Use map2 and list_rbind to build one data set containing Minnesota Wild data from 2001-2004.