Data Acquisition with APIs in R

You can download this .qmd file from here. Just hit the Download Raw File button.

Credit to Brianna Heggeseth and Leslie Myint from Macalester College for a few of these descriptions and examples.

Getting data from websites

Option 1: APIs

When we interact with sites like The New York Times, Zillow, and Google, we are accessing their data via a graphical layout (e.g., images, colors, columns) that is easy for humans to read but hard for computers.

An API stands for Application Programming Interface, and this term describes a general class of tool that allows computers, rather than humans, to interact with an organization’s data. How does this work?

  • When we use web browsers to navigate the web, our browsers communicate with web servers using a technology called HTTP or Hypertext Transfer Protocol to get information that is formatted into the display of a web page.
  • Programming languages such as R can also use HTTP to communicate with web servers. The easiest way to do this is via Web APIs, or Web Application Programming Interfaces, which focus on transmitting raw data, rather than images, colors, or other appearance-related information that humans interact with when viewing a web page.

A large variety of web APIs provide data accessible to programs written in R (and almost any other programming language!). Almost all reasonably large commercial websites offer APIs. Todd Motto has compiled an expansive list of Public Web APIs on GitHub, although it’s about 3 years old now so it’s not a perfect or complete list. Feel free to browse this list to see what data sources are available.

For our purposes of obtaining data, APIs exist where website developers make data nicely packaged for consumption. The language HTTP (hypertext transfer protocol) underlies APIs, and the R package httr() (and now the updated httr2()) was written to map closely to HTTP with R. Essentially you send a request to the website (server) where you want data from, and they send a response, which should contain the data (plus other stuff).

The case studies in this document provide a really quick introduction to data acquisition, just to get you started and show you what’s possible. For more information, these links can be somewhat helpful:

  • https://towardsdatascience.com/functions-with-r-and-rvest-a-laymens-guide-acda42325a77
  • https://nceas.github.io/oss-lessons/data-liberation/intro-webscraping.html

Wrapper packages

In R, it is easiest to use Web APIs through a wrapper package, an R package written specifically for a particular Web API.

  • The R development community has already contributed wrapper packages for many large Web APIs (e.g. ZillowR, rtweet, genius, Rspotify, tidycensus, etc.)
  • To find a wrapper package, search the web for “R package” and the name of the website. For example:
  • rOpenSci also has a good collection of wrapper packages.

In particular, tidycensus is a wrapper package that makes it easy to obtain desired census information for mapping and modeling:

Warning: • You have not set a Census API key. Users without a key are limited to 500
queries per day and may experience performance limitations.
ℹ For best results, get a Census API key at
http://api.census.gov/data/key_signup.html and then supply the key to the
`census_api_key()` function to use it throughout your tidycensus session.
This warning is displayed once per session.

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |=================================                                     |  46%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |======================================================================| 100%

Obtaining raw data from the Census Bureau was that easy! Often we will have to obtain and use a secret API key to access the data, but that’s not always necessary with tidycensus.

Now we can tidy that data and produce plots and analyses.

# Rename cryptic variables from the census form
sample_acs_data <- sample_acs_data |>
  rename(population = B01003_001E,
         population_moe = B01003_001M,
         median_income = B19013_001E,
         median_income_moe = B19013_001M)

# Plot with geom_sf since our data contains 1 row per census tract
#   with its geometry
ggplot(data = sample_acs_data) + 
  geom_sf(aes(fill = median_income), colour = "white", linetype = 2) + 
  theme_void()  

# The whole state of MN is overwhelming, so focus on a single county
sample_acs_data |>
  filter(str_detect(NAME, "Hennepin")) |>
  ggplot() + 
    geom_sf(aes(fill = median_income), colour = "white", linetype = 2)

# Look for relationships between variables with 1 row per tract
as_tibble(sample_acs_data) |>
  ggplot(aes(x = population, y = median_income)) + 
    geom_point() + 
    geom_smooth(method = "lm")  

Extra resources:

get_acs() is one of the functions that is part of tidycensus. Let’s explore what’s going on behind the scenes with get_acs()

Accessing web APIs directly

Getting a Census API key

Many APIs (and their wrapper packages) require users to obtain a key to use their services.

  • This lets organizations keep track of what data is being used.
  • It also rate limits their API and ensures programs don’t make too many requests per day/minute/hour. Be aware that most APIs do have rate limits — especially for their free tiers.

Navigate to https://api.census.gov/data/key_signup.html to obtain a Census API key:

  • Organization: St. Olaf College
  • Email: Your St. Olaf email address

You will get the message:

Your request for a new API key has been successfully submitted. Please check your email. In a few minutes you should receive a message with instructions on how to activate your new key.

Check your email. Copy and paste your key into a new text file:

  • File > New File > Text File (towards the bottom of the menu)
  • Save as census_api_key.txt in the same folder as this .qmd.

You could then read in the key with code like this:

myapikey <- readLines("~/264_fall_2024/DS2_preview_work/census_api_key.txt")

Handling API keys

While this works, the problem is once we start backing up our files to GitHub, your API key will also appear on GitHub, and you want to keep your API key secret. Thus, we might use environment variables instead:

One way to store a secret across sessions is with environment variables. Environment variables, or envvars for short, are a cross platform way of passing information to processes. For passing envvars to R, you can list name-value pairs in a file called .Renviron in your home directory. The easiest way to edit it is to run:

file.edit("~/.Renviron")

The file looks something like

PATH = “path” VAR1 = “value1” VAR2 = “value2” And you can access the values in R using Sys.getenv():

Sys.getenv("VAR1")
#> [1] "value1"

Note that .Renviron is only processed on startup, so you’ll need to restart R to see changes.

Another option is to use Sys.setenv and Sys.getenv:

# I used the first line to store my CENSUS API key in .Renviron
#   after uncommenting - should only need to run one time
# Sys.setenv("CENSUS_KEY" = "my census api key pasted here")
# my_census_api_key <- Sys.getenv("CENSUS_KEY")

On Your Own

  1. Write a for loop to obtain the Hennepin County data from 2017-2021

  2. Write a function to give choices about year, county, and variables

  3. Use your function from (2) along with map and list_rbind to build a data set for Rice county for the years 2019-2021

One more example using an API key

Here’s an example of getting data from a website that attempts to make imdb movie data available as an API.

Initial instructions:

  • go to omdbapi.com under the API Key tab and request a free API key
  • store your key as discussed earlier
  • explore the examples at omdbapi.com

We will first obtain data about the movie Coco from 2017.

myapikey <- Sys.getenv("OMDB_KEY")

# Find url exploring examples at omdbapi.com
url <- str_c("http://www.omdbapi.com/?t=Coco&y=2017&apikey=", myapikey)

coco <- GET(url)   # coco holds response from server
coco               # Status of 200 is good!

details <- content(coco, "parse")   
details                         # get a list of 25 pieces of information
details$Year                    # how to access details
details[[2]]                    # since a list, another way to access

Now build a data set for a collection of movies

# Must figure out pattern in URL for obtaining different movies
#  - try searching for others
movies <- c("Coco", "Wonder+Woman", "Get+Out", 
            "The+Greatest+Showman", "Thor:+Ragnarok")

# Set up empty tibble
omdb <- tibble(Title = character(), Rated = character(), Genre = character(),
       Actors = character(), Metascore = double(), imdbRating = double(),
       BoxOffice = double())

# Use for loop to run through API request process 5 times,
#   each time filling the next row in the tibble
#  - can do max of 1000 GETs per day
for(i in 1:5) {
  url <- str_c("http://www.omdbapi.com/?t=",movies[i],
               "&apikey=", myapikey)
  Sys.sleep(0.5)
  onemovie <- GET(url)
  details <- content(onemovie, "parse")
  omdb[i,1] <- details$Title
  omdb[i,2] <- details$Rated
  omdb[i,3] <- details$Genre
  omdb[i,4] <- details$Actors
  omdb[i,5] <- parse_number(details$Metascore)
  omdb[i,6] <- parse_number(details$imdbRating)
  omdb[i,7] <- parse_number(details$BoxOffice)   # no $ and ,'s
}

omdb

#  could use stringr functions to further organize this data - separate 
#    different genres, different actors, etc.

On Your Own (continued)

  1. (Based on final project by Mary Wu and Jenna Graff, MSCS 264, Spring 2024). Start with a small data set on 56 national parks from kaggle, and supplement with columns for the park address (a single column including address, city, state, and zip code) and a list of available activities (a single character column with activities separated by commas) from the park websites themselves.

Preliminaries:

np_kaggle <- read_csv("~/264_fall_2024/Data/parks.csv")
Rows: 56 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Park Code, Park Name, State
dbl (3): Acres, Latitude, Longitude

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.