Mini-Project 2: Data Acquisition

Overview

In this project, you will find data on the web, acquire it through API or web scraping techniques, and create a tidy tibble suitable for future analysis

Groups

I will form pairs with input from you (partner preference survey)

Timeline

Tentative due date Points
Stage I Project Proposal Mon Oct 27 10
Stage II GitHub Collaboration Wed Oct 29 10
Stage III Project Submission Fri Nov 7 80
100

Stage I: Project Proposal

For this project, you and your partner will find data on the web that’s not available as a neat .csv file or .xls spreadsheet (it maybe even be spread across multiple webpages), and you will acquire it using techniques from class and tidy it so it’s suitable for a future data/text analysis.

You must use one (or multiple) of these approaches to data acquisition:

  • APIs (using httr or httr2)
  • An API wrapper package (that requires an API key)
  • Harvesting data that appears in table form on a webpage (using rvest with html_table)
  • Harvesting pieces of data that could appear anywhere on a webpage (using rvest with html_text)

In your proposal, describe your motivations for obtaining the data you plan to gather, including questions you hope to answer. Describe the source(s) of your data, and what you expect your final data set(s) to look like (what do rows and columns represent, etc.). How will you use your data to answer questions of interest?

Stage II: GitHub Collaboration

You and your partner must (a) set up a GitHub repository where one person is the Owner and the other is a Collaborator, (b) connect the GitHub repository to an R project that you edit in your own RStudio on your laptop, and (c) certify that you and your partner can handle merge conflicts, should they arise.

To obtain your “certification” in (c), follow the instructions here while sitting with your partner. Read the entire Chapter 10, and then specifically follow Steps 1-6 in Section 10.2 (don’t worry about the Exercise box that follows Step 6) and then follow Steps 1-11 in Section 10.4 (again don’t worry about the Exercise box that follows Step 11). To receive your “certification”, you will upload screenshots as in Step 11 but with details from you and your partner.

The following notes may help as your working through the instructions in Chapter 10:

  • You can enter git config pull.rebase false in the Terminal tab in RStudio
  • The Owner needs to check “set up ReadMe” when creating the initial repo
  • Both the Owner and Collaborator need to connect to GitHub through RStudio (using File > New Project > Version Control) after the Owner sets up the initial GitHub repo. Keep checking that RStudio is the same at various stages.
  • Before Step 4 (in Part 1), the Owner may have to remove the .Rproj and .gitignore folders, and they may also have to reload GitHub
  • In Step 8 (Part 2), click Stage even though it looks like the blue box is filled

Stage III: Project Submission

You will submit a link to a GitHub repo that contains (a) a quarto file with your code, and (b) at least one .csv file (formed using write_csv()) containing data you have acquired and then cleaned. Your quarto file should contain at least one custom function and possibly iteration techniques as well. Be sure that it is well commented (including describing each variable in your final .csv file(s)) and that it conforms to style guidelines.

Submission and Rubric

Mini-Project 2 must be submitted on Moodle by 11:00 PM on Fri Nov 7. You should submit a rendered pdf.

Check out this rubric for Mini-Project 2.