Mini-Project 4: Text Analysis

Overview

You will find a data set containing string data. This could be newspaper articles, tweets, songs, plays, movie reviews, or anything else you can imagine. Then you will answer questions of interest and tell a story about your data using skills you have developed in strings, regular expressions, and text analysis.

Your story must contain the following elements:

  • at least 3 different str_ functions
  • at least 3 different regular expressions
  • at least 2 different text analysis applications (count words, bing sentiment, afinn sentiment, nrc sentiment, wordclouds, trajectories over sections or time, tf-idf, bigrams, correlations, networks, LDA, etc.). Note that many interesting insights can be gained by strategic and thoughtful use of regular expressions paired with simple counts and summary statistics.
  • at least 3 illustrative, well-labeled plots or tables
  • a description of what insights can be gained from your plots and tables. Be sure you weave a compelling and interesting story!

Be sure to highlight the elements above so that they are easy for me to spot!

Evaluation Rubric

Available here

Timeline

Mini-Project 4 must be submitted on moodle by 11:00 PM on Tues Nov 26. You should simply add a tab to your quarto webpage for Mini-Project 4, then you can just submit your URL (as long as your webpage also has a link to the GitHub repo containing your R code).

Topic Ideas

Obama tweets

#barack <- read_csv("Data/tweets_potus.csv") 
barack <- read_csv("https://proback.github.io/264_fall_2024/Data/tweets_potus.csv")
#michelle <- read_csv("Data/tweets_flotus.csv") 
michelle <- read_csv("https://proback.github.io/264_fall_2024/Data/tweets_flotus.csv")

tweets <- bind_rows(barack %>% 
                      mutate(person = "Barack"),
                    michelle %>% 
                      mutate(person = "Michelle")) %>%
  mutate(timestamp = ymd_hms(timestamp))

President Barack Obama became the first US President with an official Twitter account, when @POTUS went live on May 18, 2015. (Yes, there was a time before Twitter/X.) First Lady Michelle Obama got in on Twitter much earlier, though her first tweet was not from @FLOTUS. All of the tweets from @POTUS and @FLOTUS are now archived on Twitter as @POTUS44 and @FLOTUS44, and they are available as a csv download from the National Archive. You can read more here.

Potential things to investigate:

  • use of specific terms
  • use of @, #, RT (retweet), or -mo (personal tweet from Michelle Obama)
  • timestamp for date and time trends
  • sentiment analysis
  • anything else that seems interesting!

Dear Abby advice column

Read in the “Dear Abby” data underlying The Pudding’s 30 Years of American Anxieties article.

posts <- read_csv("https://raw.githubusercontent.com/the-pudding/data/master/dearabby/raw_da_qs.csv")

Take a couple minutes to scroll through the 30 Years of American Anxieties article to get ideas for themes that you might want to search for and illustrate using regular expressions.

Other sources for string data

library(RTextTools)  # may have to install first
data(NYTimes)
as_tibble(NYTimes)
# A tibble: 3,104 × 5
   Article_ID Date      Title                                 Subject Topic.Code
        <int> <fct>     <fct>                                 <fct>        <int>
 1      41246 1-Jan-96  Nation's Smaller Jails Struggle To C… Jails …         12
 2      41257 2-Jan-96  FEDERAL IMPASSE SADDLING STATES WITH… Federa…         20
 3      41268 3-Jan-96  Long, Costly Prelude Does Little To … Conten…         20
 4      41279 4-Jan-96  Top Leader of the Bosnian Serbs Now … Bosnia…         19
 5      41290 5-Jan-96  BATTLE OVER THE BUDGET: THE OVERVIEW… Battle…          1
 6      41302 7-Jan-96  South African Democracy Stumbles on … politi…         19
 7      41314 8-Jan-96  Among Economists, Little Fear on Def… econom…          1
 8      41333 10-Jan-96 BATTLE OVER THE BUDGET: THE OVERVIEW… budget…          1
 9      41344 11-Jan-96 High Court Is Cool To Census Change   census…         20
10      41355 12-Jan-96 TURMOIL AT BARNEYS: THE DIFFICULTIES… barney…         15
# ℹ 3,094 more rows