#barack <- read_csv("Data/tweets_potus.csv")
<- read_csv("https://proback.github.io/264_fall_2024/Data/tweets_potus.csv")
barack #michelle <- read_csv("Data/tweets_flotus.csv")
<- read_csv("https://proback.github.io/264_fall_2024/Data/tweets_flotus.csv")
michelle
<- bind_rows(barack %>%
tweets mutate(person = "Barack"),
%>%
michelle mutate(person = "Michelle")) %>%
mutate(timestamp = ymd_hms(timestamp))
Mini-Project 4: Text Analysis
Overview
You will find a data set containing string data. This could be newspaper articles, tweets, songs, plays, movie reviews, or anything else you can imagine. Then you will answer questions of interest and tell a story about your data using skills you have developed in strings, regular expressions, and text analysis.
Your story must contain the following elements:
- at least 3 different str_ functions
- at least 3 different regular expressions
- at least 2 different text analysis applications (count words, bing sentiment, afinn sentiment, nrc sentiment, wordclouds, trajectories over sections or time, tf-idf, bigrams, correlations, networks, LDA, etc.). Note that many interesting insights can be gained by strategic and thoughtful use of regular expressions paired with simple counts and summary statistics.
- at least 3 illustrative, well-labeled plots or tables
- a description of what insights can be gained from your plots and tables. Be sure you weave a compelling and interesting story!
Be sure to highlight the elements above so that they are easy for me to spot!
Evaluation Rubric
Available here
Timeline
Mini-Project 4 must be submitted on moodle by 11:00 PM on Tues Nov 26. You should simply add a tab to your quarto webpage for Mini-Project 4, then you can just submit your URL (as long as your webpage also has a link to the GitHub repo containing your R code).
Topic Ideas
Obama tweets
President Barack Obama became the first US President with an official Twitter account, when @POTUS went live on May 18, 2015. (Yes, there was a time before Twitter/X.) First Lady Michelle Obama got in on Twitter much earlier, though her first tweet was not from @FLOTUS. All of the tweets from @POTUS and @FLOTUS are now archived on Twitter as @POTUS44 and @FLOTUS44, and they are available as a csv download from the National Archive. You can read more here.
Potential things to investigate:
- use of specific terms
- use of @, #, RT (retweet), or -mo (personal tweet from Michelle Obama)
- timestamp for date and time trends
- sentiment analysis
- anything else that seems interesting!
Dear Abby advice column
Read in the “Dear Abby” data underlying The Pudding’s 30 Years of American Anxieties article.
<- read_csv("https://raw.githubusercontent.com/the-pudding/data/master/dearabby/raw_da_qs.csv") posts
Take a couple minutes to scroll through the 30 Years of American Anxieties article to get ideas for themes that you might want to search for and illustrate using regular expressions.
Other sources for string data
- Other articles from The Pudding
- NY Times headlines from the RTextTools package (see below)
- further analysis with the
bigspotify
data from class - Tidy Tuesday
- kaggle
- Data Is Plural
- the options are endless – be resourceful and creative!
library(RTextTools) # may have to install first
data(NYTimes)
as_tibble(NYTimes)
# A tibble: 3,104 × 5
Article_ID Date Title Subject Topic.Code
<int> <fct> <fct> <fct> <int>
1 41246 1-Jan-96 Nation's Smaller Jails Struggle To C… Jails … 12
2 41257 2-Jan-96 FEDERAL IMPASSE SADDLING STATES WITH… Federa… 20
3 41268 3-Jan-96 Long, Costly Prelude Does Little To … Conten… 20
4 41279 4-Jan-96 Top Leader of the Bosnian Serbs Now … Bosnia… 19
5 41290 5-Jan-96 BATTLE OVER THE BUDGET: THE OVERVIEW… Battle… 1
6 41302 7-Jan-96 South African Democracy Stumbles on … politi… 19
7 41314 8-Jan-96 Among Economists, Little Fear on Def… econom… 1
8 41333 10-Jan-96 BATTLE OVER THE BUDGET: THE OVERVIEW… budget… 1
9 41344 11-Jan-96 High Court Is Cool To Census Change census… 20
10 41355 12-Jan-96 TURMOIL AT BARNEYS: THE DIFFICULTIES… barney… 15
# ℹ 3,094 more rows