This project focuses on classifying tweets (source: twitter) of the 2016 candidates for presidency to the United States, Hillary Clinton (Democrat) and Donald Trump (Republican), into two (and sometimes three) classes: positive or negative (neutral is the optional third choice). This is a process known as Sentiment Analysis, and researchers using this analysis have reported astounding accuracies within the last couple of years.
Natural language processing (NLP) is used in the context of the Python programming language -- we focus on leveraging the NLTK package. This project is written in Python, taking advantage of the simple syntax and powerful modules to bring to you, the user, informative data visualizations of the tweets by President.
In this part I will be reading in the data, tokenizing the tweets, and visualizing them. In the next part, I will introduce the algorithms used to classify these tweets as "Positive", "Negative" or "Neutral".
We encourage you to try replicating this project and make your own contributions! You can fork this project on GitHub (link at the bottom of the page).
A translation to R for this project is in the works.
# Load Modules
First, we want to load the appropriate modules into our **Python** environment.
For this we use the `import` method, followed by the module argument.
Make sure to have these modules installed in your local environment first.
For this, you can use `sudo pip3 install MODULE_NAME` in your console, replacing `MODULE_NAME` with the package you would like to install.
# START A script.py FILE
# Or clone the project off Github and use for reference
# INSTALL any modules not already on local environment
Any modules included in the main `tweetAnalysis.py` file that are not above should already be included with your local environment.
More than just surface level knowledge of the modules is encouraged; you can find the official documentation for each **Python** module by typing `python WRITE_MODULE_HERE`
# Get Data
## Collecting Data
For now, we have provided a labeled data set with 400 tweets dating from July 26th to August 21st (**Hillary’s** 200 tweets are from August 10th to August 21st, and **Trump’s** 200 are from July 26th to August 10th), as well as researcher labeled corpuses of **Amazon**, **IMDB** and **Yelp!** customer reviews that will be used separately and compared for effectiveness.
This data was scraped through this process [here](http://www.slideshare.net/cosmopolitanvan/five-steps-to-get-tweets-sent-by-a-list-of-users).
If you would like to scrape your own tweets and do your own analysis, please go to our Github Page here and follow these instructions in the [“Scraping Twitter”](https://github.com/inertia7/sentiment_ClintonTrump_2016#scraping-twitter) Section.
## Subsetting Data
IF YOU ARE NOT PLANNING TO SCRAPE YOUR OWN DATA, GO TO NEXT SECTION
Using the **SQL** software described in [“Scraping Twitter”](https://github.com/inertia7/sentiment_ClintonTrump_2016#scraping-twitter) Section on our Github page, one can look through the raw output of the **Twitter Scraper** (which has a lot of extraneous information) and select what specifically you would like to look at.
For our project, we selected the following sections:
The last section was chosen arbitrarily to create a column for the labels, and was manually changed.
## Loading Data
Here are what some tweets look like:
1,767397907473502208,2016-08-21 16:28:13.000000,HillaryClinton,"If you were president, how would you spend $4 billion? (One guess as to Trump's answer: https://t.co/9wj6zlbB7j)",positive
2,767366136824467456,2016-08-21 14:21:58.000000,HillaryClinton,There is so much more that unites us than divides us. That's why we're the greatest country on Earth. https://t.co/6uZVP0tjK9,positive
3,767354949613346816,2016-08-21 13:37:31.000000,HillaryClinton,This lifelong Republican wrote a letter to his daughter about why he's voting for Hillary this November: https://t.co/iqmnZaR7xN,positive
4,767183928344059904,2016-08-21 02:17:56.000000,HillaryClinton,"It's too late now to say sorry, Donald. https://t.co/TWj7QwEfVh",negative
5,767134737257332737,2016-08-20 23:02:28.000000,HillaryClinton,"If you dream it, you should be able to build it. https://t.co/uArcPunfFy https://t.co/qKg0MOHVBv",positive
6,767123506148630528,2016-08-20 22:17:50.000000,HillaryClinton,This choice is pretty straightforward. https://t.co/8D3Og1UM14 https://t.co/1pbwefsVLy,neutral
7,767104881253548034,2016-08-20 21:03:50.000000,HillaryClinton,Donald Trump has shown us who he is. We should believe him. https://t.co/TWj7QwEfVh,neutral
8,767077465797918723,2016-08-20 19:14:54.000000,HillaryClinton,Trump’s tax plan could give his own family a $4 billion tax break. Here's what we could do with that money instead: https://t.co/89rNgjYcGk,positive
9,767069710903115778,2016-08-20 18:44:05.000000,HillaryClinton,"RT @Deadspin: ""We don't win anymore."" https://t.co/opLEssKAxA",negative
10,767046604041912320,2016-08-20 17:12:16.000000,HillaryClinton,Let's send Donald Trump a message in November: We're not going back. https://t.co/85ovKr3y2x,positive
Here we load the data using `csv.reader()` function
# How to read in files
# Change if necessary
file = "tweets.csv"
randomized_file = "randomized_tweets.csv"
training_data = "training_data.csv"
testing_data = "testing_data.csv"
unlabeled_data = "unlabeled.csv"
predicted_data_NB = "predicted_nb.csv"
predicted_data_LSVM = "predicted_lsvm.csv"
training_file = csv.writer(open(training_data, "wb+"))
testing_file = csv.writer(open(testing_data, "wb+"))
unlabeled_file = csv.writer(open(unlabeled_data, "wb+"))
** # Now to randomize the data**
** # (http://stackoverflow.com/questions/4618298/randomly-mix-lines-of-3-million-line-file)**
with open(file, 'rb') as source:
data = [ (random.random(), line) for line in source ]
with open(randomized_file, 'wb+') as target:
for _, line in data:
target.write( line )
prepped_tweet_file = csv.reader(open(randomized_file, "rb"))
Here we use the `random` module in Python to generate random indices for every data point and then put it back together in order to effectively create a random train/test split.
Random sampling is desired (not just here, but in all experiments and statistical problems) because it minimizes bias by given all inputs an equal chance to be chosen and used in an experiment.
But the most important reason?
*"The mathematical theorems which justify most frequentist statistical procedures apply only to random samples."* ([source](https://www.ma.utexas.edu/users/mks/statmistakes/RandomSampleImportance.html))
# Do Exploratory Analysis
## Bar (Frequency) Charts
Performing unsupervised learning would normally be a critical step in the exploratory analysis phase. This phase will highlight the relationships between explanatory features and determine which features are the most significant.
However, most forms of unsupervised learning come via continuous numerical features and since this is not the case (they are tokenized words and are therefore categorical) we cannot perform many of the typical analyses. I have graphed the tweets by sentiment throughout the week (positive, negative, and Neutral for Monday through Sunday) and here is the output:
** Sentiment of Tweets by Both Presidents**
<iframe width="100%" height=415 frameborder="0" scrolling="no" src="https://plot.ly/~raviolli77/69.embed?autosize=True&width=90%&height=100%"></iframe>
** Sentiment of Tweets by Hillary Clinton**
<iframe width="100%" height=415 frameborder="0" scrolling="no" src="https://plot.ly/~raviolli77/73.embed?autosize=True&width=90%&height=100%"></iframe>
** Sentiment of Tweets by Donald Trump**
<iframe width="100%" height=415 frameborder="0" scrolling="no" src="https://plot.ly/~raviolli77/71.embed?autosize=True&width=90%&height=100%"></iframe>
From these graphs we notice a couple of things:
+ During the course of the week, the overall amount of negative tweets fluctuate all over the place; first starting off relatively high (about 37 negative tweets), it decreases down all the way to about 5, then peaks and decreases once more before the week ends.
+ Comparing **Clinton's** to **Trump's** tweets, it turns out there is no comparison. **Hillary's** tweets are dominated by the neutral sentiment, while **Trump's** are full of negative tweets.
It turns out the notion that the media only talks about negativity continues here. The amount of positive tweets on any given day has a max of about 26 tweets, averages between 15-20 a day, and is overshadowed by either neutral or negative tweets every single day. While neutral tweets are preferable to negative ones, it would do some good for the media and influential people to spread more positivity every day :)
If one were to take a look at the `tweets.csv` file and notice the time stamp of each president's first and last tweets, this would reveal that **Clinton** actually tweets more. It took **Hillary** only 11 days to get to 200 tweets, while for **Trump** this took 14 days. Keep in mind this is a small sample size, the number of tweets could spike around election time, and tweets could possibly get deleted along the way. But taking this sample alone, **Clinton** does in fact tweet more than **Trump**.
## Word Clouds
A word cloud is a collection of words that highlights the frequency of said words through size (bigger words show up more, smaller ones show up less).
Here is the word cloud generated by Donald Trump's tweets:
Here is the word cloud generated by Hillary Clinton's tweets:
Here is the word cloud generated by both candidates:
# Part 2
Please click on the link [here](https://www.inertia7.com/projects/93) to proceed to Part 2 of the project!