This project focuses on classifying tweets (source: twitter) of the 2016 candidates for presidency to the United States, Hillary Clinton (Democrat) and Donald Trump (Republican), into two (and sometimes three) classes: positive or negative (neutral is the optional third choice). This is a process known as Sentiment Analysis, and researchers using this analysis have reported astounding accuracies within the last couple of years.
Natural language processing (NLP) is used in the context of the Python programming language -- we focus on leveraging the NLTK package. This project is written in Python, taking advantage of the simple syntax and powerful modules to bring to you, the user, informative data visualizations of the tweets by President.
The two algorithms used for this project, Naive Bayes and Linear Support Vector Machines, are both very commonly for classification problems in which observations are classified into 2 or more categories. With the type of problem we are analyzing here, along with the valuable NLTK package in Python, these two algorithms are the perfect choice.
We encourage you to try replicating this project and make your own contributions! You can fork this project on GitHub (link at the bottom of the page).
A translation to R for this project is in the works.
First, we want to load the appropriate modules into our Python environment.
For this we use the import
method, followed by the module argument, and alias.
Make sure to have these modules installed in your global environment first. For this, you can use
sudo pip install MODULE_NAMEin your console; replace
MODULE_NAME
with the package you would like to install.
# START A script.py FILE # IMPORT MODULES (and pip install the ones you don't have) import pandas as pd import csv import sys import random import numpy as np import time import twython from time import strftime import matplotlib.pyplot as plt import matplotlib.patches as mpatches import nltk from nltk import corpus from nltk.corpus import stopwords from nltk.tokenize import RegexpTokenizer nltk.download('punkt') nltk.download('stopwords') from nltk.twitter import Twitter from sklearn import datasets from sklearn.naive_bayes import GaussianNB from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from sklearn.linear_model import SGDClassifier from sklearn import metrics from sklearn.grid_search import GridSearchCV
More than just surface level knowledge of the modules is encouraged; you can find the official documentation for each Python module by typing python WRITE_MODULE_HERE
For now, we have provided a labeled data set with 400 tweets dating from July 26th to August 21st (Hillary’s 200 tweets are from August 10th to August 21st, and Trump’s 200 are from July 26th to August 10th), as well as researcher labeled corpuses of Amazon, IMDB and Yelp! customer reviews that will be used separately and compared for effectiveness.
This data was scraped through this process here.
If you would like to scrape your own tweets and do your own analysis, please go to our Github Page here and follow these instructions in the “Scraping Twitter” Section.
IF YOU ARE NOT PLANNING TO SCRAPE YOUR OWN DATA, GO TO NEXT SECTION
Using the SQL software described in “Scraping Twitter” Section on our Github page, one can look through the raw output of the Twitter Scraper (which has a lot of extraneous information) and select what specifically you would like to look at.
For our project, we selected the following sections:
rowid, tweet_id, created_at, query, content, possibly_sensitive.
The last section was chosen arbitrarily to create a column for the labels, and was manually changed.
Here we load the data using csv.reader()
function
# How to read in files # Change if necessary file = "tweets.csv" randomized_file = "randomized_tweets.csv" training_data = "training_data.csv" testing_data = "testing_data.csv" unlabeled_data = "unlabeled.csv" predicted_data_NB = "predicted_nb.csv" predicted_data_LSVM = "predicted_lsvm.csv" training_file = csv.writer(open(training_data, "wb+")) testing_file = csv.writer(open(testing_data, "wb+")) unlabeled_file = csv.writer(open(unlabeled_data, "wb+")) # Now to randomize the data; this is how # Gotten from Github: # (http://stackoverflow.com/questions/4618298/randomly-mix-lines-of-3-million-line-file) with open(file, 'rb') as source: data = [ (random.random(), line) for line in source ] data.sort() with open(randomized_file, 'wb+') as target: for _, line in data: target.write( line ) prepped_tweet_file = csv.reader(open(randomized_file, "rb"))
Here we use the random
module in Python to generate random indices for every data point and then put it back together in order to effectively create a random train/test split.
Random sampling is desired (not just here, but in all experiments and statistical problems) because it minimizes bias by given all inputs an equal chance to be chosen and used in an experiement.
But the most important reason?
"The mathematical theorems which justify most frequentist statistical procedures apply only to random samples." (source: here)
Performing unsupervised learning would normally be a critical step in the exploratory analysis phase. This phase will highlight the relationships between explanatory features and determine which features are the most significant.
However, in order to run unsupervised learning, the explanatory features must all be continuous numerical, and since this is not the case (they are tokenized words and are therefore categorical) we cannot perform this step. I have graphed the tweets by sentiment throughout the week (positive, negative, and Neutral for Monday through Sunday) and here is the output:
Sentiment of Tweets by Both Presidents Sentiment of Tweets by Hillary Clinton Sentiment of Tweets by Donald TrumpFrom these graphs we notice a couple of things:
It turns out the notion that the media only talks about negativity continues here. The amount of positive tweets on any given day has a max of about 26 tweets, averages between 15-20 a day, and is overshadowed by either neutral or negative tweets every single day. While neutral tweets are preferable to negative ones, it would do some good for the media and influential people to spread more positivity every day :)
If one were to take a look at the tweets.csv
file and notice the time stamp of each president's first and last tweets, this would reveal that Clinton actually tweets more. It took Hillary only 11 days to get to 200 tweets, while for Trump this took 14 days. Keep in mind this is a small sample size, the number of tweets could spike around election time, and tweets could possibly get deleted along the way. But taking this sample alone, Clinton does in fact tweet more than Trump.
There are two algorithms I chose for this project due to their past effectiveness in classification and sentiment analysis: Naive Bayes and Linear Support Vector Machines.
The Naive Bayes algorithm is based of the widely used Bayes formula (go figure) which is a conditional probability of one event given the other. Let us declare two events c and x. The probability of c given x can be summarized as so:
In the above picture, $$ P(c|x) $$ is the posterior probability of class (c, target) given predictor (x, attributes). $$ P(c) $$ is the prior probability of class. $$ P(x|c) $$ is the likelihood which is the Probability of predictor given class. $$ P(x) $$ is the Prior probability of predictor.
Thus this is a powerful algorithm when all of the possible values in the state space of x are in the data set you are working with; since there are a lot of repeated words in the data set, this is an ideal algorithm to use.
What makes the Naive Bayes algorithm unique is the fact that it assumes that the predictor variables are independent of one another; it "naively" assumes that they are independent.
In other words, while the predictor variables may depend on one another or upon existence of other features, they individually contribute to the class of the data.
The main benefit of a model like this is that it allows for use on large data sets, and its simplicity enables it to outperform many other classification algorithms.
Other benefits include robustness when training on multi-class prediction and many different predictor variables (including categorical)
One of its main pitfalls is that it struggles with data it has never encountered (or has not been trained on).
Let's say that we encounter an instance A that we have not seen in the training data. Because the probability of anything event given A will have the probability of A in the denominator and its value is zero, the algorithm automatically assigns all probabilities of events given A a value of zero.
Other drawbacks include being a poor predictor of unlabeled data and the not so solid assumption that all of the predictor variables are independent (this is nearly impossible in real life).
If you would like to know more about the Naive Bayes algorithm within the context of Python, here is the scikit learn
documentation for your convenience.
The information for this section, as well as an excellent resource for Naive Bayes classification can be found here.
Linear Support Vector Machines are also an excellent choice for classification, as these models choose to represent the data points as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.
The basis of this algorithm is the concept of a decision plane, an abstract boundary between groups of different class labels. The best way to imagine this would be the following image:
Since this is a classfication problem, it really is just about putting different inputs into their respective classes (and trying to predict unlabeled inputs as well).
New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
However, data won't always be as pretty and nicely divided as the example above; it only required division with a simple linear regressor. Most classification tasks require a more complex structure in order to make a successful separation and predict new instances of data based on the data available.
Most data require a classification structure like the one below:
A linear regressor would not be an ideal choice for this; lucky for us the Support Vector machine is adequately suited to handle this type of task (the task being performing non-linear classification by implicitly mapping their inputs into high-dimensional feature spaces through this method here).
This is why a Linear Support Vector Machine can handle this classification problem. The general gist of how it works is below:
A side note is that when data is not labeled, this method is not possible because this is a supervised learning problem. Supervised learning algorithms require knowledge of the true class labels to "train" on in order to predict new instances of data ("test" data).
This means an unsupervised method (no knowledge of the true class labels) is needed, which will attempt to "cluster" the data into groups or labels and then use these clusters to predict unlabeled data.
This clustering algorithm, which is an improvement on Support Vector Machines, is called Support Vector Clustering and can come in handy when little to none of your data is labeled (or if there are no known labels to begin with).
If you would like more information on how exactly the algorithm works within the context of Python, here is the scikit learn
documentation for your convenience.
An excellent explanation (and the source of the images and information here) of Linear Support Vector Machines in theory can be found here and here.
Now for the fun part! Let's see these bad boys in action
# Read in csvs # And divide data into appropriate sections # Train on train data, test with test data train = pd.read_csv(training_data, names = train_columns) test = pd.read_csv(testing_data, names = test_columns) x_train = np.array((train['row_id'], train['tweet_id'], train['month'], train['day'], train['hour'], train['president'], train['tweet'])) y_train = np.array(train['label']) x_test = np.array((test['row_id'], test['tweet_id'], test['month'], test['day'], test['hour'], test['president'], test['tweet'])) y_test = np.array(test['label']) train_words = np.array(train['tweet']) test_words = np.array(test['tweet']) # Show unique labels unique, counts = np.unique(y_train[0], return_counts=True) print np.asarray((unique, counts)).T def naive_bayes(x_train, y_train, x_test, y_test): """ Building a Pipeline; this does all of the work in the extract_and_train() function Can be found on Github """ text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ]) text_clf = text_clf.fit(x_train, y_train) # Evaluate performance on test set predicted = text_clf.predict(x_test) print("The accuracy of a Naive Bayes algorithm is: ") print(np.mean(predicted == y_test)) print("Number of mislabeled points out of a total %d points for the Naive Bayes algorithm : %d" % (x_test.shape[0],(y_test != predicted).sum())) # Tune parameters and predict unlabeled tweets parameter_tuning(text_clf, x_train, y_train) predict_unlabeled_tweets(text_clf, predicted_data_NB) def linear_svm(x_train, y_train, x_test, y_test): """ Let's try a Linear Support Vector Machine (SVM) """ text_clf_two = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42)), ]) text_clf_two = text_clf_two.fit(x_train, y_train) predicted_two = text_clf_two.predict(x_test) print("The accuracy of a Linear SVM is: ") print(np.mean(predicted_two == y_test)) print("Number of mislabeled points out of a total %d points for the Linear SVM algorithm: %d" %(x_test.shape[0],(y_test != predicted_two).sum())) # Tune parameters and predict unlabeled tweets parameter_tuning(text_clf_two, x_train, y_train) predict_unlabeled_tweets(text_clf_two, predicted_data_LSVM)
This will output the accuracy of the algorithms, as well as the number of incorrectly labeled points.
If you have cloned the repo from Github, you can simply run the following commands to parse the initial data file, perform exploratory analysis, and train our two algorithms (as well as predict unlabeled tweets, which we will look into in the next section):
python tweetAnalysis.py parse python tweetAnalysis.py train
Or if you are familiar with a Makefile
you can simply run make
and the same commands will run.
Let's see how accurate our new algorithms will be with a test set of 100 unlabeled tweets (50 from each candidate).
# Take algorithm and test data as input # Output probability and number of incorrectly labeled points def predict_unlabeled_tweets(classifier, output): # Make predictions unlabeled_tweets = pd.read_csv(unlabeled_data, names = unlabeled_columns) unlabeled_words = np.array(unlabeled_tweets["tweet"]) predictions = classifier.predict(unlabeled_words) print(predictions) # Create new file for predictions # And utilize csv module to iterate through csv predicted_tweets = csv.writer(open(output, "wb+")) unlabeled_tweets = csv.reader(open(unlabeled_data, "rb+")) # Iterate through csv and get president and tweet # Add prediction to end # Also recieved from Github: # http://stackoverflow.com/questions/23682236/ add-a-new-column-to-an-existing-csv-file index = 0 for row in unlabeled_tweets: (row_id, tweet_id, month, day, hour, president, tweet, label) = row predicted_tweets.writerow([president] + [tweet] + [predictions[index]]) index += 1
This will already be done for you by running
python tweetAnalysis.py trainnow let's take a look at our new csv files with each of the algorithm's predictions:
row_id | tweet_id | tweet | label |
---|---|---|---|
1 | realDonaldTrump | ['great', 'new', 'poll', 'thank', 'makeamericagreatagain', 'https', 'co', 'mxovx0tlpc']" | positive |
2 | realDonaldTrump | "['bernie', 'exhausted', 'wants', 'shut', 'go', 'home', 'bed']" | negative |
3 | HillaryClinton | "['people', 'taking', 'care', 'children', 'parents', 'deserve', 'good', 'wage', 'good', 'benefits', 'secure', 'retirement']" | negative |
4 | realDonaldTrump | "['rt', 'piersmorgan', 'trump', 'makes', 'funny', 'obvious', 'joke', 'russia', 'going', 'hillary', 'emails', 'amp', 'u', 'media', 'goes', 'insane', 'fury', 'plays', 't\xe2']" | negative |
5 | realDonaldTrump | "['crooked', 'hillary', 'clinton', 'wants', 'flood', 'country', 'syrian', 'immigrants', 'know', 'little', 'nothing', 'danger', 'massive']" | negative |
6 | realDonaldTrump | "['country', 'feel', 'great', 'already', 'millions', 'wonderful', 'people', 'living', 'poverty', 'violence', 'despair']" | negative |
7 | realDonaldTrump | "['one', 'worse', 'judgement', 'hillary', 'clinton', 'corruption', 'devastation', 'follows', 'wherever', 'goes']" | negative |
8 | realDonaldTrump | "['great', 'back', 'iowa', 'tbt', 'jerryjrfalwell', 'joining', 'davenport', 'past', 'winter', 'maga', 'https', 'co', 'a5if0qhnic']" | positive |
9 | HillaryClinton | "['multi', 'millionaires', 'able', 'pay', 'lower', 'tax', 'rate', 'secretaries', 'https', 'co', 'xfx93s1vgh']" | negative |
10 | realDonaldTrump | "['crookedhillary', 'https', 'co', 'lwi9gqdehe']" | negative |
From the above output, it seems like our Naive Bayes method did pretty well for itself. While Clinton's tweet about parents deserving a good wage, benefits and a secure retirement drawing the negative label was certainly surprising, every other tweet seemed to be correctly labeled (if not simply based on the individual words in the sentence).
Now let's switch over to the Linear Support Vector Machine and compare its predictions for the same tweets.
row_id | tweet_id | tweet | label |
---|---|---|---|
1 | realDonaldTrump | ['great', 'new', 'poll', 'thank', 'makeamericagreatagain', 'https', 'co', 'mxovx0tlpc']" | positive |
2 | realDonaldTrump | "['bernie', 'exhausted', 'wants', 'shut', 'go', 'home', 'bed']" | negative |
3 | HillaryClinton | "['people', 'taking', 'care', 'children', 'parents', 'deserve', 'good', 'wage', 'good', 'benefits', 'secure', 'retirement']" | negative |
4 | realDonaldTrump | "['rt', 'piersmorgan', 'trump', 'makes', 'funny', 'obvious', 'joke', 'russia', 'going', 'hillary', 'emails', 'amp', 'u', 'media', 'goes', 'insane', 'fury', 'plays', 't\xe2']" | negative |
5 | realDonaldTrump | "['crooked', 'hillary', 'clinton', 'wants', 'flood', 'country', 'syrian', 'immigrants', 'know', 'little', 'nothing', 'danger', 'massive']" | negative |
6 | realDonaldTrump | "['country', 'feel', 'great', 'already', 'millions', 'wonderful', 'people', 'living', 'poverty', 'violence', 'despair']" | negative |
7 | realDonaldTrump | "['one', 'worse', 'judgement', 'hillary', 'clinton', 'corruption', 'devastation', 'follows', 'wherever', 'goes']" | negative |
8 | realDonaldTrump | "['great', 'back', 'iowa', 'tbt', 'jerryjrfalwell', 'joining', 'davenport', 'past', 'winter', 'maga', 'https', 'co', 'a5if0qhnic']" | positive |
9 | HillaryClinton | "['multi', 'millionaires', 'able', 'pay', 'lower', 'tax', 'rate', 'secretaries', 'https', 'co', 'xfx93s1vgh']" | neutral |
10 | realDonaldTrump | "['crookedhillary', 'https', 'co', 'lwi9gqdehe']" | negative |
We can see the predictions are almost exactly the same (besides for a neutral label from the Linear SVM for the 9th tweet), and again Hillary's tweet about parents gets the negative label for some reason.
But hey, computers don't have emotions and would never understand, so the fact that we can tokenize strings, classify them and have them learn through this is rather mindblowing.
Have we found a way for machines to learn emotions?
Have we made machines just a bit smarter?
Are we bring Judgment Day upon us?
Will the Terminator make an appearance?
Well the last one would be cool, but this is just one step towards proper Sentiment Analysis. There were more things we could have done to improve the accuracy and quality of the algorithms, such as:
Nonetheless, this was an interesting and eye-opening little project; if you get your own Twitter dev account you would be able to scrape tweets yourself, and do your own little classification experiment. And of course feel free to borrow this code from the Github link below; no reason to do all the work yourself!
If you Google "Sentiment Analysis" you will find amazing algorithms that can properly classify sentences with around 99.99% accuracy. These were built by the world's leading minds, and is a reflection of how truly close we are with Artificial Intelligence (i.e. machines "learning" behaviors and new ways of thinking, hence "machine learning").
It is my desire for this project not to be a one-time, resume-filling endeavor, but to be forever a work-in-progress. I intend to pursue all of the steps above and more to get professional-quality, working algorithms. If you would like to contribute and help speed up this process, please clone and contribute to the Github repo below.
Congratulations for getting this far! We hope you enjoyed this project. Please reach out to us here if you have any feedback or would like to publish your own project.