# SENTIMENT ANALYSIS ON TWITTER (PYTHON 3) [PART 2]

###### CLASSIFYING TWEETS OF THE 2016 U.S. PRESIDENTIAL CANDIDATES WITH PYTHON 3.5
2

PYTHON 3

HARD

last hacked on Aug 20, 2017

This is Part 2 of the Sentiment Analysis on Presidential Tweets project using Natural Language Processing (NLP). In this part I will be introducing the algorithms used for this project: Naive Bayes, Linear Support Vector Machines, and Multi-Layer Perceptrons (Neural Networks). The first two are both very commonly for classification problems in which observations are classified into 2 or more categories. I have introduced the Neural Network to see if its effectiveness in many areas can compare. We encourage you to try replicating this project and make your own contributions! You can fork this project on GitHub (link at the bottom of the page). A translation to R for this project is in the works.
# Part 1 [Part 1 here](https://www.inertia7.com/projects/28) if you missed it (or just couldn't wait to see Part 2 ;) ) # Choosing the Model There are two algorithms I chose for this project due to their past effectiveness in classification and sentiment analysis: **Naive Bayes** and **Linear Support Vector Machines**. I also introduced another algorithm called a Neural Network to test its highly popularized effectiveness. ## Naive Bayes The **Naive Bayes** algorithm is based of the widely used **Bayes formula** (go figure) which is a conditional probability of one event given the other. Let us declare two events *c* and *x*. The probability of *c* given *x* can be summarized as so: <img src = "https://www.analyticsvidhya.com/wp-content/uploads/2015/09/Bayes_rule-300x172-300x172.png" width = "50%", height = "50%"> In the above picture, <script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML"> </script> $$P(c|x)$$ is the **posterior probability** of class (c, target) given predictor (x, attributes). $$P(c)$$ is the **prior probability** of class. $$P(x|c)$$ is the likelihood which is the **Probability of predictor given class**. $$P(x)$$ is the **Prior probability of predictor**. Thus this is a powerful algorithm when all of the possible values in the state space of *x* (i.e. possible values x could take on) are in the data set you are working with. Since the inputs are words and many are repeated, this is an ideal algorithm to use. What makes the **Naive Bayes** algorithm unique (and powerful) is the fact that it "naively" assumes that the predictor variables are independent of each other. In other words, while the values of the predictor variables may depend on one another or upon existence of other features, they individually contribute to the class of the data. The main benefit of a model like this is that it allows for use on large data sets, and its simplicity enables it to outperform many other classification algorithms. Other benefits include robustness when training on multi-class prediction and many different predictor variables (including categorical). One of its main pitfalls is that it struggles with data it has never encountered (i.e. has not been trained on). Let's say that we encounter an instance *A* (or in our case, word) that we have not seen in the training data. Because the probability of anything event given *A* will have the probability of A in the denominator (with a value of zero), the algorithm automatically assigns all probabilities of events given *A* a value of zero. A way to combat this issue is to implement a psuedocount; what this entails is adding one to the count of the word when calculating probability. Therefore, a word that has never been seen before would have a probability of (1/Total number of words). The other word probabilities change slightly, but it is marginal and avoids this problem. I have not yet implemented this but plan to do so in the future. Other drawbacks include being a poor predictor of unlabeled data and the not so solid assumption that all of the predictor variables are independent (this is nearly impossible in real life). If you would like to know more about the Naive Bayes algorithm within the context of Python, here is the  scikit learn [documentation](http://scikit-learn.org/stable/modules/naive_bayes.html)for your convenience. The information for this section, as well as an excellent resource for Naive Bayes classification can be found [here.](https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/) ## Linear Support Vector Machines **Linear Support Vector Machines** are also an excellent choice for classification, as these models choose to represent the data points as points in high dimensional space, dividing the separate categories by a boundary that is as wide as possible. This boundary forms the basis of this algorithm called a **decision plane,** an abstract boundary between groups of different class labels. The best way to imagine this would be the following image: <img src = "http://i.imgur.com/HhvkyGf.gif" width="100" height="100"> Thus, the observations in the training set are grouped together via proximity to each other in the high dimensional space separated by decision boundaries. New examples are then mapped into that same space and predicted to belong to a category based on which side of the boundary they fall on. However, data won't always be as pretty and nicely divided as the example above; it only required a simple linear division to split up. Most classification tasks require a more complex structure in order to make a successful separation and predict new instances of data like the one below: <img src = "http://i.imgur.com/p0kzY1U.gif" width="100" height="100"> A linear boundary would not be an ideal choice for this; lucky for us the Support Vector Machine is adequately suited to handle this type of task (the task being performing non-linear classification by implicitly mapping their inputs into high-dimensional feature spaces through this method [here.](https://en.wikipedia.org/wiki/Kernel_method) This is why a **Linear Support Vector Machine** can handle this classification problem. The general gist of how it works is below: ![alt text](http://i.imgur.com/06ecxqK.gif) If none of your data is not labeled, then this algorithm cannot work; instead an unsupervised method is needed. A clustering algorithm version of a **Support Vector Machine** is called **Support Vector Clustering** and can come in handy when little to none of your data is labeled (or if there are no known labels to begin with). If you would like more information on how exactly the algorithm works within the context of **Python**, here is the scikit learn [documentation](http://scikit-learn.org/stable/modules/svm.html) for your convenience. An excellent explanation (and the source of the images and information here) of **Linear Support Vector Machines** in theory can be found [here](http://www.statsoft.com/Textbook/Support-Vector-Machines) and [here.](https://en.wikipedia.org/wiki/Support_vector_machine) ## Neural Networks Neural networks are the next biggest thing in Machine Learning, for reasons such as + Lower bias AND variance (supercedes the bias-variance tradeoff) + Generalizes well to new data; is equipped with cross validation for parameter tuning + Covers a wide range of machine learning applications Here is a helpful visual (link to the article [here](https://visualstudiomagazine.com/articles/2013/05/01/neural-network-feed-forward.aspx): ![alt text](https://visualstudiomagazine.com/articles/2013/05/01/~/media/ECG/visualstudiomagazine/Images/2013/05/0513vsm_McCaffreyNeuralNet2.ashx) For more in depth detail, I have linked a video [here](https://www.youtube.com/watch?v=bxe2T-V8XRs) for your convenience. We will be putting the reputation of Neural Networks to the test and see what accuracy we can get! We will be using a specific type of Neural Network called a *Multi-layer Perceptron* (will add more explanation later). An excellent explanation (and the source of the images and information here) of **Multi-layer Perceptrons** in theory can be found [here](http://scikit-learn.org/stable/modules/neural_networks_supervised.html) # Building the Model Now for the fun part! Let's see these bad boys in action python # Read in csvs # And divide data into appropriate sections # Train on train data, test with test data train = pd.read_csv(training_data, names = train_columns) test = pd.read_csv(testing_data, names = test_columns) x_train = np.array((train['row_id'], train['tweet_id'], train['month'], train['day'], train['hour'], train['president'], train['tweet'])) y_train = np.array(train['label']) x_test = np.array((test['row_id'], test['tweet_id'], test['month'], test['day'], test['hour'], test['president'], test['tweet'])) y_test = np.array(test['label']) train_words = np.array(train['tweet']) test_words = np.array(test['tweet']) # Show unique labels unique, counts = np.unique(y_train[0], return_counts=True) print np.asarray((unique, counts)).T def naive_bayes(x_train, y_train, x_test, y_test): """ Building a Pipeline; this does all of the work in the extract_and_train() function Can be found on Github """ text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ]) text_clf = text_clf.fit(x_train, y_train) # Evaluate performance on test set predicted = text_clf.predict(x_test) print("The accuracy of a Naive Bayes algorithm is: ") print(np.mean(predicted == y_test)) print("Number of mislabeled points out of a total %d points for the Naive Bayes algorithm : %d" % (x_test.shape[0],(y_test != predicted).sum())) # Tune parameters and predict unlabeled tweets parameter_tuning(text_clf, x_train, y_train) predict_unlabeled_tweets(text_clf, predicted_data_NB) def linear_svm(x_train, y_train, x_test, y_test): """ Let's try a Linear Support Vector Machine (SVM) """ # Only the initialization is shown to show different parameters; the rest of the model evaluation is the same text_clf_two = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42)), ]) # Model fitting, evaluation and prediction of unlabeled points performed here def neural_network(x_train, y_train, x_test, y_test): """ Now let's try one of the most powerful algorithms: An Artifical Neural Network (ANN) """ # Only the initialization is shown to show different parameters; the rest of the model evaluation is the same text_clf_three = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('chi2', SelectKBest(chi2, k = 'all')), ('clf', MLPClassifier( hidden_layer_sizes=(100,), max_iter=10, alpha=1e-4, solver='sgd', verbose=10, tol=1e-4, random_state=1, learning_rate_init=.1)), ]) # Model fitting, evaluation and prediction of unlabeled points performed here  This will output the accuracy of the algorithms, as well as the number of incorrectly labeled points. If you have cloned the [repo](https://github.com/inertia7/sentiment_ClintonTrump_2016) from **Github**, you can simply run the following commands to parse the initial data file, perform exploratory analysis, and train the algorithms (as well as predict unlabeled tweets, which we will look into in the next section): python python tweetAnalysis.py parse python tweetAnalysis.py train  Or if you are familiar with a Makefile you can simply run make and the same commands will run. # Predicting the Unlabeled Tweets Let's see how accurate our new algorithms will be with a test set of 100 unlabeled tweets (50 from each candidate). python # Take algorithm and test data as input # Output probability and number of incorrectly labeled points def predict_unlabeled_tweets(classifier, output): # Make predictions unlabeled_tweets = pd.read_csv(unlabeled_data, names = unlabeled_columns) unlabeled_words = np.array(unlabeled_tweets["tweet"]) predictions = classifier.predict(unlabeled_words) print(predictions) # Create new file for predictions</span>** # And utilize csv module to iterate through csv</span>** predicted_tweets = csv.writer(open(output, "wb+")) unlabeled_tweets = csv.reader(open(unlabeled_data, "rb+")) # Iterate through csv and get president and tweet # Add prediction to end # Also recieved from Github: # http://stackoverflow.com/questions/23682236/add-a-new-column-to-an-existing-csv-file index = 0 for row in unlabeled_tweets: (row_id, tweet_id, month, day, hour, president, tweet, label) = row predicted_tweets.writerow([president] + [tweet] + [predictions[index]]) index += 1  This will already be done for you by running python tweetAnalysis.py train now let's take a look at our new csv files with each of the algorithm's predictions: First the ** Naive Bayes ** algorithm: ### 10 Predicted tweets from Naive Bayes Algorithm | row_id | tweet_id | tweet | label | | ----- | --------- | --------- | ------ | | 3 | realDonaldTrump | ['shooting', 'deaths', 'police', 'officers', '78', 'year', 'must', 'restore', 'law', 'order', 'protect', 'great', 'law', 'enforcement', 'officers'] | positive | | 5 | realDonaldTrump | ['crooked', 'hillary', 'clinton', 'wants', 'flood', 'country', 'syrian', 'immigrants', 'know', 'little', 'nothing', 'danger', 'massive'] | negative | | 9 | HillaryClinton | ['country', 'going', 'clean', 'energy', 'superpower', '21st', 'century', 'create', 'millions', 'jobs', 'want', 'us', 'hillary'] | negative | | 10 | HillaryClinton | ['trump', 'wants', 'eliminate', 'estate', 'tax', 'would', 'mean', 'family', 'gets', '4', 'billion', 'tax', 'cut', '99', '8', 'americans', 'get', 'nothing'] | negative | | 26 | realDonaldTrump | ['america', 'future'] | positive | | 42 | realDonaldTrump | ['crookedhillary'] | negative | | 51 | realDonaldTrump | ['order', 'try', 'deflect', 'horror', 'stupidity', 'wikileakes', 'disaster', 'dems', 'said', 'maybe', 'russia', 'dealing', 'trump', 'crazy'] | negative | | 53 | HillaryClinton | ['trump', 'may', 'talk', 'big', 'game', 'trade', 'approach', 'based', 'fear', 'strength', 'fear', 'compete', 'rest', 'world'] | negative | | 76 | HillaryClinton | ['starting', 'day', 'one', 'work', 'parties', 'pass', 'biggest', 'investment', 'new', 'good', 'paying', 'jobs', 'since', 'world', 'war', 'ii'] | positive | | 89 | HillaryClinton | ['1979'] | negative | From the above output, it seems like our Naive Bayes method did a decent job. Did my best job to take a combination of correctly labeled tweets, incorrectly labeled tweets and mistakenly correct labels; with more data perhaps the algorithm would have done better. Now let's switch over to the **Linear Support Vector Machine** and compare its predictions for the same tweets. ## 10 Predicted tweets from Linear Support Vector Machine Algorithm | row_id | tweet_id | tweet | label | | ----- | --------- | --------- | ------ | | 3 | realDonaldTrump | ['shooting', 'deaths', 'police', 'officers', '78', 'year', 'must', 'restore', 'law', 'order', 'protect', 'great', 'law', 'enforcement', 'officers'] | positive | | 5 | realDonaldTrump | ['crooked', 'hillary', 'clinton', 'wants', 'flood', 'country', 'syrian', 'immigrants', 'know', 'little', 'nothing', 'danger', 'massive'] | negative | | 9 | HillaryClinton | ['country', 'going', 'clean', 'energy', 'superpower', '21st', 'century', 'create', 'millions', 'jobs', 'want', 'us', 'hillary'] | positive | | 10 | HillaryClinton | ['trump', 'wants', 'eliminate', 'estate', 'tax', 'would', 'mean', 'family', 'gets', '4', 'billion', 'tax', 'cut', '99', '8', 'americans', 'get', 'nothing'] | negative | | 26 | realDonaldTrump | ['america', 'future'] | positive | | 42 | realDonaldTrump | ['crookedhillary'] | negative | | 51 | realDonaldTrump | ['order', 'try', 'deflect', 'horror', 'stupidity', 'wikileakes', 'disaster', 'dems', 'said', 'maybe', 'russia', 'dealing', 'trump', 'crazy'] | negative | | 53 | HillaryClinton | ['trump', 'may', 'talk', 'big', 'game', 'trade', 'approach', 'based', 'fear', 'strength', 'fear', 'compete', 'rest', 'world'] | negative | | 76 | HillaryClinton | ['starting', 'day', 'one', 'work', 'parties', 'pass', 'biggest', 'investment', 'new', 'good', 'paying', 'jobs', 'since', 'world', 'war', 'ii'] | positive | | 89 | HillaryClinton | ['1979'] | negative | Looks like the Linear Support Vector Machine did a good job as well; besides for the correctly labeled third tweet as *positive* rather than *negative* makes its performance slightly better then the Naive Bayes Algorithm. And last but not least, we have the ** Neural Network. ** ## 10 Predicted tweets from the Multi-layer Perceptron (Neural Network) | row_id | tweet_id | tweet | label | | ----- | --------- | --------- | ------ | | 3 | realDonaldTrump | ['shooting', 'deaths', 'police', 'officers', '78', 'year', 'must', 'restore', 'law', 'order', 'protect', 'great', 'law', 'enforcement', 'officers'] | negative | | 5 | realDonaldTrump | ['crooked', 'hillary', 'clinton', 'wants', 'flood', 'country', 'syrian', 'immigrants', 'know', 'little', 'nothing', 'danger', 'massive'] | negative | | 9 | HillaryClinton | ['country', 'going', 'clean', 'energy', 'superpower', '21st', 'century', 'create', 'millions', 'jobs', 'want', 'us', 'hillary'] | negative | | 10 | HillaryClinton | ['trump', 'wants', 'eliminate', 'estate', 'tax', 'would', 'mean', 'family', 'gets', '4', 'billion', 'tax', 'cut', '99', '8', 'americans', 'get', 'nothing'] | negative | | 26 | realDonaldTrump | ['america', 'future'] | negative | | 42 | realDonaldTrump | ['crookedhillary'] | negative | | 51 | realDonaldTrump | ['order', 'try', 'deflect', 'horror', 'stupidity', 'wikileakes', 'disaster', 'dems', 'said', 'maybe', 'russia', 'dealing', 'trump', 'crazy'] | negative | | 53 | HillaryClinton | ['trump', 'may', 'talk', 'big', 'game', 'trade', 'approach', 'based', 'fear', 'strength', 'fear', 'compete', 'rest', 'world'] | negative | | 76 | HillaryClinton | ['starting', 'day', 'one', 'work', 'parties', 'pass', 'biggest', 'investment', 'new', 'good', 'paying', 'jobs', 'since', 'world', 'war', 'ii'] | negative | | 89 | HillaryClinton | ['1979'] | negative | Contrary to the popular belief that Neural Networks are the key to everything, here we have found an exception. The Multi-Layer Perceptron ended up doing the worst on this type of analysis, via this terminal output: In fact, it ended up classifying every single tweet as *negative;* while the world can be a negative place it can't be negative all the time. But hey, computers don't have emotions and are terrible at understanding the nuances of language, so the fact that we can tokenize strings, classify them and have them learn through this is rather mindblowing. # Conclusion Have we found a way for machines to learn emotions? Have we made machines just a bit smarter? Are we bring Judgment Day upon us? Will the Terminator make an appearance? Well the last one would be cool, but this is just one step towards proper Sentiment Analysis. There were more things we could have done to improve the accuracy and quality of the algorithms, such as: + Provide certain instructions to the algorithms where individual words, such as evil and amazing, are classified as negative and positive respectively (if you have unlabeled or partially labeled data) + Use a properly trained corpus of sentences by professionals (I have found and formatted a researcher labeled corpus, and will be putting it to use soon. More to come later). Nonetheless, this was an interesting and eye-opening little project; if you get your own Twitter dev account you would be able to scrape tweets yourself, and do your own little classification experiment. And of course feel free to borrow this code from the Github link below; no reason to do all the work yourself! If you **Google** **"Sentiment Analysis"** you will find amazing algorithms that can properly classify sentences with around 99.99% accuracy. These were built by the world's leading minds, and is a reflection of how truly close we are with Artificial Intelligence (i.e. machines "learning" behaviors and new ways of thinking, hence "machine learning"). It is my desire for this project not to be a one-time, resume-filling endeavor, but to be forever a work-in-progress. I intend to pursue all of the steps above and more to get professional-quality, working algorithms. If you would like to contribute and help speed up this process, please clone and contribute to the **Github** repo below.