MYERS BRIGGS PERSONALITY TYPE NATURAL LANGUAGE PROCESSING (NLP) ANALYSIS (PART 2)

ANALZYING TWEETS LABELED WITH THEIR CORRESPONDING PERSONALITY TYPE
2

PYTHON 3, VIRTUAL ENVIRONMENTS

HARD

last hacked on Oct 29, 2017

This project focuses on analyzing tweets by personality type (source: twitter) and drawing various observations (word frequencies, word significance, etc.). This is a process known as Sentiment Analysis, and researchers using this analysis have reported astounding accuracies within the last couple of years. Natural language processing (NLP) is used in the context of the Python programming language -- we focus on leveraging the NLTK package. This project is written in Python, taking advantage of the simple syntax and powerful modules to bring to you, the user, informative data visualizations of the tweets by personality type. In the last part I read in the data, tokenized the tweets, and visualized them. In this part, I will introduce the algorithms used to classify these tweets as their corresponding personality types. We encourage you to try replicating this project and make your own contributions! You can fork this project on GitHub (link at the bottom of the page). A translation to R for this project is in the works. Click [here](https://www.inertia7.com/projects/109) for Part 1!
# Building the Model Now for the fun part! Let's see these bad boys in action ```python mbtitype = np.array(file_unsep['type']) mbtiposts = np.array(file_unsep['posts']) X_train, X_test, y_train, y_test = train_test_split( mbtiposts, mbtitype, test_size=0.33, random_state=42) print(len(X_train)) print(len(X_test)) print(len(y_train)) print(len(y_test)) """ 5812 2863 5812 2863 """ ``` Here we use the `train_test_split` function from sci-kit learn to generate random indices for every data point and then put it back together in order to effectively create a random train/test split. Random sampling is desired (not just here, but in all experiments and statistical problems) because it minimizes bias by given all inputs an equal chance to be chosen and used in an experiment. But the most important reason? *"The mathematical theorems which justify most frequentist statistical procedures apply only to random samples."* ([source](https://www.ma.utexas.edu/users/mks/statmistakes/RandomSampleImportance.html)) First we will make a mack classifier and make predictions on some mock data ```python # Training a classifer clf = MultinomialNB().fit(X_train_tfidf, y_train) INTJ_sentence = ['Writing college essays is stressful because I have to give a stranger a piece of myself and that piece has to incorporate all of who I am'] INTJ_X_new_counts = count_vect.transform(INTJ_sentence) INTJ_X_new_tfidf = tfidf_transformer.transform(INTJ_X_new_counts) ENFP_sentence = ['Our favorite friendships are the ones where you can go from talking about the latest episode of the Bachelorette to the meaning of life'] ENFP_X_new_counts = count_vect.transform(ENFP_sentence) ENFP_X_new_tfidf = tfidf_transformer.transform(ENFP_X_new_counts) # Make a prediction of test sentence predictedINTJ = clf.predict(INTJ_X_new_tfidf) predictedENFP = clf.predict(ENFP_X_new_tfidf) for words, category in zip(INTJ_sentence, predictedINTJ): print('%r => %s' % (INTJ_sentence, category)) for words, category in zip(ENFP_sentence, predictedENFP): print('%r => %s' % (ENFP_sentence, category)) ``` We end up with the output of: ```python ['Writing college essays is stressful because I have to give a stranger a piece of myself and that piece has to incorporate all of who I am'] => INFP ['Our favorite friendships are the ones where you can go from talking about the latest episode of the Bachelorette to the meaning of life'] => INFP ``` Now let's build the rest of the models ```python def naive_bayes(): """ Building a Pipeline; this does all of the work in the extract_and_train() function Can be found on Github """ text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ]) text_clf = text_clf.fit(X_train, y_train) # Evaluate performance on test set predicted = text_clf.predict(X_test) print("The accuracy of a Naive Bayes algorithm is: ") print(np.mean(predicted == y_test)) print("Number of mislabeled points out of a total %d points for the Naive Bayes algorithm : %d" % (x_test.shape[0],(y_test != predicted).sum())) def linear_svm(): """ Let's try a Linear Support Vector Machine (SVM) """ # Only the initialization is shown to show different parameters; the rest of the model evaluation is the same text_clf_two = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42)), ]) # Model fitting, evaluation and prediction of unlabeled points performed here def neural_network(): """ Now let's try one of the most powerful algorithms: An Artifical Neural Network (ANN) """ # Only the initialization is shown to show different parameters; the rest of the model evaluation is the same text_clf_three = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('chi2', SelectKBest(chi2, k = 'all')), ('clf', MLPClassifier( hidden_layer_sizes=(100,), max_iter=10, alpha=1e-4, solver='sgd', verbose=10, tol=1e-4, random_state=1, learning_rate_init=.1)), ]) # Model fitting, evaluation and prediction of unlabeled points performed here ``` This will output the accuracy of the algorithms, as well as the number of incorrectly labeled points. If you have cloned the [repo](https://github.com/Njfritter/myersBriggsNLPAnalysis) from **Github**, you can simply run the following commands to parse the initial data file, perform exploratory analysis, and train the algorithms (as well as predict unlabeled tweets, which we will look into in the next section): ```python python3 NLPAnalysis.py process python3 NLPAnalysis.py viz python3 NLPAnalysis.py NB # Naive Bayes python3 NLPAnalysis.py SVM # Linear Support Vector Machine python3 NLPAnalysis.py NN # Neural Network ``` Or if you are familiar with a `Makefile` you can simply run `make` and the same commands will run. # Showing predictions ```python # Code was run for every model for words, label in zip(X_test[1:5, ], predicted[1:5, ]): print(words + " => " + label) ``` This will already be done for you by running `python tweetAnalysis.py MODEL` now let's take a look at our algorithm's predictions: First the ** Naive Bayes ** algorithm: ### 10 Predicted tweets from Naive Bayes Algorithm ** See Github for the predicted tokenized words + results! ** ## 10 Predicted tweets from Linear Support Vector Machine Algorithm ** See Github for the predicted tokenized words + results! ** And last but not least, we have the ** Neural Network. ** ** See Github for the predicted tokenized words + results! ** My results here!! # Conclusion Have we found a way for machines to learn emotions? Have we made machines just a bit smarter? Are we bring Judgment Day upon us? Will the Terminator make an appearance? Well the last one would be cool, but this is just one step towards proper Sentiment Analysis. There were more things we could have done to improve the accuracy and quality of the algorithms, such as: + Thing 1 + Thing 2 Nonetheless, this was an interesting and eye-opening little project; if you get your own Twitter dev account you would be able to scrape tweets yourself, and do your own little classification experiment. And of course feel free to borrow this code from the Github link below; no reason to do all the work yourself! If you **Google** **"Sentiment Analysis"** you will find amazing algorithms that can properly classify sentences with around 99.99% accuracy. These were built by the world's leading minds, and is a reflection of how truly close we are with Artificial Intelligence (i.e. machines "learning" behaviors and new ways of thinking, hence "machine learning"). It is my desire for this project not to be a one-time, resume-filling endeavor, but to be forever a work-in-progress. I intend to pursue all of the steps above and more to get professional-quality, working algorithms. If you would like to contribute and help speed up this process, please clone and contribute to the **Github** repo below.

COMMENTS







keep exploring!

back to all projects