last hacked on Feb 12, 2018

# YouTube Data Prediction ## Abstract: Using natural language processing (nlp) and machine learning (ml) Algorithms to parse and classify YouTube comments as positive, negative or neutral. ## Goals: * Understand how word usage can be indicative of overall user sentiment towards or approval of videos on social media platforms * Learn about NLP and how computers can be used to comprehend, analyze, and classify human language and emotions through text ## Current Progress: [Jupyter Notebook]( ## Method: 1. Extract data from Youtube API using ApiCall 2. Use NLP to tokenize and track word frequencies in comments (using sci-kit learn package) 3. Classify using multinomial naive bayes, support vector machines, and logistic regression 4. visualize results with plots and plotly dashboard (for interactive material) ## Contributors: * Andie Donovan * Shon Inouye * Conor O'Brien * Matthew Peterschmidt ## Links: 1. [Github]( 2. [API](
## Read in & Clean Data Note that we have to use encoding = "latin-1" instead of UTF-8 because we have foreign languages present The different encodings treat characters differently (in latin-1 each character is only one byte long whereas in utf-8 it can me more than one byte in length). Typically utf-8 captures more types of characters, so it was surprising that we had to use latin-8. For later: look into this more: # Basic imports import pandas as pd import os import csv import numpy as np Sklearn import sklearn # machine learning from sklearn.feature_extraction.text import CountVectorizer # frequency counts matrix from sklearn.model_selection import train_test_split # splitting up data from sklearn import metrics # for accuracy/ precision from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import SGDClassifier # Support Vector Machine Classifier # multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification) # Dashboard/ Plotly Packages: import plotly.dashboard_objs as dashboard import IPython.display from IPython.display import Image import plotly.plotly as py import plotly.graph_objs as go from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot init_notebook_mode(connected=True) from pylab import * os.chdir('/Users/andiedonovan/myProjects/Youtube/') # change directory df = pd.read_csv('labeledCom.csv', delimiter=";", skiprows=2, encoding='latin-1', engine='python') # read in the data # rename the columns df.columns = [ 'label', 'comment', 'column3' ] df = df.drop('column3', axis = 1).dropna() # drop column 3 and missing values print(df.head(5)) ## Split into Training and Test Data Using a pre-defined train-test-split function, we randomly split the data into training data (75%) and test data (25%). We set the x variable for both to the comments, since these are the attributes we will use for classificationand the y variable to the label, as this is what we are trying to predict. The random_state paramteter is simply for reproducability (otherwise the function would produce a different split every time we ran it). X_train, X_test, Y_train, Y_test = train_test_split( df["comment"], df["label"], test_size=0.25, random_state=42) # Let's make sure all of the data looks good: print('lengths training variables: ', len(X_train),",", len(Y_train)) print('lengths testing variables: ', len(X_test),",", len(Y_test), '\n') print('Are there any missing values?', '\n * Training:', pd.isnull(X_train).values.any(), ',', pd.isnull(Y_train).values.any(), '\n * Testing: ', pd.isnull(X_test).values.any(), ",", pd.isnull(Y_test).values.any()) type(X_test) # we have a pandas core Series; we just want the comments in an array without numbering # help(X_test) # use values attribute # we will want to use X_test.values(), Y_train.values(), .... to just access the data in list format ## Building a Model Documentation: Scikit-Learn Documentation We want to initialize a Count Vectorizer, which will convert the comments to a matrix of token (word) counts. This produces a sparse representation of the counts We then fit the model using our training data cv = CountVectorizer() x_train_counts = cv.fit_transform(X_train.values) # fit_transform to counts type(x_train_counts) # scipy.sparse.csr.csr_matrix ### Transform test values as well: x_test_counts = cv.transform(X_test.values) # transform test data as well (but we don't need to train it since its test data!) ### Initializing the Classifier: mnb = MultinomialNB(), Y_train) # fit the model on the training data word counts and training data lables ### Making the Predictions: mnb_predict = mnb.predict(x_test_counts) # make our y predictions (labels) on the comment test data for i in mnb_predict[:10]: print (i) ### Accuracy Metrics mnb_acc = metrics.accuracy_score(Y_test, mnb_predict) print('We obtained ', round(mnb_acc, 6), '% accuracy for the model') np.mean(mnb_predict == Y_test) # same score, different method print('Here is the Classification Report: \n') print(metrics.classification_report(Y_test, mnb_predict)) print('Here is the Confusion Matrix: \n') metrics.confusion_matrix(Y_test, mnb_predict) ## Using a TF-IDF Transformation * Instead of just counting the number of occurences bluntly, the term frequency inverse document frequency transformation weights words based on their number of occurences in each document (aka comment) compared to occurences in the entire corpus (aka collection of comments) * Can also be used to remove stop words Here we apply the transformation: tfidf_transformer = TfidfTransformer() x_tfidf_tr = tfidf_transformer.fit_transform(x_train_counts) x_tfidf_tst = tfidf_transformer.transform(x_test_counts) mnb2 = MultinomialNB(), Y_train) tfidf_pred = mnb2.predict(x_tfidf_tst) tfidf_acc = metrics.accuracy_score(Y_test, tfidf_pred) print('We obtained ', round(tfidf_acc, 6), '% accuracy for the tf-idf transformed model') We obtained 0.543796 % accuracy for the tf-idf transformed model Accuracy actually decreased! A possible explanation is that the comments are typically very short and likely to do not have many repeating words, which could have created a skewed set of weights favoring the content of long or more repetative comments (not really sure though...) ## Using the SGD (Stochastic Gradient Descent) Learning Essentially SGD tries to find an extrema through iteration. In this case we are dealing with likelihood functions in which each algorithm tries to find the classification (positive, neutral or negative) with the highest likelihood of occuring (or probability of being correct in a sense). We want to maximize this likelihood while minimizing inaccuracies (Type I/ Type II Errors) and computational expenditure. We might have to revisit material on Statistical Powers. The count vectorizer transformation converts our data to a sparse matrix, which represents the model features (aka comments & their word counts) in matrix-form where the majority of the entries are zero. Loss-Functions: maps different events to a "cost" or price paid for inaccuracy of predictions/ classifications. Purpose is to choose the optimal loss function (aka try to minimize costly inaccuracies) ### Logistic Regression sgd = SGDClassifier(loss='log', penalty='l2', alpha=1e-3, max_iter=5, tol=None, random_state=1), Y_train) sgd_predict = sgd.predict(x_test_counts) sgd_acc = metrics.accuracy_score(Y_test, sgd_predict) print('We obtained ', round(sgd_acc, 6), '% accuracy for the logistic regression model') We obtained 0.638686 % accuracy for the logistic regression model Classifier Parameters: * loss parameter set to 'log' to use logistic regression * penalty: regularization term; sparsity or feature selection * the process of introducing additional information in order to prevent overfitting/ sparsity overlaps: regularization) * alpha: smoothing parameter? * max_iter: how many times to pass over the training data * tol: stopping criterion; counts losses * verbosity - using more factors than necessary; want to maximize parsimony ### Linear Support Vector Machines (SVM) svm = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=5, tol=None, random_state=1) # penalty, loss, alpha paramters, Y_train) svm_predict = svm.predict(x_test_counts) svm_acc = metrics.accuracy_score(Y_test, svm_predict) print('We obtained ', round(svm_acc, 6), '% accuracy for the SVM model') We obtained 0.613139 % accuracy for the SVM model We specify the loss parameter to 'hinge' to use linear SVM ## Next Steps 1. Feature Engineering: * Look at Peter Norvig's Spelling Corrector * Meta Features, Stemming, Speech Tagging, etc.: * NLP Blog Post * Stanford Post 2. Try out more models; see if we can pre-process the data better? * Random Forests (Decision Trees), 'Ensembling' algorithms? 3. Add more videos/ comments for diversity of diction/ sentiments 4. Create a clean way of presenting results (ie with comments and their predicted classification. Maybe we can highlight the correct ones in green and the wrong ones in red?) 5. Data Visualization / Dashboard? * Plotly * Tutorial


keep exploring!

back to all projects