TEXT MESSAGES:SPAM OR HAM?

NLP/MACHINE LEARNING USING R
2

R

MEDIUM

last hacked on Jul 22, 2017

## Abstract Using [this](https://www.kaggle.com/uciml/sms-spam-collection-dataset) data set from Kaggle, I will be using NLP/Machine Learning to build a model that predicts whether or not a given text message is spam. Since there are already a number of NLP tutorials on this site (including a [basic one](http://staging-x.inertia7.com/projects/35) that I uploaded myself), I will not be going as in depth with my explanations for this project. Instead I will focus more on demonstrating a practical implementation of NLP.
## Loading Packages First I need to load the packages I will be using. ``` library(tm) library(dplyr) library(stringr) library(caret) library(caTools) ``` The `tm` package (short for text mining) is the best package I have found in R for NLP. All of the text processing for this project will be done using the `tm` package. The `dplyr` package will come in handy for aggregating data, as well as stringing together `tm` functions using the pipe operator. The `caret` package is a very powerful machine learning package that I will be using to create the ML model. Next, I need to set the seed for my machine learning model. This could technically be done later on, but I like to do it at the beginning of my machine learning projects so I don't forget. Setting the seed is essential to make the project reproducible. ``` set.seed(19) ``` ## Getting the Data Now I can import the data and begin to take a look at it. ``` texts <- read.csv("spam.csv") ``` ``` > glimpse(texts) Observations: 5,572 Variables: 5 $ v1 <fctr> ham, ham, spam, ham, ham, spam, ham, ham, spam, spam, ham, spam, spam, ham, ham, spam, ham, ham, ham, sp... $ v2 <fctr> Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore... $ X <fctr> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , $ X.1 <fctr> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , $ X.2 <fctr> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , > head(texts) v1 1 ham 2 ham 3 spam 4 ham 5 ham 6 spam v2 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 Ok lar... Joking wif u oni... 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 U dun say so early hor... U c already then say... 5 Nah I don't think he goes to usf, he lives around here though 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv X X.1 X.2 1 2 3 4 5 6 ``` It's immediately clear that there are some unneccessary variables, so I'll begin by getting rid of those. ``` texts <- select(texts, v1, v2) ``` I also want to convert the ham/spam column (v1) to a binary format. I will do this using the `stringr` package. ``` texts$v1 <- as.character(texts$v1) texts$v1 <- str_replace(texts$v1, "ham", "0") texts$v1 <- str_replace(texts$v1, "spam", "1") texts$v1 <- as.factor(texts$v1) ``` Before I go any further, I want to quickly check to see how imbalanced the data set is. ``` > table(texts$v1) 0 1 4825 747 ``` Clearly the data set is imbalanced (roughly 85% ham and 15% spam). There should still be enough spam observations to fit a classification model to it, but there are a few techniques (i.e. downsampling, upsampling) that I could try if the imbalance proves to be an issue. ## Text Processing Now I can get started processing the text. The end goal is to create a Document Term Matrix that I can fit a machine learning model to, but there are I few steps I have to take before I can create one. First I need to convert the `texts` dataframe to a corpus. ``` texts_corp <- Corpus(VectorSource(texts$v2)) ``` Now I'm going to apply a few text cleaning functions from the `tm` package. If you've taken a look at my basic NLP tutorial, this should look familiar. ``` texts_corp <- texts_corp %>% tm_map(PlainTextDocument) %>% tm_map(removePunctuation) %>% tm_map(removeNumbers) %>% tm_map(content_transformer(tolower)) %>% tm_map(removeWords, stopwords(kind = "en")) %>% tm_map(content_transformer(stripWhitespace)) ``` Now I can create a Document Term Matrix. ``` dtm <- DocumentTermMatrix(x = texts_corp, control = list(tokenize = "words", stemming = "english", weighting = weightTf)) ``` For some reason, specifying the `minDocFreq` (minimun document frequency) in the `DocumentTermMatrix` function wasn't working for me, so I will be taking care of this manually a little bit later. Before I do that, I need to convert the DTM to a dataframe and take a look at its dimensions. ``` tf <- as.data.frame.matrix(dtm) ``` ``` > dim(tf) [1] 5572 8227 ``` Clearly I need to reduce the number of variables (in this case all the variables are terms that appear in the texts) before I can create a meaningful model. I'll do this by first finding the most frequent terms. ``` term_freq <- colSums(tf) freq <- data.frame(term = names(term_freq), count = term_freq) ``` ``` > arrange(freq, desc(count))[1:20,] term count 1 call 578 2 now 479 3 can 405 4 get 390 5 will 378 6 just 366 7 dont 279 8 free 278 9 ltgt 276 10 know 257 11 like 242 12 got 239 13 ill 237 14 good 234 15 come 226 16 day 211 17 time 208 18 love 195 19 want 192 20 send 190 ``` To be a useful predictor, a term will have to appear frequently enough that the computer has a large enough sample size to learn from. There isn't a way to know exactly how frequently a term has to appear to be useful, so to start, I chose to filter the terms that appeared 50 times or more (roughly 1% of the total number of observations). Since the model I ended up creating was accurate, I didn't end up changing this number. But you can play around with it to see if setting it larger or smaller improves your results. ``` most_freq <- filter(freq, count >= 50) ``` Now that I have the terms that appear more than 50 times, I can select only those columns from my `tf` data frame. Unfornately, since the `select` function from `dplyr` doesn't take a character vector as the input, I had to get creative and copy paste the output of `most_freq` into a text editor and turn it into a list that I could copy paste back into R. It isn't the most elegant solution, but it was quick to implement and it works. I also had to remove the term "next" from the list because R has a `next` function already so it doesn't allow you to have a variable named "next". I could have chosen to just rename that variable instead, but frankly I'm suprised that a word as common as "next" wasn't already included as one of the default English stopwords. ``` tf <- select(tf, already, also, always, amp, anything, around, ask, babe, back, box, buy, call, can, cant, care, cash, chat, claim, come, coming, contact, cos, customer, day, dear, didnt, dont, dun, even, every, feel, find, first, free, friends, get, getting, give, going, gonna, good, got, great, guaranteed, gud, happy, help, hey, home, hope, ill, ive, just, keep, know, last, late, later, leave, let, life, like, lol, lor, love, ltgt, make, many, meet, message, mins, miss, mobile, money, morning, msg, much, name, need, new, nice, night, nokia, now, number, one, people, per, phone, pick, place, please, pls, prize, really, reply, right, said, say, see, send, sent, service, sleep, someone, something, soon, sorry, still, stop, sure, take, tell, text, thanks, thats, thing, things, think, thk, time, today, told, tomorrow, tonight, txt, urgent, wait, waiting, wan, want, wat, way, week, well, went, will, win, wish, won, wont, work, yeah, year, yes, yet, youre) ``` The last step before I can implement a model is adding the Class (ham or spam) column to the new dataframe. ``` texts_tf <- mutate(tf, Class = texts$v1) ``` ``` >dim(texts_tf) [1] 5572 148 ``` As you can see, the number of variables has now been greatly reduced (to 148). Now I can begin creating a machine learning model. ## Creating a Machine Learning Model The first step is splitting the data into a training set and a test set. I will be using an 80/20 split. ``` trainIndex <- createDataPartition(texts_tf$Class, times = 1, p = 0.8, list = FALSE) train <- texts_tf[trainIndex, ] test <- texts_tf[-trainIndex, ] ``` Next, I'll fit a logistic regression model to the training set (Note: in the `caret` package `glm` (generalized linear model) is the default way to implement logistic regression). I will be using 10 fold cross-validation when training the model to prevent overfitting. ``` model <- train(Class ~ ., data = train, method = "glm", trControl = trainControl(method = "cv", number = 10) ) ``` ``` > model Generalized Linear Model 4458 samples 147 predictor 2 classes: '0', '1' No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 4012, 4012, 4012, 4013, 4012, 4012, ... Resampling results: Accuracy Kappa 0.9585 0.810846 ``` Initially, this looks like a solid model, but I will need to explore it further to see what it's doing. First I'll take a look at its variable importances. The `caret` package displays variable importances on scale from 0 to 100. ``` > varImp(model) glm variable importance only 20 most important variables shown (out of 147) Overall call 100.00 reply 73.57 txt 70.95 text 67.73 send 47.95 stop 47.60 mobile 47.49 chat 43.46 free 42.43 new 38.16 message 36.49 help 36.47 find 36.25 box 34.71 nokia 34.55 now 33.70 ill 33.46 yes 33.02 service 32.89 mins 32.61 ``` Now I'll see how the model performs on the test set. ``` p <- predict(model, test, type = "prob") predictions <- factor(ifelse(p["0"] > 0.4, "0", "1")) ``` Through trial and error I found that setting the probability threshold to 0.4 rather than the defult 0.5 improved the model's performance. One of the best ways to measure a binary classification model's performance is creating a confusion matrix. This will tell me the number of false negatives and false positives from my test set. ``` > confusionMatrix(predictions, test$Class) Confusion Matrix and Statistics Reference Prediction 0 1 0 953 24 1 12 125 Accuracy : 0.9677 95% CI : (0.9555, 0.9773) No Information Rate : 0.8662 P-Value [Acc > NIR] : < 2e-16 Kappa : 0.8556 Mcnemar's Test P-Value : 0.06675 Sensitivity : 0.9876 Specificity : 0.8389 Pos Pred Value : 0.9754 Neg Pred Value : 0.9124 Prevalence : 0.8662 Detection Rate : 0.8555 Detection Prevalence : 0.8770 Balanced Accuracy : 0.9132 'Positive' Class : 0 ``` The overall accuracy on the test set is 96.77% and both the positive predictions value and negative predictions value are over 90%. That's a good sign that the model is performing well. I can also create an ROC curve to better visualize its performance. ``` colAUC(p, test$Class, plotROC = TRUE) ``` ![alt-text](https://github.com/bryandaetz/TextMessageSpam/raw/master/ROC_Curve.png) The ROC curve reaffirms what the confusion matrix told me: that the model is sufficiently accurate. ## Final Thoughts Hopefully you found this example of a practical application of NLP interesting. I could have tried to improve the performace of my model by implementing other machine learning techniques, but for my purposes I think 96.77% accuracy with positive and negative prediction accuracies both over 90% is certainly adequate. For the problem at hand, simple logistic regression was able to get the job done.

COMMENTS







keep exploring!

back to all projects