Using [this](https://www.kaggle.com/uciml/sms-spam-collection-dataset) data set from Kaggle, I will be using NLP/Machine Learning to build a model that predicts whether or not a given text message is spam. Since there are already a number of NLP tutorials on this site (including a
[basic one](http://staging-x.inertia7.com/projects/35) that I uploaded myself), I will not be going as in depth with my explanations for this project. Instead I will focus more on demonstrating a practical implementation of NLP.
## Loading Packages
First I need to load the packages I will be using.
The `tm` package (short for text mining) is the best package I have found in R for NLP. All of the text processing for this project will be done using the `tm` package. The `dplyr` package will come in handy for aggregating data, as well as stringing together `tm` functions using the pipe operator. The `caret` package is a very powerful machine learning package that I will be using to create the ML model.
Next, I need to set the seed for my machine learning model. This could technically be done later on, but I like to do it at the beginning of my machine learning projects so I don't forget. Setting the seed is essential to make the project reproducible.
## Getting the Data
Now I can import the data and begin to take a look at it.
texts <- read.csv("spam.csv")
$ v1 <fctr> ham, ham, spam, ham, ham, spam, ham, ham, spam, spam, ham, spam, spam, ham, ham, spam, ham, ham, ham, sp...
$ v2 <fctr> Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore...
$ X <fctr> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ X.1 <fctr> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ X.2 <fctr> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
2 Ok lar... Joking wif u oni...
3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
4 U dun say so early hor... U c already then say...
5 Nah I don't think he goes to usf, he lives around here though
6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv
X X.1 X.2
It's immediately clear that there are some unneccessary variables, so I'll begin by getting rid of those.
texts <- select(texts, v1, v2)
I also want to convert the ham/spam column (v1) to a binary format. I will do this using the `stringr` package.
texts$v1 <- as.character(texts$v1)
texts$v1 <- str_replace(texts$v1, "ham", "0")
texts$v1 <- str_replace(texts$v1, "spam", "1")
texts$v1 <- as.factor(texts$v1)
Before I go any further, I want to quickly check to see how imbalanced the data set is.
Clearly the data set is imbalanced (roughly 85% ham and 15% spam). There should still be enough spam observations to fit a classification model to it, but there are a few techniques (i.e. downsampling, upsampling) that I could try if the imbalance proves to be an issue.
## Text Processing
Now I can get started processing the text. The end goal is to create a Document Term Matrix that I can fit a machine learning model to, but there are I few steps I have to take before I can create one.
First I need to convert the `texts` dataframe to a corpus.
texts_corp <- Corpus(VectorSource(texts$v2))
Now I'm going to apply a few text cleaning functions from the
`tm` package. If you've taken a look at my basic NLP tutorial, this should look familiar.
texts_corp <- texts_corp %>%
tm_map(removeWords, stopwords(kind = "en")) %>%
Now I can create a Document Term Matrix.
dtm <- DocumentTermMatrix(x = texts_corp,
list(tokenize = "words",
stemming = "english",
weighting = weightTf))
For some reason, specifying the `minDocFreq` (minimun document frequency) in the `DocumentTermMatrix` function wasn't working for me, so I will be taking care of this manually a little bit later.
Before I do that, I need to convert the DTM to a dataframe and take a look at its dimensions.
tf <- as.data.frame.matrix(dtm)
 5572 8227
Clearly I need to reduce the number of variables (in this case all the variables are terms that appear in the texts) before I can create a meaningful model.
I'll do this by first finding the most frequent terms.
term_freq <- colSums(tf)
freq <- data.frame(term = names(term_freq), count = term_freq)
> arrange(freq, desc(count))[1:20,]
1 call 578
2 now 479
3 can 405
4 get 390
5 will 378
6 just 366
7 dont 279
8 free 278
9 ltgt 276
10 know 257
11 like 242
12 got 239
13 ill 237
14 good 234
15 come 226
16 day 211
17 time 208
18 love 195
19 want 192
20 send 190
To be a useful predictor, a term will have to appear frequently enough that the computer has a large enough sample size to learn from. There isn't a way to know exactly how frequently a term has to appear to be useful, so to start, I chose to filter the terms that appeared 50 times or more (roughly 1% of the total number of observations). Since the model I ended up creating was accurate, I didn't end up changing this number. But you can play around with it to see if setting it larger or smaller improves your results.
most_freq <- filter(freq, count >= 50)
Now that I have the terms that appear more than 50 times, I can select only those columns from my `tf` data frame. Unfornately, since the `select` function from `dplyr` doesn't take a character vector as the input, I had to get creative and copy paste the output of `most_freq` into a text editor and turn it into a list that I could copy paste back into R. It isn't the most elegant solution, but it was quick to implement and it works.
I also had to remove the term "next" from the list because R has a `next` function already so it doesn't allow you to have a variable named "next". I could have chosen to just rename that variable instead, but frankly I'm suprised that a word as common as "next" wasn't already included as one of the default English stopwords.
tf <- select(tf, already,
The last step before I can implement a model is adding the Class (ham or spam) column to the new dataframe.
texts_tf <- mutate(tf, Class = texts$v1)
 5572 148
As you can see, the number of variables has now been greatly reduced (to 148). Now I can begin creating a machine learning model.
## Creating a Machine Learning Model
The first step is splitting the data into a training set and a test set. I will be using an 80/20 split.
trainIndex <- createDataPartition(texts_tf$Class,
times = 1,
p = 0.8,
list = FALSE)
train <- texts_tf[trainIndex, ]
test <- texts_tf[-trainIndex, ]
Next, I'll fit a logistic regression model to the training set (Note: in the `caret` package `glm` (generalized linear model) is the default way to implement logistic regression). I will be using 10 fold cross-validation when training the model to prevent overfitting.
model <- train(Class ~ .,
data = train,
method = "glm",
trControl = trainControl(method = "cv",
number = 10)
Generalized Linear Model
2 classes: '0', '1'
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4012, 4012, 4012, 4013, 4012, 4012, ...
Initially, this looks like a solid model, but I will need to explore it further to see what it's doing.
First I'll take a look at its variable importances. The
`caret` package displays variable importances on scale from 0 to 100.
glm variable importance
only 20 most important variables shown (out of 147)
Now I'll see how the model performs on the test set.
p <- predict(model, test, type = "prob")
predictions <- factor(ifelse(p["0"] > 0.4,
Through trial and error I found that setting the probability threshold to 0.4 rather than the defult 0.5 improved the model's performance.
One of the best ways to measure a binary classification model's performance is creating a confusion matrix. This will tell me the number of false negatives and false positives from my test set.
> confusionMatrix(predictions, test$Class)
Confusion Matrix and Statistics
Prediction 0 1
0 953 24
1 12 125
Accuracy : 0.9677
95% CI : (0.9555, 0.9773)
No Information Rate : 0.8662
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8556
Mcnemar's Test P-Value : 0.06675
Sensitivity : 0.9876
Specificity : 0.8389
Pos Pred Value : 0.9754
Neg Pred Value : 0.9124
Prevalence : 0.8662
Detection Rate : 0.8555
Detection Prevalence : 0.8770
Balanced Accuracy : 0.9132
'Positive' Class : 0
The overall accuracy on the test set is 96.77% and both the positive predictions value and negative predictions value are over 90%. That's a good sign that the model is performing well.
I can also create an ROC curve to better visualize its performance.
colAUC(p, test$Class, plotROC = TRUE)
The ROC curve reaffirms what the confusion matrix told me: that the model is sufficiently accurate.
## Final Thoughts
Hopefully you found this example of a practical application of NLP interesting. I could have tried to improve the performace of my model by implementing other machine learning techniques, but for my purposes I think 96.77% accuracy with positive and negative prediction accuracies both over 90% is certainly adequate. For the problem at hand, simple logistic regression was able to get the job done.