IRIS FLOWER CLASSIFICATION (R)

KNN ANALYSIS OF IRIS FLOWERS WITH R
1

R

EASY

last hacked on Nov 22, 2018

This project focuses on the classification of iris flowers into their respective species by using the K-Nearest Neighbors machine-learning algorithm. The three species in this classification problem include setosa, versicolor, and virginica. The explanatory variables include sepal length, sepal width, pedal length, petal width. See sepal wiki. See petal wiki. We are essentially trying to predict the species of the iris flower based on physical features! The K-Nearest Neighbor algorithm is interesting because it is a simple yet powerful a machine learning method used for classification. It predicts based on majority votes, measuring a certain number of neighboring observation points (k) and classifies based on attribute prevalence using Euclidean distance.
# Load Packages First we load the appropriate packages into our **R** environment. For this we use the `library()` method and include the package names as arguments. Make sure to first install the packages using the `install.packages()` method if you haven't done so already. # Here if you haven't installed these packages do so! install.packages("ggplot2") install.packages("ggfortify") install.packages("caret") install.packages("gridExtra") install.packages("GGally") install.packages("gmodels") Recall you run the code above only once. Now you run this to load the packages into your **R** environment. library(data.table) library(ggplot2) library(ggfortify) library(caret) library(class) library(gridExtra) library(GGally) library(RGraphics) library(gmodels) Next we load our data. # Get Data The [iris dataset](https://archive.ics.uci.edu/ml/datasets/Iris) is very popular in statistical learning, and is readily available in the **R** base. To load this dataset all we have to do is call `iris` using the `attach()` and `data()` methods. We also run `head()` to get a quick grance at our data. ``` attach(iris) data(iris) iris_tb <- as_tibble(iris) head(iris_tb) colnames(iris_tb) <- c('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species') ``` ### Terminal Output > head(iris) sepal_length sepal_width petal_length petal_width species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa Our terminal output above shows six observations of our data. We can appreciate a total of five variables. The goal is to predict species as a function of the other four variables. Next we do exploratory analysis. # Exploratory Analysis We begin our exploratory analysis by looking for relationships across our explanatory and predicted variables. For this, we use `ggplot()` and `plotly` to generate scatterplots of the `sepal length` (y-axis) and `sepal width` (x-axis), and the `petal length` (y-axis) and `petal width` (x-axis). ``` gg1<-ggplot(iris_tb, aes(x=sepal_width,y=sepal_length, shape=species, color=species)) + theme(panel.background = element_rect(fill = "gray98"), axis.line = element_line(colour="black"), axis.line.x = element_line(colour="gray"), axis.line.y = element_line(colour="gray")) + geom_point(size=2) + labs(title = "Sepal Width Vs. Sepal Length") ggplotly(gg1) ``` <iframe width="100%" height="800" frameborder="0" scrolling="no" src="//plot.ly/~raviolli77/63.embed"></iframe> Next we plot the `petal length` vs the `petal width`! ``` gg2<-ggplot(iris_tb, aes(x=petal_width,y=petal_length, shape=species, color=species)) + theme(panel.background = element_rect(fill = "gray98"), axis.line = element_line(colour="black"), axis.line.x = element_line(colour="gray"), axis.line.y = element_line(colour="gray")) + geom_point(size=2) + labs(title = "Petal Length Vs. Petal Width") ggplotly(gg2) ``` <iframe width="100%" height="800" frameborder="0" scrolling="no" src="//plot.ly/~raviolli77/61.embed"></iframe> The plots below shows setosa to be most distinguisable of the three species with respect to both sepal and petal attributes. We can infer then that the `setosa` species will yield the least prediction errors, while the other two species, `versicolor` and `virginica`, might not. Below is a plot that shows the relationships across our various explanatory variables. ``` pairs <- ggpairs(iris_tb, mapping=aes(color=species), columns=1:4) + theme(panel.background = element_rect(fill = "gray98"), axis.line = element_line(colour="black"), axis.line.x = element_line(colour="gray"), axis.line.y = element_line(colour="gray")) pairs ggplotly(pairs) %>% layout(showlegend = FALSE) ``` <iframe width="100%" height="800" frameborder="0" scrolling="no" src="//plot.ly/~raviolli77/65.embed"></iframe> This plot reduces the dimensions and gives an overarching view of the interactions of the different attributes. This plot will be handy for other classification models like **Linear Discrimant Analysis** which is not in this project but we included more statistical process on `Iris` in the Github respository. # Model Estimation The **Kth-Nearest Neighbor** algorithm predicts based on majority votes, measuring a certain number of neighboring observation points (k) and classifies based on attribute prevalence using *Euclidean distance*. Here is [documentation](https://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html) on the `knn()` method. Check the [documentation](https://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html) on its `packageCLASS` as well. We begin this section by creating training and test sets with 80% and 20% of observations generated randomly hence the `set.seed` function. We will be using the `caret` package to create the training and test sets. We didn't post the outputs of the indices or the sets because it would be too much code outputted so we recommend to run for yourself to understand what's going on! Once you run the code you'll get a pretty good idea of what we're doing. ``` set.seed(88) trainIndex <- createDataPartition(iris_tb$species, p = .8, list = FALSE, times = 1) # Creating 80 20 split training_set <- iris_tb[ trainIndex,] test_set <- iris_tb[-trainIndex,] ``` Utilizing the `train` function we will create our best model through cross validation. This will simultaneously test other parameters while also producing the best fit which we will call `fit` ``` fit <- train(species ~ ., data = training_set, method = "knn") ``` Let's output the metrics ``` fit ``` ### Terminal Output ``` > fit k-Nearest Neighbors 120 samples 4 predictors 3 classes: 'setosa', 'versicolor', 'virginica' No pre-processing Resampling: Bootstrapped (25 reps) Summary of sample sizes: 120, 120, 120, 120, 120, 120, ... Resampling results across tuning parameters: k Accuracy Kappa 5 0.9680846 0.9516459 7 0.9656736 0.9479064 9 0.9669771 0.9498683 Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 5. ``` The terminal output above shows that our optimal **k = 5** based on the `Accuracy` and `Kappa` values. # Prediction Results Now we will predict the test set values to ensure we receive a non-biased prediction metric. We will predict the test set species based on the four measurements and output how well our model did on unseen data. ``` predict_test_set <- predict(fit, newdata = test_set) ``` Below we can see our results of our model . ``` CrossTable(x = test_set$species, y = predict_test_set, prop.chisq=FALSE) ``` ### Terminal Output ``` > CrossTable(x = test_set$species, + y = predict_test_set, + prop.chisq=FALSE) Cell Contents |-------------------------| | N | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------| Total Observations in Table: 30 | predict_test_set test_set$species | setosa | versicolor | virginica | Row Total | -----------------|------------|------------|------------|------------| setosa | 10 | 0 | 0 | 10 | | 1.000 | 0.000 | 0.000 | 0.333 | | 1.000 | 0.000 | 0.000 | | | 0.333 | 0.000 | 0.000 | | -----------------|------------|------------|------------|------------| versicolor | 0 | 8 | 2 | 10 | | 0.000 | 0.800 | 0.200 | 0.333 | | 0.000 | 1.000 | 0.167 | | | 0.000 | 0.267 | 0.067 | | -----------------|------------|------------|------------|------------| virginica | 0 | 0 | 10 | 10 | | 0.000 | 0.000 | 1.000 | 0.333 | | 0.000 | 0.000 | 0.833 | | | 0.000 | 0.000 | 0.333 | | -----------------|------------|------------|------------|------------| Column Total | 10 | 8 | 12 | 30 | | 0.333 | 0.267 | 0.400 | | -----------------|------------|------------|------------|------------| ``` We can see here that the model predicted virginica when it was actually versicolor which from our exploratory analysis we assumed there would be some prediction errors since these two species were the least distinguishable among all three species. Both tables are just reiterating the results, but we received two wrong prediction. Thus we calculate the test error as: ``` round(1 - fit$results$Accuracy[1], 4) ``` ### Terminal Output ``` 0.0319 ``` Thus we get a test error rate of 0.0319 which is not that bad, but granted this is a very easy set to use **Kth Nearest Neighbor** modeling! # Conclusions Our model yielded **test error rates** is **0.0319** for the three different species, not bad! As this project shows, **K-Nearest Neighbors** modeling is fairly simple. For datasets with a small amount of variables *KNN* is a viable method, but for datasets with many variables we run into the *curse of dimensionality*. Check this [stack exchange article](http://stats.stackexchange.com/questions/65379/machine-learning-curse-of-dimensionality-explained) for an explanation on the *curse of dimensionality*. We feel like this project is a good intro to training and test sets which are very important components not just to data science but to statistical learning overall!

COMMENTS


not that lit





keep exploring!

back to all projects