last hacked on Mar 11, 2019

**Note**: This tutorial makes use of LaTeX and until official support is available on this site click the link to run a script to render any equations: [link and **activate MathJax render engine**][1] [1]: javascript:(function(){if(window.MathJax===undefined){var%20script%20=%20document.createElement("script");script.type%20=%20"text/javascript";script.src%20=%20"";var%20config%20=%20%27MathJax.Hub.Config({%27%20+%20%27extensions:%20["tex2jax.js"],%27%20+%20%27tex2jax:%20{%20inlineMath:%20[["$","$"],["\\\\\\\\\\\\(","\\\\\\\\\\\\)"]],%20displayMath:%20[["$$","$$"],["\\\\[","\\\\]"]],%20processEscapes:%20true%20},%27%20+%20%27jax:%20["input/TeX","output/HTML-CSS"]%27%20+%20%27});%27%20+%20%27MathJax.Hub.Startup.onload();%27;if%20(window.opera)%20{script.innerHTML%20=%20config}%20else%20{script.text%20=%20config}%20document.getElementsByTagName("head")[0].appendChild(script);(doChatJax=function(){window.setTimeout(doChatJax,1000);MathJax.Hub.Queue(["Typeset",MathJax.Hub]);})();}else{MathJax.Hub.Queue(["Typeset",MathJax.Hub]);}})(); # Machine Learning III: Supervised Learning ## Overview We are going to read and analyze a csv file containing the famous Wisconsin Breast Cancer Dataset and feed it into a __classifier__, an algorithm that goes through the process of prediciting the _class_, or label, of our data points in the observation. Classifiers are particularly a common thing in supervised learning ecosystems. # Supervised Learning In supervised learning our data is __labeled__, meaning the trainer provides the alogirthm with a desired input/output pair (sometimes referred to as the __supervisory signal__) to help it generate an inferred function that can map new cases to the right label. In other words, supervised learning is a learning algorithm based in __ground truths__ which is able to utilize induction to approximate the relationship between input and output parameters. With supervised learning you want to watch out for how you optimize between __model complexity__ and __overfitting__. You overfit your data everytime your function is learning to fit your training data in a way that does not _generalize to your dataset_. ## Classifying vs. Regressions The regression we learned previously is an example of a forecasting algorithm, which is used to try and predict a trend and generate information in the future. While useful for continous or ordered values, most business driven questions instead are not structured like this; generally a business will concern itself with questions such as whether a costumer with $ X $ features is their optimal target for an ad campaign, and here they need to group their output costumer into a class such as "likely buyer" or "unlikely buyer". These decision generating classification problems are more directly tied to implementation for a business so it naturally makes sense to work in these problem spaces with a classification model. # K-Nearest Neighbors K-Nearest Neighbors (or kNN for short) can be used to classify or perform regressions, though it see's much more widespread adoption in classification models. With KNN it is easy to interpret the outcome, in a moderate regime the computation is not too cost-prohibative, and it provides us with quite a fair bit of predicative power. This makes it a strong tool in the data scientists toolkit until the application. In KNN we calculate the relative distance of all the points with our dataset we seek to classify, and the nearest neighbors we select based off our $ k $ parameter vote on the label of our input. The algorithm basically works out to: 1. Read and clean data 2. Select and initialize value of k 3. Iterate from 1 to $ n $ number of training data points and: 1. Calculate the (Euclidean) distance between each element of training data and test data. Other metric distances, such as the cosine and Chebyshev distances, are occasionally used. 2. Sort distances to optimize for lowest value 3. Pull first $ k $ elements from the sorted array to generate your voters 4. Return the most freqent vote as your class for the test data. ```python import numpy as np from sklearn import preprocessing, model_selection, neighbors import pandas as pd from urllib.request import urlopen ``` ## Imports Panda is a dataframe library; it also used to read a csv file of Google stocks Numpy gives us access to fast computations and a handy array module Urllib has the handy urlopen function, which we will use to access our dataset SKL Preprocessing here can be used for data scaling on our features, SKL model selection creates training and testing samples and split/shuffle data for decreasing bias SKL neighbors is the k Nearest Neighbors classifier # Opening our dataset We are using the UCI breast cancer data set available at . I encourage you to explore that directory if you are interested in learning more about it. ```python data_URL = '' indices= ["id","clump_thickness","uniform_cell_size","uniform_cell_shape", "marginal_adhesion","single_epi_cell_size","bare_nuclei","bland_chromation", "normal_nucleoli","mitoses","class"] df = pd.read_csv(urlopen(data_URL),names=indices) ``` # Data Exploration A healthy practice in data science is to explore your data prior to performing any major analysis. Wise generals know a bit of scouting gives you the intel to plan out major operations efficiently. A first peek at our dataframe will inform us about how to structure our training data. ```python print(df.head()) ``` id clump_thickness uniform_cell_size uniform_cell_shape \ 0 1000025 5 1 1 1 1002945 5 4 4 2 1015425 3 1 1 3 1016277 6 8 8 4 1017023 4 1 1 marginal_adhesion single_epi_cell_size bare_nuclei bland_chromation \ 0 1 2 1 3 1 5 7 10 3 2 1 2 2 3 3 1 3 4 3 4 3 2 1 3 normal_nucleoli mitoses class 0 1 1 2 1 2 1 2 2 1 1 2 3 7 1 2 4 1 1 2 Here we either can drop our ids or use them to populate the index for the dataframe since it's a unique identifier that makes our current indexing value irrelevant. Here dropping the data when we have less useful generated information as our index seems like a waste so I will err on the side of caution and set our index to be `id`. ```python # df.drop(['id'],1, inplace=True) df.set_index('id',inplace=True) print(df.head()) ``` clump_thickness uniform_cell_size uniform_cell_shape \ id 1000025 5 1 1 1002945 5 4 4 1015425 3 1 1 1016277 6 8 8 1017023 4 1 1 marginal_adhesion single_epi_cell_size bare_nuclei \ id 1000025 1 2 1 1002945 5 7 10 1015425 1 2 2 1016277 1 3 4 1017023 3 2 1 bland_chromation normal_nucleoli mitoses class id 1000025 3 1 1 2 1002945 3 2 1 2 1015425 3 1 1 2 1016277 3 7 1 2 1017023 3 1 1 2 Next up, let's read over the documentation provided in the directory I previously linked to see if the authors can help us parse the material. If you ignored my previous encouragement then you might wonder how I knew the indices and their names. It's in the docs, specifically, there it states quite a few useful things. ``` 6. Number of Attributes: 10 plus the class attribute 7. Attribute Information: (class attribute has been moved to last column) # Attribute Domain -- ----------------------------------------- 1. Sample code number id number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7. Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 - 10 11. Class: (2 for benign, 4 for malignant) 8. Missing attribute values: 16 ``` There are 16 instances in Groups 1 to 6 that contain a single missing (i.e., unavailable) attribute value, now denoted by "?". So we know now that our label is `class`, as well as the fact that the data is a set of integers are normalized to be between 1 and 10, and that there are unavailable attribute values that we need to clean and put into text form prior to generating our classifiers training and test sets. ```python df.replace('?',-99999,inplace=True) ``` 9. Class distribution: Benign: 458 (65.5%) Malignant: 241 (34.5%) Lastly, we can see have about twice as many benign as malignant growths, meaning that we should pay attention to how well our classifier learns each case. Thanks to this exploration we have a pretty solid grasp on our dataset, which may have taken considerably longer if we didn't take a few moments to scan the pertinant documentation. Skip this step in future projects at your peril. # Training our Model Just like our regression, switching over to numpy arrays shuold assist with the computation of our learning system, though we need to keep in mind our `class` for the classifier is still in the data. Let's create the input and output arrays to train our system. Since we already saw all our data is normalized preprocessing the data doesn't benefit us much and can be ignored. ```python X=np.array(df.drop(['class'],1)) y=np.array(df['class']) ``` As our system votes let us take a moment to consider what is being output by the classifier: it will vote on our test data and then if we want to score it's only reasonable to see how accurate it was with the voting, so we should print the results of our classification and see what percent we got right. ```python X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2,) clf=neighbors.KNeighborsClassifier(), y_train) accuracy = clf.score(X_test, y_test) print(accuracy) ``` 0.9785714285714285 It seems there were a few errors in our classifier so let's take a more detailed look at how __confident__ our classifier was on its votes. Specifically we're curious about the cases it got wrong. ```python all_votes=np.array([np.append(clf.predict_proba(X_test)[i],y_test[i]) for i in range(1,y_test.size)]) # print([1 if abs(clf.predict_proba(X_test)[i][0]-clf.predict_proba(X_test)[i][1])<1 else None for i in range(1,y_test.size)] ) ``` ``` [[1. 0. 2. ] [1. 0. 2. ] [1. 0. 2. ] [0.4 0.6 2. ] [1. 0. 2. ] [0. 1. 4. ] [0. 1. 4. ] [1. 0. 2. ] ... this goes on for quite a while ``` Let's take a peak at specific instances where our classifier messed up. ```python for vote in all_votes: if vote[0]>vote[1]: result=2 else: result=4 if result!=vote[2]: print(vote,result==vote[2]) ``` [0.4 0.6 2. ] False [0.6 0.4 4. ] False [0.4 0.6 2. ] False It looks like each time that the classifier was wrong it was in fact on the decision boundary and one vote in the other direction would have swayed the results to correct. Let's compare that to the number of times our classifier wasn't 100% certain. ```python for vote in all_votes: if vote[0]>vote[1]: result=2 else: result=4 if abs(vote[0]-vote[1])<1: print(vote,result==vote[2]) ``` [0.4 0.6 2. ] False [0.6 0.4 4. ] False [0.2 0.8 4. ] True [0.2 0.8 4. ] True [0.2 0.8 4. ] True [0.2 0.8 4. ] True [0.8 0.2 2. ] True [0.4 0.6 4. ] True [0.2 0.8 4. ] True [0.2 0.8 4. ] True [0.2 0.8 4. ] True [0.4 0.6 2. ] False [0.2 0.8 4. ] True [0.2 0.8 4. ] True [0.2 0.8 4. ] True Our classifier generally felt much more certain on the contentious data so it might make sense for us to in the future disregard 2 out of 5 classifications or perform further analysis to reach a conclusion. # Conclusions KNN provides an easy to implement and interpretable way to create a supervised learner that labels our test data. To understand it's limitations it would be reasonable to implement the algorithm from scratch. # Credits A large amount of inspiration is based on content from the Practical Machine Learning Tutorial created by YouTuber SentDex. Playlist:


keep exploring!

back to all projects