last hacked on Mar 15, 2019

**Note**: This tutorial makes use of LaTeX and until official support is available on this site click the link to run a script to render any equations: [link and **activate MathJax render engine**][1] [1]: javascript:(function(){if(window.MathJax===undefined){var%20script%20=%20document.createElement("script");script.type%20=%20"text/javascript";script.src%20=%20"";var%20config%20=%20%27MathJax.Hub.Config({%27%20+%20%27extensions:%20["tex2jax.js"],%27%20+%20%27tex2jax:%20{%20inlineMath:%20[["$","$"],["\\\\\\\\\\\\(","\\\\\\\\\\\\)"]],%20displayMath:%20[["$$","$$"],["\\\\[","\\\\]"]],%20processEscapes:%20true%20},%27%20+%20%27jax:%20["input/TeX","output/HTML-CSS"]%27%20+%20%27});%27%20+%20%27MathJax.Hub.Startup.onload();%27;if%20(window.opera)%20{script.innerHTML%20=%20config}%20else%20{script.text%20=%20config}%20document.getElementsByTagName("head")[0].appendChild(script);(doChatJax=function(){window.setTimeout(doChatJax,1000);MathJax.Hub.Queue(["Typeset",MathJax.Hub]);})();}else{MathJax.Hub.Queue(["Typeset",MathJax.Hub]);}})(); # Machine Learning IV: K-Nearest Neighbor ## Overview We are going to generate a k-Nearest Neighbor algorithm from scratch, unit test with a toy case, and the read and analyze a csv file containing the famous Wisconsin Breast Cancer Dataset. Since there are many more ways to implement this algorithm than our regression it's important to start emphasizing on good design, and how we can improve on our implementations by making sure to pay attention to our dataset, performance, and organization of our code. To refresh the K-Nearest Neighbor workflow is: 1. Read and clean data 2. Select and initialize value of k 3. Iterate from 1 to $ n $ number of training data points and: 1. Calculate the (Euclidean) distance between each element of training data and test data. 2. Sort distances to optimize for lowest value 3. Pull first $ k $ elements from the sorted array to generate your voters 4. Return the most freqent vote as your class for the test data. Part 1 and 2 will be handled in the main loop. ```python # Home made k nearest neighbor algorithm import numpy as np import warnings from math import sqrt import matplotlib.pyplot as plt from matplotlib import style from collections import Counter style.use('fivethirtyeight') ``` ## Imports Numpy gives us access to fast computations and a handy array module Warnings provides a method to send warnings to the user Counter is useful for simplfying the iterations we are going to take PyPlot is handy for our plotting needs # Classifier Analysis ## Distance computation As we start to generate our own homemade kNN classifier we notice there are two distinct components to the computation. First a series of pair wise distances are computed across all of our features, and then we implement a voting using the nearest neighbors. Our kNN needs to compute distances to run the vote, so to promote reusability we should use a method of computing distance that supports n-dimensional data. Naively this means we can take the square-root Pythagorean addition of the difference of each vector. $$ D = \sum_{i=1}^n (v_i^{(1)}-v_i^{(2)})^2 $$ We can do this by computing the square root of the summed squares with some handy list comprehension, wherein we span all indices. ```python def basic_dist(pt1,pt2): return sqrt(sum([(pt1[i]-pt2[i])**2 for i in range(len(pt1))])) ``` Here we already run into one point where we should be hesitant. We are nesting for loops into our main computation. Another way we can manage this computation more quickly is by computing the norm of the difference between two vectors, thereby reducing the loops. $$D = \Delta \overline{v} = |\overline{v_1}-\overline{v_2}|$$ Since we are talking about performance numpy is a good place to start, we'll turn our data into numpy arrays and compute the norm of the difference vector. ```python # using numpy norm to generate distance def np_dist(pt1,pt2): return np.linalg.norm(np.array(pt1)-np.array(pt2)) ``` Numpy and instantiate Numpy arrays that will serve as our vector in the norm process. Here we reduce the demensions of the problem by norming (how does Numpy norm work math-wise?). By using a nifty little decorator I've packaged here (github url) I'm going to pull the vote out of our class to do some extra demonstrating. (add attribute extractor to code) ## Picking $k$ With kNN fundamentally being a voting classifier, we introduce a very realistic possibility that our voter can tie if we're not careful about defining the value of k. To make sure that the classifier doesn't tie, the best course of action is to set our $k$ such that: $$ k: 2n+1>g $$ with $n$ being an integer and $g$ being the number of groups our classifier. We computer distances and append terms to our array, sorting them when we are ready to vote. The easiest way to vote is on a sorted list, since our kernal is the minimum of a monotonic function (the square-root) and this means that selecting the first k elements of the sorted list minimizes the distance of votes. By counting the most common element in this collection we easily recover the vote outcome. Finding the optimal value for k can be a bit tricky. In general you can observe that as you increase k the coarseness of your decision boundaries decreases. <img src="" alt="k-criteria" width="50%"> At the edge Case of k=1 you're decision boundary seems like it perfectly matches with your data, however if we examine the accuracy we'll notice we have overfitted our parameters and that is likely to quite hamper our algorithms efficacy. ```python # kNN parent function def kNN (data, predict, k=3): if len(data) >= k: warnings.warn('K can result in a tie if used with this many groups') distances=[] for group in data: # recall that this is a dictionary so group is the keys in data for feature in data[group]: distances.append([np_dist(feature,predict),group]) votes = [i[1] for i in sorted(distances)[:k]] vote_outcome = Counter(votes).most_common(1)[0][0] return vote_outcome ``` # Unit Testing Time to see if there are any issues with our code via a quick test. While we could put together an intricate sampling algorithm to generate a nice decision boundary a quicker example illutstrates our classifier. ```python # dummy data data = {'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]} new_feat = [4,6] result = kNN(data, new_feat) print(result) [[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in data[i]] for i in data] # same as: ##for i in data: ## for ii in data[i]: ## plt.scatter(ii[0],ii[1],s=100,color=i) plt.scatter(new_feat[0], new_feat[1], marker='*', s=100, color=result) ``` r <img src=""> # Revisting the Breast Cancer Dataset Now that we are confident about our system we can scale up to a larger model, like the breast cancer dataset in tutorial 3. We are going to need to import a few libraries to analyze the data and visualize it. Since our data has numerous features it makes sense to graph each result with respect to each other axis, and fortunately Pandas has an extension of matplotlib cooked into it that allos us to quickly get a high level view of our data. ```python from urllib.request import urlopen import random import pandas as pd # pandas has a useful scatter matrix to help visualize this high-dimensional dataset, # may also try to generate in matplotlib from pandas.plotting import scatter_matrix style.use('seaborn') data_URL = '' indices= ["id","clump_thickness","uniform_cell_size","uniform_cell_shape", "marginal_adhesion","single_epi_cell_size","bare_nuclei","bland_chromation", "normal_nucleoli","mitoses","class"] df = pd.read_csv(urlopen(data_URL),names=indices) ``` ## Cleaning Our first decision is to drop the unique identifiers or to make them into our index. You will see in the following sections that keeping our index allows us a much greater degree of freedom to reference our data. After that we can examine the unique values and determine what cleaning needs to be applied. ```python # either we can drop our patient ids or our index will be the patient number, lets go with making it the index df.set_index('id',inplace=True) for col in df: print(col,df[col].unique()) ``` clump_thickness [ 5 3 6 4 8 1 2 7 10 9] uniform_cell_size [ 1 4 8 10 2 3 7 5 6 9] uniform_cell_shape [ 1 4 8 10 2 3 5 6 7 9] marginal_adhesion [ 1 5 3 8 10 4 6 2 9 7] single_epi_cell_size [ 2 7 3 1 6 4 5 8 10 9] bare_nuclei ['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6'] bland_chromation [ 3 9 1 2 4 5 7 8 6 10] normal_nucleoli [ 1 2 7 4 5 3 10 6 9 8] mitoses [ 1 5 4 2 3 7 10 8 6] class [2 4] We have unknown values present in our bare nuclei, so now we've reached an impasse where we must decide if we should remove all unknown data or figure out how to still work with it. Recall that kNN classifies based on distance, so we were to take missing variables and assign them an exceptionally far distance from the rest of our data, this means we can essentially generate a seperate cluster of points to do their own classifications on an inbound data. So let's replace `'?'` with `-99999`. Notice the single quotes around our "bare_nuclei" column. This indicates the system is treating these values as strings as oppose to floats. In fact, our entire dataset is being treated as integers, so let's conform to that convention momentarily. ```python df.replace('?',-99999, inplace=True) df["bare_nuclei"]=df["bare_nuclei"].astype(int) for col in df: print(col,df[col].unique()) ``` clump_thickness [ 5 3 6 4 8 1 2 7 10 9] uniform_cell_size [ 1 4 8 10 2 3 7 5 6 9] uniform_cell_shape [ 1 4 8 10 2 3 5 6 7 9] marginal_adhesion [ 1 5 3 8 10 4 6 2 9 7] single_epi_cell_size [ 2 7 3 1 6 4 5 8 10 9] bare_nuclei [ 1 10 2 4 3 9 7 -99999 5 8 6] bland_chromation [ 3 9 1 2 4 5 7 8 6 10] normal_nucleoli [ 1 2 7 4 5 3 10 6 9 8] mitoses [ 1 5 4 2 3 7 10 8 6] class [2 4] ## Training and Testing Now that our data is uniform, numerical, and extrapolated as necessary; there are a few ways we can prepare our dataset for training and testing. Naively, if we are working with distances it might make sense for us to create a list with our datapoints and indexes, shuffle that, then feed our test data into the algorithm while storing the exact sequence of indices so later we can feed our information back into the dataframe. To that end we can `zip` our indices and values to run through the shuffling splitting, and assigning. By zipping this we are iterating over each pair of indices and their values, then creating a tuple with each value to then store in a list called `full_dataset`. By feeding `*full_dataset` with that prepended `*` operator we have created an iterable, and zipping over this particular iterable is how we can inverse our first zip operation, which is an easy way to inverse the zip operation. ```python test_size=0.2 full_dataset = list(zip(df.values.tolist(), list(df.index))) # swap data around random.shuffle(full_dataset) # split back into data and index full_dataset,full_index=map(list,zip(*full_dataset)) train_set={2:[],4:[]} test_set={2:[],4:[]} test_data= full_dataset[:int(test_size*len(full_dataset))] test_index= full_index[:int(test_size*len(full_dataset))] train_data= full_dataset[-int(test_size*len(full_dataset)):] train_index= full_index[-int(test_size*len(full_dataset)):] # hard coded data such that key is class and values our features for i in train_data: train_set[i[-1]].append(i[:-1]) for i in test_data: test_set[i[-1]].append(i[:-1]) print(train_set[2][0],train_set[2][0]) print(test_set[2][0],test_set[2][0]) ``` [2, 1, 1, 1, 2, 1, 3, 1, 1] [2, 1, 1, 1, 2, 1, 3, 1, 1] [3, 3, 2, 1, 2, 3, 3, 1, 1] [3, 3, 2, 1, 2, 3, 3, 1, 1] ### Thinking in dataframes This was how my first attempt at this problem was. However, we should recall that the point of a dataframe is that it can structure our data and be a useful medium to manipulate the dataset. So with that in mind I opted to overhaul this data prepping stage. Instead of having to create all these lists and dictionaries and zipping and unzipping indices and values we can generate one list of indices that contains the order by which we pull values from the data, shuffle that around, and decide that the first 20% is our training subset. This means we skip all the extra memory allocation and operations from making lists. For larger datasets this might be the difference between crashing and running smoothly. ```python train_set={2:[],4:[]} test_set={2:[],4:[]} full_index=list(df.index) random.shuffle(full_index) test_index=full_index[:int(test_size*len(full_dataset))] train_index= full_index[-int(test_size*len(full_dataset)):] for i in list(df.loc[train_index,df.columns].values): train_set[i[-1]].append(i[:-1]) for i in list(df.loc[test_index,df.columns].values): test_set[i[-1]].append(i[:-1]) ``` Now that the dataset is processed we can initialize the classifier and save our results. Here it makes sense to attach our test index so that we can append the data back to our dataframe. ```python # for scorekeeping correct,total=0,0 # kNN time! # define our votes votes=[] for group in test_set: # for data in test_set[group]: for data in test_set[group]: # vote = kNN(train_set,data, k=5) vote = kNN(train_set,data, k=3) votes.append(vote) if group == vote: correct += 1 # print() total+=1 votes=dict(zip(test_index,votes)) print("Accuracy:", correct/total) ``` Accuracy: 0.9877300613496932 Our 94-98% accuracy provides a relatively accurate first glimpse at the data, and is on par with the sklearn implementaion of the previous section as far as accuracy is concerned, which is reassuring. The matter of relative performance will be addressed in the next module. # Visualizing Results with Pandas Scatter Matrix To visualize this multi-dimenionsional dataset it can benefit us to plot many low-resolution datasets in order to take a "squint" at the results of our analysis. A scatter-matrix from the pandas library is an easy to use high-level implementation that gives us a matplotlib scatter-plot for every numerically valued entry of our dataset. ```python df["results"]=pd.Series(votes) # define colors; # 1: train benign, 2:train malignant, # 3: test benign RIGHT, 4: test benign WRONG, # 5: test malignant RIGHT, 6: test malignant WRONG color_wheel={(2,0):"#000000",(4,0):"#AA0000",(2,2):"#00FF00",(2,4):"#FF00FF",(4,4):"#0000DD",(4,2):"#FF0000"} df["results"].fillna(0,inplace=True) df["class"]=df["class"].astype(float).values df["colors"]=df.apply(lambda x: color_wheel[x["class"],x["results"]] ,axis=1) for col in df: print(col,df[col].unique()) ``` clump_thickness [ 5 3 6 4 8 1 2 7 10 9] uniform_cell_size [ 1 4 8 10 2 3 7 5 6 9] uniform_cell_shape [ 1 4 8 10 2 3 5 6 7 9] marginal_adhesion [ 1 5 3 8 10 4 6 2 9 7] single_epi_cell_size [ 2 7 3 1 6 4 5 8 10 9] bare_nuclei [ 1 10 2 4 3 9 7 -99999 5 8 6] bland_chromation [ 3 9 1 2 4 5 7 8 6 10] normal_nucleoli [ 1 2 7 4 5 3 10 6 9 8] mitoses [ 1 5 4 2 3 7 10 8 6] class [2. 4.] results [0. 2. 4.] colors ['#000000' '#AA0000' '#00FF00' '#0000DD' '#FF0000' '#FF00FF'] Taking the relevant datapoints and dropping `"colors","class","results"` from the data points (though we do use color to visualize each point) let's us create our first scatter matrix. ```python scatter_matrix(df.drop(["colors","class","results"],axis=1), color=df["colors"], figsize=(15, 15), diagonal='kde') ``` <img src=""> Our data isn't very interpretable in this form. With integer values for our data we've essentially masked each of the previous datapoints when we read the next value that shares two common points on that graph. If we consider our range of 0 to 10, then an integer only representation can have a maximum of $ x^2 = 121 $ unique datapoints per graph. Our dataset has over 600 points, so we could only see 20% of our data at best, and more realistically we are seeing on order of 10% per graph. Additionally, we have no way of reliably visualizing data density with this schema, so this problem should be addressed with our solution. Worse still, the identity of all points under the last value are covered so our interpretation of our classifier is quite likely not representative of the data. ## Improve visibility with a little noise? To resolve this issues, we can insert noise into our system's visualization in a process called *jittering*. By insert noise with a fractional value, that let's our points deviate by a certain amount to span the rest of the figure. BY using a uniform distribution with a $ |{\delta}| \leq 0.5 $ we span the whole continuum of numbers in our figure. A uniform distribution with this jitter means that on each chart the chance that one point is suppressed by another is $ {1\over{N}}^2 $ per figure and $ {1\over{N}}^9 $ overall, where $N$ is equal to the number of possible states of your random noise kernal. To be a bit more accurate about this problem with regards to _visualization_, we could consider $N$ the number of possible dots that whose _cross-section_ would completely fill a box encompassing our but that is a complex way to approach the problem that requires a lot of relatively useless fine-tuning to get the math right. ```python # before we add noise can pull the cluster of data without values back in to improve our overall visualization df.replace(-99999,-3, inplace=True) df_noise=pd.DataFrame().reindex_like(df) for col in df_noise.drop(["colors","class","results"],axis=1): df_noise[col]=df_noise.apply(lambda x:random.uniform(-.5,+.5),axis=1) print(df_noise.drop(["colors","class","results"],axis=1).head()) ``` clump_thickness uniform_cell_size uniform_cell_shape \ id 1000025 0.322660 -0.055592 0.265712 1002945 -0.108393 -0.092970 0.116542 1015425 -0.385150 0.484096 0.148421 1016277 -0.027083 -0.268905 -0.127269 1017023 0.151548 0.164880 -0.385867 marginal_adhesion single_epi_cell_size bare_nuclei \ id 1000025 0.310932 0.380684 -0.308248 1002945 -0.476200 0.324911 0.202803 1015425 0.317468 -0.106422 -0.406428 1016277 -0.391186 -0.088818 0.412015 1017023 -0.173984 -0.189758 0.278725 bland_chromation normal_nucleoli mitoses id 1000025 0.068818 0.082130 -0.142768 1002945 0.366056 -0.251566 0.383092 1015425 0.413878 -0.490812 0.402324 1016277 -0.056861 0.137514 -0.148469 1017023 -0.291780 0.266494 0.263958 Lastly, we should renormalize our `'?'` data in order to make the `bare_nuclei` figure|interprable. By putting it on the negative axis we still can represent those points as a seperate cluster of classifiers. ```python scatter_matrix((df.drop(["colors","class","results"],axis=1)+df_noise.drop(["colors","class","results"],axis=1)),alpha=0.99, color=df["colors"], figsize=(15, 15), diagonal='kde') ``` <img src=""> # Conclusions While constructing the k-Nearest Neighbor classifier from scratch there was a chance to fine tune the distance computation to add some parallelizatoin. In general with the implementation of an algorithm many choises arise and building some intuition with regards to selecting methods that decrease computation time or improve memory efficiency is going to be exceptionally important as your dataset's size and complexity increase. Developing a more robust visualization with multi-dimensional datasets was helpful for seeing higher order data but requires an attention to detail, the question of how to improve the efficacy of your visualization should always be prioritized, and even with the adjustments made here with our scatter matrix there is much room for improvement. # Credits A large amount of inspiration is based on content from the Practical Machine Learning Tutorial created by YouTuber SentDex. Playlist:


keep exploring!

back to all projects