# INTRO TO MACHINE LEARNING I: THE BIG IDEA

###### A BASIC RECIPE USING SCI-KIT LEARN
1

PYTHON, SCI-KIT LEARN

EASY

last hacked on Jul 15, 2019

## Note This tutorial uses LaTeX and until official support is available on this site click the link to run a script to render any equations: [link and **activate MathJax render engine**][1] (as a bookmark this link is reusable on any site) # Machines Learning I : How do machines learn? ### The short answer You teach them or they teach themselves. Tutorial done, have a good day. ### Okay, but really Maybe that's too short for an answer. Fundamentally with computer programs you are providing your system an input, giving it a set of rules to process the input, and gernerate an output. In machine learning you are instead more implicitly instructing the system. Arthur Samuel said machine learning is the "field of study that gives computers the ability to learn without being explicitly programmed." Since you are here for a tutorial you probably are interested in the long way anyways so let's get down to business. ## Overview This is part 1 of a series of tutorials that will cover a family of basic learning technologies focused on learning systems with Sci Kit Learn. When we are dealing with prebuilt libraries we will use familiar datasets that data scientists frequent. We will also write the algorithms from scratch since that is where most of your understanding and intuition will come from. Topics will include linear regression, K Nearest Neighbors, Support Vector Machines (SVM), flat clustering, hierarchical clustering, and neural networks. Since computers were a bit of a luxury until recent times most introductory materials in machine learning were created in a time where the theoretical pedagogy was favored, so here we'll focus on building intution by building practical systems. ## Two Major Paradigms With a learning algorithm we are generally engaging in at least one of these two paradigms: Supervised learning: we interact with our learning system to tell it whether predictions are correct. Unsupervised learning: we have our system draw inferences without us providing a labeled training set. While we will dive further into supervised learning in part III, for now it is sufficient to know that regression training is a supervised learning algorithm. # Linear Regressions If you are dealing in linear data then regressions allow you to collapse the relation to a familiar form. $$y_i = m x_i+ b$$ Here our intercept $b$ is referred to as the bias. Based off how closely our data fits the trendline at each point in time we can generate a measure of error called squared error, used to generate $r^2$ error. $$r^2 = 1 - \frac{SE \hat{y}}{SE \bar{y}}$$ We'll lay all that out when we make our regression from scratch. SVM will be covered more later, here it suffices to say that we are going to compare our regression and it's accuracy on the dataset with an alternative. # The Code: Big Ideas As a superivsed learning model a regression system (and all SL systems) takes _features_ and outputs _labels_ that correspond to our insights of interest. We are going to read and analyze a csv file containing the daily performance of google stocks and feed it into an algorithm that goes through the process of prediciting the __regressor__, which is the __feature__ or __explanatory variable__ that describes the insight we are seeking from our data points in the observation. Regression driven classifications are a common form of supervised learning. [1]:javascript:(function(){if(window.MathJax===undefined){var%20script%20=%20document.createElement("script");script.type%20=%20"text/javascript";script.src%20=%20"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS_HTML";var%20config%20=%20%27MathJax.Hub.Config({%27%20+%20%27extensions:%20["tex2jax.js"],%27%20+%20%27tex2jax:%20{%20inlineMath:%20[["$","$"],["\\\\\\\\\\\$$","\\\\\\\\\\\$$"]],%20displayMath:%20[["$$","$$"],["\\\$","\\\$"]],%20processEscapes:%20true%20},%27%20+%20%27jax:%20["input/TeX","output/HTML-CSS"]%27%20+%20%27});%27%20+%20%27MathJax.Hub.Startup.onload();%27;if%20(window.opera)%20{script.innerHTML%20=%20config}%20else%20{script.text%20=%20config}%20document.getElementsByTagName("head")[0].appendChild(script);(doChatJax=function(){window.setTimeout(doChatJax,1000);MathJax.Hub.Queue(["Typeset",MathJax.Hub]);})();}else{MathJax.Hub.Queue(["Typeset",MathJax.Hub]);}})(); python # imports and setting graph style import math import numpy as np import pandas as pd from sklearn import preprocessing, model_selection, svm from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt from matplotlib import style import datetime style.use('ggplot')  # Imports Panda is a dataframe library; it also used to read a csv file of Google stocks Math gives us access to a general python mathematics utility library Numpy gives us access to fast computations and a handy array module Datetime gives us access to a fully featured date and time utility ### Sci-Kit Learn Sci-Kit Learn is a well-maintained toolkit for mining data, training using a variety of algorithms and kernels, and achieving good results in machine learning without too much effort. SKL Preprocessing here is used for data scaling on our features, SKL CV used create training and testing samples and split/shuffle data for decreasing bias SKL SVM is to show an alternative classifier SKL linear regression ### Matplotlib pyplot is a MATLAB like plotting framework style allows us to quickly modify the appearence of our plots using a template # Setting up our Dataframe Our first order of business is to read the data from a source file. Here I'm using a csv acquired from Kaggle that describes googles stock prices. Source is at:https://www.kaggle.com/hanumanstark/google-stock-prices/data python df=pd.read_csv('GOOGL.csv',names=["Date","Adj. Open","Adj. High","Adj. Low","Adj. Close","Close","Adj. Volume"]) df.set_index('Date',inplace=True)  Here we set our index to be the date parameter in the csv and it's time to peak inside and see what we're looking at. It's always a healthy practice to explore our data some before we run to actually create a production system. python print(df.head())  Adj. Open Adj. High Adj. Low Adj. Close Close \ Date 2009-05-22 198.528534 199.524521 196.196198 196.946945 196.946945 2009-05-26 196.171173 202.702698 195.195190 202.382385 202.382385 2009-05-27 203.023026 206.136139 202.607605 202.982986 202.982986 2009-05-28 204.544540 206.016022 202.507507 205.405411 205.405411 2009-05-29 206.261261 208.823822 205.555557 208.823822 208.823822 Adj. Volume Date 2009-05-22 3433700 2009-05-26 6202700 2009-05-27 6062500 2009-05-28 5332200 2009-05-29 5291100 ## Cleaning our Data Now cleaning data is where we put on our thinking caps and think about the implications of what data we have and the label we'd like to generate. Our data follows a upwards trend that makes the particular values less important for any training. In the world of stocks you are concerned with volitality of a stock (often on a particular day), which involves the max and min values (high and low), as well as the opening and closing prices and their change. To refine our problem and reduce the dimensionality, as well as make this data relatively well behaved and easy to scale well, we can focus on percent differentials that descibe volitility and the over trend of the day. python df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']] df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0 df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0  By taking these, and retaining our close (which encodes knowledge about the upward trend of the data) allows us to potentially construct a label from that dataset. python df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]  # Forecasting As mentioned before, we are curious about how the price at the end of the day can be predicted from insights currently accessible, meaning Adj. Close makes a perfect label to train a supervised system on top of a feature for training. If we are in the business of predicting then our first decision is how far would like to seer into the future. Our label, Adj. Close, will will therefore need to be used to populate a forecast column that is offset equivalent to the days ahead we forecast. python forecast_col = 'Adj. Close' df.fillna(value=-99999, inplace=True) forecast_out = int(math.ceil(0.01 * len(df))) df['label'] = df[forecast_col].shift(-forecast_out) print(0.01*len(df), "days")  23.35 days Here our predictions will reach out 1% beyond the dataset, translating to roughly three weeks of predictions. We also replace all invalid fields with -99999, which is used to fill empty fields to prevent empty field errors but still treat those cells as outliers in our model. ## Training our model Now that we are getting into the computationally demanding things let's prep our training dataset. We are going to use numpy's array due to how well it performs computationally. Obviously we don't want to include the label as a feature or else our "training" data since our test is going to learn to find the label using the label, which doesn't make much sense for an algorithm intended to predict the label. python X = np.array(df.drop(['label'], 1)) X = preprocessing.scale(X)  python X_lately = X[-forecast_out:] X = X[:-forecast_out] print(X[range(100,2000,300)])  [[-1.01641427 -0.746068 0.48386614 0.99320209] [-0.89015962 -1.22537949 0.22712695 -0.48926729] [-0.86180307 -1.2500331 -0.09751171 -0.22484167] [-0.38346763 -0.45530186 0.97613 -0.2205003 ] [ 0.19602207 -0.85862944 0.47009931 -0.96193397] [ 0.3153021 -0.13103011 0.92403588 -0.41014582] [ 0.86568705 0.70003195 1.3146781 -0.62786553]] Let's make sure to drop our NaN. python df.dropna(inplace=True)  Now all we need to do is define our label so that we can partition a training set and train our model. python y = np.array(df['label'])  When it comes to training our data, we need to retain a large enough test size that we can try to understand how accurate our model is. Here I selected to have a ratio of 80:20 and saved the accuracy output by our classifier. python X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2) clf = LinearRegression(n_jobs=-1) clf.fit(X_train, y_train) accuracy = clf.score(X_test, y_test) print(accuracy)  0.9796825258049572 98% accuracy sounds pretty good, with a linear regression this is the $r^2$ error, however keep in mind that our data is quite monotonic and well behaved for a linear regression to make predictions. Before I started to prepare to make millions gaming the market I'd probably relegate these simple models to paper trading, and take any advice presented with a grain of salt and probably do some paper trading until I had a more specialized model. A linear regression weights past data too greatly to likely allow for any useful outcomes on something as volatile as day trading. Let's also compare this to another method available to Support Vector Machine, say the support vector machine, which will get it's own full tutorial in the upcoming entries. python # compare other algorithm kernels in SVM to this for k in ['linear','poly','rbf','sigmoid']: X2_train, X2_test, y2_train, y2_test = model_selection.train_test_split(X, y, test_size=0.2) clf2=svm.SVR(kernel=k,gamma='scale') clf2.fit(X2_train, y2_train) accuracy2 = clf.score(X2_test, y2_test) print(k,accuracy2)  linear 0.9832107065035292 poly 0.9806614401924786 rbf 0.9834028527143208 sigmoid 0.98205710151984 We have comparable results across the board, which suggests that our linear regression isn't really doing anything particularly special to pull insights out of this system. Google stock goes up in general. Who would've figured? # Visualizing our results So we haven't actually seen our prediction yet, and for something like this it makes much more sense to observe the actual variation relative to our past data in a chart. Remember that visualization and presenting your results in a readily digestable form is about the only way that data science becomes useful for informing future decisions. So let's get to PyPlot-ing python forecast_set = clf.predict(X_lately) df['Forecast'] = np.nan  Forecast is going to be the index of components of our y-axis that we generated in the algorithm, the 23 days of future predictions. python last_date = pd.to_datetime(df.iloc[-1].name) # print(last_date) last_unix = last_date.timestamp() one_day = 86400 next_unix = last_unix + one_day for i in forecast_set: next_date = datetime.datetime.fromtimestamp(next_unix) next_unix += 86400 df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)]+[i]  Now we need to generate the dates corresponding to the forecast prices. We seed our date generator loop with the last value of index, the date of the adjusted closing price, and iterate through, assigning the corresponding values. In order to better visualize the data I sliced it to contain approximately 1 years stock information. python df['Adj. Close'][-365:-1].plot() df['Forecast'][-365:-1].plot() plt.legend(loc=2) plt.xlabel('Date') plt.ylabel('Price') plt.show()  <img src='https://i.imgur.com/MnKozr0.png'> # Conclusion sklean is an easy first step to creating a learning system, but there is a lot that's been abstracted away and to go beyond being a script monkey, gain the intuition necessary to effectively use algorithms, and level up as a data scientist its high time we coded the algorithm from scratch. In the next section we'll put together a regression algorithm ourselves. # Credits A large amount of inspiration is pulled directly from content from the Practical Machine Learning Tutorial created by YouTuber SentDex. Playlist: https://www.youtube.com/watch?list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&v=OGxgnH8y2NM