last hacked on Jul 15, 2019

## Note: This tutorial makes extensive use of LaTeX and until official support is available on this site click the link to run a script to render any equations: [link and **activate MathJax render engine**][1] [1]: javascript:(function(){if(window.MathJax===undefined){var%20script%20=%20document.createElement("script");script.type%20=%20"text/javascript";script.src%20=%20"";var%20config%20=%20%27MathJax.Hub.Config({%27%20+%20%27extensions:%20["tex2jax.js"],%27%20+%20%27tex2jax:%20{%20inlineMath:%20[["$","$"],["\\\\\\\\\\\\(","\\\\\\\\\\\\)"]],%20displayMath:%20[["$$","$$"],["\\\\[","\\\\]"]],%20processEscapes:%20true%20},%27%20+%20%27jax:%20["input/TeX","output/HTML-CSS"]%27%20+%20%27});%27%20+%20%27MathJax.Hub.Startup.onload();%27;if%20(window.opera)%20{script.innerHTML%20=%20config}%20else%20{script.text%20=%20config}%20document.getElementsByTagName("head")[0].appendChild(script);(doChatJax=function(){window.setTimeout(doChatJax,1000);MathJax.Hub.Queue(["Typeset",MathJax.Hub]);})();}else{MathJax.Hub.Queue(["Typeset",MathJax.Hub]);}})(); # Machine Learning II: Our Homebrew Regression With a regression, we are figuring out whether X and Y any structure relationship. For a linear regression, this relation takes the form of the function: $$ y=mx+b $$ $m$ slope with bias $b$ describes our "best fit line", the optimal value for these values in a linear regression are: $$ m=\frac{\bar{x}\cdot\bar{y}-\overline{xy}}{\overline{x}^2-\overline{x^2}} $$ $$ b=\bar{y}-m\bar{x} $$ $\bar{x}$ signifies we are computing the mean of the operand (everything under the bar). To program this efficiently and maximize performance lets import mean from statistics and represent our data as numpy arrays. ```python # this is our homemade linear regression script from statistics import mean import numpy as np import matplotlib.pyplot as plt from matplotlib import style style.use('ggplot') ``` # Imports We are using the mean from the `statistics` library to analyze a numpy array of our independent and dependent variables Matplotlib will be used for visualizations and the handy `ggplot` style will allow us to quickly get off the groud with styling. # Building our Algorithm ## Regression Analysis To build a bit of inuition let's start by looking at the dividend as it has fewer variables to examine. Do $\overline{x}^2$ and $\overline{x^2}$ have any definite relationship that informs us about the shape of the graph (i.e.: is it always positive/negative)? As functions of $x$ we already know that the relationship is based on our independent variable, so if we took a second and pictured a set of data points at every integer value of $x$, the value of $\overline{x}^2 - \overline{x^2}$ would be completely independent of the $y$ values, and therefore completely independent of describing our correlation. This means that the dividends sign has to be independent of the dataset. Specifically, if we use the following definitions. $$ \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i $$ . A convenient relation known as Jensen's inequality allows us to determine $$ \overline{x^2} = \frac{1}{n} \sum\_{i=1}^n x\_i ^2 \geq \frac{1}{n^2}( \sum\_{i=1}^n x\_i ^2) = \overline{x}^2 $$ . The proof relies on showing that there is a residual term in our expansion for $\overline{x^2}$ if we extracted $ \overline{x}^2 $ from it. $$ \begin{align} \sum\_{i=1}^n x\_i^2 & = \sum\_{i=1}^n (x\_i - \overline{x} + \overline{x})^2 \\\\ & = \sum\_{i=1}^n ( (x\_i - \overline{x})^2 + 2 \overline{x} ( x\_i - \overline{x} + \overline{x^2} ) \\\\ & = \sum\_{i=1}^n (x\_i - \overline{x})^2 + 2 \overline{x} ( -n \overline{x} + \sum\_{i=1}^n x\_i ) + n \overline{x^2} \\\\ & = n \overline{x^2} + \sum\_{i=1}^n (x\_i - \overline{x})^2 \\\\ & \ge n \overline{x^2} \end{align} $$ This means that $\overline{x}^2-\overline{x^2}$ is always negative, so let's keep that in mind as we examine the far more complicated relations in the numerator: $\bar{x}\cdot\bar{y}-\overline{xy}$. To examine a few cases using our previous analysis let's examine the correlation 1, covariance 1 ($y=x$) and -1 ($y=-x$) cases. Axiomatically we recreated our denominator (or our denominator time -1), and the $\overline{xy}$ term is similarly larger then the first term. This term in other cases isn't always larger but we can see that a residual term existed in the previous proof: $$ ... + \sum_{i=1}^n (x_i - \overline x)^2 $$ and recognize that a similar series of residual terms that is a multiple of $x$ and $y$ exists in the general case. The terms with indendent variables farthest from center of our data in this summation would disproportionately skew the mean even if the $y$ values were the same, so to recover the slope the center of your dataset depends on its edges relative to the center of the function. This all sounds just like what we would want to hear as far as analysis of a linear regression goes. The bias $b$ easily falls out of reorganizing our linear equation, taking our determined regression slope, and drawing a line through the mean value until we reach the y-intercept. ## Coding our Regression We can almost trivially generate the slope and intercept computer-side with this function. ```python def best_fit_slope_and_intercept(xs,ys): m = ((mean(xs)*mean(ys))-mean(xs*ys))/((mean(xs)**2)-(mean(xs**2))) b= mean(ys)-m*mean(xs) return m, b ``` With m and b generated if we feed our algorithm an x we can generate a y. ```python def predict_y(m,predict_x,b): return (m*predict_x+b) ``` ## $r^2$ Error or Coefficient of Determination ### Squared Error Our squared error is the differential between a predicted term $ f(x_i) $ (which is basically just an interpolation from our trend line at $ x_i $), and the actual data-point $ y_i $. The difference is squared to both remove sign dependence and introduce a disproportionate penalty for the more deviation displayed by the datapoint. $$ SE= \sum_{i=1}^n (f_i - y_i)^2 $$ Frequently the mean of this sequence is extracted to generate __mean squared error__ $$ MSE= \frac{1}{n}\sum_{i=1}^n (f_i - y_i)^2 $$ but in our case the summation of all our terms in $SE$ works well enough since we are going to normalize the squared error in the process of computing the coefficient of determination. ### Computing $r^2$ $r^2$ tells us the deviation of data based on the ratio of the squared error of the best fit regression linerelative to the squared error of the mean of the y (a line of slope 0 through the average value of the y line). $$ r^2 = 1 - \frac{SE \hat{y}}{SE \overline{y}} $$ This means that if we take our model and compare it to the mean value of our system (which could be thought as about the worst useful approximation of our data we could manage), we are measuring proportionately how much better our approximation is. <img src="*SXq8PJ0r0vUwpj6lwZS9NQ.png" height=100%> For example, the only way to recover $r^2 = 1$ is when $SE \hat{y}=0$, which means we have absolutely no deviations in when we sum over all terms in $f_i$'s and $y_i$'s. This corresponds to a perfect modelling of the data without any error. In the edge case we recover $m=0$ our error will indicate $SE \hat{y}=SE \overline{y}$ and $r^2 = 0$ and might indicate we should look into a higher order regression. However, this does not directly imply that for small $r^2$ our error will be drowned out due to similiarity between our squared errors. ### But Why $r^2$ and Least Squares for Error? I mentioned that $r^2$ is convenient in that it both removes signs and punishes for larger deviations, however that might not be motivating enough a reason for one to accept it's utility. If we wanted to harshly punish deviation we could pick a larger term for our error like cubed error, and we could just take the absolute value of any error kernel we summed over. However, there is a strong statistical significance in this choice when you consider it's relationship to the **normal distribution**: $$ \frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/2\sigma^2} $$ $\mu$ and $\sigma$ are the expectation and standard deviation respectively. When independent random variables are added, the **Central Limit theorem** tells us the normalized sum of a system converges on the normal distribution. This is the reason that when you flip a fair coin your system will eventually converge to a "bell curve" solution. Given how frequently the normal distribution pops up, the **Gauss–Markov theorem** showed that as long as our source of error isn't _biased_, i.e.: the expectation of our error is 0, then the **best linear unbiased estimator** is constructed by least squares. ## Coding our Error By now you've become a math wiz and it's just a matter of chugging through a few lines of code to define our methods. For Python noobies, the `**` operator on a numerical argument can be used to compute the power. For square roots, `math.sqrt()` is faster though since it's a C implementation of the square root. **Note:** This notation requires you feed it an operand and an operator, the `**` operator without a following term instead does something completely different, creating an iterable kwarg, which is something that you will see if in a future tutorial. ```python def squared_error(ys_original,ys_line): return sum((ys_original-ys_line)**2) def get_r_squared(ys_original,ys_line): ys_mean= [mean(ys_original) for y in ys_original] return 1- squared_error(ys_original,ys_line)/squared_error(ys_original,ys_mean) ``` # Unit Testing our Work Unit testing is an essential skill for any programmer, where we validate that each segment of code we've prepared performs according to our designed specifications. It involves testing for all cases we intend to input to guarantee our output is well-behaved. It also can allow for us to determine if any edge cases exist which may result in bugged behavior, verifying overall accuracy. We are going to feed random values prepared by a method to our previous functions and analyze whether the regression that results matches with our intuition. ```python # Our goal is to now unit test the regression to # determine if there are any errors import random ``` For our dummy data we should populate our set with an array of random values that roughly generate a trendline of our choosing. ```python # unit function testing script # NOTE:here correlation isn't corresponding exactly to the statistical term due to the # variance controlling that, it shapes the angle of the correlation line def dummy_data(count, variance, step=2, covariance=False): val=1 xs=[] ys=[] for i in range(count): xs.append(i) ys.append(val + random.randrange(-variance,variance)) # I want step to determine slope moreso than the covariance so we use normalized values for covariance if covariance and -1 <= covariance <= 1: val+=step*covariance elif covariance: print("Invalid covariance given, assuming 0") return np.array(xs, dtype=np.float64), np.array(ys, dtype=np.float) ``` Let's test the cases where we have a strong positive correlation, a 0 correlation system with lesser variance, and a very high variance but strong negative covariant solution. Let's generate $r^2$ and plot our graphs. A nice little shorthand we utilize is to generate our function's parameters as lists, but when we pass it to the function we convert it to a list of iterable arguments via the `*` operator. ```python for fparam in [[40,10,2,1],[40,5,2,0],[40,40,2,-1]]: xs, ys = dummy_data(*fparam) m, b = best_fit_slope_and_intercept(xs,ys) regression = [m*x + b for x in xs] r_squared=get_r_squared(ys,regression) print("r-squared:",r_squared) # generate a prediction based on our regression and plot it predict_x=40 # Plotting our results plt.scatter(xs,ys,color='#000066',label='data') plt.plot(xs, regression,label='regression') plt.scatter(predict_x,predict_y(m,predict_x,b),color='#006600',label='prediction') plt.legend(loc=4) ``` r-squared: 0.9476643130193614 <!-- ![png](Homemade%20Linear%20Regression_files/Homemade%20Linear%20Regression_16_1.png) --> <img src=""> r-squared: 0.02054988670733826 <!-- ![png](Homemade%20Linear%20Regression_files/Homemade%20Linear%20Regression_16_3.png) --> <img src=""> r-squared: 0.36420642500042966 <!-- ![png](Homemade%20Linear%20Regression_files/Homemade%20Linear%20Regression_16_5.png) --> <img src=""> # Conclusion Our homemade linear regression can generate a prediction and $r^2$ measure of error, and allows us to glean new insights into what is occuring under the hood in a linear regression. For example, despite our data having significantly less variance, a large covariance can strongly drown out the noise and allow us to find at least a moderate correlative effect, indicating that this algorithm is a robust step in understanding and extrapolating from our dataset. # Credits A large amount of inspiration is based on content from the Practical Machine Learning Tutorial created by YouTuber SentDex. Playlist: ```python ```


keep exploring!

back to all projects