This project focuses on finding the <em>best</em> linear regression model to estimate the median value (in $1000's) of owner-occupied homes in the Suburbs of Boston.
Understanding house prices (and related variables) are relevant to economists and policy-makers when allocating budgets for domestic purposes. Linear regression is a fundamental concept in statistical-learning, from which other more sophisticated methods arise. Although this may not be the most precise, the statistical foundation in which this procedure is built on is worth exploring.
For this project we leverage the power of R for both the statistical computations as well as the visualizations.
The data used in this project is available at the UCI Machine Learning Repository. This modeling problem is referenced in the 2014 version of the book "An Introduction to Statistical Learning" by Gareth Games, Daniela Witten, Trevor Hastie, and Robert Tibshinari.
We encourage you to replicate this project and submit your own contributions. You can fork this project on GitHub.

# Environment Setup
First we load the necessary packages into our **R** environment using the `library()` method. We will need the packages `usdm` to test for *multicollinearity*, `car` to diagnose for outliers, `MASS` to diagnose studentized residuals, `DAAG` to cross validate our models, and lmtest to check for heteroskedasticity. It is crucial that you first install these packages using `install.packages()` if haven't done so already.
## Installing packages
# RUN EACH OF COMMANDS ON YOUR CONSOLE:
install.packages("usdm")
install.packages("car")
install.packages("MASS")
install.packages("DAAG")
install.packages("lmtest")
install.packages("ggplot2")
install.packages("ggfortify")
install.packages("GGally")
## Loading Packages
library(usdm) # for testing collinearity
library(car) # for testing outliers
library(MASS) # for testing studentized residuals
library(DAAG) # for cross validation of model
library(lmtest) # for checking homoskedasticity/heteroskedasticity
library(ggplot2) # Use for visuals
library(ggfortify)
library(GGally)
## Loading Data
The Boston Housing dataset is very popular in the **R** community, and is easily accessible through the `MASS` package. To call this data we simply use the `data()` method. We use the `help()` method to get context on our data, and the `attach()` method to include our variable names.
The variable chas appears to be a factor of 0 and 1 so we turn it into a factor variable using the `as.factor()` method. Finally, we create our data frame using the `data.frame()` for use in the rest of our project.
data(Boston)
help(Boston)
attach(Boston)
boston.df <- data.frame(Boston)
boston.df$chas <- as.factor(boston.df$chas)
boston.df$rad <- as.factor(boston.df$rad)
# Since rad seems to be an ordinal variable we add levels
# just to be safe!
levels(boston.df$rad) <- c(1, 2, 3, 4, 5, 6, 7, 8, 24)
str(boston.df)
Now we are ready for some exploratory analysis!
# Exploratory Analysis
We want to gain intuition from our data by figuring out which exploratory variables might be the most relevant in estimating for medv, our response variable.
## Data Structure
Here we run our data frame through the `str()` method to see how many variables and observations we have, as well as a general feel for the data.
str(boston.df)
We can see that we have a total of 14 variables and 506 observations. We also see the data types for the different variables, as well as sample values.
### Terminal Output
Here is our output.
> str(boston.df)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87
$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9
## Scatterplot Matrices
Here we run our data frame through the `ggpairs()` method to get a feel for how our variables might be related to each other. To learn more scatterplot matrices, visit this [r-bloggers article](https://www.r-bloggers.com/scatterplot-matrices/). But for this presentation we are using the extension `ggpairs()` to provide a cleaner visual than that of the regular `pairs()` function. We start off with a subsection of the pairs function, since using all variables produces a plot that is not readable, and henceforth unusable.
pairs <- ggpairs(boston.df,
columns = c(1, 2, 3, 4, 5, 14),
lower=list(continuous=wrap("smooth",
colour="turquoise4")),
diag=list(continuous=wrap("barDiag",
fill="turquoise4"))) +
theme(panel.background = element_rect(fill = "gray98"),
axis.line.y = element_line(colour="gray"),
axis.line.x = element_line(colour="gray"))
pairs
<img src="http://i.imgur.com/pg7zzx9.png">
We now extend the `ggpairs()` function to the rest of the variables to get more information regarding the interaction of our variables. For that we just call up the variables which we did not include in the first iteration of `ggpairs()`:
pairs1 <- ggpairs(boston.df,
columns = c(6, 7, 8, 9, 10, 14),
lower=list(continuous=wrap("smooth",
colour="turquoise4")),
diag=list(continuous=wrap("barDiag",
fill="turquoise4"))) +
theme(panel.background = element_rect(fill = "gray98"),
axis.line.y = element_line(colour="gray"),
axis.line.x = element_line(colour="gray"))
pairs1
<img src="http://i.imgur.com/sDFjNsr.png">
And finally we have:
pairs3 <- ggpairs(boston.df,
columns=c(11, 12, 13,14),
lower=list(continuous=wrap("smooth",
colour="turquoise4")),
diag=list(continuous=wrap("barDiag"
, fill="turquoise4"))) +
theme(panel.background = element_rect(fill = "gray98"),
axis.line.y = element_line(colour="black"),
axis.line.x = element_line(colour="black"))
pairs3
<img src="http://i.imgur.com/vOIZQ8Q.png">
It helps to click on the images to get a better picture of the correlation between variables.
Here we are looking for linear relationships between our response variable `medv` and the rest of the variables in our data frame. We can roughly infer that `rm`, `ptratio`, and `lstat` might be the most relevant in estimating `medv`. We also use the `ggpairs()` function to be wary of *multicollinearity* between dependent variables. We do not want this because we might be over-fitting our model, when we have two variables that are highly correlated repeating themselves.
We included more exploratory analysis using ggplot2 to give you an idea of some of the setbacks/pitfalls we might run into when using this data set. Because from the ggpairs function we can see that most of the interactions between the variables do not show linear iterations, and show some patterns.
## Variable Interaction
An interaction that is fairly simple to start with is the variables `lstat` and `tax` since it gives us a healthy perspective on how the `rad` will influence the model. We use the `geom_point()` function to create a scatterplot which we then add points of color based on the variable `rad`, which we will see through the next few graphs plays a large role in how variables interact. So we start assigning a `ggplot()` function with the variables and then we add `geom_point()` with the color set to `factor(rad)` as shown here:
taxMedv <- ggplot(boston.df,
aes(tax, lstat, colour = rad)) +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray"))
taxMedv + geom_point()
<img src="http://i.imgur.com/TPOhsvB.png">
## Outliers
In this section we highlight possible outliers which might influence our linear model. This we will go into more detail when we look at the residual diagnostics, especially *Cook's Distance* and *Leverage*! Here we plot `tax` vs `ptratio` with the colour emphasizing `ptratio` to show potential outliers! Since this is a two variable interaction plot we can't know for sure if they are actual outliers, since our models will be multi-dimensional, but we decided to include this plot to showcase if you are learning with a lower dimension data set.
taxPtratio <- ggplot(boston.df,
aes(tax, ptratio, colour = ptratio))
taxPtratio + geom_point() +
ggtitle("TAX vs. PTRATIO Scatterplot") +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray"))
<img src="http://i.imgur.com/PzzlqT5.png">
## Distribution of Medv
Next we emphasize the distribution of `medv` to make the appropriate transformations, if necessary. We create a histogram which emphasizes the count
Here we plot a histogram of `medv` to check if the transformation mentioned earlier is needed. We provide the histogram again using `ggplot2()`.
medvHist <- ggplot(boston.df, aes(medv))
medvHist + geom_histogram(aes(y = ..density.., fill = ..count..),
colour = 'white', bins = 25) +
geom_density() +
scale_fill_gradient("Count", low = "black",
high = "turquoise4") +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray")) +
ggtitle("Histogram of MEDV")
<img src="http://i.imgur.com/x34tlzw.png">
From this histogram we can see that the median price has a *right skewed distribution*! Thus we conclude that a *log transformation* will be needed.
Here we create the histogram of the *log transformation* to see the distribution!
# First we create a data frame with the log transformation
logMedv <- data.frame(log(boston.df$medv))
# Next we change the name of the column to make it easy
# for us to call in our ggplot2 visual
colnames(logMedv) <- c("logMedv")
logMedvHist <- ggplot(logMedv, aes(logMedv))
logMedvHist + geom_histogram(aes(y=..density.., fill=..count..),
colour = 'white', bins = 25) +
geom_density() +
scale_fill_gradient("Count", low = "black",
high = "turquoise4") +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray")) +
ggtitle("Histogram of log(MEDV)")
<img src="http://i.imgur.com/0agQOKY.png">
We can see here the distribution is significantly better so now we can proceed to building our linear model.
# Fitting Model
This is arguably the most engaging part--finding a linear regression model that best fits our data.
## Model 1: All in
For our first model we will go all in with our entire data set. That is, we will use all of our variables and observations in our data frame to predict `medv` using *linear regression* with the `lm()` method. Throughout the modeling phase, we will fine-tune our working data set using *backward selection*, which means that we will eliminate variables that are, statistically, the *least significant*.
We run our model through the `bptest()` method to check for *heteroskedasticity*. Roughly speaking, *heteroskedastic* data contains non-constant errors, which violates the assumptions for running a linear model. **In short, we do not want heteroskedastic data. Instead, we prefer homoskedastic data. Homoskedastic data contain errors that are randomly or normally distributed**.
Finally we run our model through the `vif()` method (for *variance inflation factor*) to check for *multicollinearity*. Roughly speaking, *multicollinearity* in a data set means that two or more explanatory variables are highly correlated, hence one of the variables can be linearly predicted as a function of the others. This yields problems in the context of explaining the response variable as a function of its explanatory variables. In short, we do not want *multicollinearity*.
**R tip**: when we utilize '.' that means we are calling all variables. This saves us time from manually selecting each variables as such `boston.df$chas` into our model.
# Fitting our first model
fit1 <- lm(log(medv) ~ ., data = boston.df)
summary(fit1)
bptest(fit1)
vif(fit1)
Below is our terminal output. Here we see the summary of our first model. To learn more about model summaries, visit this [r-bloggers article](https://www.r-bloggers.com/interpreting-regression-coefficient-in-r/). Notice the **Multiple R-squared of 0.7946**. In *multiple regression*, the R-squared is the square of the correlation between the response and fitted (predicted) linear model. This means that the higher our R-squared the better job our model is doing at predicting the response variable. We will need to run some diagnostics on our model before believing our R-squared value.
### Terminal Output
> summary(fit1)
Call:
lm(formula = log(medv) ~ ., data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.73163 -0.10074 -0.01271 0.09153 0.86209
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0078476 0.2188374 18.314 < 2e-16 ***
crim -0.0102270 0.0013164 -7.769 4.75e-14 ***
zn 0.0015096 0.0005698 2.649 0.008333 **
indus 0.0027748 0.0025634 1.082 0.279575
chas1 0.1002734 0.0347642 2.884 0.004096 **
nox -0.7728738 0.1569242 -4.925 1.16e-06 ***
rm 0.0868708 0.0169635 5.121 4.39e-07 ***
age 0.0001981 0.0005327 0.372 0.710086
dis -0.0511276 0.0081318 -6.287 7.21e-10 ***
rad2 0.0833711 0.0595060 1.401 0.161838
rad3 0.1766617 0.0537784 3.285 0.001094 **
rad4 0.1002350 0.0478207 2.096 0.036595 *
rad5 0.1344828 0.0486359 2.765 0.005908 **
rad6 0.0961606 0.0589625 1.631 0.103565
rad7 0.2075191 0.0632793 3.279 0.001115 **
rad8 0.1939560 0.0600758 3.229 0.001329 **
rad24 0.3504602 0.0720382 4.865 1.55e-06 ***
tax -0.0005052 0.0001569 -3.221 0.001365 **
ptratio -0.0370069 0.0058188 -6.360 4.67e-10 ***
black 0.0004148 0.0001072 3.871 0.000123 ***
lstat -0.0291645 0.0020395 -14.300 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.189 on 485 degrees of freedom
Multiple R-squared: 0.7946, Adjusted R-squared: 0.7861
F-statistic: 93.81 on 20 and 485 DF, p-value: < 2.2e-16
Below we see the terminal output from our `bptest(fit1)` script, which tells us about the heteroskedasticity of our model. The p-value that we obtain from this diagnostic is very small, which suggests that our data is *homoskedastic* (not *heteroskedastic*). This is good, because as we mentioned earlier, we need *homoskedastic* data to run a linear regression model.
### Terminal Output
> bptest(fit1) # checking for homoskedasticity
studentized Breusch-Pagan test
data: fit1
BP = 77.592, df = 20, p-value = 1.002e-08
Below we see the terminal output from our `vif(fit1)` script, which tells us about the *multicollinearity* of our model. *VIF* values higher than 5 are considered to be problematic, so we drop them from the model. *tax* has the highest *VIF* value at 9.0085, so we drop it first.
### Terminal Output
> vif(fit1)
crim zn indus chas1 nox rm age
1.8119 2.4962 4.3705 1.1018 4.6729 2.0076 3.1776
dis rad2 rad3 rad4 rad5 rad6
4.1436 2.2654 2.8445 5.5092 5.8824 2.3996
rad7 rad8 rad24 tax ptratio black lstat
1.8410 2.3090 14.1690 9.8769 2.2427 1.3525 2.9976
## Model 2: Removing Collinear Variables (tax)
Now onto our second model; this time we are removing the tax variable (because of high *multicollinearity*). We use the `update()` method to recycle our first model while removing `tax`. Use `vif()` and `summary()` again to diagnose.
fit2 <- update(fit1, ~ . - tax)
vif(fit2)
summary(fit2)
We can see below that `rad24` suggests *multicollinearity* since the value is > 5, but since we do not know how to remove one level of a factor variable we will drop `rad`.
### Terminal Output
> fit2 <- update(fit1, ~ . - tax)
> vif(fit2) # Checking again for more multicollinearity
crim zn indus chas1 nox rm age
1.8116 2.4222 3.5828 1.0874 4.6638 2.0065 3.1689
dis rad2 rad3 rad4 rad5 rad6 rad7
4.1359 2.2251 2.8287 5.4929 5.8445 2.3074 1.8366
rad8 rad24
2.2972 8.3218
ptratio black lstat
2.2359 1.3516 2.9974
Below we get a summary of our second model. This time we are looking for variables that are not statistically significant to remove them. By convention, variables with a p-value larger than *.05* are considered not to be statistically significant. Notice the *p-value* in the terminal output below listed as `Pr(>|t|)` for each variable. Let's start by dropping `age`, `indus`.
### Terminal Output
> summary(fit2)
Call:
lm(formula = log(medv) ~ crim + zn + indus + chas + nox + rm +
age + dis + rad + ptratio + black + lstat, data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.73162 -0.10219 -0.01228 0.09216 0.86447
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9170473 0.2190964 17.878 < 2e-16 ***
crim -0.0101787 0.0013289 -7.659 1.02e-13 ***
zn 0.0011936 0.0005667 2.106 0.035711 *
indus -0.0007300 0.0023432 -0.312 0.755519
chas1 0.1130736 0.0348677 3.243 0.001264 **
nox -0.7951695 0.1582759 -5.024 7.12e-07 ***
rm 0.0881738 0.0171214 5.150 3.79e-07 ***
age 0.0001085 0.0005371 0.202 0.839957
dis -0.0522585 0.0082022 -6.371 4.35e-10 ***
rad2 0.1089429 0.0595399 1.830 0.067900 .
rad3 0.1895977 0.0541429 3.502 0.000505 ***
rad4 0.0918609 0.0482082 1.906 0.057305 .
rad5 0.1219047 0.0489441 2.491 0.013082 *
rad6 0.0589410 0.0583737 1.010 0.313133
rad7 0.1976450 0.0638115 3.097 0.002066 **
rad8 0.1801044 0.0604967 2.977 0.003055 **
rad24 0.2014158 0.0557378 3.614 0.000333 ***
ptratio -0.0380353 0.0058658 -6.484 2.20e-10 ***
black 0.0004239 0.0001081 3.920 0.000101 ***
lstat -0.0291049 0.0020590 -14.136 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1908 on 486 degrees of freedom
Multiple R-squared: 0.7902, Adjusted R-squared: 0.782
F-statistic: 96.34 on 19 and 486 DF, p-value: < 2.2e-16
## Model 3: Removing Non-Significant Variables (age, indus, rad)
Now we run our third model. For this iteration we have removed the `tax`, `age`, `indus`, and `rad` variables.
fit3 <- update(fit2, ~ . - age - indus - rad)
summary(fit3)
The summary of our model tells us that our least significant variable, `crim`, is somewhat significant with a p-value of *0.045211*. We can now stop dropping variables. We can now move to another set of diagnostics.
### Terminal Output
> summary(fit3)
Call:
lm(formula = log(medv) ~ crim + zn + chas + nox + rm + dis +
ptratio + black + lstat, data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.71003 -0.10155 -0.01094 0.09083 0.89539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.7901595 0.1962986 19.308 < 2e-16 ***
crim -0.0083340 0.0012238 -6.810 2.83e-11 ***
zn 0.0008755 0.0005407 1.619 0.106052
chas1 0.1195298 0.0349829 3.417 0.000686 ***
nox -0.6852309 0.1302350 -5.261 2.13e-07 ***
rm 0.1061611 0.0164243 6.464 2.45e-10 ***
dis -0.0488011 0.0075982 -6.423 3.14e-10 ***
ptratio -0.0333366 0.0047273 -7.052 5.96e-12 ***
black 0.0003735 0.0001083 3.449 0.000610 ***
lstat -0.0287382 0.0019479 -14.753 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1947 on 496 degrees of freedom
Multiple R-squared: 0.7772, Adjusted R-squared: 0.7731
F-statistic: 192.2 on 9 and 496 DF, p-value: < 2.2e-16
## Diagnosing for Outliers and High-Leverage Points
Now we will run our model through the `outlierTest()` to check for outliers and high leverage points in our data. Linear regression model operates under the assumption of *no outliers* or *high leverage points*.
outlierTest(fit3, cutoff = Inf, n.max = 15)
Below we see the terminal output for the outliers' test. Observations with an absolute value greater than 3 for their *studentized residuals* are considered problematic, so we remove them from the model.
### Terminal Output
> outlierTest(fit3, cutoff = Inf, n.max = 15)
rstudent unadjusted p-value Bonferonni p
413 4.828866 1.8331e-06 0.00092755
372 4.415447 1.2386e-05 0.00626750
369 4.054746 5.8286e-05 0.02949300
373 4.044845 6.0722e-05 0.03072500
402 -3.716832 2.2476e-04 0.11373000
375 3.606444 3.4192e-04 0.17301000
490 -3.579379 3.7835e-04 0.19144000
506 -3.449505 6.0959e-04 0.30845000
215 3.439297 6.3248e-04 0.32004000
401 -3.403025 7.2048e-04 0.36456000
368 3.033191 2.5469e-03 NA
410 2.991901 2.9108e-03 NA
398 -2.819016 5.0101e-03 NA
400 -2.775641 5.7179e-03 NA
366 2.702446 7.1196e-03 NA
We see that rows 413, 372, 369, 373, 402, 375, 490, 506, 215, and 40 1meet the criteria for removal as their studentized residuals have absolute values above 3.
## Model 4: Removing Outliers and High-Leverage Points
Next we update our data frame by removing the outlier observations, declare our fourth model, and get residual diagnostics plots using autoplot() methods to test our model.
boston.df <- boston.df[-c(413, 372, 369, 373, 402, 375, 490, 506, 215, 401),]
fit4 <- lm(log(medv) ~ . - tax - age - indus - rad, data = boston.df)
summary(fit4)
### Terminal Output
> fit4 <- lm(log(medv) ~ . - tax - age - indus - rad, data = boston.df)
> summary(fit4)
Call:
lm(formula = log(medv) ~ . - tax - age - indus - rad, data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.54154 -0.09775 -0.01559 0.08727 0.66039
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.456e+00 1.693e-01 20.411 < 2e-16 ***
crim -8.426e-03 1.047e-03 -8.046 6.61e-15 ***
zn 6.405e-04 4.607e-04 1.390 0.16511
chas1 9.297e-02 3.009e-02 3.090 0.00212 **
nox -5.592e-01 1.121e-01 -4.987 8.54e-07 ***
rm 1.329e-01 1.426e-02 9.320 < 2e-16 ***
dis -4.128e-02 6.521e-03 -6.330 5.58e-10 ***
ptratio -3.229e-02 4.040e-03 -7.993 9.64e-15 ***
black 4.811e-04 9.358e-05 5.141 3.98e-07 ***
lstat -2.786e-02 1.740e-03 -16.014 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1651 on 486 degrees of freedom
Multiple R-squared: 0.829, Adjusted R-squared: 0.8259
F-statistic: 261.8 on 9 and 486 DF, p-value: < 2.2e-16
Here we get the summary of our model. It looks like `lstat` is our most significant variable along with `rm`. Our *R-squared* is 0.808, and since it is close to 1, we can say that our model explains a large part of our response variable. Our *residual standard error* (**RSE**) is fairly small at 3.87, which is good. Let's see if we can do better.
But first, let's take a look at a few residual diagnostics plots for our fourth model. To learn more residual diagnostics, visit this [r-bloggers article](https://www.r-bloggers.com/model-validation-interpreting-residual-plots/). Given the curve in our *Residuals vs Fitted* plot below we can infer that there is non-linearity in our data.
autoplot(fit4, label.size = 3, shape = 21,
data=boston.df, colour='turquoise4') +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray")) +
ggtitle("Residual Diagnostics for fit4")
<img src="http://i.imgur.com/8N31qIX.png">
For this reason we will include non-linear terms in our model in the next couple of iterations.
## Diagnosing Residuals (Checking for Normality of Residuals)
Another assumption for linear regression is normality of residual terms. We test normality of residuals using the `residuels()` and `ggplot()` methods. We also fit a curve using the `geom_density()` method.
residBar <- ggplot(data=fit4, aes(residuals(fit4))) +
geom_histogram(aes(y =..density..),
col="black", fill="white") +
geom_density(col=1) +
theme(panel.background = element_rect(fill = "gray98"),
panel.grid.minor = element_blank(),
axis.line = element_line(colour="gray"),
axis.line.x = element_line(colour="gray")) +
ggtitle("Histogram of Residuals for fit4")
residBar
<img src="http://i.imgur.com/qphAjST.png">
Here we see that the residuals of our model display a *normal distribution*; which is good as this is a linear regression assumption.
## Model 5: Adding Non-Linear Terms
Now onto the fifth iteration of our model with non-linear terms. Here we include an `lstat`-squared by adding `I(lstat^2)` to the model. We can infer a non-linear relationship between the response variable and our model in the residual diagnostics for model 4. Furthermore, we can sense a non-linear relationship between `lstat` and `medv` from the exploratory analysis phase.
fit5 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df)
summary(fit5)
### Terminal Output
> fit5 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df)
> summary(fit5)
Call:
lm(formula = log(medv) ~ . - tax - age - indus - rad + I(lstat^2),
data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.59975 -0.09465 -0.01432 0.09241 0.64969
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.583e+00 1.701e-01 21.057 < 2e-16 ***
crim -8.997e-03 1.043e-03 -8.625 < 2e-16 ***
zn 3.381e-04 4.610e-04 0.733 0.463687
chas1 9.074e-02 2.967e-02 3.058 0.002352 **
nox -5.111e-01 1.113e-01 -4.594 5.55e-06 ***
rm 1.208e-01 1.440e-02 8.392 5.28e-16 ***
dis -4.047e-02 6.433e-03 -6.292 7.02e-10 ***
ptratio -2.979e-02 4.036e-03 -7.382 6.84e-13 ***
black 4.589e-04 9.245e-05 4.964 9.59e-07 ***
lstat -4.522e-02 4.809e-03 -9.403 < 2e-16 ***
I(lstat^2) 5.083e-04 1.316e-04 3.864 0.000127 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1628 on 485 degrees of freedom
Multiple R-squared: 0.8341, Adjusted R-squared: 0.8307
F-statistic: 243.9 on 10 and 485 DF, p-value: < 2.2e-16
In our model summary above we see that `I(lstat^2)` is indeed *statistically significant*, more so than most other explanatory variables. Our *R-squared* has increased to 0.833 (which is good) and our **RSE** has decreased to 3.62 (which is also good).
Given the significance of `I(lstat^2)` it may make sense to include a third-level polynomial for `lstat`. Let's try that in the next model. But first, let's take a look at our *Residuals vs Fitted* plot.
autoplot(fit5, label.size = 3, shape = 21,
data=boston.df, colour='turquoise4') +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray")) +
ggtitle("Residual Diagnostics for fit5")
<img src="http://i.imgur.com/cdbkLnS.png">
We can see that our model is doing a little better as far as linearity is concerned, compared to our previous model.
## Model 6: More Non-Linearity
Now onto model 6. Here we are adding `I(lstat^3)` to our model given the *statistical significance* of `I(lstat^2)`. We will run some diagnostics to see if this makes sense or if we have gone too far.
fit6 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2) + I(lstat^3), data = boston.df)
summary(fit6)
### Terminal Output
> fit6 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2) + I(lstat^3), data = boston.df)
> summary(fit6)
Call:
lm(formula = log(medv) ~ . - tax - age - indus - rad + I(lstat^2) +
I(lstat^3), data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.59756 -0.09290 -0.01524 0.09406 0.64617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.639e+00 1.898e-01 19.167 < 2e-16 ***
crim -8.998e-03 1.044e-03 -8.622 < 2e-16 ***
zn 2.845e-04 4.682e-04 0.608 0.54365
chas1 8.922e-02 2.978e-02 2.996 0.00287 **
nox -5.158e-01 1.115e-01 -4.624 4.83e-06 ***
rm 1.176e-01 1.521e-02 7.732 6.18e-14 ***
dis -4.024e-02 6.446e-03 -6.243 9.42e-10 ***
ptratio -2.996e-02 4.046e-03 -7.405 5.86e-13 ***
black 4.606e-04 9.254e-05 4.977 8.98e-07 ***
lstat -5.291e-02 1.249e-02 -4.236 2.72e-05 ***
I(lstat^2) 1.026e-03 7.870e-04 1.304 0.19296
I(lstat^3) -9.954e-06 1.492e-05 -0.667 0.50497
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1629 on 484 degrees of freedom
Multiple R-squared: 0.8343, Adjusted R-squared: 0.8305
F-statistic: 221.5 on 11 and 484 DF, p-value: < 2.2e-16
Given its *p-value*, it looks like `I(lstat^3)` does not add much to the model. We do notice a slight increase in the *R-squared* value to 0.834 and a slight decrease in the **MSE** value to 3.61. We are not sure if this might be an improvement, however, as we have traded simplicity (or interpretability) in the model for little accuracy improvement.
Observing our residual diagnostics, there seems to be little difference between our results across model 5 and 6.
autoplot(fit6, label.size = 3, shape = 21,
data=boston.df, colour='turquoise4') +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray")) +
ggtitle("Residual Diagnostics for fit6")
<img src="http://i.imgur.com/2oFqv63.png">
Now we are ready to move on to the model diagnostics stage, to determine which of our models rank best at estimating our response variable.
# Cross Validation
In this stage we determine which of our models is best at estimating our response variable while not overfitting our data. For this we use the `CVlm()` method. We will be comparing models `fit4`, `fit5`, and `fit6`.
## Cross Validation for Model 4
First we run our fourth model through the `CVlm()` method. We also get a set of attributes of our cross validation object.
par(mfrow = c(1,1))
fit4_CV <- suppressWarnings(CVlm(data = boston.df,
form.lm = formula(fit4),
m = 10,
dots = FALSE,
seed = 1,
plotit = c("Observed", "Residual"),
main = "Cross-validation for fit4",
legend.pos="topleft"))
attr(fit4_CV, 'ms')
Below is the cross validation plot of our fourth model. **R** outputs this plot when calling the `CVlm()` method. Roughly speaking, we want each of the fold symbols to be as close as possible to the lines on the plot. We call the `attr()` to get more details on the *mean squared error* of the fit.
<img src="http://i.imgur.com/8ZG4OYM.png">
Below is the `attr()` output for our model 4 cross validation object.
### Terminal Output
> attr(fit4_CV, 'ms')
[1] 0.0284
The relevant statistic we gather from the `CVlm()` and `attributes()` methods is the *Mean Squared Error* (**MSE**). This value represents the *mean squared error* our model has over a random set of test data sets, simulated by our training data set. Therefore, the smaller our **MSE** is, the better. Our **MSE** for model 4 is 0.0284 thus far. Let's look at models 5 and 6.
## Cross Validation for Model 5
We run our fifth model through the `CVlm()` method. We also get a set of attributes of our cross validation object.
par(mfrow = c(1,1))
fit5_CV <- suppressWarnings(CVlm(data = boston.df,
form.lm = formula(fit5),
m = 10,
dots = FALSE,
seed = 1,
plotit = c("Observed", "Residual"),
main = "Cross-validation for fit5",
legend.pos="topleft"))
attr(fit5_CV, 'ms')
Below is the cross validation plot of our fifth model, and its attributes.
<img src="http://i.imgur.com/rhDSv4j.png">
Below is the `attr()` output for our model 5 cross validation object, again specifically the *mean squared error*.
### Terminal Output
> attr(fit5_CV, 'ms')
[1] 0.0278
Our **MSE** for model 5 is 0.0278, definitely better than that of model 4.
## Cross Validation for Model 6
We run our sixth model thought the `CVlm()` method. We also get a set of attributes of our cross validation object.
par(mfrow = c(1,1))
fit6_CV <- suppressWarnings(CVlm(data = boston.df,
form.lm = formula(fit6),
m = 10,
dots = FALSE,
seed = 1,
plotit = c("Observed", "Residual"),
main = "Cross-validation for fit6",
legend.pos="topleft"))
attr(fit6_CV, 'ms')
Below is the cross validation plot of our sixth model, and its attributes.
<img src="http://i.imgur.com/unSTpuG.png">
Below is the `attr()` output for our model 6 cross validation object.
### Terminal Output
> attr(fit6_CV, 'ms')
[1] 0.028
Our `MSE` for model 6 is 0.028, so close to our other models, but it seems `fit5` was our best model!
# Conclusions
With an *R-squared* value of 0.834 (.001 smaller than model 6's) and a *Mean Squared Error* of 0.0278, model 5 is the winner. Let's take another look at a summary of model 5:
### Terminal Output
> fit5 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df)
> summary(fit5)
Call:
lm(formula = log(medv) ~ . - tax - age - indus - rad + I(lstat^2),
data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.5998 -0.0947 -0.0143 0.0924 0.6497
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.58e+00 1.70e-01 21.06 < 2e-16 ***
crim -9.00e-03 1.04e-03 -8.63 < 2e-16 ***
zn 3.38e-04 4.61e-04 0.73 0.46369
chas1 9.07e-02 2.97e-02 3.06 0.00235 **
nox -5.11e-01 1.11e-01 -4.59 5.6e-06 ***
rm 1.21e-01 1.44e-02 8.39 5.3e-16 ***
dis -4.05e-02 6.43e-03 -6.29 7.0e-10 ***
ptratio -2.98e-02 4.04e-03 -7.38 6.8e-13 ***
black 4.59e-04 9.24e-05 4.96 9.6e-07 ***
lstat -4.52e-02 4.81e-03 -9.40 < 2e-16 ***
I(lstat^2) 5.08e-04 1.32e-04 3.86 0.00013 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.163 on 485 degrees of freedom
Multiple R-squared: 0.834, Adjusted R-squared: 0.831
F-statistic: 244 on 10 and 485 DF, p-value: < 2e-16
Next let's visualize the residual diangostics again.
autoplot(fit5, which = 1:6, ncol = 3, label.size = 1,
colour='turquoise4') +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray"))
<img src="http://i.imgur.com/YURlRPT.png">
Model 5 suggests that `lstat`, `I(lstat^2)`, `rm` and `ptratio` are most relevant explanatory variables in estimating `medv`. It is worth mentioning that our residual diagnostics suggested a certain amount of non-linearity that is not addressed by our model. This means that there might be other methods that do a better job at estimating `medv` than *multivariate linear regression*.
However, `lm()` is a fundamental model in statistical learning for estimation and prediction. Understanding this type of modeling allows for the understanding of more sophisticated models.

## COMMENTS