**Regression****Analysis**of**Boston****Home****Prices**with**R**

This project focuses on finding the *best* linear regression model to estimate the median value (in $1000's) of owner-occupied homes in the Suburbs of Boston.

Understanding house prices (and related variables) are relevant to economists and policy-makers when allocating budgets for domestic purposes. Linear regression is a fundamental concept in statistical-learning, from which other more sophisticated methods arise. Although this may not be the most precise, the statistical foundation in which this procedure is built on is worth exploring.

For this project we leverage the power of R for both the statistical computations as well as the visualizations.

The data used in this project is available at the UCI Machine Learning Repository. This modeling problem is referenced in the 2014 version of the book "An Introduction to Statistical Learning" by Gareth Games, Daniela Witten, Trevor Hastie, and Robert Tibshinari.

We encourage you to replicate this project and submit your own contributions. You can fork this project on GitHub.

- Basic working knowledge of the R programming language, statistics, and economics. If you need context in R we recommend that you visit DataCamp.
- R environment. Download R. Download R Studio.

- David A. Campos
- Raul Eulogio
- Amil Khan

First we load the necessary packages into our R environment using the `library()`

method. We will need the packages `usdm`

to test for multicollinearity, `car`

to diagnose for outliers, `MASS`

to diagnose studentized residuals, `DAAG`

to cross validate our models, and `lmtest`

to check for heteroskedasticity. It is crucial that you first install these packages using `install.packages()`

if haven't done so already.

# RUN EACH OF COMMANDS ON YOUR CONSOLE:install.packages("usdm") install.packages("car") install.packages("MASS") install.packages("DAAG") install.packages("lmtest") install.packages("ggplot2") install.packages("ggfortify") install.packages("GGally")

`library(usdm) `**# for testing collinearity**
library(car) **# for testing outliers**
library(MASS) **# for testing studentized residuals**
library(DAAG) **# for cross validation of model**
library(lmtest) **# for checking homoskedasticity/heteroskedasticity**
library(ggplot2) **# Use for visuals**
library(ggfortify)
library(GGally)

The Boston Housing dataset is very popular in the R community, and is easily accessible through the `MASS`

package. To call this data we simply use the `data()`

method. We use the `help()`

method to get context on our data, and the `attach()`

method to include our variable names.

The variable `chas`

appears to be a factor of 0 and 1 so we turn it into a factor variable using the `as.factor()`

method. Finally, we create our data frame using the `data.frame()`

for use in the rest of our project.

```
data(Boston)
help(Boston)
attach(Boston)
boston.df <- data.frame(Boston)
boston.df$chas <- as.factor(boston.df$chas)
boston.df$rad <- as.factor(boston.df$rad)
```**# Since rad seems to be an ordinal variable we add levels
# just to be safe! **
levels(boston.df$rad) <- c(1, 2, 3, 4, 5, 6, 7, 8, 24)
str(boston.df)

Now we are ready for some exploratory analysis!

We want to gain intuition from our data by figuring out which exploratory variables might be the most relevant in estimating for `medv`

, our response variable.

Here we run our data frame through the `str()`

method to see how many variables and observations we have, as well as a general feel for the data.

`str(boston.df)`

We can see that we have a total of 14 variables and 506 observations. We also see the data types for the different variables, as well as sample values.

# Here is our output.> str(boston.df) 'data.frame': 506 obs. of 14 variables: $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... $ rm : num 6.58 6.42 7.18 7 7.15 ... $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 $ dis : num 4.09 4.97 4.97 6.06 6.06 ... $ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 $ tax : num 296 242 242 222 222 222 311 311 311 311 ... $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 $ black : num 397 397 393 395 397 ... $ lstat : num 4.98 9.14 4.03 2.94 5.33 ... $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9

Here we run our data frame through the `ggpairs()`

method to get a feel for how our variables might be related to each other. To learn more scatterplot matrices, visit this r-bloggers article. But for this presentation we are using the extension `ggpairs`

to provide a cleaner visual than that of the regular `pairs`

function. We start off with a subsection of the pairs function, since using all variables produces a plot that is not readable, and henceforth unusable.

```
pairs <- ggpairs(boston.df,
columns = c(1, 2, 3, 4, 5, 14),
lower=list(continuous=wrap("smooth",
colour="turquoise4")),
diag=list(continuous=wrap("barDiag",
fill="turquoise4"))) +
theme(panel.background = element_rect(fill = "gray98"),
axis.line.y = element_line(colour="gray"),
axis.line.x = element_line(colour="gray"))
pairs
```

We now extend the `ggpairs`

function to the rest of the variables to get more information regarding the interaction of our variables. For that we just call up the variables which we did not include in the first iteration of `ggpairs`

:

```
pairs1 <- ggpairs(boston.df,
columns = c(6, 7, 8, 9, 10, 14),
lower=list(continuous=wrap("smooth",
colour="turquoise4")),
diag=list(continuous=wrap("barDiag",
fill="turquoise4"))) +
theme(panel.background = element_rect(fill = "gray98"),
axis.line.y = element_line(colour="gray"),
axis.line.x = element_line(colour="gray"))
pairs1
```

And finally we have:

```
pairs3 <- ggpairs(boston.df,
columns=c(11, 12, 13,14),
lower=list(continuous=wrap("smooth",
colour="turquoise4")),
diag=list(continuous=wrap("barDiag"
, fill="turquoise4"))) +
theme(panel.background = element_rect(fill = "gray98"),
axis.line.y = element_line(colour="black"),
axis.line.x = element_line(colour="black"))
pairs3
```

It helps to click on the images to get a better picture of the correlation between variables.

Here we are looking for linear relationships between our response variable `medv`

and the rest of the variables in our data frame. We can roughly infer that `rm`

, `ptratio`

, and `lstat`

might be the most relevant in estimating `medv`

. We also use the `ggpairs`

function to be wary of multicollinearity between dependent variables. We do not want this because we might be over-fitting our model, when we have two variables that are highly correlated repeating themselves.

We included more exploratory analysis using ggplot2 to give you an idea of some of the setbacks/pitfalls we might run into when using this data set. Because from the ggpairs function we can see that most of the interactions between the variables do not show linear iterations, and show some patterns.

An interaction that is fairly simple to start with is the variables `lstat`

and `tax`

since it gives us a healthy perspective on how the `rad`

will influence the model. We use the `geom_point`

function to create a scatterplot which we then add points of color based on the variable `rad`

, which we will see through the next few graphs plays a large role in how variables interact. So we start assigning a `ggplot`

function with the variables and then we add `geom_point`

with the color set to `factor(rad)`

as shown here:

```
taxMedv <- ggplot(boston.df,
aes(tax, lstat, colour = rad)) +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray"))
taxMedv + geom_point()
```

For the sake of saving space on this project we limited the amount of images, but when we use the function `facet_wrap`

, which we will use later in this project. The plot shows that the index of **24** has an even distribution with respect to percentage of lower status while

In this section we highlight possible outliers which might influence our linear model. This we will go into more detail when we look at the residual diagnostics, especially `Cook's Distance`

and `Leverage`

! Here we plot `tax`

vs `ptratio`

with the colour emphasizing `ptratio`

to show outliers!

```
taxPtratio <- ggplot(boston.df,
aes(tax, ptratio, colour = ptratio))
taxPtratio + geom_point() +
ggtitle("TAX vs. PTRATIO Scatterplot") +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray"))
```

Next we emphasize the distribution of `medv`

to make the appropriate transformations, if necessary. We create a histogram which emphasizes the count

Here we plot a histogram of `medv`

to check if the transformation mentioned earlier is needed. We provide the histogram again using `ggplot2`

.

```
medvHist <- ggplot(boston.df, aes(medv))
medvHist + geom_histogram(aes(y = ..density.., fill = ..count..),
colour = 'white', bins = 25) +
geom_density() +
scale_fill_gradient("Count", low = "black",
high = "turquoise4") +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray")) +
ggtitle("Histogram of MEDV")
```

From this histogram we can see that the median price has a **right skewed distribution**! Thus we conclude that a **log** transformation will be needed.

Here we create the histogram of the `log`

transformation to see the distribution!

**# First we create a data frame with the log transformation **
logMedv <- data.frame(log(boston.df$medv))
**# Next we change the name of the column to make it easy
# for us to call in our ggplot2 visual**
colnames(logMedv) <- c("logMedv")
logMedvHist <- ggplot(logMedv, aes(logMedv))
logMedvHist + geom_histogram(aes(y=..density.., fill=..count..),
colour = 'white', bins = 25) +
geom_density() +
scale_fill_gradient("Count", low = "black",
high = "turquoise4") +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray")) +
ggtitle("Histogram of log(MEDV)")

We can see here the distribution is significantly better so now we can proceed to building our linear model.

This is arguably the most engaging part--finding a linear regression model that best fits our data.

For our first model we will go ** all in** with our entire data set. That is, we will use all of our variables and observations in our data frame to predict

`medv`

using linear regression with the `lm()`

method. Throughout the modeling phase, we will fine-tune our working data set using
We run our model through the `bptest()`

method to check for heteroskedasticity. Roughly speaking, heteroskedastic data contains non-constant errors, which violates the assumptions for running a linear model. **In short, we do not want heteroskedastic data. Instead, we prefer homoskedastic data. Homoskedastic data contain errors that are randomly or normally distributed.**

Finally we run our model through the `vif()`

method (for *variance inflation factor*) to check for multicollinearity. Roughly speaking, multicollinearity in a data set means that two or more *explanatory* variables are highly correlated, hence one of the variables can be *linearly* predicted as a function of the others. This yields problems in the context of explaining the response variable as a function of its explanatory variables. **In short, we do not want multicollinearity.**

**R tip** when we utilize '.' that means we are calling all variables. This saves us time from manually selecting each variables as such `boston.df$chas`

into our model.

` ````
fit1 <- lm(log(medv) ~ ., data = boston.df)
summary(fit1)
bptest(fit1)
vif(fit1)
```

Below is our terminal output. Here we see the summary of our first model. To learn more about model summaries, visit this r-bloggers article. Notice the **Multiple R-squared** of **0.7946**. In multiple regression, the R-squared is the square of the correlation between the response and fitted (predicted) linear model. This means that the higher our R-squared the better job our model is doing at predicting the response variable. We will need to run some diagnostics on our model before believing our R-squared value.

` ````
> summary(fit1)
Call:
lm(formula = log(medv) ~ ., data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.73163 -0.10074 -0.01271 0.09153 0.86209
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0078476 0.2188374 18.314 < 2e-16 ***
crim -0.0102270 0.0013164 -7.769 4.75e-14 ***
zn 0.0015096 0.0005698 2.649 0.008333 **
indus 0.0027748 0.0025634 1.082 0.279575
chas1 0.1002734 0.0347642 2.884 0.004096 **
nox -0.7728738 0.1569242 -4.925 1.16e-06 ***
rm 0.0868708 0.0169635 5.121 4.39e-07 ***
age 0.0001981 0.0005327 0.372 0.710086
dis -0.0511276 0.0081318 -6.287 7.21e-10 ***
rad2 0.0833711 0.0595060 1.401 0.161838
rad3 0.1766617 0.0537784 3.285 0.001094 **
rad4 0.1002350 0.0478207 2.096 0.036595 *
rad5 0.1344828 0.0486359 2.765 0.005908 **
rad6 0.0961606 0.0589625 1.631 0.103565
rad7 0.2075191 0.0632793 3.279 0.001115 **
rad8 0.1939560 0.0600758 3.229 0.001329 **
rad24 0.3504602 0.0720382 4.865 1.55e-06 ***
tax -0.0005052 0.0001569 -3.221 0.001365 **
ptratio -0.0370069 0.0058188 -6.360 4.67e-10 ***
black 0.0004148 0.0001072 3.871 0.000123 ***
lstat -0.0291645 0.0020395 -14.300 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.189 on 485 degrees of freedom
Multiple R-squared: 0.7946, Adjusted R-squared: 0.7861
F-statistic: 93.81 on 20 and 485 DF, p-value: < 2.2e-16
```

Below we see the terminal output from our `bptest(fit1)`

script, which tells us about the heteroskedasticity of our model. The p-value that we obtain from this diagnostic is very small, which suggests that our data is homoskedastic (not heteroskedastic). This is good, because as we mentioned earlier, we need homoskedastic data to run a linear regression model.

```
> bptest(fit1) # checking for homoskedasticity
studentized Breusch-Pagan test
data: fit1
BP = 77.592, df = 20, p-value = 1.002e-08
```

Below we see the terminal output from our `vif(fit1)`

script, which tells us about the multicollinearity of our model. VIF values higher than 5 are considered to be problematic, so we drop them from the model. `tax`

has the highest VIF value at 9.0085, so we drop it first.

` ````
> vif(fit1)
crim zn indus chas1 nox rm age
1.8119 2.4962 4.3705 1.1018 4.6729 2.0076 3.1776
dis rad2 rad3 rad4 rad5 rad6
4.1436 2.2654 2.8445 5.5092 5.8824 2.3996
rad7 rad8 rad24 tax ptratio black lstat
1.8410 2.3090 14.1690 9.8769 2.2427 1.3525 2.9976
```

Now onto our second model; this time we are removing the `tax`

variable (because of high multicollinearity). We use the `update()`

method to recycle our first model while removing `tax`

. Use `vif()`

and `summary()`

again to diagnose.

fit2 <- update(fit1, ~ . - tax) vif(fit2) summary(fit2)

We can see below that `rad24`

suggests multicollinearity since the value is > 5, but since we do not know how to remove one level of a factor variable we will drop `rad`

.

` ````
> fit2 <- update(fit1, ~ . - tax)
> vif(fit2) # Checking again for more multicollinearity
crim zn indus chas1 nox rm age
1.8116 2.4222 3.5828 1.0874 4.6638 2.0065 3.1689
dis rad2 rad3 rad4 rad5 rad6 rad7
4.1359 2.2251 2.8287 5.4929 5.8445 2.3074 1.8366
rad8 rad24
2.2972 8.3218
ptratio black lstat
2.2359 1.3516 2.9974
```

Below we get a summary of our second model. This time we are looking for variables that are not statistically significant to remove them. By convention, variables with a p-value larger than **.05** are considered **not** to be statistically significant. Notice the p-value in the terminal output below listed as `Pr(>|t|)`

for each variable. Let's start by dropping `age`

, `indus`

.

` ````
> summary(fit2)
Call:
lm(formula = log(medv) ~ crim + zn + indus + chas + nox + rm +
age + dis + rad + ptratio + black + lstat, data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.73162 -0.10219 -0.01228 0.09216 0.86447
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9170473 0.2190964 17.878 < 2e-16 ***
crim -0.0101787 0.0013289 -7.659 1.02e-13 ***
zn 0.0011936 0.0005667 2.106 0.035711 *
indus -0.0007300 0.0023432 -0.312 0.755519
chas1 0.1130736 0.0348677 3.243 0.001264 **
nox -0.7951695 0.1582759 -5.024 7.12e-07 ***
rm 0.0881738 0.0171214 5.150 3.79e-07 ***
age 0.0001085 0.0005371 0.202 0.839957
dis -0.0522585 0.0082022 -6.371 4.35e-10 ***
rad2 0.1089429 0.0595399 1.830 0.067900 .
rad3 0.1895977 0.0541429 3.502 0.000505 ***
rad4 0.0918609 0.0482082 1.906 0.057305 .
rad5 0.1219047 0.0489441 2.491 0.013082 *
rad6 0.0589410 0.0583737 1.010 0.313133
rad7 0.1976450 0.0638115 3.097 0.002066 **
rad8 0.1801044 0.0604967 2.977 0.003055 **
rad24 0.2014158 0.0557378 3.614 0.000333 ***
ptratio -0.0380353 0.0058658 -6.484 2.20e-10 ***
black 0.0004239 0.0001081 3.920 0.000101 ***
lstat -0.0291049 0.0020590 -14.136 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1908 on 486 degrees of freedom
Multiple R-squared: 0.7902, Adjusted R-squared: 0.782
F-statistic: 96.34 on 19 and 486 DF, p-value: < 2.2e-16
```

Now we run our third model. For this iteration we have removed the `tax`

, `age`

, `indus`

, and `rad`

variables.

fit3 <- update(fit2, ~ . - age - indus - rad) summary(fit3)

The summary of our model tells us that our least significant variable, `crim`

, is somewhat significant with a p-value of < 2.2e-16. We can now stop dropping variables. We can now move to another set of diagnostics.

` ````
> summary(fit3)
Call:
lm(formula = log(medv) ~ crim + zn + chas + nox + rm + dis +
ptratio + black + lstat, data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.71003 -0.10155 -0.01094 0.09083 0.89539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.7901595 0.1962986 19.308 < 2e-16 ***
crim -0.0083340 0.0012238 -6.810 2.83e-11 ***
zn 0.0008755 0.0005407 1.619 0.106052
chas1 0.1195298 0.0349829 3.417 0.000686 ***
nox -0.6852309 0.1302350 -5.261 2.13e-07 ***
rm 0.1061611 0.0164243 6.464 2.45e-10 ***
dis -0.0488011 0.0075982 -6.423 3.14e-10 ***
ptratio -0.0333366 0.0047273 -7.052 5.96e-12 ***
black 0.0003735 0.0001083 3.449 0.000610 ***
lstat -0.0287382 0.0019479 -14.753 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1947 on 496 degrees of freedom
Multiple R-squared: 0.7772, Adjusted R-squared: 0.7731
F-statistic: 192.2 on 9 and 496 DF, p-value: < 2.2e-16
```

Now we will run our model through the `outlierTest()`

to check for outliers and high leverage points in our data. Linear regression model operates under the assumption of **no** outliers or high leverage points.

` ````
outlierTest(fit3, cutoff = Inf, n.max = 15)
```

Below we see the terminal output for the outliers' test. Observations with an absolute value greater than 3 for their studentized residuals are considered problematic, so we remove them from the model.

> outlierTest(fit3, cutoff = Inf, n.max = 15) rstudent unadjusted p-value Bonferonni p 413 4.828866 1.8331e-06 0.00092755 372 4.415447 1.2386e-05 0.00626750 369 4.054746 5.8286e-05 0.02949300 373 4.044845 6.0722e-05 0.03072500 402 -3.716832 2.2476e-04 0.11373000 375 3.606444 3.4192e-04 0.17301000 490 -3.579379 3.7835e-04 0.19144000 506 -3.449505 6.0959e-04 0.30845000 215 3.439297 6.3248e-04 0.32004000 401 -3.403025 7.2048e-04 0.36456000 368 3.033191 2.5469e-03 NA 410 2.991901 2.9108e-03 NA 398 -2.819016 5.0101e-03 NA 400 -2.775641 5.7179e-03 NA 366 2.702446 7.1196e-03 NA

We see that rows 413, 372, 369, 373, 402, 375, 490, 506, 215, and 40 1meet the criteria for removal as their studentized residuals have absolute values above 3.

Next we update our data frame by removing the outlier observations, declare our fourth model, and get residual diagnostics plots using the `par()`

and `plot()`

methods to test our model.

```
boston.df <- boston.df[-c(413, 372, 369, 373, 402, 375, 490, 506, 215, 401),]
fit4 <- lm(log(medv) ~ . - tax - age - indus - rad, data = boston.df)
summary(fit4)
```

Here we get the summary of our model. It looks like `lstat`

is our most significant variable along with `rm`

. Our R-squared is 0.808, and since it is close to 1, we can say that our model explains a large part of our response variable. Our **residual standard error** (RSE) is fairly small at 0.1651, which is good. Let's see if we can do better.

` ````
> fit4 <- lm(log(medv) ~ . - tax - age - indus - rad, data = boston.df)
> summary(fit4)
Call:
lm(formula = log(medv) ~ . - tax - age - indus - rad, data = boston.df)
Residuals:
Min 1Q Median 3Q Max
-0.54154 -0.09775 -0.01559 0.08727 0.66039
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.456e+00 1.693e-01 20.411 < 2e-16 ***
crim -8.426e-03 1.047e-03 -8.046 6.61e-15 ***
zn 6.405e-04 4.607e-04 1.390 0.16511
chas1 9.297e-02 3.009e-02 3.090 0.00212 **
nox -5.592e-01 1.121e-01 -4.987 8.54e-07 ***
rm 1.329e-01 1.426e-02 9.320 < 2e-16 ***
dis -4.128e-02 6.521e-03 -6.330 5.58e-10 ***
ptratio -3.229e-02 4.040e-03 -7.993 9.64e-15 ***
black 4.811e-04 9.358e-05 5.141 3.98e-07 ***
lstat -2.786e-02 1.740e-03 -16.014 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1651 on 486 degrees of freedom
Multiple R-squared: 0.829, Adjusted R-squared: 0.8259
F-statistic: 261.8 on 9 and 486 DF, p-value: < 2.2e-16
```

But first, let's take a look at a few residual diagnostics plots for our fourth model. To learn more residual diagnostics, visit this r-bloggers article. Given the curve in our **Residuals vs Fitted** plot below we can infer that there is non-linearity in our data.

```
autoplot(fit4, label.size = 3, shape = 21,
data=boston.df, colour='turquoise4') +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray")) +
ggtitle("Residual Diagnostics for fit4")
```

For this reason we will include non-linear terms in our model in the next couple of iterations.

Another assumption for linear regression is normality of residual terms. We test normality of residuals using the `studres()`

and `hist()`

methods. We also fit a curve using the `lines()`

method.

` ````
residBar <- ggplot(data=fit4, aes(residuals(fit4))) +
geom_histogram(aes(y =..density..),
col="black", fill="white") +
geom_density(col=1) +
theme(panel.background = element_rect(fill = "gray98"),
panel.grid.minor = element_blank(),
axis.line = element_line(colour="gray"),
axis.line.x = element_line(colour="gray")) +
ggtitle("Histogram of Residuals for fit4")
residBar
```

Here we see that the residuals of our model display a normal distribution; which is good as this is a linear regression assumption.

Now onto the fifth iteration of our model with non-linear terms. Here we include an `lstat`

-squared by adding `I(lstat^2)`

to the model. We can infer a non-linear relationship between the response variable and our model in the residual diagnostics for model 4. Furthermore, we can sense a non-linear relationship between `lstat`

and `medv`

from the exploratory analysis phase.

` ````
fit5 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df)
summary(fit5)
```

In our model summary below we see that `I(lstat^2)`

is indeed statistically significant, more so than most other explanatory variables. Our **R-squared** has increased to 0.8341 (which is good) and our RSE has decreased to 0.1628 (which is also good).

> fit5 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df) > summary(fit5) Call: lm(formula = log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df) Residuals: Min 1Q Median 3Q Max -0.59975 -0.09465 -0.01432 0.09241 0.64969 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.583e+00 1.701e-01 21.057 < 2e-16 *** crim -8.997e-03 1.043e-03 -8.625 < 2e-16 *** zn 3.381e-04 4.610e-04 0.733 0.463687 chas1 9.074e-02 2.967e-02 3.058 0.002352 ** nox -5.111e-01 1.113e-01 -4.594 5.55e-06 *** rm 1.208e-01 1.440e-02 8.392 5.28e-16 *** dis -4.047e-02 6.433e-03 -6.292 7.02e-10 *** ptratio -2.979e-02 4.036e-03 -7.382 6.84e-13 *** black 4.589e-04 9.245e-05 4.964 9.59e-07 *** lstat -4.522e-02 4.809e-03 -9.403 < 2e-16 *** I(lstat^2) 5.083e-04 1.316e-04 3.864 0.000127 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1628 on 485 degrees of freedom Multiple R-squared: 0.8341, Adjusted R-squared: 0.8307 F-statistic: 243.9 on 10 and 485 DF, p-value: < 2.2e-16

Given the significance of `I(lstat^2)`

it may make sense to include a third-level polynomial for `lstat`

. Let's try that in the next model. But first, let's take a look at our **Residuals vs Fitted** plot.

```
autoplot(fit5, label.size = 3, shape = 21,
data=boston.df, colour='turquoise4') +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray")) +
ggtitle("Residual Diagnostics for fit5")
```

We can see that our model is doing a little better as far as linearity is concerned, compared to our previous model.

Now onto model 6. Here we are adding `I(lstat^3)`

to our model given the statistical significance of `I(lstat^2)`

. We will run some diagnostics to see if this makes sense or if we have gone too far.

` ````
fit6 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2) + I(lstat^3), data = boston.df)
summary(fit6)
```

Given its p-value, it looks like `I(lstat^3)`

does not add much to the model. We do notice a slight increase in the R-squared value to 0.8343 and a slight decrease in the MSE value to 0.1629. We are not sure if this might be an improvement, however, as we have traded simplicity (or interpretability) in the model for little accuracy improvement.

> fit6 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2) + I(lstat^3), data = boston.df) > summary(fit6) Call: lm(formula = log(medv) ~ . - tax - age - indus - rad + I(lstat^2) + I(lstat^3), data = boston.df) Residuals: Min 1Q Median 3Q Max -0.59756 -0.09290 -0.01524 0.09406 0.64617 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.639e+00 1.898e-01 19.167 < 2e-16 *** crim -8.998e-03 1.044e-03 -8.622 < 2e-16 *** zn 2.845e-04 4.682e-04 0.608 0.54365 chas1 8.922e-02 2.978e-02 2.996 0.00287 ** nox -5.158e-01 1.115e-01 -4.624 4.83e-06 *** rm 1.176e-01 1.521e-02 7.732 6.18e-14 *** dis -4.024e-02 6.446e-03 -6.243 9.42e-10 *** ptratio -2.996e-02 4.046e-03 -7.405 5.86e-13 *** black 4.606e-04 9.254e-05 4.977 8.98e-07 *** lstat -5.291e-02 1.249e-02 -4.236 2.72e-05 *** I(lstat^2) 1.026e-03 7.870e-04 1.304 0.19296 I(lstat^3) -9.954e-06 1.492e-05 -0.667 0.50497 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1629 on 484 degrees of freedom Multiple R-squared: 0.8343, Adjusted R-squared: 0.8305 F-statistic: 221.5 on 11 and 484 DF, p-value: < 2.2e-16

Observing our residual diagnostics, there seems to be little difference between our results across model 5 and 6.

```
autoplot(fit6, label.size = 3, shape = 21,
data=boston.df, colour='turquoise4') +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray")) +
ggtitle("Residual Diagnostics for fit6")
```

Now we are ready to move on to the model diagnostics stage, to determine which of our models rank best at estimating our response variable.

In this stage we determine which of our models is best at estimating our response variable while not overfitting our data. For this we use the `CVlm()`

method. We will be comparing models 4, 5, and 6.

First we run our fourth model through the `CVlm()`

method. We also get a set of attributes of our cross validation object.

` ````
par(mfrow = c(1,1))
fit4_CV <- suppressWarnings(CVlm(data = boston.df, form.lm = formula(fit4), m = 10, dots = FALSE, seed = 1, plotit = c("Observed", "Residual"),main = "Cross-validation for fit4", legend.pos="topleft"))
attr(fit4_CV, 'ms')
```

Below is the cross validation plot of our fourth model. R outputs this plot when calling the `CVlm()`

method. Roughly speaking, we want each of the fold symbols to be as close as possible to the lines on the plot. We call the `attr()`

to get more details on the mean squared error of the fit.

Below is the `attr()`

output for our model 4 cross validation object.

> attr(fit4_CV, 'ms') [1] 0.0284

The relevant statistic we gather from the `CVlm()`

and `attributes()`

methods is the **Mean Squared Error** (MSE). This value represents the mean squared error our model has over a random set of test data sets, simulated by our training data set. Therefore, the smaller our MSE is, the better. Our MSE for model 4 is 0.0284 thus far. Let's look at models 5 and 6.

We run our fifth model through the `CVlm()`

method. We also get a set of attributes of our cross validation object.

` ````
par(mfrow = c(1,1))
fit5_CV <- suppressWarnings(CVlm(data = boston.df, form.lm = formula(fit5), m = 10, dots = FALSE, seed = 1, plotit = c("Observed", "Residual"), main = "Cross-validation for fit5", legend.pos="topleft"))
attr(fit5_CV, 'ms')
```

Below is the cross validation plot of our fifth model, and its attributes.

Below is the `attr()`

output for our model 5 cross validation object, again specifically the mean squared error.

> attr(fit5_CV, 'ms') [1] 0.0278

Our MSE for model 5 is 0.0278, definitely better than that of model 4.

We run our sixth model thought the `CVlm()`

method. We also get a set of attributes of our cross validation object.

` ````
par(mfrow = c(1,1))
fit6_CV <- suppressWarnings(CVlm(data = boston.df, form.lm = formula(fit6), m = 10, dots = FALSE, seed = 1, plotit = c("Observed", "Residual"), main = "Cross-validation for fit6", legend.pos="topleft"))
```

Below is the cross validation plot of our sixth model, and its attributes.

Below is the `attr()`

output for our model 6 cross validation object.

> attr(fit6_CV, 'ms') [1] 0.028

Our MSE for model 6 is 0.028, so close to our other models, but it seems **fit5** was our best model!

With an R-squared value of 0.834 and a Mean Squared Error of 0.0278, model 5 is the winner. Let's take another look at a summary of model 5:

> fit5 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df) > summary(fit5) Call: lm(formula = log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df) Residuals: Min 1Q Median 3Q Max -0.5998 -0.0947 -0.0143 0.0924 0.6497 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.58e+00 1.70e-01 21.06 < 2e-16 *** crim -9.00e-03 1.04e-03 -8.63 < 2e-16 *** zn 3.38e-04 4.61e-04 0.73 0.46369 chas1 9.07e-02 2.97e-02 3.06 0.00235 ** nox -5.11e-01 1.11e-01 -4.59 5.6e-06 *** rm 1.21e-01 1.44e-02 8.39 5.3e-16 *** dis -4.05e-02 6.43e-03 -6.29 7.0e-10 *** ptratio -2.98e-02 4.04e-03 -7.38 6.8e-13 *** black 4.59e-04 9.24e-05 4.96 9.6e-07 *** lstat -4.52e-02 4.81e-03 -9.40 < 2e-16 *** I(lstat^2) 5.08e-04 1.32e-04 3.86 0.00013 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.163 on 485 degrees of freedom Multiple R-squared: 0.834, Adjusted R-squared: 0.831 F-statistic: 244 on 10 and 485 DF, p-value: < 2e-16

For more residual diagnostics which can be useful in your analysis with the update of the `autoplot`

function using `ggplot2`

and `ggfortify`

we can make more aesthetically pleasing visuals!

```
autoplot(fit5, which = 1:6, ncol = 3, label.size = 1,
colour='turquoise4') +
theme(panel.background = element_rect(fill = "gray98"),
axis.line = element_line(colour="black"),
axis.line.x = element_line(colour="gray"),
axis.line.y = element_line(colour="gray"))
```

Model 5 suggests that `lstat`

, `I(lstat^2)`

, `rm`

and `ptratio`

are most relevant explanatory variables in estimating `medv`

. It is worth mentioning that our residual diagnostics suggested a certain amount of non-linearity that is not addressed by our model. This means that there might be other methods that do a better job at estimating `medv`

than `lm()`

.

However, `lm()`

is a fundamental model in statistical learning for estimation and prediction. Understanding this type of modeling allows for the understanding of more sophisticated models.

Congratulations for getting this far! We hope you enjoyed this project. Please reach out to us here if you have any feedback or would like to publish your own project.