Modeling Home Prices (R)


Project Summary

Abstract

This project focuses on finding the best linear regression model to estimate the median value (in $1000's) of owner-occupied homes in the Suburbs of Boston.

Understanding house prices (and related variables) are relevant to economists and policy-makers when allocating budgets for domestic purposes. Linear regression is a fundamental concept in statistical-learning, from which other more sophisticated methods arise. Although this may not be the most precise, the statistical foundation in which this procedure is built on is worth exploring.

For this project we leverage the power of R for both the statistical computations as well as the visualizations.

The data used in this project is available at the UCI Machine Learning Repository. This modeling problem is referenced in the 2014 version of the book "An Introduction to Statistical Learning" by Gareth Games, Daniela Witten, Trevor Hastie, and Robert Tibshinari.

We encourage you to replicate this project and submit your own contributions. You can fork this project on GitHub.

Requirements

Steps

Contributors

1. Environment Setup

First we load the necessary packages into our R environment using the library() method. We will need the packages usdm to test for multicollinearity, car to diagnose for outliers, MASS to diagnose studentized residuals, DAAG to cross validate our models, and lmtest to check for heteroskedasticity. It is crucial that you first install these packages using install.packages() if haven't done so already.

Installing Packages


# RUN EACH OF COMMANDS ON YOUR CONSOLE:

install.packages("usdm")
install.packages("car")
install.packages("MASS")
install.packages("DAAG")
install.packages("lmtest")
install.packages("ggplot2")
install.packages("ggfortify")
install.packages("GGally")

Loading Packages

library(usdm) # for testing collinearity 
library(car) # for testing outliers
library(MASS) # for testing studentized residuals
library(DAAG) # for cross validation of model
library(lmtest) # for checking homoskedasticity/heteroskedasticity
library(ggplot2) # Use for visuals
library(ggfortify)
library(GGally)

Loading Data

The Boston Housing dataset is very popular in the R community, and is easily accessible through the MASS package. To call this data we simply use the data() method. We use the help() method to get context on our data, and the attach() method to include our variable names.

The variable chas appears to be a factor of 0 and 1 so we turn it into a factor variable using the as.factor() method. Finally, we create our data frame using the data.frame() for use in the rest of our project.


data(Boston) 
help(Boston)
attach(Boston)

boston.df <- data.frame(Boston)

boston.df$chas <- as.factor(boston.df$chas)
boston.df$rad <- as.factor(boston.df$rad)

# Since rad seems to be an ordinal variable we add levels 
# just to be safe!    

levels(boston.df$rad) <- c(1, 2, 3, 4, 5, 6, 7, 8, 24)


str(boston.df)

Now we are ready for some exploratory analysis!

2. Exploratory Analysis

We want to gain intuition from our data by figuring out which exploratory variables might be the most relevant in estimating for medv, our response variable.

Data Structure

Here we run our data frame through the str() method to see how many variables and observations we have, as well as a general feel for the data.

str(boston.df)

We can see that we have a total of 14 variables and 506 observations. We also see the data types for the different variables, as well as sample values.

TERMINAL OUTPUT

 # Here is our output.
> str(boston.df)
'data.frame': 506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87
 $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9
        

Getting Scatterplot Matrices

Here we run our data frame through the ggpairs() method to get a feel for how our variables might be related to each other. To learn more scatterplot matrices, visit this r-bloggers article. But for this presentation we are using the extension ggpairs to provide a cleaner visual than that of the regular pairs function. We start off with a subsection of the pairs function, since using all variables produces a plot that is not readable, and henceforth unusable.

pairs <- ggpairs(boston.df, 
                 columns = c(1, 2, 3, 4, 5, 14),
                 lower=list(continuous=wrap("smooth",
                                            colour="turquoise4")),
                 diag=list(continuous=wrap("barDiag",
                                           fill="turquoise4")))  + 
  theme(panel.background = element_rect(fill = "gray98"),
        axis.line.y = element_line(colour="gray"),
        axis.line.x = element_line(colour="gray"))
pairs

We now extend the ggpairs function to the rest of the variables to get more information regarding the interaction of our variables. For that we just call up the variables which we did not include in the first iteration of ggpairs:

pairs1 <- ggpairs(boston.df, 
                  columns = c(6, 7, 8, 9, 10, 14),
                  lower=list(continuous=wrap("smooth", 
                                             colour="turquoise4")),
                  diag=list(continuous=wrap("barDiag", 
                                            fill="turquoise4"))) + 
  theme(panel.background = element_rect(fill = "gray98"),
        axis.line.y = element_line(colour="gray"),
        axis.line.x = element_line(colour="gray"))
pairs1

And finally we have:

pairs3 <- ggpairs(boston.df,
                  columns=c(11, 12, 13,14), 
                  lower=list(continuous=wrap("smooth",
                                             colour="turquoise4")),
                 diag=list(continuous=wrap("barDiag"
                                           , fill="turquoise4"))) +
  theme(panel.background = element_rect(fill = "gray98"),
        axis.line.y = element_line(colour="black"),
        axis.line.x = element_line(colour="black"))
pairs3

It helps to click on the images to get a better picture of the correlation between variables.

Here we are looking for linear relationships between our response variable medv and the rest of the variables in our data frame. We can roughly infer that rm, ptratio, and lstat might be the most relevant in estimating medv. We also use the ggpairs function to be wary of multicollinearity between dependent variables. We do not want this because we might be over-fitting our model, when we have two variables that are highly correlated repeating themselves.

We included more exploratory analysis using ggplot2 to give you an idea of some of the setbacks/pitfalls we might run into when using this data set. Because from the ggpairs function we can see that most of the interactions between the variables do not show linear iterations, and show some patterns.

Variable Interaction

An interaction that is fairly simple to start with is the variables lstat and tax since it gives us a healthy perspective on how the rad will influence the model. We use the geom_point function to create a scatterplot which we then add points of color based on the variable rad, which we will see through the next few graphs plays a large role in how variables interact. So we start assigning a ggplot function with the variables and then we add geom_point with the color set to factor(rad) as shown here:

taxMedv <- ggplot(boston.df, 
        aes(tax, lstat, colour = rad)) + 
theme(panel.background = element_rect(fill = "gray98"),
        axis.line   = element_line(colour="black"),
        axis.line.x = element_line(colour="gray"),
        axis.line.y = element_line(colour="gray"))
taxMedv + geom_point()

For the sake of saving space on this project we limited the amount of images, but when we use the function facet_wrap, which we will use later in this project. The plot shows that the index of 24 has an even distribution with respect to percentage of lower status while

Outliers

In this section we highlight possible outliers which might influence our linear model. This we will go into more detail when we look at the residual diagnostics, especially Cook's Distance and Leverage! Here we plot tax vs ptratio with the colour emphasizing ptratio to show outliers!

taxPtratio <- ggplot(boston.df, 
aes(tax, ptratio, colour = ptratio))
taxPtratio + geom_point() + 
ggtitle("TAX vs. PTRATIO Scatterplot") +
  theme(panel.background = element_rect(fill = "gray98"),
        axis.line   = element_line(colour="black"),
        axis.line.x = element_line(colour="gray"),
        axis.line.y = element_line(colour="gray")) 

Distribution of Medv

Next we emphasize the distribution of medv to make the appropriate transformations, if necessary. We create a histogram which emphasizes the count

Here we plot a histogram of medv to check if the transformation mentioned earlier is needed. We provide the histogram again using ggplot2.

medvHist <- ggplot(boston.df, aes(medv))
medvHist + geom_histogram(aes(y = ..density.., fill = ..count..),
                          colour = 'white', bins = 25) + 
  geom_density() +
  scale_fill_gradient("Count", low = "black", 
    high = "turquoise4") +
  theme(panel.background = element_rect(fill = "gray98"),
        axis.line   = element_line(colour="black"),
        axis.line.x = element_line(colour="gray"),
        axis.line.y = element_line(colour="gray")) + 
  ggtitle("Histogram of MEDV")

From this histogram we can see that the median price has a right skewed distribution! Thus we conclude that a log transformation will be needed.

Here we create the histogram of the log transformation to see the distribution!

# First we create a data frame with the log transformation 
logMedv <- data.frame(log(boston.df$medv))

# Next we change the name of the column to make it easy
# for us to call in our ggplot2 visual
colnames(logMedv) <- c("logMedv")

logMedvHist <- ggplot(logMedv, aes(logMedv))
logMedvHist + geom_histogram(aes(y=..density.., fill=..count..), 
                             colour = 'white', bins = 25) + 
  geom_density() +
  scale_fill_gradient("Count", low = "black", 
                      high = "turquoise4") +
  theme(panel.background = element_rect(fill = "gray98"),
        axis.line   = element_line(colour="black"),
        axis.line.x = element_line(colour="gray"),
        axis.line.y = element_line(colour="gray")) + 
  ggtitle("Histogram of log(MEDV)")

We can see here the distribution is significantly better so now we can proceed to building our linear model.

3. Modeling

This is arguably the most engaging part--finding a linear regression model that best fits our data.

Model 1: All In

For our first model we will go all in with our entire data set. That is, we will use all of our variables and observations in our data frame to predict medv using linear regression with the lm() method. Throughout the modeling phase, we will fine-tune our working data set using backward selection, which means that we will eliminate variables that are, statistically, the least significant.

We run our model through the bptest() method to check for heteroskedasticity. Roughly speaking, heteroskedastic data contains non-constant errors, which violates the assumptions for running a linear model. In short, we do not want heteroskedastic data. Instead, we prefer homoskedastic data. Homoskedastic data contain errors that are randomly or normally distributed.

Finally we run our model through the vif() method (for variance inflation factor) to check for multicollinearity. Roughly speaking, multicollinearity in a data set means that two or more explanatory variables are highly correlated, hence one of the variables can be linearly predicted as a function of the others. This yields problems in the context of explaining the response variable as a function of its explanatory variables. In short, we do not want multicollinearity.

R tip when we utilize '.' that means we are calling all variables. This saves us time from manually selecting each variables as such boston.df$chas into our model.

        
fit1 <- lm(log(medv) ~ ., data = boston.df)
summary(fit1)

bptest(fit1)

vif(fit1)
        
        

Below is our terminal output. Here we see the summary of our first model. To learn more about model summaries, visit this r-bloggers article. Notice the Multiple R-squared of 0.7946. In multiple regression, the R-squared is the square of the correlation between the response and fitted (predicted) linear model. This means that the higher our R-squared the better job our model is doing at predicting the response variable. We will need to run some diagnostics on our model before believing our R-squared value.

TERMINAL OUTPUT

        
> summary(fit1)

Call:
lm(formula = log(medv) ~ ., data = boston.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.73163 -0.10074 -0.01271  0.09153  0.86209 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.0078476  0.2188374  18.314  < 2e-16 ***
crim        -0.0102270  0.0013164  -7.769 4.75e-14 ***
zn           0.0015096  0.0005698   2.649 0.008333 ** 
indus        0.0027748  0.0025634   1.082 0.279575    
chas1        0.1002734  0.0347642   2.884 0.004096 ** 
nox         -0.7728738  0.1569242  -4.925 1.16e-06 ***
rm           0.0868708  0.0169635   5.121 4.39e-07 ***
age          0.0001981  0.0005327   0.372 0.710086    
dis         -0.0511276  0.0081318  -6.287 7.21e-10 ***
rad2         0.0833711  0.0595060   1.401 0.161838    
rad3         0.1766617  0.0537784   3.285 0.001094 ** 
rad4         0.1002350  0.0478207   2.096 0.036595 *  
rad5         0.1344828  0.0486359   2.765 0.005908 ** 
rad6         0.0961606  0.0589625   1.631 0.103565    
rad7         0.2075191  0.0632793   3.279 0.001115 ** 
rad8         0.1939560  0.0600758   3.229 0.001329 ** 
rad24        0.3504602  0.0720382   4.865 1.55e-06 ***
tax         -0.0005052  0.0001569  -3.221 0.001365 ** 
ptratio     -0.0370069  0.0058188  -6.360 4.67e-10 ***
black        0.0004148  0.0001072   3.871 0.000123 ***
lstat       -0.0291645  0.0020395 -14.300  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.189 on 485 degrees of freedom
Multiple R-squared:  0.7946,  Adjusted R-squared:  0.7861 
F-statistic: 93.81 on 20 and 485 DF,  p-value: < 2.2e-16
        
        

Below we see the terminal output from our bptest(fit1) script, which tells us about the heteroskedasticity of our model. The p-value that we obtain from this diagnostic is very small, which suggests that our data is homoskedastic (not heteroskedastic). This is good, because as we mentioned earlier, we need homoskedastic data to run a linear regression model.

TERMINAL OUTPUT

 > bptest(fit1) # checking for homoskedasticity

  studentized Breusch-Pagan test

data:  fit1
BP = 77.592, df = 20, p-value = 1.002e-08
        
        

Below we see the terminal output from our vif(fit1) script, which tells us about the multicollinearity of our model. VIF values higher than 5 are considered to be problematic, so we drop them from the model. tax has the highest VIF value at 9.0085, so we drop it first.

TERMINAL OUTPUT

        
> vif(fit1)
   crim      zn   indus   chas1     nox      rm     age     
 1.8119  2.4962  4.3705  1.1018  4.6729  2.0076  3.1776 
dis    rad2    rad3    rad4    rad5    rad6
 4.1436  2.2654  2.8445  5.5092  5.8824  2.3996 
   rad7    rad8   rad24     tax ptratio   black   lstat 
 1.8410  2.3090 14.1690  9.8769  2.2427  1.3525  2.9976             
            
        

Model 2: Removing Collinear Variable (tax)

Now onto our second model; this time we are removing the tax variable (because of high multicollinearity). We use the update() method to recycle our first model while removing tax. Use vif() and summary() again to diagnose.

        
fit2 <- update(fit1, ~ . - tax)
vif(fit2)
summary(fit2)
        
        

We can see below that rad24 suggests multicollinearity since the value is > 5, but since we do not know how to remove one level of a factor variable we will drop rad.

TERMINAL OUTPUT

        
> fit2 <- update(fit1, ~ . - tax)
> vif(fit2) # Checking again for more multicollinearity 
   crim      zn   indus   chas1     nox      rm     age      
 1.8116  2.4222  3.5828  1.0874  4.6638  2.0065  3.1689  
dis    rad2    rad3    rad4    rad5    rad6    rad7
 4.1359  2.2251  2.8287  5.4929  5.8445  2.3074  1.8366
   rad8   rad24 
 2.2972  8.3218 
ptratio   black   lstat 
 2.2359  1.3516  2.9974 

        
        

Below we get a summary of our second model. This time we are looking for variables that are not statistically significant to remove them. By convention, variables with a p-value larger than .05 are considered not to be statistically significant. Notice the p-value in the terminal output below listed as Pr(>|t|) for each variable. Let's start by dropping age, indus.

TERMINAL OUTPUT

        
> summary(fit2)

Call:
lm(formula = log(medv) ~ crim + zn + indus + chas + nox + rm + 
    age + dis + rad + ptratio + black + lstat, data = boston.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.73162 -0.10219 -0.01228  0.09216  0.86447 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.9170473  0.2190964  17.878  < 2e-16 ***
crim        -0.0101787  0.0013289  -7.659 1.02e-13 ***
zn           0.0011936  0.0005667   2.106 0.035711 *  
indus       -0.0007300  0.0023432  -0.312 0.755519    
chas1        0.1130736  0.0348677   3.243 0.001264 ** 
nox         -0.7951695  0.1582759  -5.024 7.12e-07 ***
rm           0.0881738  0.0171214   5.150 3.79e-07 ***
age          0.0001085  0.0005371   0.202 0.839957    
dis         -0.0522585  0.0082022  -6.371 4.35e-10 ***
rad2         0.1089429  0.0595399   1.830 0.067900 .  
rad3         0.1895977  0.0541429   3.502 0.000505 ***
rad4         0.0918609  0.0482082   1.906 0.057305 .  
rad5         0.1219047  0.0489441   2.491 0.013082 *  
rad6         0.0589410  0.0583737   1.010 0.313133    
rad7         0.1976450  0.0638115   3.097 0.002066 ** 
rad8         0.1801044  0.0604967   2.977 0.003055 ** 
rad24        0.2014158  0.0557378   3.614 0.000333 ***
ptratio     -0.0380353  0.0058658  -6.484 2.20e-10 ***
black        0.0004239  0.0001081   3.920 0.000101 ***
lstat       -0.0291049  0.0020590 -14.136  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1908 on 486 degrees of freedom
Multiple R-squared:  0.7902,  Adjusted R-squared:  0.782 
F-statistic: 96.34 on 19 and 486 DF,  p-value: < 2.2e-16
            
        

Model 3: Removing Non-Significant Variables (age, indus, rad)

Now we run our third model. For this iteration we have removed the tax, age, indus, and rad variables.

        
fit3 <- update(fit2, ~ . - age - indus - rad)
summary(fit3)
        
        

The summary of our model tells us that our least significant variable, crim, is somewhat significant with a p-value of < 2.2e-16. We can now stop dropping variables. We can now move to another set of diagnostics.

TERMINAL OUTPUT

        
> summary(fit3)

Call:
lm(formula = log(medv) ~ crim + zn + chas + nox + rm + dis + 
    ptratio + black + lstat, data = boston.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.71003 -0.10155 -0.01094  0.09083  0.89539 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.7901595  0.1962986  19.308  < 2e-16 ***
crim        -0.0083340  0.0012238  -6.810 2.83e-11 ***
zn           0.0008755  0.0005407   1.619 0.106052    
chas1        0.1195298  0.0349829   3.417 0.000686 ***
nox         -0.6852309  0.1302350  -5.261 2.13e-07 ***
rm           0.1061611  0.0164243   6.464 2.45e-10 ***
dis         -0.0488011  0.0075982  -6.423 3.14e-10 ***
ptratio     -0.0333366  0.0047273  -7.052 5.96e-12 ***
black        0.0003735  0.0001083   3.449 0.000610 ***
lstat       -0.0287382  0.0019479 -14.753  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1947 on 496 degrees of freedom
Multiple R-squared:  0.7772,  Adjusted R-squared:  0.7731 
F-statistic: 192.2 on 9 and 496 DF,  p-value: < 2.2e-16
        
        

Diagnosing for Outliers and High-Leverage Points

Now we will run our model through the outlierTest() to check for outliers and high leverage points in our data. Linear regression model operates under the assumption of no outliers or high leverage points.

        
outlierTest(fit3, cutoff = Inf, n.max = 15)
        
        

Below we see the terminal output for the outliers' test. Observations with an absolute value greater than 3 for their studentized residuals are considered problematic, so we remove them from the model.

TERMINAL OUTPUT

        
> outlierTest(fit3, cutoff = Inf, n.max = 15)
     rstudent unadjusted p-value Bonferonni p
413  4.828866         1.8331e-06   0.00092755
372  4.415447         1.2386e-05   0.00626750
369  4.054746         5.8286e-05   0.02949300
373  4.044845         6.0722e-05   0.03072500
402 -3.716832         2.2476e-04   0.11373000
375  3.606444         3.4192e-04   0.17301000
490 -3.579379         3.7835e-04   0.19144000
506 -3.449505         6.0959e-04   0.30845000
215  3.439297         6.3248e-04   0.32004000
401 -3.403025         7.2048e-04   0.36456000
368  3.033191         2.5469e-03           NA
410  2.991901         2.9108e-03           NA
398 -2.819016         5.0101e-03           NA
400 -2.775641         5.7179e-03           NA
366  2.702446         7.1196e-03           NA                
        
        

We see that rows 413, 372, 369, 373, 402, 375, 490, 506, 215, and 40 1meet the criteria for removal as their studentized residuals have absolute values above 3.

Model 4: Removing Outliers and High-Leverage Points

Next we update our data frame by removing the outlier observations, declare our fourth model, and get residual diagnostics plots using the par() and plot() methods to test our model.

boston.df <- boston.df[-c(413, 372, 369, 373, 402, 375, 490, 506, 215, 401),]

fit4 <- lm(log(medv) ~ . - tax - age - indus - rad, data = boston.df)
summary(fit4)

Here we get the summary of our model. It looks like lstat is our most significant variable along with rm. Our R-squared is 0.808, and since it is close to 1, we can say that our model explains a large part of our response variable. Our residual standard error (RSE) is fairly small at 0.1651, which is good. Let's see if we can do better.

TERMINAL OUTPUT

            
> fit4 <- lm(log(medv) ~ . - tax - age - indus - rad, data = boston.df)
> summary(fit4)

Call:
lm(formula = log(medv) ~ . - tax - age - indus - rad, data = boston.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54154 -0.09775 -0.01559  0.08727  0.66039 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.456e+00  1.693e-01  20.411  < 2e-16 ***
crim        -8.426e-03  1.047e-03  -8.046 6.61e-15 ***
zn           6.405e-04  4.607e-04   1.390  0.16511    
chas1        9.297e-02  3.009e-02   3.090  0.00212 ** 
nox         -5.592e-01  1.121e-01  -4.987 8.54e-07 ***
rm           1.329e-01  1.426e-02   9.320  < 2e-16 ***
dis         -4.128e-02  6.521e-03  -6.330 5.58e-10 ***
ptratio     -3.229e-02  4.040e-03  -7.993 9.64e-15 ***
black        4.811e-04  9.358e-05   5.141 3.98e-07 ***
lstat       -2.786e-02  1.740e-03 -16.014  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1651 on 486 degrees of freedom
Multiple R-squared:  0.829, Adjusted R-squared:  0.8259 
F-statistic: 261.8 on 9 and 486 DF,  p-value: < 2.2e-16
            
        

But first, let's take a look at a few residual diagnostics plots for our fourth model. To learn more residual diagnostics, visit this r-bloggers article. Given the curve in our Residuals vs Fitted plot below we can infer that there is non-linearity in our data.

autoplot(fit4, label.size = 3, shape = 21, 
         data=boston.df, colour='turquoise4') + 
  theme(panel.background = element_rect(fill = "gray98"),
        axis.line   = element_line(colour="black"),
        axis.line.x = element_line(colour="gray"),
        axis.line.y = element_line(colour="gray")) + 
  ggtitle("Residual Diagnostics for fit4")

For this reason we will include non-linear terms in our model in the next couple of iterations.

Diagnosing Residuals (Checking for Normality of Residuals)

Another assumption for linear regression is normality of residual terms. We test normality of residuals using the studres() and hist() methods. We also fit a curve using the lines() method.

        
residBar <- ggplot(data=fit4, aes(residuals(fit4))) + 
  geom_histogram(aes(y =..density..), 
                 col="black", fill="white") +
  geom_density(col=1) +
  theme(panel.background = element_rect(fill = "gray98"),
        panel.grid.minor = element_blank(),
        axis.line   = element_line(colour="gray"),
        axis.line.x = element_line(colour="gray")) +
  ggtitle("Histogram of Residuals for fit4") 
residBar

        
        

Here we see that the residuals of our model display a normal distribution; which is good as this is a linear regression assumption.

Model 5: Adding Non-Linear Terms

Now onto the fifth iteration of our model with non-linear terms. Here we include an lstat-squared by adding I(lstat^2) to the model. We can infer a non-linear relationship between the response variable and our model in the residual diagnostics for model 4. Furthermore, we can sense a non-linear relationship between lstat and medv from the exploratory analysis phase.

        
fit5 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df)
summary(fit5)
        
        

In our model summary below we see that I(lstat^2) is indeed statistically significant, more so than most other explanatory variables. Our R-squared has increased to 0.8341 (which is good) and our RSE has decreased to 0.1628 (which is also good).

TERMINAL OUTPUT

            
> fit5 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df)
> summary(fit5)

Call:
lm(formula = log(medv) ~ . - tax - age - indus - rad + I(lstat^2), 
    data = boston.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59975 -0.09465 -0.01432  0.09241  0.64969 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.583e+00  1.701e-01  21.057  < 2e-16 ***
crim        -8.997e-03  1.043e-03  -8.625  < 2e-16 ***
zn           3.381e-04  4.610e-04   0.733 0.463687    
chas1        9.074e-02  2.967e-02   3.058 0.002352 ** 
nox         -5.111e-01  1.113e-01  -4.594 5.55e-06 ***
rm           1.208e-01  1.440e-02   8.392 5.28e-16 ***
dis         -4.047e-02  6.433e-03  -6.292 7.02e-10 ***
ptratio     -2.979e-02  4.036e-03  -7.382 6.84e-13 ***
black        4.589e-04  9.245e-05   4.964 9.59e-07 ***
lstat       -4.522e-02  4.809e-03  -9.403  < 2e-16 ***
I(lstat^2)   5.083e-04  1.316e-04   3.864 0.000127 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1628 on 485 degrees of freedom
Multiple R-squared:  0.8341,  Adjusted R-squared:  0.8307 
F-statistic: 243.9 on 10 and 485 DF,  p-value: < 2.2e-16
            
        

Given the significance of I(lstat^2) it may make sense to include a third-level polynomial for lstat. Let's try that in the next model. But first, let's take a look at our Residuals vs Fitted plot.

autoplot(fit5, label.size = 3, shape = 21, 
         data=boston.df, colour='turquoise4') + 
  theme(panel.background = element_rect(fill = "gray98"),
        axis.line   = element_line(colour="black"),
        axis.line.x = element_line(colour="gray"),
        axis.line.y = element_line(colour="gray")) + 
  ggtitle("Residual Diagnostics for fit5")

We can see that our model is doing a little better as far as linearity is concerned, compared to our previous model.

Model 6: More Non-Linearity

Now onto model 6. Here we are adding I(lstat^3) to our model given the statistical significance of I(lstat^2). We will run some diagnostics to see if this makes sense or if we have gone too far.

        
fit6 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2) + I(lstat^3), data = boston.df)
summary(fit6)
        
        

Given its p-value, it looks like I(lstat^3) does not add much to the model. We do notice a slight increase in the R-squared value to 0.8343 and a slight decrease in the MSE value to 0.1629. We are not sure if this might be an improvement, however, as we have traded simplicity (or interpretability) in the model for little accuracy improvement.

TERMINAL OUTPUT

            
> fit6 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2) + I(lstat^3), data = boston.df)
> summary(fit6)

Call:
lm(formula = log(medv) ~ . - tax - age - indus - rad + I(lstat^2) + 
    I(lstat^3), data = boston.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59756 -0.09290 -0.01524  0.09406  0.64617 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.639e+00  1.898e-01  19.167  < 2e-16 ***
crim        -8.998e-03  1.044e-03  -8.622  < 2e-16 ***
zn           2.845e-04  4.682e-04   0.608  0.54365    
chas1        8.922e-02  2.978e-02   2.996  0.00287 ** 
nox         -5.158e-01  1.115e-01  -4.624 4.83e-06 ***
rm           1.176e-01  1.521e-02   7.732 6.18e-14 ***
dis         -4.024e-02  6.446e-03  -6.243 9.42e-10 ***
ptratio     -2.996e-02  4.046e-03  -7.405 5.86e-13 ***
black        4.606e-04  9.254e-05   4.977 8.98e-07 ***
lstat       -5.291e-02  1.249e-02  -4.236 2.72e-05 ***
I(lstat^2)   1.026e-03  7.870e-04   1.304  0.19296    
I(lstat^3)  -9.954e-06  1.492e-05  -0.667  0.50497    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1629 on 484 degrees of freedom
Multiple R-squared:  0.8343,  Adjusted R-squared:  0.8305 
F-statistic: 221.5 on 11 and 484 DF,  p-value: < 2.2e-16
            
        

Observing our residual diagnostics, there seems to be little difference between our results across model 5 and 6.

autoplot(fit6, label.size = 3, shape = 21, 
         data=boston.df, colour='turquoise4') + 
  theme(panel.background = element_rect(fill = "gray98"),
        axis.line   = element_line(colour="black"),
        axis.line.x = element_line(colour="gray"),
        axis.line.y = element_line(colour="gray")) + 
  ggtitle("Residual Diagnostics for fit6")

Now we are ready to move on to the model diagnostics stage, to determine which of our models rank best at estimating our response variable.

4. Cross Validation

In this stage we determine which of our models is best at estimating our response variable while not overfitting our data. For this we use the CVlm() method. We will be comparing models 4, 5, and 6.

Cross Validation for Model 4

First we run our fourth model through the CVlm() method. We also get a set of attributes of our cross validation object.

        
par(mfrow = c(1,1))
fit4_CV <- suppressWarnings(CVlm(data = boston.df, form.lm = formula(fit4), m = 10, dots = FALSE, seed = 1, plotit = c("Observed", "Residual"),main = "Cross-validation for fit4", legend.pos="topleft"))
attr(fit4_CV, 'ms')
        
        

Below is the cross validation plot of our fourth model. R outputs this plot when calling the CVlm() method. Roughly speaking, we want each of the fold symbols to be as close as possible to the lines on the plot. We call the attr() to get more details on the mean squared error of the fit.

Below is the attr() output for our model 4 cross validation object.

TERMINAL OUTPUT

> attr(fit4_CV, 'ms')
[1] 0.0284 

The relevant statistic we gather from the CVlm() and attributes() methods is the Mean Squared Error (MSE). This value represents the mean squared error our model has over a random set of test data sets, simulated by our training data set. Therefore, the smaller our MSE is, the better. Our MSE for model 4 is 0.0284 thus far. Let's look at models 5 and 6.

Cross Validation for Model 5

We run our fifth model through the CVlm() method. We also get a set of attributes of our cross validation object.

        
par(mfrow = c(1,1))
fit5_CV <- suppressWarnings(CVlm(data = boston.df, form.lm = formula(fit5), m = 10, dots = FALSE, seed = 1, plotit = c("Observed", "Residual"), main = "Cross-validation for fit5", legend.pos="topleft"))
attr(fit5_CV, 'ms')
        
        

Below is the cross validation plot of our fifth model, and its attributes.

Below is the attr() output for our model 5 cross validation object, again specifically the mean squared error.

TERMINAL OUTPUT

> attr(fit5_CV, 'ms')
[1] 0.0278

Our MSE for model 5 is 0.0278, definitely better than that of model 4.

Cross Validation for Model 6

We run our sixth model thought the CVlm() method. We also get a set of attributes of our cross validation object.

        
par(mfrow = c(1,1))
fit6_CV <- suppressWarnings(CVlm(data = boston.df, form.lm = formula(fit6), m = 10, dots = FALSE, seed = 1, plotit = c("Observed", "Residual"), main = "Cross-validation for fit6", legend.pos="topleft"))

Below is the cross validation plot of our sixth model, and its attributes.

Below is the attr() output for our model 6 cross validation object.

TERMINAL OUTPUT

> attr(fit6_CV, 'ms')
[1] 0.028
 

Our MSE for model 6 is 0.028, so close to our other models, but it seems fit5 was our best model!

5. Conclusions

With an R-squared value of 0.834 and a Mean Squared Error of 0.0278, model 5 is the winner. Let's take another look at a summary of model 5:

TERMINAL OUTPUT

            
> fit5 <- lm(log(medv) ~ . - tax - age - indus - rad + I(lstat^2), data = boston.df)
> summary(fit5)

Call:
lm(formula = log(medv) ~ . - tax - age - indus - rad + I(lstat^2), 
    data = boston.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5998 -0.0947 -0.0143  0.0924  0.6497 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.58e+00   1.70e-01   21.06  < 2e-16 ***
crim        -9.00e-03   1.04e-03   -8.63  < 2e-16 ***
zn           3.38e-04   4.61e-04    0.73  0.46369    
chas1        9.07e-02   2.97e-02    3.06  0.00235 ** 
nox         -5.11e-01   1.11e-01   -4.59  5.6e-06 ***
rm           1.21e-01   1.44e-02    8.39  5.3e-16 ***
dis         -4.05e-02   6.43e-03   -6.29  7.0e-10 ***
ptratio     -2.98e-02   4.04e-03   -7.38  6.8e-13 ***
black        4.59e-04   9.24e-05    4.96  9.6e-07 ***
lstat       -4.52e-02   4.81e-03   -9.40  < 2e-16 ***
I(lstat^2)   5.08e-04   1.32e-04    3.86  0.00013 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.163 on 485 degrees of freedom
Multiple R-squared:  0.834, Adjusted R-squared:  0.831 
F-statistic:  244 on 10 and 485 DF,  p-value: < 2e-16
            
        

For more residual diagnostics which can be useful in your analysis with the update of the autoplot function using ggplot2 and ggfortify we can make more aesthetically pleasing visuals!

autoplot(fit5, which = 1:6, ncol = 3, label.size = 1, 
         colour='turquoise4') + 
  theme(panel.background = element_rect(fill = "gray98"),
        axis.line   = element_line(colour="black"),
        axis.line.x = element_line(colour="gray"),
        axis.line.y = element_line(colour="gray"))

Model 5 suggests that lstat, I(lstat^2), rm and ptratio are most relevant explanatory variables in estimating medv. It is worth mentioning that our residual diagnostics suggested a certain amount of non-linearity that is not addressed by our model. This means that there might be other methods that do a better job at estimating medv than lm().

However, lm() is a fundamental model in statistical learning for estimation and prediction. Understanding this type of modeling allows for the understanding of more sophisticated models.

CONGRATULATIONS!

Congratulations for getting this far! We hope you enjoyed this project. Please reach out to us here if you have any feedback or would like to publish your own project.

GitHub Repo Link Fork this project on GitHub

Try this project next:


Iris Flower Classification (R)

Iris Flower Classification (R)

KNN Analysis of Iris Flowers with R