This is a very quick run-through of some basic statistical concepts, adapted from Lab 4 in Harvard's CS109 course. Please feel free to try the original lab if you're feeling ambitious :-) The CS109 git repository also has the solutions if you're stuck.
Linear regression is used to model and predict continuous outcomes with normal random errors. There are nearly an infinite number of different types of regression models and each regression model is typically defined by the distribution of the prediction errors (called "residuals") of the type of data. Logistic regression is used to model binary outcomes whereas Poisson regression is used to predict counts. In this exercise, we'll see some examples of linear regression as well as Train-test splits.
The packages we'll cover are: statsmodels
, seaborn
, and scikit-learn
. While we don't explicitly teach statsmodels
and seaborn
in the Springboard workshop, those are great libraries to know.
In [1]:
# special IPython command to prepare the notebook for matplotlib and other libraries
%matplotlib inline
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
# special matplotlib argument for improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")
Given a dataset containing predictor variables $X$ and outcome/response variable $Y$, linear regression can be used to:
Linear Regression is a method to model the relationship between a set of independent variables $X$ (also knowns as explanatory variables, features, predictors) and a dependent variable $Y$. This method assumes the relationship between each predictor $X$ is linearly related to the dependent variable $Y$. The most basic linear regression model contains one independent variable $X$, we'll call this the simple model.
$$ Y = \beta_0 + \beta_1 X + \epsilon$$where $\epsilon$ is considered as an unobservable random variable that adds noise to the linear relationship. In linear regression, $\epsilon$ is assumed to be normally distributed with a mean of 0. In other words, what this means is that on average, if we know $Y$, a roughly equal number of predictions $\hat{Y}$ will be above $Y$ and others will be below $Y$. That is, on average, the error is zero. The residuals, $\epsilon$ are also assumed to be "i.i.d.": independently and identically distributed. Independence means that the residuals are not correlated -- the residual from one prediction has no effect on the residual from another prediction. Correlated errors are common in time series analysis and spatial analyses.
$\beta_0$ is the intercept of the linear model and represents the average of $Y$ when all independent variables $X$ are set to 0.
$\beta_1$ is the slope of the line associated with the regression model and represents the average effect of a one-unit increase in $X$ on $Y$.
Back to the simple model. The model in linear regression is the conditional mean of $Y$ given the values in $X$ is expressed a linear function.
Least squares is a method that can estimate the coefficients of a linear model by minimizing the squared residuals:
$$ \mathscr{L} = \sum_{i=1}^N \epsilon_i = \sum_{i=1}^N \left( y_i - \hat{y}_i \right)^2 = \sum_{i=1}^N \left(y_i - \left(\beta_0 + \beta_1 x_i\right)\right)^2 $$where $N$ is the number of observations and $\epsilon$ represents a residual or error, ACTUAL - PREDICTED.
We want to minimize the squared residuals and solve for $\hat{\beta_0}$ so we take the partial derivative of $\mathscr{L}$ with respect to $\hat{\beta_0}$
$ \begin{align} \frac{\partial \mathscr{L}}{\partial \hat{\beta_0}} &= \frac{\partial}{\partial \hat{\beta_0}} \sum_{i=1}^N \epsilon^2 \\ &= \frac{\partial}{\partial \hat{\beta_0}} \sum_{i=1}^N \left( y_i - \hat{y}_i \right)^2 \\ &= \frac{\partial}{\partial \hat{\beta_0}} \sum_{i=1}^N \left( y_i - \left( \hat{\beta}_0 + \hat{\beta}_1 x_i \right) \right)^2 \\ &= -2 \sum_{i=1}^N \left( y_i - \left( \hat{\beta}_0 + \hat{\beta}_1 x_i \right) \right) \hspace{25mm} \mbox{(by chain rule)} \\ &= -2 \sum_{i=1}^N y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i \\ &= -2 \left[ \left( \sum_{i=1}^N y_i \right) - n \hat{\beta_0} - \hat{\beta}_1 \left( \sum_{i=1}^N x_i \right) \right] \\ & 2 \left[ n \hat{\beta}_0 + \hat{\beta}_1 \sum_{i=1}^N x_i - \sum_{i=1}^N y_i \right] = 0 \hspace{20mm} \mbox{(Set equal to 0 and solve for $\hat{\beta}_0$)} \\ & n \hat{\beta}_0 + \hat{\beta}_1 \sum_{i=1}^N x_i - \sum{i=1}^N y_i = 0 \\ & n \hat{\beta}_0 = \sum_{i=1}^N y_i - \hat{\beta}_1 \sum_{i=1}^N x_i \\ & \hat{\beta}_0 = \frac{\sum_{i=1}^N y_i - \hat{\beta}_1 \sum_{i=1}^N x_i}{n} \\ & \hat{\beta}_0 = \frac{\sum_{i=1}^N y_i}{n} - \hat{\beta}_1 \frac{\sum_{i=1}^N x_i}{n} \\ & \boxed{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}} \end{align} $
Using this new information, we can compute the estimate for $\hat{\beta}_1$ by taking the partial derivative of $\mathscr{L}$ with respect to $\hat{\beta}_1$.
$ \begin{align} \frac{\partial \mathscr{L}}{\partial \hat{\beta_1}} &= \frac{\partial}{\partial \hat{\beta_1}} \sum_{i=1}^N \epsilon^2 \\ &= \frac{\partial}{\partial \hat{\beta_1}} \sum_{i=1}^N \left( y_i - \hat{y}_i \right)^2 \\ &= \frac{\partial}{\partial \hat{\beta_1}} \sum_{i=1}^N \left( y_i - \left( \hat{\beta}_0 + \hat{\beta}_1 x_i \right) \right)^2 \\ &= 2 \sum_{i=1}^N \left( y_i - \left( \hat{\beta}_0 + \hat{\beta}_1 x_i \right) \right) \left( -x_i \right) \hspace{25mm}\mbox{(by chain rule)} \\ &= -2 \sum_{i=1}^N x_i \left( y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i \right) \\ &= -2 \sum_{i=1}^N x_i y_i - \hat{\beta}_0 x_i - \hat{\beta}_1 x_i^2 \\ &= -2 \sum_{i=1}^N x_i y_i - \left( \bar{y} - \hat{\beta}_1 \bar{x} \right) x_i - \hat{\beta}_1 x_i^2 \\ &= -2 \sum_{i=1}^N x_i y_i - \bar{y}x_i + \hat{\beta}_1\bar{x}x_i - \hat{\beta}_1 x_i^2 \\ &= -2 \left[ \sum_{i=1}^N x_i y_i - \bar{y} \sum_{i=1}^N x_i + \hat{\beta}_1\bar{x} - \hat{\beta}_1 x_i^2 \right] \\ &= -2 \left[ \hat{\beta}_1 \left\{ \bar{x} \sum_{i=1}^N x_i - \sum_{i=1}^N x_i^2 \right\} + \left\{ \sum_{i=1}^N x_i y_i - \bar{y} \sum_{i=1}^N x_i \right\}\right] \\ & 2 \left[ \hat{\beta}_1 \left\{ \sum_{i=1}^N x_i^2 - \bar{x} \sum_{i=1}^N x_i \right\} + \left\{ \bar{y} \sum_{i=1}^N x_i - \sum_{i=1}^N x_i y_i \right\} \right] = 0 \\ & \hat{\beta}_1 = \frac{-\left( \bar{y} \sum_{i=1}^N x_i - \sum_{i=1}^N x_i y_i \right)}{\sum_{i=1}^N x_i^2 - \bar{x}\sum_{i=1}^N x_i} \\ &= \frac{\sum_{i=1}^N x_i y_i - \bar{y} \sum_{i=1}^N x_i}{\sum_{i=1}^N x_i^2 - \bar{x} \sum_{i=1}^N x_i} \\ & \boxed{\hat{\beta}_1 = \frac{\sum_{i=1}^N x_i y_i - \bar{x}\bar{y}n}{\sum_{i=1}^N x_i^2 - n \bar{x}^2}} \end{align} $
The solution can be written in compact matrix notation as
$$\hat\beta = (X^T X)^{-1}X^T Y$$We wanted to show you this in case you remember linear algebra, in order for this solution to exist we need $X^T X$ to be invertible. Of course this requires a few extra assumptions, $X$ must be full rank so that $X^T X$ is invertible, etc. Basically, $X^T X$ is full rank if all rows and columns are linearly independent. This has a loose relationship to variables and observations being independent respective. This is important for us because this means that having redundant features in our regression models will lead to poorly fitting (and unstable) models. We'll see an implementation of this in the extra linear regression example.
The Boston Housing data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository.
sklearn
This data set is available in the sklearn python module which is how we will access it today.
In [2]:
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
In [3]:
boston.keys()
Out[3]:
In [4]:
boston.data.shape
Out[4]:
In [5]:
# Print column names
print(boston.feature_names)
In [6]:
# Print description of Boston housing data set
print(boston.DESCR)
Now let's explore the data set itself.
In [7]:
bos = pd.DataFrame(boston.data)
bos.head()
Out[7]:
There are no column names in the DataFrame. Let's add those.
In [8]:
bos.columns = boston.feature_names
bos.head()
Out[8]:
Now we have a pandas DataFrame called bos
containing all the data we want to use to predict Boston Housing prices. Let's create a variable called PRICE
which will contain the prices. This information is contained in the target
data.
In [9]:
print(boston.target.shape)
In [10]:
bos['PRICE'] = boston.target
bos.head()
Out[10]:
In [11]:
bos.describe()
Out[11]:
In [12]:
plt.scatter(bos.CRIM, bos.PRICE)
plt.xlabel("Per capita crime rate by town (CRIM)")
plt.ylabel("Housing Price")
plt.title("Relationship between CRIM and Price")
Out[12]:
Exercise: What kind of relationship do you see? e.g. positive, negative? linear? non-linear? Is there anything else strange or interesting about the data? What about outliers?
Exercise: Create scatter plots between *RM* and *PRICE*, and *PTRATIO* and *PRICE*. Label your axes appropriately using human readable labels. Tell a story about what you see.
Exercise: What are some other numeric variables of interest? Why do you think they are interesting? Plot scatterplots with these variables and *PRICE* (house price) and tell a story about what you see.
In [13]:
# your turn: describe relationship
Most of the data points seem to be between 0 and 20 for the crime rate. There are many points that have 0 or near 0 crime rate. The plot shows also that the highest crime rates occur in areas with lower housing prices. However there are also areas with low crime rates with lower housing prices. At the highest housing prices, crime rate ranges from 0 to 20. Overall, there seems to be a non-linear relationship.
In [14]:
# your turn: scatter plot between *RM* and *PRICE*
plt.scatter(bos.RM, bos.PRICE)
plt.xlabel("Average Number of Rooms per Dwelling (RM)")
plt.ylabel("Housing Price")
plt.title("Relationship between RM and Price")
Out[14]:
There is a clear positive and linear relationship between the number of rooms and housing price. As the number of rooms increase, the housing price also increases. There are outliers that do not follow this trend but that is most likely due to location. It's expected that certain locations have higher or lower housing prices.
In [15]:
# your turn: scatter plot between *PTRATIO* and *PRICE*
plt.scatter(bos.PTRATIO, bos.PRICE)
plt.xlabel("Pupil-Teacher Ratio by Town (PTRATIO)")
plt.ylabel("Housing Price")
plt.title("Relationship between PTRATIO and Price")
Out[15]:
By visually inspecting this plot, it is difficult to determine a relationship. There does not seem to be a strong correlation between the pupil-teacher ratio and housing price.
In [16]:
# your turn: create some other scatter plots
Several variables would be intereting to investigate in relation to housing prices. Access to employment and tax rates may have important relationships with housing prices. We will plot those next.
In [17]:
# your turn: scatter plot between *PTRATIO* and *PRICE*
plt.scatter(bos.DIS, bos.PRICE)
plt.xlabel("Weighted Distances to Five Boston Employment Centres (DIS)")
plt.ylabel("Housing Price")
plt.title("Relationship between DIS and Price")
Out[17]:
There seems to be a logmarithic trend between the distance to employment centres and housing prices. Many of the lower housing costs are near employment centres. There are typically higher prices as the distance increases, but the prices are not significantly higher.
In [18]:
plt.scatter(bos.TAX, bos.PRICE)
plt.xlabel("Full-Value Property-Tax Rate per $10,000 (TAX)")
plt.ylabel("Housing Price")
plt.title("Relationship between TAX and Price")
Out[18]:
It is difficult to determine a relationship with this visual. What is interesting, are the data points the are placed on the right side of the graph around 670.
Seaborn is a cool Python plotting library built on top of matplotlib. It provides convenient syntax and shortcuts for many common types of plots, along with better-looking defaults.
We can also use seaborn regplot for the scatterplot above. This provides automatic linear regression fits (useful for data exploration later on). Here's one example below.
In [19]:
sns.regplot(y="PRICE", x="RM", data=bos, fit_reg = True)
Out[19]:
In [20]:
plt.hist(np.log(bos.CRIM))
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()
Exercise: In the above histogram, we took the logarithm of the crime rate per capita. Repeat this histogram without taking the log. What was the purpose of taking the log? What do we gain by making this transformation? What do you now notice about this variable that is not obvious without making the transformation?
Exercise: Plot the histogram for *RM* and *PTRATIO* against each other, along with the two variables you picked in the previous section. We are looking for correlations in predictors here.
In [21]:
#your turn
In [22]:
plt.hist(bos.CRIM)
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()
Since most of the data is near 0 and the range of the data is high, it is difficult to compare each bar of the histogram. Rescaling the values by taking the log will give you a better picture of all of the data.
In [23]:
plt.hist(bos.RM)
plt.hist(bos.PTRATIO)
plt.title("RM and PTRATIO")
plt.ylabel("Frequencey")
plt.legend(["RM", "PTRATIO"], loc='best')
plt.show()
In [24]:
plt.hist(bos.DIS)
plt.hist(bos.TAX)
plt.title("DIS and TAX")
plt.ylabel("Frequencey")
plt.legend(["DIS", "TAX"], loc='best')
plt.show()
There does not seem to be strong correlations in the graphs above.
Here,
$Y$ = boston housing prices (called "target" data in python, and referred to as the dependent variable or response variable)
and
$X$ = all the other features (or independent variables, predictors or explanatory variables)
which we will use to fit a linear regression model and predict Boston housing prices. We will use the least-squares method to estimate the coefficients.
We'll use two ways of fitting a linear regression. We recommend the first but the second is also powerful in its features.
statsmodels
Statsmodels is a great Python library for a lot of basic and inferential statistics. It also provides basic regression functions using an R-like syntax, so it's commonly used by statisticians. While we don't cover statsmodels officially in the Data Science Intensive workshop, it's a good library to have in your toolbox. Here's a quick example of what you could do with it. The version of least-squares we will use in statsmodels is called ordinary least-squares (OLS). There are many other versions of least-squares such as partial least squares (PLS) and weighted least squares (WLS).
In [25]:
# Import regression modules
import statsmodels.api as sm
from statsmodels.formula.api import ols
In [26]:
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
m = ols('PRICE ~ RM',bos).fit()
print(m.summary())
There is a ton of information in this output. But we'll concentrate on the coefficient table (middle table). We can interpret the RM
coefficient (9.1021) by first noticing that the p-value (under P>|t|
) is so small, basically zero. This means that the number of rooms, RM
, is a statisticall significant predictor of PRICE
. The regression coefficient for RM
of 9.1021 means that on average, each additional room is associated with an increase of $\$9,100$ in house price net of the other variables. The confidence interval gives us a range of plausible values for this average change, about ($\$8,279, \$9,925$), definitely not chump change.
In general, the $\hat{\beta_i}, i > 0$ can be interpreted as the following: "A one unit increase in $x_i$ is associated with, on average, a $\hat{\beta_i}$ increase/decrease in $y$ net of all other variables."
On the other hand, the interpretation for the intercept, $\hat{\beta}_0$ is the average of $y$ given that all of the independent variables $x_i$ are 0.
statsmodels
formulasThis formula notation will seem familiar to R
users, but will take some getting used to for people coming from other languages or are new to statistics.
The formula gives instruction for a general structure for a regression call. For statsmodels
(ols
or logit
) calls you need to have a Pandas dataframe with column names that you will add to your formula. In the below example you need a pandas data frame that includes the columns named (Outcome
, X1
,X2
, ...), but you don't need to build a new dataframe for every regression. Use the same dataframe with all these things in it. The structure is very simple:
Outcome ~ X1
But of course we want to to be able to handle more complex models, for example multiple regression is doone like this:
Outcome ~ X1 + X2 + X3
In general, a formula for an OLS multiple linear regression is
Y ~ X1 + X2 + ... + Xp
This is the very basic structure but it should be enough to get you through the homework. Things can get much more complex. You can force statsmodels to treat variables as categorical with the C()
function, call numpy functions to transform data such as np.log
for extremely-skewed data, or fit a model without an intercept by including - 1
in the formula. For a quick run-down of further uses see the statsmodels
help page.
Let's see how our model actually fit our data. We can see below that there is a ceiling effect, we should probably look into that. Also, for large values of $Y$ we get underpredictions, most predictions are below the 45-degree gridlines.
Exercise: Create a scatterplot between the predicted prices, available in `m.fittedvalues` (where `m` is the fitted model) and the original prices. How does the plot look? Do you notice anything interesting or weird in the plot? Comment on what you see.
In [27]:
# your turn
In [28]:
sns.regplot(y=bos.PRICE, x=m.fittedvalues, fit_reg = True)
plt.xlabel("Predicted Prices")
plt.ylabel("Actual Prices")
plt.title("Predicted vs Actual Housing Prices")
Out[28]:
With an R-squared value of 0.484 this is what you would expect to see between the relationship of the predicted and actual values. The regression line has slope close to one. This indicates that the predicted prices generally does a good job. We can also see that there are some outliers where the actual price is much higher than the predicted price.
In [29]:
from sklearn.linear_model import LinearRegression
X = bos.drop('PRICE', axis = 1)
# This creates a LinearRegression object
lm = LinearRegression()
lm
Out[29]:
Check out the scikit-learn docs here. We have listed the main functions here. Most machine learning models in scikit-learn follow this same API of fitting a model with fit
, making predictions with predict
and the appropriate scoring function score
for each model.
Main functions | Description |
---|---|
lm.fit() |
Fit a linear model |
lm.predit() |
Predict Y using the linear model with estimated coefficients |
lm.score() |
Returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model |
In [30]:
# Look inside lm object
# lm.<tab>
Output | Description |
---|---|
lm.coef_ |
Estimated coefficients |
lm.intercept_ |
Estimated intercept |
In [31]:
# Use all 13 predictors to fit linear regression model
lm.fit(X, bos.PRICE)
Out[31]:
Exercise: How would you change the model to not fit an intercept term? Would you recommend not having an intercept? Why or why not? For more information on why to include or exclude an intercept, look [here](https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf).
Exercise: One of the assumptions of the linear model is that the residuals must be i.i.d. (independently and identically distributed). To satisfy this, is it enough that the residuals are normally distributed? Explain your answer.
Exercise: True or false. To use linear regression, $Y$ must be normally distributed. Explain your answer.
In [32]:
# your turn
Exercise: How would you change the model to not fit an intercept term? Would you recommend not having an intercept? Why or why not?
You can change the model to not fit an intercept term by setting the "fit_intercept" paramter to false. I would not recommend eliminating the intercept. Removing the intercept would introduce bias to the other regression parameters. Typically, you should always have an intercept unless your linear approximation goes through the origin.
Exercise: One of the assumptions of the linear model is that the residuals must be i.i.d. (independently and identically distributed). To satisfy this, is it enough that the residuals are normally distributed? Explain your answer.
Yes, this is enough to satisfy the condition.
Exercise: True or false. To use linear regression, $Y$ must be normally distributed. Explain your answer.
This is false. The normality assumption for linear regression applies to the errors and not to the dependent variable.
Let's look at the estimated coefficients from the linear model using 1m.intercept_
and lm.coef_
.
After we have fit our linear regression model using the least squares method, we want to see what are the estimates of our coefficients $\beta_0$, $\beta_1$, ..., $\beta_{13}$:
$$ \hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_{13} $$
In [33]:
print('Estimated intercept coefficient: {}'.format(lm.intercept_))
In [34]:
print('Number of coefficients: {}'.format(len(lm.coef_)))
In [35]:
# The coefficients
pd.DataFrame({'features': X.columns, 'estimatedCoefficients': lm.coef_})[['features', 'estimatedCoefficients']]
Out[35]:
In [36]:
# first five predicted prices
lm.predict(X)[0:5]
Out[36]:
Exercise: Histogram: Plot a histogram of all the predicted prices. Write a story about what you see. Describe the shape, center and spread of the distribution. Are there any outliers? What might be the reason for them? Should we do anything special with them?
Exercise: Scatterplot: Let's plot the true prices compared to the predicted prices to see they disagree (we did this with `statsmodels` before).
Exercise: We have looked at fitting a linear model in both `statsmodels` and `scikit-learn`. What are the advantages and disadvantages of each based on your exploration? Based on the information provided by both packages, what advantage does `statsmodels` provide?
In [37]:
# your turn
In [38]:
plt.hist(lm.predict(X), bins='auto')
plt.xlabel("Predicted Prices")
plt.ylabel("Frequencey")
plt.show()
The predicted prices seem to be normally distributed. Most of the prices are centered around 20.
In [39]:
sns.regplot(y=bos.PRICE, x=lm.predict(X), fit_reg = True)
plt.xlabel("Predicted Prices")
plt.ylabel("Actual Prices")
plt.title("Predicted vs Actual Housing Prices")
Out[39]:
statsmodel vs scikit-learn
Statsmodel syntax is more alligned towards that of R. It is more powerful when it comes to statistical analysis. It also follows the traditional model where we can check to see how well a given model fits the data and what variables affect the outcome. Scikit-learn follows machine learning tradition and it is used to find the best model for prediction. We can see that the regression line for the scatter plot above is closer to the actually data compared to the scatter plot made from the statsmodel predicted values.
The partitioning of the sum-of-squares shows the variance in the predictions explained by the model and the variance that is attributed to error.
$$TSS = ESS + RSS$$The residual sum-of-squares is one of the basic ways of quantifying how much error exists in the fitted model. We will revisit this in a bit.
$$ RSS = \sum_{i=1}^N r_i^2 = \sum_{i=1}^N \left(y_i - \left(\beta_0 + \beta_1 x_i\right)\right)^2 $$
In [40]:
print(np.sum((bos.PRICE - lm.predict(X)) ** 2))
In [41]:
print(np.sum(lm.predict(X) - np.mean(bos.PRICE)) ** 2)
The coefficient of determination, $R^2$, tells us the percentage of the variance in the response variable $Y$ that can be explained by the linear regression model.
$$ R^2 = \frac{ESS}{TSS} $$The $R^2$ value is one of the most common metrics that people use in describing the quality of a model, but it is important to note that $R^2$ increases artificially as a side-effect of increasing the number of independent variables. While $R^2$ is reported in almost all statistical packages, another metric called the adjusted $R^2$ is also provided as it takes into account the number of variables in the model, and can sometimes even be used for non-linear regression models!
$$R_{adj}^2 = 1 - \left( 1 - R^2 \right) \frac{N - 1}{N - K - 1} = R^2 - \left( 1 - R^2 \right) \frac{K}{N - K - 1} = 1 - \frac{\frac{RSS}{DF_R}}{\frac{TSS}{DF_T}}$$where $N$ is the number of observations, $K$ is the number of variables, $DF_R = N - K - 1$ is the degrees of freedom associated with the residual error and $DF_T = N - 1$ is the degrees of the freedom of the total error.
The mean squared errors are just the averages of the sum-of-squares errors over their respective degrees of freedom.
$$MSE = \frac{ESS}{K}$$$$MSR = \frac{RSS}{N-K-1}$$Remember: Notation may vary across resources particularly the use of R and E in RSS/ESS and MSR/MSE. In some resources, E = explained and R = residual. In other resources, E = error and R = regression (explained). This is a very important distinction that requires looking at the formula to determine which naming scheme is being used.
Given the MSR and MSE, we can now determine whether or not the entire model we just fit is even statistically significant. We use an $F$-test for this. The null hypothesis is that all of the $\beta$ coefficients are zero, that is, none of them have any effect on $Y$. The alternative is that at least one $\beta$ coefficient is nonzero, but it doesn't tell us which one in a multiple regression:
$$H_0: \beta_i = 0, \mbox{for all $i$} \\ H_A: \beta_i > 0, \mbox{for some $i$}$$ $$F = \frac{MSR}{MSE} = \left( \frac{R^2}{1 - R^2} \right) \left( \frac{N - K - 1}{K} \right)$$Once we compute the $F$-statistic, we can use the $F$-distribution with $N-K$ and $K-1$ degrees of degrees of freedom to get a p-value.
Warning! The $F$-statistic mentioned in this section is NOT the same as the F1-measure or F1-value discused in Unit 7.
Let's look at the relationship between `PTRATIO` and housing price.
Exercise: Try fitting a linear regression model using only the 'PTRATIO' (pupil-teacher ratio by town) and interpret the intercept and the coefficients.
Exercise: Calculate (or extract) the $R^2$ value. What does it tell you?
Exercise: Compute the $F$-statistic. What does it tell you?
Exercise: Take a close look at the $F$-statistic and the $t$-statistic for the regression coefficient. What relationship do you notice? Note that this relationship only applies in *simple* linear regression models.
In [42]:
# your turn
In [43]:
lr = LinearRegression()
lr.fit(bos.PTRATIO.values.reshape(-1,1), bos.PRICE)
print("Coefficient:", lr.coef_[0])
print("Intercept:", lr.intercept_)
In [44]:
r_squared = lr.score(X[["PTRATIO"]], bos.PRICE)
print("R^2 =", r_squared)
The $R^2$ value is low and tells us that PTRATIO is not a good predictor.
In [45]:
N = bos.PTRATIO.count()
K = 1
F = (r_squared / (1 - r_squared)) * ((N - K - 1) / K)
print("F-statistic =", F)
The F-statistic is very high which tells us that we can reject our null hypothesis. The regression coefficients are statistically significant.
Fit a linear regression model using three independent variables
Exercise: Compute or extract the $F$-statistic. What does it tell you about the model?
Exercise: Compute or extract the $R^2$ statistic. What does it tell you about the model?
Exercise: Which variables in the model are significant in predicting house price? Write a story that interprets the coefficients.
In [46]:
# your turn
In [47]:
lr2 = LinearRegression()
lr2.fit(X[['CRIM', "RM", "PTRATIO"]], bos.PRICE)
Out[47]:
In [48]:
N2 = bos.PRICE.count()
K2 = 3
r_squared2 = lr2.score(X[['CRIM', 'RM', 'PTRATIO']], bos.PRICE)
F2 = (r_squared2 / (1 - r_squared2)) * ((N2 - K2 - 1) / K2)
print("F-statistic =", F2)
The F-statistic is very high which means the regression coefficients are statistically significant.
In [49]:
print("R^2 =", r_squared2)
The $R^2$ value is 0.59 which tells us that CRIM, RM, and PTRATIO can be good predictors for price when used together. They are better predictors than using PTRATIO alone.
During modeling, there will be times when we want to compare models to see which one is more predictive or fits the data better. There are many ways to compare models, but we will focus on two.
The $F$-statistic can also be used to compare two nested models, that is, two models trained on the same dataset where one of the models contains a subset of the variables of the other model. The full model contains $K$ variables and the reduced model contains a subset of these $K$ variables. This allows us to add additional variables to a base model and then test if adding the variables helped the model fit.
$$F = \frac{\left( \frac{RSS_{reduced} - RSS_{full}}{DF_{reduced} - DF_{full}} \right)}{\left( \frac{RSS_{full}}{DF_{full}} \right)}$$where $DF_x = N - K_x - 1$ where $K_x$ is the number of variables in model $x$.
Another statistic for comparing two models is AIC, which is based on the likelihood function and takes into account the number of variables in the model.
$$AIC = 2 K - 2 \log_e{L}$$where $L$ is the likelihood of the model. AIC is meaningless in the absolute sense, and is only meaningful when compared to AIC values from other models. Lower values of AIC indicate better fitting models.
statsmodels
provides the AIC in its output.
Exercise: Find another variable (or two) to add to the model we built in Part 3. Compute the $F$-test comparing the two models as well as the AIC. Which model is better?
In [50]:
RSS_reduced = np.sum((bos.PRICE - lr2.predict(X[['CRIM', 'RM', 'PTRATIO']])) ** 2)
lr3 = LinearRegression()
lr3.fit(X[['CRIM', "RM", "PTRATIO", "LSTAT", "NOX"]], bos.PRICE)
RSS_full = np.sum((bos.PRICE - lr3.predict(X[['CRIM', 'RM', 'PTRATIO', "LSTAT", "NOX"]])) ** 2)
DF_reduced = bos.PRICE.count() - 3 - 1
DF_full = bos.PRICE.count() - 5 - 1
F = ((RSS_reduced - RSS_full)/(DF_reduced - DF_full)) / (RSS_full/DF_full)
print("F =", F)
In [51]:
m1 = sm.OLS(bos.PRICE, X[['CRIM', 'RM', 'PTRATIO']], hasconst=True).fit()
m2 = sm.OLS(bos.PRICE, X[['CRIM', 'RM', 'PTRATIO', "LSTAT", "NOX"]], hasconst=True).fit()
print("AIC with 'CRIM', 'RM', 'PTRATIO' =", m1.aic)
print("AIC with 'CRIM', 'RM', 'PTRATIO', 'LSTAT', 'NOX' =", m2.aic)
The model with more variables has a lower AIC which indicates that its model has a better fit.
Linear regression makes several assumptions. It is always best to check that these assumptions are valid after fitting a linear regression model.
There are some other issues that are important investigate with linear regression models.
Take the reduced model from Part 3 to answer the following exercises. Take a look at [this blog post](http://mpastell.com/2013/04/19/python_regression/) for more information on using statsmodels to construct these plots.
Exercise: Construct a fitted values versus residuals plot. What does the plot tell you? Are there any violations of the model assumptions?
Exercise: Construct a quantile plot of the residuals. What does the plot tell you?
Exercise: What are some advantages and disadvantages of the fitted vs. residual and quantile plot compared to each other?
Exercise: Identify any outliers (if any) in your model and write a story describing what these outliers might represent.
Exercise: Construct a leverage plot and identify high leverage points in the model. Write a story explaining possible reasons for the high leverage points.
Exercise: Remove the outliers and high leverage points from your model and run the regression again. How do the results change?
In [52]:
# Your turn.
In [53]:
# Construct a fitted values versus residuals plot. What does the plot tell you? Are there any violations of the model assumptions?
predicted_values = lr2.predict(X[['CRIM', 'RM', 'PTRATIO']])
residuals = bos.PRICE - predicted_values
plt.scatter(predicted_values, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()
Eyeballing the plot, there does not seem to be any patterns.
In [54]:
# Construct a quantile plot of the residuals. What does the plot tell you?
import pylab
stats.probplot(residuals, dist="norm", plot=pylab)
pylab.show()
The qq plot shows that there is a positive skew with the data.
What are some advantages and disadvantages of the fitted vs. residual and quantile plot compared to each other?
The fitted values versus residuals plot shows the relationship between the predicted values and residuals. It could tell you whether a linear model is appropriate. It also shows outliers. The quantile plot tells you how the data is distributed. It does not really identify outliers but it is easier to eyeball.
Identify any outliers (if any) in your model and write a story describing what these outliers might represent.
There are outliers in the model which are most likely due to differences in prices depending on location.
In [55]:
# Construct a leverage plot and identify high leverage points in the model. Write a story explaining possible reasons for the high leverage points.
from statsmodels.graphics.regressionplots import *
fitted = ols('PRICE ~ CRIM + RM + PTRATIO',bos).fit()
plot_leverage_resid2(fitted)
influence_plot(fitted)
Out[55]:
High leverage points are due to outliers in the independent variable data set.
In [56]:
# Remove the outliers and high leverage points from your model and run the regression again. How do the results change?
def reject_outliers(data, m = 2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else 0.
return data[s<m]
no_outliers_CRIM = reject_outliers(X['CRIM'].values)
no_outliers_RM = reject_outliers(X['RM'].values)
no_outliers_PTRATIO = reject_outliers(X['PTRATIO'].values)
df = bos[(bos['CRIM'].isin(no_outliers_CRIM)) & (bos['RM'].isin(no_outliers_RM)) & (bos['PTRATIO'].isin(no_outliers_PTRATIO))]
lr4 = LinearRegression()
lr4.fit(df[['CRIM', "RM", "PTRATIO"]], df.PRICE)
r_2 = lr4.score(X[['CRIM', "RM", "PTRATIO"]], bos.PRICE)
print("R^2 =", r_2)