This is a very quick run-through of some basic statistical concepts, adapted from Lab 4 in Harvard's CS109 course. Please feel free to try the original lab if you're feeling ambitious :-) The CS109 git repository also has the solutions if you're stuck.
Linear regression is used to model and predict continuous outcomes with normal random errors. There are nearly an infinite number of different types of regression models and each regression model is typically defined by the distribution of the prediction errors (called "residuals") of the type of data. Logistic regression is used to model binary outcomes whereas Poisson regression is used to predict counts. In this exercise, we'll see some examples of linear regression as well as Train-test splits.
The packages we'll cover are: statsmodels
, seaborn
, and scikit-learn
. While we don't explicitly teach statsmodels
and seaborn
in the Springboard workshop, those are great libraries to know.
In [1]:
# special IPython command to prepare the notebook for matplotlib and other libraries
%matplotlib inline
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
# special matplotlib argument for improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")
The Boston Housing data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository.
sklearn
This data set is available in the sklearn python module which is how we will access it today.
In [2]:
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
In [3]:
boston.keys()
Out[3]:
In [4]:
boston.data.shape
Out[4]:
In [5]:
# Print column names
print(boston.feature_names)
In [6]:
# Print description of Boston housing data set
print(boston.DESCR)
Now let's explore the data set itself.
In [7]:
bos = pd.DataFrame(boston.data)
bos.head()
Out[7]:
There are no column names in the DataFrame. Let's add those.
In [8]:
bos.columns = boston.feature_names
bos.head()
Out[8]:
Now we have a pandas DataFrame called bos
containing all the data we want to use to predict Boston Housing prices. Let's create a variable called PRICE
which will contain the prices. This information is contained in the target
data.
In [9]:
print(boston.target.shape)
In [10]:
bos['PRICE'] = boston.target
bos.head()
Out[10]:
In [11]:
bos.describe()
Out[11]:
In [12]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.scatter(bos.CRIM, bos.PRICE)
plt.xlabel("Per capita crime rate by town (CRIM)")
plt.ylabel("Housing Price")
plt.title("Relationship between CRIM and Price")
Out[12]:
In [13]:
# Answers to Part 2 Exercise Set 1
# Question 1) What kind of relationship do you see? e.g. positive, negative?
# linear? non-linear? Is there anything else strange or interesting about
# the data? What about outliers?
# I see a weak negative linear relationship. Yes, the data looks interesting in
# that its distribution appears to be positively skewed and has a few outliers.
In [14]:
# Part 2 Exercise Set 1
# Question 2: Create scatter plots between *RM* and *PRICE*, and PTRATIO and PRICE.
# Label your axes appropriately using human readable labels.
# Tell a story about what you see.
# Create scatter plots between *RM* and *PRICE*
plt.scatter(bos.RM, bos.PRICE)
plt.xlabel("Average Number of Rooms Per Dwelling (RM)")
plt.ylabel("Housing Price")
plt.title("Relationship between RM and Price")
Out[14]:
In [15]:
# Part 2 Exercise Set 1:
# Create scatter plot between *PTRATIO* and *PRICE*
plt.scatter(bos.PTRATIO, bos.PRICE)
plt.xlabel("Pupil-Teacher Ratio by Town (PTRATIO)")
plt.ylabel("Housing Price")
plt.title("Relationship between PTRATIO and Price")
Out[15]:
In [16]:
# Question 2 continuation: it appears that a positive linear
# relationship seemed to exist in the graph between average
# number of rooms per dwelling and housing price.
In [17]:
# your turn: create some other scatter plots
# scatter plot between *NOX* and *PRICE*
plt.scatter(bos.NOX, bos.PRICE)
plt.xlabel("Nitric Oxides Concentration (parts per 10 million) (NOX)")
plt.ylabel("Housing Price")
plt.title("Relationship between NOX and Price")
Out[17]:
In [18]:
# Exercise 1: What are some other numeric variables of interest? Why do you think
# they are interesting? Plot scatterplots with these variables and
# PRICE (house price) and tell a story about what you see.
# In my opinion, other variables of interest would be nitric oxides
# concentration since it can describe pollutants in the area.
# Another is the column describing percent of black population
# that may describe neighborhood housing prices.
# your turn: create some other scatter plots
# Create a scatter plot between *NOX* and *PRICE*
plt.scatter(bos.B, bos.PRICE)
plt.xlabel("1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town (B)")
plt.ylabel("Housing Price")
plt.title("Relationship between B and Price")
Out[18]:
In [19]:
# your turn: create some other scatter plots
# Create a scatter plot between *DIS* and *LSTAT*
plt.scatter(bos.DIS, bos.LSTAT)
plt.xlabel("weighted distances to five Boston employment centres (DIS)")
plt.ylabel("% lower status of the population")
plt.title("Relationship between DIS and LSTAT")
Out[19]:
In [20]:
import seaborn as sns
sns.regplot(y="PRICE", x="RM", data=bos, fit_reg = True)
Out[20]:
In [21]:
plt.hist(np.log(bos.CRIM))
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()
In [22]:
# Part 2 Exercise 1: In the above histogram, we took the logarithm of the crime rate per
# capita. Repeat this histogram without taking the log.
plt.hist(bos.CRIM)
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()
# Exercise 2 Question 1 continuation: What was the purpose of taking the log? What do we gain
# by making this transformation? What do you now notice about this variable that is not
# obvious without making the transformation?
# We usually take logarithms of variables that are multiplicatively related or in other
# words it's growing exponentially in time. By taking logarithms of variables before
# plotting the data, any exponential nature of variables is taken out of equation so
# that we can see the pattern in a linear model if that's the case. Logging in short,
# is similar to deflaton so that a trend can be straightened out and a linear model
# can be fitted.
# Before taking the logarithm of the variable, it's obvious that it's exponential in nature.
In [23]:
# Part 2 Exercise 2:
# Plot the histogram for RM and PTRATIO against each other, along
# with the two variables you picked in the previous section. We
# are looking for correlations in predictors here.
import seaborn as sns
sns.set(color_codes=True)
sns.jointplot(bos.RM, bos.PTRATIO)
Out[23]:
In [24]:
# Part 2 Exercise 2 Continuation:
# Plot the histogram for the two variables you picked in
# the previous section.
import seaborn as sns
sns.set(color_codes=True)
sns.jointplot(bos.NOX, bos.PRICE)
Out[24]:
Here,
$Y$ = boston housing prices (called "target" data in python, and referred to as the dependent variable or response variable)
and
$X$ = all the other features (or independent variables, predictors or explanatory variables)
which we will use to fit a linear regression model and predict Boston housing prices. We will use the least-squares method to estimate the coefficients.
We'll use two ways of fitting a linear regression. We recommend the first but the second is also powerful in its features.
In [25]:
# Import regression modules
import statsmodels.api as sm
from statsmodels.formula.api import ols
In [26]:
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
m = ols('PRICE ~ RM',bos).fit()
print(m.summary())
Let's see how our model actually fit our data. We can see below that there is a ceiling effect, we should probably look into that. Also, for large values of $Y$ we get underpredictions, most predictions are below the 45-degree gridlines.
In [27]:
# Part 3 Exercise 1: Create a scatterplot between the predicted prices,
# available in m.fittedvalues (where m is the fitted model)
# and the original prices.
# Import regression modules
import statsmodels.api as sm
from statsmodels.formula.api import ols
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
m = ols('PRICE ~ RM',bos).fit()
# Create the scatter plot between predicted values and *PRICE*
plt.scatter(m.predict(), bos.PRICE)
plt.xlabel("Predicted Housing Price Based on Linear Regression")
plt.ylabel("Housing Price")
plt.title("Relationship between Predicted Price and Original Price")
Out[27]:
In [28]:
from sklearn.linear_model import LinearRegression
X = bos.drop('PRICE', axis = 1)
# This creates a LinearRegression object
lm = LinearRegression()
lm
Out[28]:
In [29]:
# Use all 13 predictors to fit linear regression model
lm.fit(X, bos.PRICE)
Out[29]:
Exercise: How would you change the model to not fit an intercept term? Would you recommend not having an intercept? Why or why not? For more information on why to include or exclude an intercept, look [here](https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf).
Exercise: One of the assumptions of the linear model is that the residuals must be i.i.d. (independently and identically distributed). To satisfy this, is it enough that the residuals are normally distributed? Explain your answer.
Exercise: True or false. To use linear regression, $Y$ must be normally distributed. Explain your answer.
In [30]:
# Part 3 Exercise 2 Question:
# How would you change the model to not fit an intercept term?
# Would you recommend not having an intercept? Why or why not?
# To change the model to not fit an intercept term then
# we need to fit a linear regression through the origin (RTO).
# Using sklearn's LinearRegression function, I will have to set
# the fit_intercept parameter to False.
# As far as recommending whether to have an intercept or not,
# this would depend on the data set. Hocking (1996) and Adelman
# et.al. (1994) have found that a careful change of data range
# and data size needs to be considered. For example, if the
# data is far from the origin then fitting through the origin
# might present a discontinuity from an otherwise linear
# function with a positive or negative intercept. If uncertain,
# then one might run a couple of diagnostics. Hahn (1977)
# suggested to run a fit with and without an intercept then
# compare the standard errors to decide whether OLS or RTO
# provides a superior fit.
In [31]:
print('Estimated intercept coefficient: {}'.format(lm.intercept_))
In [32]:
print('Number of coefficients: {}'.format(len(lm.coef_)))
In [33]:
# The coefficients
pd.DataFrame({'features': X.columns, 'estimatedCoefficients': lm.coef_})[['features', 'estimatedCoefficients']]
Out[33]:
In [34]:
# first five predicted prices
lm.predict(X)[0:5]
Out[34]:
In [35]:
# Part 3 Exercise Set III:
# Question 1: Histogram: Plot a histogram of all the predicted prices. Write a story
# about what you see. Describe the shape, center and spread of the distribution.
# Are there any outliers? What might be the reason for them? Should we do
# anything special with them?
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.hist(lm.predict(X))
plt.title("Linear Regression")
plt.xlabel("Predicted Prices")
plt.ylabel("Frequency")
plt.show()
# The graph appears to be symmetric and bell-shaped, showing a normal
# distribution. The center seems to be around 20 in the x-axis.
# The spread of the distribution is from -5 to 45. Yes, there
# are outliers in the form of negative valued prices.
In [36]:
# Part 3 Exercise Set III
# Question 2: Scatterplot: Let's plot the true prices compared to
# the predicted prices to see they disagree
# (we did this with statsmodels before).
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# Create the scatter plot between predicted values and *PRICE*
plt.scatter(lm.predict(X), bos.PRICE)
plt.xlabel("Predicted Housing Price Based on Linear Regression")
plt.ylabel("Housing Price")
plt.title("Relationship between Predicted Price and Original Price")
# Question 3: We have looked at fitting a linear model in both
# statsmodels and scikit-learn. What are the advantages
# and disadvantages of each based on your exploration?
# Based on the information provided by both packages,
# what advantage does statsmodels provide?
Out[36]:
In [37]:
print(np.sum((bos.PRICE - lm.predict(X)) ** 2))
In [38]:
print(np.sum(lm.predict(X) - np.mean(bos.PRICE)) ** 2)
In [39]:
# Part 3 Exercise Set IV:
# Question 1: Fit a linear regression model using only the Pupil-teacher
# ratio by town (PTRATIO) column and interpret the coefficients.
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
import pandas as pd
lm = LinearRegression()
lm.fit(X[['PTRATIO']], bos.PRICE)
print('Estimated intercept coefficient: {}'.format(lm.intercept_))
print('Number of coefficients: {}'.format(len(lm.coef_)))
In [40]:
# Exercise 2: Calculate (or extract) the R2 value. What does it tell you?
lm.score(X[['PTRATIO']], bos.PRICE)
Out[40]:
In [41]:
# Exercise 3: Compute the F-statistic. What does it tell you?
m = ols('PRICE ~ PTRATIO',bos).fit()
print(m.summary())
Fit a linear regression model using three independent variables
Exercise: Compute or extract the $F$-statistic. What does it tell you about the model?
Exercise: Compute or extract the $R^2$ statistic. What does it tell you about the model?
Exercise: Which variables in the model are significant in predicting house price? Write a story that interprets the coefficients.
In [42]:
# Part 3 Exercise Set V
# Fit a linear regression model using three independent variables
# 1) 'CRIM' (per capita crime rate by town)
# 2) 'RM' (average number of rooms per dwelling)
# 3) 'PTRATIO' (pupil-teacher ratio by town)
lm = LinearRegression()
lm.fit(X[['CRIM','RM','PTRATIO']], bos.PRICE)
Out[42]:
In [43]:
# Calculate (or extract) the R2 value.
lm.score(X[['CRIM', 'RM', 'PTRATIO']], bos.PRICE)
Out[43]:
In [44]:
# Compute the F-statistic.
m = ols('PRICE ~ CRIM + RM + PTRATIO',bos).fit()
print(m.summary())
During modeling, there will be times when we want to compare models to see which one is more predictive or fits the data better. There are many ways to compare models, but we will focus on two.
Exercise: Find another variable (or two) to add to the model we built in Part 3. Compute the $F$-test comparing the two models as well as the AIC. Which model is better?
In [45]:
# Part 4
# Find another variable (or two) to add to the model we built in Part 3.
# Compute the F-test comparing the two models as well as the AIC. Which model is better?
m = ols('PRICE ~ CRIM + RM + PTRATIO + NOX + TAX',bos).fit()
print(m.summary())
In [46]:
# Part 5 Exercise 1:
# Create a scatter plot of fitted values versus residuals
plt.scatter(m.fittedvalues, m.resid)
plt.ylabel("Fitted Values")
plt.xlabel("Normalized residuals")
Out[46]:
In [47]:
# Part 5 Exercise 2:
# Construct a quantile plot of the residuals.
from scipy import stats
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
x = stats.loggamma.rvs(c=2.5, size=500)
res = stats.probplot(m.resid, plot=ax)
ax.set_title("Probplot for loggamma dist with shape parameter 2.5")
plt.show()
In [48]:
# Part 5 Exercise 3:
# What are some advantages and disadvantages of the fitted vs.
# residual and quantile plot compared to each other?
# Answer: The fitted vs. residual plot is the most frequently
# created plot using residuals analysis. Adavatages of
# plotting it is to be able to determine non-linearity,
# unequal error variances and outliers.