This is a very quick run-through of some basic statistical concepts, adapted from Lab 4 in Harvard's CS109 course. Please feel free to try the original lab if you're feeling ambitious :-) The CS109 git repository also has the solutions if you're stuck.
Linear regression is used to model and predict continuous outcomes while logistic regression is used to model binary outcomes. We'll see some examples of linear regression as well as Train-test splits.
The packages we'll cover are: statsmodels
, seaborn
, and scikit-learn
. While we don't explicitly teach statsmodels
and seaborn
in the Springboard workshop, those are great libraries to know.
In [1]:
# special IPython command to prepare the notebook for matplotlib and other libraries
%pylab inline
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
# special matplotlib argument for improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")
Given a dataset $X$ and $Y$, linear regression can be used to:
Linear Regression is a method to model the relationship between a set of independent variables $X$ (also knowns as explanatory variables, features, predictors) and a dependent variable $Y$. This method assumes the relationship between each predictor $X$ is linearly related to the dependent variable $Y$.
$$ Y = \beta_0 + \beta_1 X + \epsilon$$where $\epsilon$ is considered as an unobservable random variable that adds noise to the linear relationship. This is the simplest form of linear regression (one variable), we'll call this the simple model.
$\beta_0$ is the intercept of the linear model
Multiple linear regression is when you have more than one independent variable
Least squares is a method that can estimate the coefficients of a linear model by minimizing the difference between the following:
$$ S = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 $$where $N$ is the number of observations.
The solution can be written in compact matrix notation as
$$\hat\beta = (X^T X)^{-1}X^T Y$$We wanted to show you this in case you remember linear algebra, in order for this solution to exist we need $X^T X$ to be invertible. Of course this requires a few extra assumptions, $X$ must be full rank so that $X^T X$ is invertible, etc. This is important for us because this means that having redundant features in our regression models will lead to poorly fitting (and unstable) models. We'll see an implementation of this in the extra linear regression example.
Note: The "hat" means it is an estimate of the coefficient.
The Boston Housing data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository.
sklearn
This data set is available in the sklearn python module which is how we will access it today.
In [2]:
from sklearn.datasets import load_boston
boston = load_boston()
In [3]:
boston.keys()
Out[3]:
In [4]:
boston.data.shape
Out[4]:
In [5]:
# Print column names
print(boston.feature_names)
In [6]:
# Print description of Boston housing data set
print(boston.DESCR)
Now let's explore the data set itself.
In [7]:
bos = pd.DataFrame(boston.data)
bos.head()
Out[7]:
There are no column names in the DataFrame. Let's add those.
In [8]:
bos.columns = boston.feature_names
bos.head()
Out[8]:
Now we have a pandas DataFrame called bos
containing all the data we want to use to predict Boston Housing prices. Let's create a variable called PRICE
which will contain the prices. This information is contained in the target
data.
In [9]:
print(boston.target.shape)
In [10]:
bos['PRICE'] = boston.target
bos.head()
Out[10]:
In [11]:
bos.describe()
Out[11]:
In [12]:
plt.scatter(bos.CRIM, bos.PRICE)
plt.xlabel("Per capita crime rate by town (CRIM)")
plt.ylabel("Housing Price")
plt.title("Relationship between CRIM and Price")
Out[12]:
The housing price seems to have a negative correlation to crime rate. It looks like there is roughly an negative exponential relationship, where intial increases in crime rate greatly decrease housing price, but then later increases have less of a notable effect.
Your turn: Create scatter plots between RM and PRICE, and PTRATIO and PRICE. What do you notice?
In [13]:
#your turn: scatter plot between *RM* and *PRICE*
plt.scatter(bos.RM, bos.PRICE)
plt.xlabel("Average number of rooms per dwelling (RM)")
plt.ylabel("Housing Price")
plt.title("Relationship between Rooms and Price")
Out[13]:
There are some outliers, but there is a very clear positive linear relationship between the average number of rooms and the housing price.
In [14]:
#your turn: scatter plot between *PTRATIO* and *PRICE*
plt.scatter(bos.PTRATIO, bos.PRICE)
plt.xlabel("pupil-teacher ratio by town (PTRATIO)")
plt.ylabel("Housing Price")
plt.title("Relationship between Pupil-Teacher ratio and Price")
Out[14]:
Here the relatoinship does not appear as clear cut. While there may be a weak negative correlation between pupil-teacher ratio and housing price, the variance in the data is quite large, and so we are not as confident in the importance of this column as a predictor of housing price.
Your turn: What are some other numeric variables of interest? Plot scatter plots with these variables and PRICE.
In [24]:
#your turn: create some other scatter plots
plt.scatter(bos.NOX, bos.PRICE)
plt.xlabel("Nitric Oxides concentration (parts per 10 million)")
plt.ylabel("Housing Price")
plt.title("Relationship between NOX concentration and Price")
Out[24]:
Interestingly, air quality seems to affect housing price. There seems to be a negative correlation between NO concentration housing price. The trend looks linear for the bulk of the data, but there does seem like it may be a non-linear relationship as well. A comparison of two or more types of fits would be useful here.
In [18]:
#your turn: create some other scatter plots
plt.scatter(bos.DIS, bos.PRICE)
plt.xlabel("weighted distances to five Boston employment centre (DIS)")
plt.ylabel("Housing Price")
plt.title("Relationship between Distance to Employment Centres and Price")
Out[18]:
Not suprisingly, the distance to employment centers strongly affects house price for towns, when the areas are "inner-city". These have the lowest housing prices. The relationship seems to be a postive non-linear one, where once we move a bit further out, hose prices are equally likely to be high or low. Once we look past 6 standardized distance units, we no longer see as many high priced outliers. This matches the anectodal experiences of many people.
Seaborn is a cool Python plotting library built on top of matplotlib. It provides convenient syntax and shortcuts for many common types of plots, along with better-looking defaults.
We can also use seaborn regplot for the scatterplot above. This provides automatic linear regression fits (useful for data exploration later on). Here's one example below.
In [19]:
sns.regplot(y="PRICE", x="RM", data=bos, fit_reg = True)
Out[19]:
In [20]:
plt.hist(bos.CRIM)
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()
Your turn:
In [22]:
#your turn
rm = sns.distplot(bos.RM, kde=False)
rm.axes.set_title('RM')
rm.set_xlabel('Average number of rooms')
rm.set_ylabel('Frequency')
Out[22]:
In [23]:
#your turn
rm = sns.distplot(bos.PTRATIO, kde=False)
rm.axes.set_title('PTRATIO')
rm.set_xlabel('Pupil-Teacher Ratio by Town')
rm.set_ylabel('Frequency')
Out[23]:
In [25]:
#your turn
rm = sns.distplot(bos.NOX, kde=False)
rm.axes.set_title('NOX')
rm.set_xlabel('Nitric Oxides concentration (parts per 10 million)')
rm.set_ylabel('Frequency')
Out[25]:
In [27]:
#your turn
rm = sns.distplot(bos.DIS, kde=False)
rm.axes.set_title('DIS')
rm.set_xlabel('Weighted distances to five Boston employment centre (DIS)')
rm.set_ylabel('Frequency')
Out[27]:
Here,
$Y$ = boston housing prices (also called "target" data in python)
and
$X$ = all the other features (or independent variables)
which we will use to fit a linear regression model and predict Boston housing prices. We will use the least squares method as the way to estimate the coefficients.
We'll use two ways of fitting a linear regression. We recommend the first but the second is also powerful in its features.
statsmodels
Statsmodels is a great Python library for a lot of basic and inferential statistics. It also provides basic regression functions using an R-like syntax, so it's commonly used by statisticians. While we don't cover statsmodels officially in the Data Science Intensive, it's a good library to have in your toolbox. Here's a quick example of what you could do with it.
In [28]:
# Import regression modules
# ols - stands for Ordinary least squares, we'll use this
import statsmodels.api as sm
from statsmodels.formula.api import ols
In [30]:
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
m = ols('PRICE ~ RM',bos).fit()
print(m.summary())
There is a ton of information in this output. But we'll concentrate on the coefficient table (middle table). We can interpret the RM
coefficient (9.1021) by first noticing that the p-value (under P>|t|
) is so small, basically zero. We can interpret the coefficient as, if we compare two groups of towns, one where the average number of rooms is say $5$ and the other group is the same except that they all have $6$ rooms. For these two groups the average difference in house prices is about $9.1$ (in thousands) so about $\$9,100$ difference. The confidence interval gives us a range of plausible values for this difference, about ($\$8,279, \$9,925$), definitely not chump change.
statsmodels
formulasThis formula notation will seem familiar to R
users, but will take some getting used to for people coming from other languages or are new to statistics.
The formula gives instruction for a general structure for a regression call. For statsmodels
(ols
or logit
) calls you need to have a Pandas dataframe with column names that you will add to your formula. In the below example you need a pandas data frame that includes the columns named (Outcome
, X1
,X2
, ...), bbut you don't need to build a new dataframe for every regression. Use the same dataframe with all these things in it. The structure is very simple:
Outcome ~ X1
But of course we want to to be able to handle more complex models, for example multiple regression is doone like this:
Outcome ~ X1 + X2 + X3
This is the very basic structure but it should be enough to get you through the homework. Things can get much more complex, for a quick run-down of further uses see the statsmodels
help page.
Let's see how our model actually fit our data. We can see below that there is a ceiling effect, we should probably look into that. Also, for large values of $Y$ we get underpredictions, most predictions are below the 45-degree gridlines.
Your turn: Create a scatterpot between the predicted prices, available in m.fittedvalues
and the original prices. How does the plot look?
In [42]:
# your turn
scatter2 = sns.regplot(x=bos.PRICE,y=m.fittedvalues, fit_reg=False)
scatter2.set_title('Comparison of real prices and predicted prices')
scatter2.set_ylabel('Predicted Prices from a #Rooms OLS regression')
Out[42]:
In [53]:
from sklearn.linear_model import LinearRegression
X = bos.drop('PRICE', axis = 1)
# This creates a LinearRegression object
lm = LinearRegression()
lm
Out[53]:
Check out the scikit-learn docs here. We have listed the main functions here.
Main functions | Description |
---|---|
lm.fit() |
Fit a linear model |
lm.predit() |
Predict Y using the linear model with estimated coefficients |
lm.score() |
Returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model |
In [54]:
# Look inside lm object
#lm.<tab>
Output | Description |
---|---|
lm.coef_ |
Estimated coefficients |
lm.intercept_ |
Estimated intercept |
In [55]:
# Use all 13 predictors to fit linear regression model
lm.fit(X, bos.PRICE)
Out[55]:
Your turn: How would you change the model to not fit an intercept term? Would you recommend not having an intercept?
In [56]:
lm2 =LinearRegression(fit_intercept=False)
lm2.fit(X, bos.PRICE)
Out[56]:
In this case, it doesn't make sense to force a zero intercept. We didn't do anything to center the data. Had we subtracted the mean from our continous variables (or performed some other centering method), then it would make sense to do this.
Let's look at the estimated coefficients from the linear model using 1m.intercept_
and lm.coef_
.
After we have fit our linear regression model using the least squares method, we want to see what are the estimates of our coefficients $\beta_0$, $\beta_1$, ..., $\beta_{13}$:
$$ \hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_{13} $$
In [57]:
print('Estimated intercept coefficient:', lm.intercept_)
In [58]:
print('Number of coefficients:', len(lm.coef_))
In [61]:
# The coefficients
pd.DataFrame(list(zip(X.columns, lm.coef_)), columns = ['features', 'estimatedCoefficients'])
Out[61]:
In [62]:
# first five predicted prices
lm.predict(X)[0:5]
Out[62]:
Your turn:
statsmodels
before).
In [64]:
# your turn
predicted_prices = lm.predict(X)
prices_hist = sns.distplot(predicted_prices)
In [66]:
prices_scatter = sns.regplot(x=bos.PRICE, y=predicted_prices, fit_reg =False)
prices_scatter.set_title('Prices versus predicted prices from 13 variable regression')
prices_scatter.set_ylabel('Predicted price')
Out[66]:
Pretty good agrreement. There still seems to be some underprediction once the price goes past about 30,00, but there is good correlation before then. Of course this is just by eye.
In [68]:
print(np.sum((bos.PRICE - lm.predict(X)) ** 2))
In [69]:
#your turn
print(np.mean((bos.PRICE-predicted_prices)**2))
In [70]:
lm = LinearRegression()
lm.fit(X[['PTRATIO']], bos.PRICE)
Out[70]:
In [72]:
msePTRATIO = np.mean((bos.PRICE - lm.predict(X[['PTRATIO']])) ** 2)
print(msePTRATIO)
We can also plot the fitted linear regression line.
In [73]:
plt.scatter(bos.PTRATIO, bos.PRICE)
plt.xlabel("Pupil-to-Teacher Ratio (PTRATIO)")
plt.ylabel("Housing Price")
plt.title("Relationship between PTRATIO and Price")
plt.plot(bos.PTRATIO, lm.predict(X[['PTRATIO']]), color='blue', linewidth=3)
plt.show()
In [79]:
# your turn
lm3 = LinearRegression()
fit3= lm.fit(X[['CRIM','RM','PTRATIO']], bos.PRICE)
fit3
Out[79]:
In [83]:
mse3 = np.mean((bos.PRICE-fit3.predict(X[['CRIM','RM','PTRATIO']]))**2)
print(mse3)
In [88]:
predicted_prices2 = fit3.predict(X[['CRIM','RM','PTRATIO']])
prices_scatter2 = sns.regplot(x=bos.PRICE, y=predicted_prices2, fit_reg =False)
prices_scatter2.set_title('Prices versus predicted prices from 3 variable regression')
prices_scatter2.set_ylabel('Predicted price')
Out[88]:
Let's stick to the linear regression example:
One way of doing this is you can create training and testing data sets manually.
In [91]:
X_train = X[:-50]
X_test = X[-50:]
Y_train = bos.PRICE[:-50]
Y_test = bos.PRICE[-50:]
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
Another way, is to split the data into random train and test subsets using the function train_test_split
in sklearn.cross_validation
. Here's the documentation.
In [92]:
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
X, bos.PRICE, test_size=0.33, random_state = 5)
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)
Your turn: Let's build a linear regression model using our new training data sets.
In [98]:
# your turn
lm = LinearRegression()
lm.fit(X_train, Y_train)
pred_train = lm.predict(X_train)
pred_test = lm.predict(X_test)
Your turn:
Calculate the mean squared error
Are they pretty similar or very different? What does that mean?
In [101]:
# your turn
mseTrain = np.mean((Y_train - lm.predict(X_train))**2)
mseTest = np.mean((Y_test - lm.predict(X_test))**2)
print("Calculated mean squared error using the training data: ",mseTrain)
print("Calculated mean squared error using the test data: ",mseTest)
The test error is larger than the training error. This is expected. It shows that the model does a better job of fitting the training data, than it does explaining the trends in our test data. However the difference is not alarmingly so. We expect that it is not a case of overfitting.
In [102]:
plt.scatter(lm.predict(X_train), lm.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)
plt.scatter(lm.predict(X_test), lm.predict(X_test) - Y_test, c='g', s=40)
plt.hlines(y = 0, xmin=0, xmax = 50)
plt.title('Residual Plot using training (blue) and test (green) data')
plt.ylabel('Residuals')
Out[102]:
Your turn: Do you think this linear regression model generalizes well on the test data?
Yes I believe it does. We see that the residuals have a roughly random spread, showing no sign of heteroscedasticity. The residuals seem to cover the same space roughly as our training data, without showing a clear delineation between the two groups. Also, the MSE calcualted before hand were of the same magnitude for both groups.
A simple extension of the Test/train split is called K-fold cross-validation.
Here's the procedure:
Luckily you don't have to do this entire process all by hand (for
loops, etc.) every single time, sci-kit learn
has a very nice implementation of this, have a look at the documentation.
Your turn (extra credit): Implement K-Fold cross-validation using the procedure above and Boston Housing data set using $K=4$. How does the average prediction error compare to the train-test split above?
In [116]:
kf = sklearn.cross_validation.KFold(n_folds=4, n=len(bos.PRICE))
error_array = []
for train_idx, test_idx in kf:
lm.fit(X.iloc[train_idx], bos.PRICE[train_idx])
predicted_prices = lm.predict(X.iloc[test_idx])
mseKFold4 = np.mean((predicted_prices - bos.PRICE[test_idx])**2)
error_array = np.append(error_array, mseKFold4)
print(np.mean(error_array))
In comparison to our singular fold split, we now have a higher mean error.
In [ ]: