Regression in Python

This is a very quick run-through of some basic statistical concepts, adapted from Lab 4 in Harvard's CS109 course. Please feel free to try the original lab if you're feeling ambitious :-) The CS109 git repository also has the solutions if you're stuck.

Linear Regression Models
Prediction using linear regression
Some re-sampling methods
- Train-Test splits
- Cross Validation

Linear regression is used to model and predict continuous outcomes while logistic regression is used to model binary outcomes. We'll see some examples of linear regression as well as Train-test splits.

The packages we'll cover are: statsmodels, seaborn, and scikit-learn. While we don't explicitly teach statsmodels and seaborn in the Springboard workshop, those are great libraries to know.



In [7]:

    
# special IPython command to prepare the notebook for matplotlib and other libraries
%pylab inline 

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn

import seaborn as sns

# special matplotlib argument for improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")









    



Populating the interactive namespace from numpy and matplotlib

Part 1: Linear Regression

Purpose of linear regression

Given a dataset $X$ and $Y$, linear regression can be used to:

Build a predictive model to predict future values of $X_i$ without a $Y$ value.
Model the strength of the relationship between each dependent variable $X_i$ and $Y$

Sometimes not all $X_i$ will have a relationship with $Y$
Need to figure out which $X_i$ contributes most information to determine $Y$

Linear regression is used in so many applications that I won't warrant this with examples. It is in many cases, the first pass prediction algorithm for continuous outcomes.

A brief recap (feel free to skip if you don't care about the math)

Linear Regression is a method to model the relationship between a set of independent variables $X$ (also knowns as explanatory variables, features, predictors) and a dependent variable $Y$. This method assumes the relationship between each predictor $X$ is linearly related to the dependent variable $Y$.

$$ Y = \beta_0 + \beta_1 X + \epsilon$$

where $\epsilon$ is considered as an unobservable random variable that adds noise to the linear relationship. This is the simplest form of linear regression (one variable), we'll call this the simple model.

$\beta_0$ is the intercept of the linear model
Multiple linear regression is when you have more than one independent variable
- $X_1$, $X_2$, $X_3$, $\ldots$

$$ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon$$

Back to the simple model. The model in linear regression is the conditional mean of $Y$ given the values in $X$ is expressed a linear function.

$$ y = f(x) = E(Y | X = x)$$

http://www.learner.org/courses/againstallodds/about/glossary.html

The goal is to estimate the coefficients (e.g. $\beta_0$ and $\beta_1$). We represent the estimates of the coefficients with a "hat" on top of the letter.

$$ \hat{\beta}_0, \hat{\beta}_1 $$

Once you estimate the coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$, you can use these to predict new values of $Y$

$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1$$

How do you estimate the coefficients?
- There are many ways to fit a linear regression model
- The method called least squares is one of the most common methods
- We will discuss least squares today

Estimating $\hat\beta$: Least squares

Least squares is a method that can estimate the coefficients of a linear model by minimizing the difference between the following:

$$ S = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 $$

where $N$ is the number of observations.

We will not go into the mathematical details, but the least squares estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ minimize the sum of the squared residuals $r_i = y_i - (\beta_0 + \beta_1 x_i)$ in the model (i.e. makes the difference between the observed $y_i$ and linear model $\beta_0 + \beta_1 x_i$ as small as possible).

The solution can be written in compact matrix notation as

$$\hat\beta = (X^T X)^{-1}X^T Y$$

We wanted to show you this in case you remember linear algebra, in order for this solution to exist we need $X^T X$ to be invertible. Of course this requires a few extra assumptions, $X$ must be full rank so that $X^T X$ is invertible, etc. This is important for us because this means that having redundant features in our regression models will lead to poorly fitting (and unstable) models. We'll see an implementation of this in the extra linear regression example.

Note: The "hat" means it is an estimate of the coefficient.

Part 2: Boston Housing Data Set

The Boston Housing data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository.

Load the Boston Housing data set from `sklearn`

This data set is available in the sklearn python module which is how we will access it today.



In [1]:

    
from sklearn.datasets import load_boston
boston = load_boston()



In [2]:

    
boston.keys()









    Out[2]:





['data', 'feature_names', 'DESCR', 'target']



In [3]:

    
boston.data.shape









    Out[3]:





(506, 13)



In [4]:

    
# Print column names
print boston.feature_names









    



['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']



In [5]:

    
# Print description of Boston housing data set
print boston.DESCR









    



Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

Now let's explore the data set itself.



In [8]:

    
bos = pd.DataFrame(boston.data)
bos.head()

There are no column names in the DataFrame. Let's add those.



In [9]:

    
bos.columns = boston.feature_names
bos.head()

Now we have a pandas DataFrame called bos containing all the data we want to use to predict Boston Housing prices. Let's create a variable called PRICE which will contain the prices. This information is contained in the target data.



In [83]:

    
print boston.target.shape









    



(506,)



In [84]:

    
bos['PRICE'] = boston.target
bos.head()

EDA and Summary Statistics

Let's explore this data set. First we use describe() to get basic summary statistics for each of the columns.



In [12]:

    
bos.describe()

Scatter plots

Let's look at some scatter plots for three variables: 'CRIM', 'RM' and 'PTRATIO'.

What kind of relationship do you see? e.g. positive, negative? linear? non-linear?



In [13]:

    
plt.scatter(bos.CRIM, bos.PRICE)
plt.xlabel("Per capita crime rate by town (CRIM)")
plt.ylabel("Housing Price")
plt.title("Relationship between CRIM and Price")









    Out[13]:





<matplotlib.text.Text at 0x11094ee10>

Your turn: Create scatter plots between RM and PRICE, and PTRATIO and PRICE. What do you notice?



In [15]:

    
#your turn: scatter plot between *RM* and *PRICE*

plt.scatter(bos.RM, bos.PRICE)
plt.xlabel("Average Number of Rooms per Dwelling")
plt.ylabel("Housing Price")
plt.title("Relationship between No. of Rooms and Price");



In [17]:

    
#your turn: scatter plot between *PTRATIO* and *PRICE*

plt.scatter(bos.PTRATIO, bos.PRICE)
plt.xlabel("Pupil-Teacher Ratio by town")
plt.ylabel("Housing Price")
plt.title("Relationship between PTRatio and Price");

Your turn: What are some other numeric variables of interest? Plot scatter plots with these variables and PRICE.



In [ ]:

    
#your turn: create some other scatter plots

Scatter Plots using Seaborn

Seaborn is a cool Python plotting library built on top of matplotlib. It provides convenient syntax and shortcuts for many common types of plots, along with better-looking defaults.

We can also use seaborn regplot for the scatterplot above. This provides automatic linear regression fits (useful for data exploration later on). Here's one example below.



In [18]:

    
sns.regplot(y="PRICE", x="RM", data=bos, fit_reg = True)









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x114423f90>

Histograms

Histograms are a useful way to visually summarize the statistical properties of numeric variables. They can give you an idea of the mean and the spread of the variables as well as outliers.



In [19]:

    
plt.hist(bos.CRIM)
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequency")
plt.show()

Your turn: Plot separate histograms and one for RM, one for PTRATIO. Any interesting observations?



In [22]:

    
#your turn

plt.hist(bos.PTRATIO)
plt.title("PTRATIO")
plt.xlabel("Pupil-Teacher Ratio")
plt.ylabel("Frequency")
plt.show()

Linear regression with Boston housing data example

Here,

$Y$ = boston housing prices (also called "target" data in python)

and

$X$ = all the other features (or independent variables)

which we will use to fit a linear regression model and predict Boston housing prices. We will use the least squares method as the way to estimate the coefficients.

We'll use two ways of fitting a linear regression. We recommend the first but the second is also powerful in its features.

Fitting Linear Regression using `statsmodels`

Statsmodels is a great Python library for a lot of basic and inferential statistics. It also provides basic regression functions using an R-like syntax, so it's commonly used by statisticians. While we don't cover statsmodels officially in the Data Science Intensive, it's a good library to have in your toolbox. Here's a quick example of what you could do with it.



In [23]:

    
# Import regression modules
# ols - stands for Ordinary least squares, we'll use this
import statsmodels.api as sm
from statsmodels.formula.api import ols



In [24]:

    
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
m = ols('PRICE ~ RM',bos).fit()
print m.summary()









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  PRICE   R-squared:                       0.484
Model:                            OLS   Adj. R-squared:                  0.483
Method:                 Least Squares   F-statistic:                     471.8
Date:                Tue, 04 Jul 2017   Prob (F-statistic):           2.49e-74
Time:                        13:38:43   Log-Likelihood:                -1673.1
No. Observations:                 506   AIC:                             3350.
Df Residuals:                     504   BIC:                             3359.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    -34.6706      2.650    -13.084      0.000       -39.877   -29.465
RM             9.1021      0.419     21.722      0.000         8.279     9.925
==============================================================================
Omnibus:                      102.585   Durbin-Watson:                   0.684
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              612.449
Skew:                           0.726   Prob(JB):                    1.02e-133
Kurtosis:                       8.190   Cond. No.                         58.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpreting coefficients

There is a ton of information in this output. But we'll concentrate on the coefficient table (middle table). We can interpret the RM coefficient (9.1021) by first noticing that the p-value (under P>|t|) is so small, basically zero. We can interpret the coefficient as, if we compare two groups of towns, one where the average number of rooms is say $5$ and the other group is the same except that they all have $6$ rooms. For these two groups the average difference in house prices is about $9.1$ (in thousands) so about $\$9,100$ difference. The confidence interval fives us a range of plausible values for this difference, about ($\$8,279, \$9,925$), deffinitely not chump change.

`statsmodels` formulas

This formula notation will seem familiar to R users, but will take some getting used to for people coming from other languages or are new to statistics.

The formula gives instruction for a general structure for a regression call. For statsmodels (ols or logit) calls you need to have a Pandas dataframe with column names that you will add to your formula. In the below example you need a pandas data frame that includes the columns named (Outcome, X1,X2, ...), bbut you don't need to build a new dataframe for every regression. Use the same dataframe with all these things in it. The structure is very simple:

Outcome ~ X1

But of course we want to to be able to handle more complex models, for example multiple regression is doone like this:

Outcome ~ X1 + X2 + X3

This is the very basic structure but it should be enough to get you through the homework. Things can get much more complex, for a quick run-down of further uses see the statsmodels help page.

Let's see how our model actually fit our data. We can see below that there is a ceiling effect, we should probably look into that. Also, for large values of $Y$ we get underpredictions, most predictions are below the 45-degree gridlines.

Your turn: Create a scatterpot between the predicted prices, available in m.fittedvalues and the original prices. How does the plot look?



In [28]:

    
# your turn
plt.scatter(m.fittedvalues, bos.PRICE)
plt.xlabel("Predicted Price")
plt.ylabel("Original Housing Price")
plt.title("Predicted vs. Original Prices");

Fitting Linear Regression using `sklearn`



In [41]:

    
from sklearn.linear_model import LinearRegression
X = bos.drop('PRICE', axis = 1)

# This creates a LinearRegression object
lm = LinearRegression()
lm









    Out[41]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

What can you do with a LinearRegression object?

Check out the scikit-learn docs here. We have listed the main functions here.

Main functions	Description
`lm.fit()`	Fit a linear model
`lm.predit()`	Predict Y using the linear model with estimated coefficients
`lm.score()`	Returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model

What output can you get?



In [45]:

    
# Look inside lm object
#lm.

Output	Description
`lm.coef_`	Estimated coefficients
`lm.intercept_`	Estimated intercept

Fit a linear model

The lm.fit() function estimates the coefficients the linear regression using least squares.



In [47]:

    
# Use all 13 predictors to fit linear regression model
lm.fit(X, bos.PRICE)









    Out[47]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Your turn: How would you change the model to not fit an intercept term? Would you recommend not having an intercept?

Estimated intercept and coefficients

Let's look at the estimated coefficients from the linear model using 1m.intercept_ and lm.coef_.

After we have fit our linear regression model using the least squares method, we want to see what are the estimates of our coefficients $\beta_0$, $\beta_1$, ..., $\beta_{13}$:

$$ \hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_{13} $$



In [48]:

    
print 'Estimated intercept coefficient:', lm.intercept_









    



Estimated intercept coefficient: 36.4911032804



In [49]:

    
print 'Number of coefficients:', len(lm.coef_)









    



Number of coefficients: 13



In [50]:

    
# The coefficients
pd.DataFrame(zip(X.columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])









    Out[50]:






  
    
      
      features
      estimatedCoefficients
    
  
  
    
      0
      CRIM
      -0.107171
    
    
      1
      ZN
      0.046395
    
    
      2
      INDUS
      0.020860
    
    
      3
      CHAS
      2.688561
    
    
      4
      NOX
      -17.795759
    
    
      5
      RM
      3.804752
    
    
      6
      AGE
      0.000751
    
    
      7
      DIS
      -1.475759
    
    
      8
      RAD
      0.305655
    
    
      9
      TAX
      -0.012329
    
    
      10
      PTRATIO
      -0.953464
    
    
      11
      B
      0.009393
    
    
      12
      LSTAT
      -0.525467

Predict Prices

We can calculate the predicted prices ($\hat{Y}_i$) using lm.predict.

$$ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \ldots \hat{\beta}_{13} X_{13} $$



In [51]:

    
# first five predicted prices
lm.predict(X)[0:5]









    Out[51]:





array([ 30.00821269,  25.0298606 ,  30.5702317 ,  28.60814055,  27.94288232])

Your turn:

Histogram: Plot a histogram of all the predicted prices
Scatter Plot: Let's plot the true prices compared to the predicted prices to see they disagree (we did this with statsmodels before).



In [52]:

    
# your turn

plt.hist(lm.predict(X))
plt.title("Predicted Prices")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

Residual sum of squares

Let's calculate the residual sum of squares

$$ S = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 $$



In [53]:

    
print np.sum((bos.PRICE - lm.predict(X)) ** 2)









    



11080.2762841

Mean squared error

This is simple the mean of the residual sum of squares.

Your turn: Calculate the mean squared error and print it.



In [56]:

    
#your turn
print np.mean((bos.PRICE - lm.predict(X)) ** 2)









    



21.8977792177

Relationship between `PTRATIO` and housing price

Try fitting a linear regression model using only the 'PTRATIO' (pupil-teacher ratio by town)

Calculate the mean squared error.



In [63]:

    
lm = LinearRegression()
lm.fit(X[['PTRATIO']], bos.PRICE)
#pd.DataFrame(zip(X.columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])

print np.mean((bos.PRICE - lm.predict(X[['PTRATIO']])) ** 2)









    



62.6522000138



In [64]:

    
msePTRATIO = np.mean((bos.PRICE - lm.predict(X[['PTRATIO']])) ** 2)
print msePTRATIO









    



62.6522000138

We can also plot the fitted linear regression line.



In [65]:

    
plt.scatter(bos.PTRATIO, bos.PRICE)
plt.xlabel("Pupil-to-Teacher Ratio (PTRATIO)")
plt.ylabel("Housing Price")
plt.title("Relationship between PTRATIO and Price")

plt.plot(bos.PTRATIO, lm.predict(X[['PTRATIO']]), color='blue', linewidth=3)
plt.show()

Your turn

Try fitting a linear regression model using three independent variables

'CRIM' (per capita crime rate by town)
'RM' (average number of rooms per dwelling)
'PTRATIO' (pupil-teacher ratio by town)

Calculate the mean squared error.



In [71]:

    
# your turn
lm = LinearRegression()
lm.fit(X[['CRIM']], bos.PRICE)

print np.mean((bos.PRICE - lm.predict(X[['CRIM']])) ** 2)









    



71.8523466653



In [73]:

    
plt.scatter(bos.CRIM, bos.PRICE)
plt.xlabel("Crime per Capita")
plt.ylabel("Housing Price")
plt.title("Relationship between CRIM and Price")

plt.plot(bos.CRIM, lm.predict(X[['CRIM']]), color='blue', linewidth=3)
plt.show()

Other important things to think about when fitting a linear regression model

**Linearity**. The dependent variable $Y$ is a linear combination of the regression coefficients and the independent variables $X$.
**Constant standard deviation**. The SD of the dependent variable $Y$ should be constant for different values of X.
- e.g. PTRATIO
**Normal distribution for errors**. The $\epsilon$ term we discussed at the beginning are assumed to be normally distributed. $$ \epsilon_i \sim N(0, \sigma^2)$$ Sometimes the distributions of responses $Y$ may not be normally distributed at any given value of $X$. e.g. skewed positively or negatively.
**Independent errors**. The observations are assumed to be obtained independently.
- e.g. Observations across time may be correlated

Part 3: Training and Test Data sets

Purpose of splitting data into Training/testing sets

Let's stick to the linear regression example:

We built our model with the requirement that the model fit the data well.
As a side-effect, the model will fit THIS dataset well. What about new data?

We wanted the model for predictions, right?

One simple solution, leave out some data (for testing) and train the model on the rest
This also leads directly to the idea of cross-validation, next section.

One way of doing this is you can create training and testing data sets manually.



In [95]:

    
X_train = X[:-50]
X_test = X[-50:]
Y_train = bos.PRICE[:-50]
Y_test = bos.PRICE[-50:]
print X_train.shape
print X_test.shape
print Y_train.shape
print Y_test.shape
X_train.head()

Another way, is to split the data into random train and test subsets using the function train_test_split in sklearn.cross_validation. Here's the documentation.



In [98]:

    
#X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
#    X, bos.PRICE, test_size=0.33, random_state = 5)
print X_train.shape
print X_test.shape
print Y_train.shape
print Y_test.shape
X_train









    



(456, 13)
(50, 13)
(456,)
(50,)






    Out[98]:






  
    
      
      CRIM
      ZN
      INDUS
      CHAS
      NOX
      RM
      AGE
      DIS
      RAD
      TAX
      PTRATIO
      B
      LSTAT
    
  
  
    
      0
      0.00632
      18.0
      2.31
      0.0
      0.538
      6.575
      65.2
      4.0900
      1.0
      296.0
      15.3
      396.90
      4.98
    
    
      1
      0.02731
      0.0
      7.07
      0.0
      0.469
      6.421
      78.9
      4.9671
      2.0
      242.0
      17.8
      396.90
      9.14
    
    
      2
      0.02729
      0.0
      7.07
      0.0
      0.469
      7.185
      61.1
      4.9671
      2.0
      242.0
      17.8
      392.83
      4.03
    
    
      3
      0.03237
      0.0
      2.18
      0.0
      0.458
      6.998
      45.8
      6.0622
      3.0
      222.0
      18.7
      394.63
      2.94
    
    
      4
      0.06905
      0.0
      2.18
      0.0
      0.458
      7.147
      54.2
      6.0622
      3.0
      222.0
      18.7
      396.90
      5.33
    
    
      5
      0.02985
      0.0
      2.18
      0.0
      0.458
      6.430
      58.7
      6.0622
      3.0
      222.0
      18.7
      394.12
      5.21
    
    
      6
      0.08829
      12.5
      7.87
      0.0
      0.524
      6.012
      66.6
      5.5605
      5.0
      311.0
      15.2
      395.60
      12.43
    
    
      7
      0.14455
      12.5
      7.87
      0.0
      0.524
      6.172
      96.1
      5.9505
      5.0
      311.0
      15.2
      396.90
      19.15
    
    
      8
      0.21124
      12.5
      7.87
      0.0
      0.524
      5.631
      100.0
      6.0821
      5.0
      311.0
      15.2
      386.63
      29.93
    
    
      9
      0.17004
      12.5
      7.87
      0.0
      0.524
      6.004
      85.9
      6.5921
      5.0
      311.0
      15.2
      386.71
      17.10
    
    
      10
      0.22489
      12.5
      7.87
      0.0
      0.524
      6.377
      94.3
      6.3467
      5.0
      311.0
      15.2
      392.52
      20.45
    
    
      11
      0.11747
      12.5
      7.87
      0.0
      0.524
      6.009
      82.9
      6.2267
      5.0
      311.0
      15.2
      396.90
      13.27
    
    
      12
      0.09378
      12.5
      7.87
      0.0
      0.524
      5.889
      39.0
      5.4509
      5.0
      311.0
      15.2
      390.50
      15.71
    
    
      13
      0.62976
      0.0
      8.14
      0.0
      0.538
      5.949
      61.8
      4.7075
      4.0
      307.0
      21.0
      396.90
      8.26
    
    
      14
      0.63796
      0.0
      8.14
      0.0
      0.538
      6.096
      84.5
      4.4619
      4.0
      307.0
      21.0
      380.02
      10.26
    
    
      15
      0.62739
      0.0
      8.14
      0.0
      0.538
      5.834
      56.5
      4.4986
      4.0
      307.0
      21.0
      395.62
      8.47
    
    
      16
      1.05393
      0.0
      8.14
      0.0
      0.538
      5.935
      29.3
      4.4986
      4.0
      307.0
      21.0
      386.85
      6.58
    
    
      17
      0.78420
      0.0
      8.14
      0.0
      0.538
      5.990
      81.7
      4.2579
      4.0
      307.0
      21.0
      386.75
      14.67
    
    
      18
      0.80271
      0.0
      8.14
      0.0
      0.538
      5.456
      36.6
      3.7965
      4.0
      307.0
      21.0
      288.99
      11.69
    
    
      19
      0.72580
      0.0
      8.14
      0.0
      0.538
      5.727
      69.5
      3.7965
      4.0
      307.0
      21.0
      390.95
      11.28
    
    
      20
      1.25179
      0.0
      8.14
      0.0
      0.538
      5.570
      98.1
      3.7979
      4.0
      307.0
      21.0
      376.57
      21.02
    
    
      21
      0.85204
      0.0
      8.14
      0.0
      0.538
      5.965
      89.2
      4.0123
      4.0
      307.0
      21.0
      392.53
      13.83
    
    
      22
      1.23247
      0.0
      8.14
      0.0
      0.538
      6.142
      91.7
      3.9769
      4.0
      307.0
      21.0
      396.90
      18.72
    
    
      23
      0.98843
      0.0
      8.14
      0.0
      0.538
      5.813
      100.0
      4.0952
      4.0
      307.0
      21.0
      394.54
      19.88
    
    
      24
      0.75026
      0.0
      8.14
      0.0
      0.538
      5.924
      94.1
      4.3996
      4.0
      307.0
      21.0
      394.33
      16.30
    
    
      25
      0.84054
      0.0
      8.14
      0.0
      0.538
      5.599
      85.7
      4.4546
      4.0
      307.0
      21.0
      303.42
      16.51
    
    
      26
      0.67191
      0.0
      8.14
      0.0
      0.538
      5.813
      90.3
      4.6820
      4.0
      307.0
      21.0
      376.88
      14.81
    
    
      27
      0.95577
      0.0
      8.14
      0.0
      0.538
      6.047
      88.8
      4.4534
      4.0
      307.0
      21.0
      306.38
      17.28
    
    
      28
      0.77299
      0.0
      8.14
      0.0
      0.538
      6.495
      94.4
      4.4547
      4.0
      307.0
      21.0
      387.94
      12.80
    
    
      29
      1.00245
      0.0
      8.14
      0.0
      0.538
      6.674
      87.3
      4.2390
      4.0
      307.0
      21.0
      380.23
      11.98
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      426
      12.24720
      0.0
      18.10
      0.0
      0.584
      5.837
      59.7
      1.9976
      24.0
      666.0
      20.2
      24.65
      15.69
    
    
      427
      37.66190
      0.0
      18.10
      0.0
      0.679
      6.202
      78.7
      1.8629
      24.0
      666.0
      20.2
      18.82
      14.52
    
    
      428
      7.36711
      0.0
      18.10
      0.0
      0.679
      6.193
      78.1
      1.9356
      24.0
      666.0
      20.2
      96.73
      21.52
    
    
      429
      9.33889
      0.0
      18.10
      0.0
      0.679
      6.380
      95.6
      1.9682
      24.0
      666.0
      20.2
      60.72
      24.08
    
    
      430
      8.49213
      0.0
      18.10
      0.0
      0.584
      6.348
      86.1
      2.0527
      24.0
      666.0
      20.2
      83.45
      17.64
    
    
      431
      10.06230
      0.0
      18.10
      0.0
      0.584
      6.833
      94.3
      2.0882
      24.0
      666.0
      20.2
      81.33
      19.69
    
    
      432
      6.44405
      0.0
      18.10
      0.0
      0.584
      6.425
      74.8
      2.2004
      24.0
      666.0
      20.2
      97.95
      12.03
    
    
      433
      5.58107
      0.0
      18.10
      0.0
      0.713
      6.436
      87.9
      2.3158
      24.0
      666.0
      20.2
      100.19
      16.22
    
    
      434
      13.91340
      0.0
      18.10
      0.0
      0.713
      6.208
      95.0
      2.2222
      24.0
      666.0
      20.2
      100.63
      15.17
    
    
      435
      11.16040
      0.0
      18.10
      0.0
      0.740
      6.629
      94.6
      2.1247
      24.0
      666.0
      20.2
      109.85
      23.27
    
    
      436
      14.42080
      0.0
      18.10
      0.0
      0.740
      6.461
      93.3
      2.0026
      24.0
      666.0
      20.2
      27.49
      18.05
    
    
      437
      15.17720
      0.0
      18.10
      0.0
      0.740
      6.152
      100.0
      1.9142
      24.0
      666.0
      20.2
      9.32
      26.45
    
    
      438
      13.67810
      0.0
      18.10
      0.0
      0.740
      5.935
      87.9
      1.8206
      24.0
      666.0
      20.2
      68.95
      34.02
    
    
      439
      9.39063
      0.0
      18.10
      0.0
      0.740
      5.627
      93.9
      1.8172
      24.0
      666.0
      20.2
      396.90
      22.88
    
    
      440
      22.05110
      0.0
      18.10
      0.0
      0.740
      5.818
      92.4
      1.8662
      24.0
      666.0
      20.2
      391.45
      22.11
    
    
      441
      9.72418
      0.0
      18.10
      0.0
      0.740
      6.406
      97.2
      2.0651
      24.0
      666.0
      20.2
      385.96
      19.52
    
    
      442
      5.66637
      0.0
      18.10
      0.0
      0.740
      6.219
      100.0
      2.0048
      24.0
      666.0
      20.2
      395.69
      16.59
    
    
      443
      9.96654
      0.0
      18.10
      0.0
      0.740
      6.485
      100.0
      1.9784
      24.0
      666.0
      20.2
      386.73
      18.85
    
    
      444
      12.80230
      0.0
      18.10
      0.0
      0.740
      5.854
      96.6
      1.8956
      24.0
      666.0
      20.2
      240.52
      23.79
    
    
      445
      0.67180
      0.0
      18.10
      0.0
      0.740
      6.459
      94.8
      1.9879
      24.0
      666.0
      20.2
      43.06
      23.98
    
    
      446
      6.28807
      0.0
      18.10
      0.0
      0.740
      6.341
      96.4
      2.0720
      24.0
      666.0
      20.2
      318.01
      17.79
    
    
      447
      9.92485
      0.0
      18.10
      0.0
      0.740
      6.251
      96.6
      2.1980
      24.0
      666.0
      20.2
      388.52
      16.44
    
    
      448
      9.32909
      0.0
      18.10
      0.0
      0.713
      6.185
      98.7
      2.2616
      24.0
      666.0
      20.2
      396.90
      18.13
    
    
      449
      7.52601
      0.0
      18.10
      0.0
      0.713
      6.417
      98.3
      2.1850
      24.0
      666.0
      20.2
      304.21
      19.31
    
    
      450
      6.71772
      0.0
      18.10
      0.0
      0.713
      6.749
      92.6
      2.3236
      24.0
      666.0
      20.2
      0.32
      17.44
    
    
      451
      5.44114
      0.0
      18.10
      0.0
      0.713
      6.655
      98.2
      2.3552
      24.0
      666.0
      20.2
      355.29
      17.73
    
    
      452
      5.09017
      0.0
      18.10
      0.0
      0.713
      6.297
      91.8
      2.3682
      24.0
      666.0
      20.2
      385.09
      17.27
    
    
      453
      8.24809
      0.0
      18.10
      0.0
      0.713
      7.393
      99.3
      2.4527
      24.0
      666.0
      20.2
      375.87
      16.74
    
    
      454
      9.51363
      0.0
      18.10
      0.0
      0.713
      6.728
      94.1
      2.4961
      24.0
      666.0
      20.2
      6.68
      18.71
    
    
      455
      4.75237
      0.0
      18.10
      0.0
      0.713
      6.525
      86.5
      2.4358
      24.0
      666.0
      20.2
      50.92
      18.13
    
  

456 rows × 13 columns

Your turn: Let's build a linear regression model using our new training data sets.

Fit a linear regression model to the training set
Predict the output on the test set



In [97]:

    
# your turn
lm = LinearRegression()
lm.fit(X_train[['RM']], Y_train)

print np.mean((Y_train - lm.predict(X_train[['RM']])) ** 2)

plt.scatter(X_train.RM, Y_train)
plt.xlabel("No. of Rooms")
plt.ylabel("Housing Price")
plt.title("Relationship between No. of Rooms and Price")

plt.plot(X_train.RM, lm.predict(X_train[['RM']]), color='blue', linewidth=3)
plt.show()









    



46.2932242688

Your turn:

Calculate the mean squared error

using just the test data
using just the training data

Are they pretty similar or very different? What does that mean?



In [96]:

    
# your turn
lm = LinearRegression()
lm.fit(X_test[['RM']], Y_test)

print np.mean((Y_test - lm.predict(X_test[['RM']])) ** 2)

plt.scatter(X_test.RM, Y_test)
plt.xlabel("No. of Rooms")
plt.ylabel("Housing Price")
plt.title("Relationship between No. of Rooms and Price")

plt.plot(X_test.RM, lm.predict(X_test[['RM']]), color='blue', linewidth=3)
plt.show()









    



12.9000419523

Residual plots



In [92]:

    
plt.scatter(lm.predict(X_train), lm.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)
plt.scatter(lm.predict(X_test), lm.predict(X_test) - Y_test, c='g', s=40)
plt.hlines(y = 0, xmin=0, xmax = 50)
plt.title('Residual Plot using training (blue) and test (green) data')
plt.ylabel('Residuals')









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-92-879fe1fe378b> in <module>()
----> 1 plt.scatter(lm.predict(X_train), lm.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)
      2 plt.scatter(lm.predict(X_test), lm.predict(X_test) - Y_test, c='g', s=40)
      3 plt.hlines(y = 0, xmin=0, xmax = 50)
      4 plt.title('Residual Plot using training (blue) and test (green) data')
      5 plt.ylabel('Residuals')

/Users/MacBookPro15/anaconda2/lib/python2.7/site-packages/sklearn/linear_model/base.pyc in predict(self, X)
    266             Returns predicted values.
    267         """
--> 268         return self._decision_function(X)
    269 
    270     _preprocess_data = staticmethod(_preprocess_data)

/Users/MacBookPro15/anaconda2/lib/python2.7/site-packages/sklearn/linear_model/base.pyc in _decision_function(self, X)
    251         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
    252         return safe_sparse_dot(X, self.coef_.T,
--> 253                                dense_output=True) + self.intercept_
    254 
    255     def predict(self, X):

/Users/MacBookPro15/anaconda2/lib/python2.7/site-packages/sklearn/utils/extmath.pyc in safe_sparse_dot(a, b, dense_output)
    187         return ret
    188     else:
--> 189         return fast_dot(a, b)
    190 
    191 

ValueError: shapes (456,14) and (1,) not aligned: 14 (dim 1) != 1 (dim 0)

Your turn: Do you think this linear regression model generalizes well on the test data?

K-fold Cross-validation as an extension of this idea

A simple extension of the Test/train split is called K-fold cross-validation.

Here's the procedure:

randomly assign your $n$ samples to one of $K$ groups. They'll each have about $n/k$ samples
For each group $k$:

Fit the model (e.g. run regression) on all data excluding the $k^{th}$ group
Use the model to predict the outcomes in group $k$
Calculate your prediction error for each observation in $k^{th}$ group (e.g. $(Y_i - \hat{Y}_i)^2$ for regression, $\mathbb{1}(Y_i = \hat{Y}_i)$ for logistic regression).

Calculate the average prediction error across all samples $Err_{CV} = \frac{1}{n}\sum_{i=1}^n (Y_i - \hat{Y}_i)^2$

Luckily you don't have to do this entire process all by hand (for loops, etc.) every single time, sci-kit learn has a very nice implementation of this, have a look at the documentation.

Your turn (extra credit): Implement K-Fold cross-validation using the procedure above and Boston Housing data set using $K=4$. How does the average prediction error compare to the train-test split above?



In [ ]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.593761	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.596783	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.647423	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

	0	1	2	4	5	6	7	8	9	10	11	12
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

	features	estimatedCoefficients
0	CRIM	-0.107171
1	ZN	0.046395
2	INDUS	0.020860
3	CHAS	2.688561
4	NOX	-17.795759
5	RM	3.804752
6	AGE	0.000751
7	DIS	-1.475759
8	RAD	0.305655
9	TAX	-0.012329
10	PTRATIO	-0.953464
11	B	0.009393
12	LSTAT	-0.525467