Linear Regression - Case Study - part 1

A well known case for regression with continuous features is the Boston housing dataset. It is so well known that it is distributed as part of the data science libraries, like scikit-learn.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import linear_model
from sklearn.metrics import explained_variance_score, mean_squared_error
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
import statsmodels.api as sm


D:\Anaconda2\envs\Python3\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

In [2]:
from sklearn.datasets import load_boston

In [3]:
boston = load_boston()

In [4]:
print(boston.DESCR)


Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

Load and explore the dataset

For convenience, we load the features and target in two pandas dataframes X and y, and concatenate them in a dataframe called df.


In [5]:
X = pd.DataFrame(boston.data)
X.columns = boston.feature_names
y = pd.DataFrame(boston.target)
y.columns = ['MEDV']
df = pd.concat([X,y], axis=1)

The first lines of this dataset look like this, the last column MEDV being the target (price).


In [6]:
df.head()


Out[6]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

The seaborn package provides a very convenient way of plotting the values of features and target:


In [7]:
fig = plt.figure(figsize=(18,18))
for i, col in enumerate(X.columns):
    ax = fig.add_subplot(4,4,i+1)
    sns.distplot(X[col], ax=ax)
ax = fig.add_subplot(4,4,len(X.columns)+2)
sns.distplot(y.MEDV, ax=ax, color='r');


We could already dive a little more into this dataset and its features but this will belong in the next parts. For now, we will use them untouched and focus on our first simple prediction pipeline.

Our first prediction pipeline

Let's define a function that will go through the whole prediction, post-processing and evaluation, so that we can run it easily.

In this pipeline, we will:

  • split the dataset into a training and a test set

    This will be made in a very simple way for the moment: the test set corresponds to the last 100 values (approx. 20%) of the dataset.

  • fit a linear regression model on the training data

  • use this model to make a prediction on test data

  • compute the mean squared error between the predicted and true targets

  • plot 1/ the predicted values vs true values, 2/ the distribution of residuals

In the next parts of this case study we will address the limitations of this first pipeline, but it illustrates some important steps.


In [8]:
def run_regression(X, y):
    # Split the data into training/test sets
    X_train = X.values[:-100]
    X_test = X.values[-100:]

    # Split the targets into training/test sets
    y_train = y.values[:-100]
    y_test = y.values[-100:]
    
    # Create linear regression object
    regr = linear_model.LinearRegression()

    # Train the model using the training sets
    regr.fit(X_train, y_train)
    
    # Make prediction
    y_pred = regr.predict(X_test)
    
    # Print accuracy score
    mse = mean_squared_error(y_pred, y_test)
    
    # Plot outputs
    fig = plt.figure(figsize=(12,6))
    ax = fig.add_subplot(121)
    ax.scatter(y_test, y_pred,  color='black')
    ax.plot([0,40],[0,40], color='g')
    ax.set_xlim([0,40])
    ax.set_ylim([0,40])
    ax.set_xlabel('Actual value', fontsize=14)
    ax.set_ylabel('Predicted value', fontsize=14)
    ax.set_title('Mean squared error: {:.2f}'.format(mse), fontsize=16)
    
    ax2 = fig.add_subplot(122)
    sns.distplot(y_test-y_pred, ax=ax2, bins=15)
    ax2.set_xlim([-30,30])
    ax2.axvline(0, color='k', linestyle='dotted')
    ax2.set_title('Residuals distribution', fontsize=16)

Simple linear regression

The average number of rooms could be an influential feature for evaluating the price, let's compute the linear regression by ordinary least squares using this feature only:


In [9]:
run_regression(pd.DataFrame(X.RM),y)


The OLS() method from the statsmodels library gives an extensive insight into the characteristics of an OLS regression:


In [10]:
olsy = y.values
olsX = X.RM.values
olsX = sm.add_constant(olsX)
model = sm.OLS(olsy, olsX)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.484
Model:                            OLS   Adj. R-squared:                  0.483
Method:                 Least Squares   F-statistic:                     471.8
Date:                Wed, 02 Aug 2017   Prob (F-statistic):           2.49e-74
Time:                        20:33:23   Log-Likelihood:                -1673.1
No. Observations:                 506   AIC:                             3350.
Df Residuals:                     504   BIC:                             3359.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -34.6706      2.650    -13.084      0.000     -39.877     -29.465
x1             9.1021      0.419     21.722      0.000       8.279       9.925
==============================================================================
Omnibus:                      102.585   Durbin-Watson:                   0.684
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              612.449
Skew:                           0.726   Prob(JB):                    1.02e-133
Kurtosis:                       8.190   Cond. No.                         58.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can at least state that the simple linear regression with this feature only gives pretty bad results.

Multiple linear regression

Let's try multiple linear regression with all the features of the untouched dataset:


In [11]:
run_regression(X,y)


Once again, we can get a more detailed insight:


In [12]:
olsy = y.values
olsX = X.values
olsX = sm.add_constant(olsX)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MEDV   R-squared:                       0.959
Model:                            OLS   Adj. R-squared:                  0.958
Method:                 Least Squares   F-statistic:                     891.1
Date:                Wed, 02 Aug 2017   Prob (F-statistic):               0.00
Time:                        20:33:23   Log-Likelihood:                -1523.8
No. Observations:                 506   AIC:                             3074.
Df Residuals:                     493   BIC:                             3129.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
CRIM          -0.0916      0.034     -2.675      0.008      -0.159      -0.024
ZN             0.0487      0.014      3.379      0.001       0.020       0.077
INDUS         -0.0038      0.064     -0.059      0.953      -0.130       0.123
CHAS           2.8564      0.904      3.160      0.002       1.080       4.633
NOX           -2.8808      3.359     -0.858      0.392      -9.481       3.720
RM             5.9252      0.309     19.168      0.000       5.318       6.533
AGE           -0.0072      0.014     -0.523      0.601      -0.034       0.020
DIS           -0.9680      0.196     -4.947      0.000      -1.352      -0.584
RAD            0.1704      0.067      2.554      0.011       0.039       0.302
TAX           -0.0094      0.004     -2.393      0.017      -0.017      -0.002
PTRATIO       -0.3924      0.110     -3.571      0.000      -0.608      -0.177
B              0.0150      0.003      5.561      0.000       0.010       0.020
LSTAT         -0.4170      0.051     -8.214      0.000      -0.517      -0.317
==============================================================================
Omnibus:                      204.050   Durbin-Watson:                   0.999
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1372.527
Skew:                           1.609   Prob(JB):                    9.11e-299
Kurtosis:                      10.399   Cond. No.                     8.50e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.5e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

statsmodels.OLS warns us about a possible strong multicollinearity. This is a serious warning: remember that the assumption of full column rank for $X$ is at the core of the derivation of the OLS estimator.

In the next parts of this case study we will, at least, have to:

  • identify possibly erroneous values in the dataset
  • study correlations between features, and correlations between features and target
  • select and/or transform some features
  • have a smarter way of splitting data into training and test sets: who can tell if the last 100 values we took are not specific in some way?