Regression in Python


This is a very quick run-through of some basic statistical concepts, adapted from Lab 4 in Harvard's CS109 course. Please feel free to try the original lab if you're feeling ambitious :-) The CS109 git repository also has the solutions if you're stuck.

  • Linear Regression Models
  • Prediction using linear regression

Linear regression is used to model and predict continuous outcomes with normal random errors. There are nearly an infinite number of different types of regression models and each regression model is typically defined by the distribution of the prediction errors (called "residuals") of the type of data. Logistic regression is used to model binary outcomes whereas Poisson regression is used to predict counts. In this exercise, we'll see some examples of linear regression as well as Train-test splits.

The packages we'll cover are: statsmodels, seaborn, and scikit-learn. While we don't explicitly teach statsmodels and seaborn in the Springboard workshop, those are great libraries to know.



In [1]:
# special IPython command to prepare the notebook for matplotlib and other libraries
%matplotlib inline 

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn

import seaborn as sns

# special matplotlib argument for improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")

Part 2: Exploratory Data Analysis for Linear Relationships

The Boston Housing data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository.

Load the Boston Housing data set from sklearn


This data set is available in the sklearn python module which is how we will access it today.


In [2]:
from sklearn.datasets import load_boston
import pandas as pd

boston = load_boston()

In [3]:
boston.keys()


Out[3]:
dict_keys(['DESCR', 'feature_names', 'data', 'target'])

In [4]:
boston.data.shape


Out[4]:
(506, 13)

In [5]:
# Print column names
print(boston.feature_names)


['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

In [6]:
# Print description of Boston housing data set
print(boston.DESCR)


Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

Now let's explore the data set itself.


In [7]:
bos = pd.DataFrame(boston.data)
bos.head()


Out[7]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

There are no column names in the DataFrame. Let's add those.


In [8]:
bos.columns = boston.feature_names
bos.head()


Out[8]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Now we have a pandas DataFrame called bos containing all the data we want to use to predict Boston Housing prices. Let's create a variable called PRICE which will contain the prices. This information is contained in the target data.


In [9]:
print(boston.target.shape)


(506,)

In [10]:
bos['PRICE'] = boston.target
bos.head()


Out[10]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

EDA and Summary Statistics


Let's explore this data set. First we use describe() to get basic summary statistics for each of the columns.


In [11]:
bos.describe()


Out[11]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000

Scatterplots


Let's look at some scatter plots for three variables: 'CRIM' (per capita crime rate), 'RM' (number of rooms) and 'PTRATIO' (pupil-to-teacher ratio in schools).


In [12]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

plt.scatter(bos.CRIM, bos.PRICE)
plt.xlabel("Per capita crime rate by town (CRIM)")
plt.ylabel("Housing Price")
plt.title("Relationship between CRIM and Price")


Out[12]:
<matplotlib.text.Text at 0x22f95cf2208>

In [13]:
# Answers to Part 2 Exercise Set 1
# Question 1) What kind of relationship do you see? e.g. positive, negative?
#             linear? non-linear? Is there anything else strange or interesting about
#             the data? What about outliers?

#             I see a weak negative linear relationship.  Yes, the data looks interesting in
#             that its distribution appears to be positively skewed and has a few outliers.

In [14]:
# Part 2 Exercise Set 1 
# Question 2: Create scatter plots between *RM* and *PRICE*, and PTRATIO and PRICE.
#             Label your axes appropriately using human readable labels.
#             Tell a story about what you see.

# Create scatter plots between *RM* and *PRICE*
plt.scatter(bos.RM, bos.PRICE)
plt.xlabel("Average Number of Rooms Per Dwelling (RM)")
plt.ylabel("Housing Price")
plt.title("Relationship between RM and Price")


Out[14]:
<matplotlib.text.Text at 0x22f960e5080>

In [15]:
# Part 2 Exercise Set 1:
# Create scatter plot between *PTRATIO* and *PRICE*
plt.scatter(bos.PTRATIO, bos.PRICE)
plt.xlabel("Pupil-Teacher Ratio by Town (PTRATIO)")
plt.ylabel("Housing Price")
plt.title("Relationship between PTRATIO and Price")


Out[15]:
<matplotlib.text.Text at 0x22f9614e7f0>

In [16]:
# Question 2 continuation: it appears that a positive linear 
#     relationship seemed to exist in the graph between average 
#     number of rooms per dwelling and housing price.

In [17]:
# your turn: create some other scatter plots
# scatter plot between *NOX* and *PRICE*
plt.scatter(bos.NOX, bos.PRICE)
plt.xlabel("Nitric Oxides Concentration (parts per 10 million) (NOX)")
plt.ylabel("Housing Price")
plt.title("Relationship between NOX and Price")


Out[17]:
<matplotlib.text.Text at 0x22f961b5860>

In [18]:
# Exercise 1: What are some other numeric variables of interest? Why do you think
#    they are interesting? Plot scatterplots with these variables and
#    PRICE (house price) and tell a story about what you see.

#    In my opinion, other variables of interest would be nitric oxides
#    concentration since it can describe pollutants in the area.
#    Another is the column describing percent of black population
#    that may describe neighborhood housing prices.

# your turn: create some other scatter plots
# Create a scatter plot between *NOX* and *PRICE*
plt.scatter(bos.B, bos.PRICE)
plt.xlabel("1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town (B)")
plt.ylabel("Housing Price")
plt.title("Relationship between B and Price")


Out[18]:
<matplotlib.text.Text at 0x22f9621a710>

In [19]:
# your turn: create some other scatter plots
# Create a scatter plot between *DIS* and *LSTAT*
plt.scatter(bos.DIS, bos.LSTAT)
plt.xlabel("weighted distances to five Boston employment centres (DIS)")
plt.ylabel("% lower status of the population")
plt.title("Relationship between DIS and LSTAT")


Out[19]:
<matplotlib.text.Text at 0x22f964d1860>

In [20]:
import seaborn as sns
sns.regplot(y="PRICE", x="RM", data=bos, fit_reg = True)


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x22f964c3550>

Histograms



In [21]:
plt.hist(np.log(bos.CRIM))
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()



In [22]:
# Part 2 Exercise 1: In the above histogram, we took the logarithm of the crime rate per 
# capita. Repeat this histogram without taking the log.

plt.hist(bos.CRIM)
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()

# Exercise 2 Question 1 continuation: What was the purpose of taking the log? What do we gain 
#      by making this transformation? What do you now notice about this variable that is not 
#      obvious without making the transformation? 

#      We usually take logarithms of variables that are multiplicatively related or in other
#      words it's growing exponentially in time. By taking logarithms of variables before
#      plotting the data, any exponential nature of variables is taken out of equation so
#      that we can see the pattern in a linear model if that's the case. Logging in short,
#      is similar to deflaton so that a trend can be straightened out and a linear model
#      can be fitted.

#      Before taking the logarithm of the variable, it's obvious that it's exponential in nature.



In [23]:
# Part 2 Exercise 2:
#          Plot the histogram for RM and PTRATIO against each other, along 
#          with the two variables you picked in the previous section. We 
#          are looking for correlations in predictors here.

import seaborn as sns
sns.set(color_codes=True)

sns.jointplot(bos.RM, bos.PTRATIO)


Out[23]:
<seaborn.axisgrid.JointGrid at 0x22f967d1278>

In [24]:
# Part 2 Exercise 2 Continuation:
#          Plot the histogram for the two variables you picked in 
#          the previous section. 

import seaborn as sns
sns.set(color_codes=True)

sns.jointplot(bos.NOX, bos.PRICE)


Out[24]:
<seaborn.axisgrid.JointGrid at 0x22f966a08d0>

Part 3: Linear Regression with Boston Housing Data Example


Here,

$Y$ = boston housing prices (called "target" data in python, and referred to as the dependent variable or response variable)

and

$X$ = all the other features (or independent variables, predictors or explanatory variables)

which we will use to fit a linear regression model and predict Boston housing prices. We will use the least-squares method to estimate the coefficients.

We'll use two ways of fitting a linear regression. We recommend the first but the second is also powerful in its features.


In [25]:
# Import regression modules
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [26]:
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
m = ols('PRICE ~ RM',bos).fit()
print(m.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  PRICE   R-squared:                       0.484
Model:                            OLS   Adj. R-squared:                  0.483
Method:                 Least Squares   F-statistic:                     471.8
Date:                Fri, 10 Mar 2017   Prob (F-statistic):           2.49e-74
Time:                        11:26:19   Log-Likelihood:                -1673.1
No. Observations:                 506   AIC:                             3350.
Df Residuals:                     504   BIC:                             3359.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    -34.6706      2.650    -13.084      0.000       -39.877   -29.465
RM             9.1021      0.419     21.722      0.000         8.279     9.925
==============================================================================
Omnibus:                      102.585   Durbin-Watson:                   0.684
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              612.449
Skew:                           0.726   Prob(JB):                    1.02e-133
Kurtosis:                       8.190   Cond. No.                         58.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Let's see how our model actually fit our data. We can see below that there is a ceiling effect, we should probably look into that. Also, for large values of $Y$ we get underpredictions, most predictions are below the 45-degree gridlines.


In [27]:
# Part 3 Exercise 1: Create a scatterplot between the predicted prices,
# available in m.fittedvalues (where m is the fitted model)
# and the original prices.

# Import regression modules
import statsmodels.api as sm
from statsmodels.formula.api import ols

# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
m = ols('PRICE ~ RM',bos).fit()

# Create the scatter plot between predicted values and *PRICE*
plt.scatter(m.predict(), bos.PRICE)
plt.xlabel("Predicted Housing Price Based on Linear Regression")
plt.ylabel("Housing Price")
plt.title("Relationship between Predicted Price and Original Price")


Out[27]:
<matplotlib.text.Text at 0x22f981a6630>

Fitting Linear Regression using sklearn


In [28]:
from sklearn.linear_model import LinearRegression
X = bos.drop('PRICE', axis = 1)

# This creates a LinearRegression object
lm = LinearRegression()
lm


Out[28]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [29]:
# Use all 13 predictors to fit linear regression model
lm.fit(X, bos.PRICE)


Out[29]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Part 3 Checkup Exercise Set II

Exercise: How would you change the model to not fit an intercept term? Would you recommend not having an intercept? Why or why not? For more information on why to include or exclude an intercept, look [here](https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf).

Exercise: One of the assumptions of the linear model is that the residuals must be i.i.d. (independently and identically distributed). To satisfy this, is it enough that the residuals are normally distributed? Explain your answer.

Exercise: True or false. To use linear regression, $Y$ must be normally distributed. Explain your answer.


In [30]:
# Part 3 Exercise 2 Question:
#   How would you change the model to not fit an intercept term?
#   Would you recommend not having an intercept? Why or why not?

#   To change the model to not fit an intercept term then
#   we need to fit a linear regression through the origin (RTO).
#   Using sklearn's LinearRegression function, I will have to set
#   the fit_intercept parameter to False.

#   As far as recommending whether to have an intercept or not,
#   this would depend on the data set. Hocking (1996) and Adelman
#   et.al. (1994) have found that a careful change of data range
#   and data size needs to be considered.  For example, if the
#   data is far from the origin then fitting through the origin
#   might present a discontinuity from an otherwise linear
#   function with a positive or negative intercept. If uncertain,
#   then one might run a couple of diagnostics.  Hahn (1977)
#   suggested to run a fit with and without an intercept then
#   compare the standard errors to decide whether OLS or RTO 
#   provides a superior fit.

In [31]:
print('Estimated intercept coefficient: {}'.format(lm.intercept_))


Estimated intercept coefficient: 36.491103280361116

In [32]:
print('Number of coefficients: {}'.format(len(lm.coef_)))


Number of coefficients: 13

In [33]:
# The coefficients
pd.DataFrame({'features': X.columns, 'estimatedCoefficients': lm.coef_})[['features', 'estimatedCoefficients']]


Out[33]:
features estimatedCoefficients
0 CRIM -0.107171
1 ZN 0.046395
2 INDUS 0.020860
3 CHAS 2.688561
4 NOX -17.795759
5 RM 3.804752
6 AGE 0.000751
7 DIS -1.475759
8 RAD 0.305655
9 TAX -0.012329
10 PTRATIO -0.953464
11 B 0.009393
12 LSTAT -0.525467

Predict Prices

We can calculate the predicted prices ($\hat{Y}_i$) using lm.predict.

$$ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \ldots \hat{\beta}_{13} X_{13} $$

In [34]:
# first five predicted prices
lm.predict(X)[0:5]


Out[34]:
array([ 30.00821269,  25.0298606 ,  30.5702317 ,  28.60814055,  27.94288232])

In [35]:
# Part 3 Exercise Set III:
#    Question 1: Histogram: Plot a histogram of all the predicted prices. Write a story 
#        about what you see. Describe the shape, center and spread of the distribution. 
#        Are there any outliers? What might be the reason for them? Should we do
#        anything special with them?

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

plt.hist(lm.predict(X))
plt.title("Linear Regression")
plt.xlabel("Predicted Prices")
plt.ylabel("Frequency")
plt.show()

#  The graph appears to be symmetric and bell-shaped, showing a normal 
#  distribution. The center seems to be around 20 in the x-axis.
#  The spread of the distribution is from -5 to 45.  Yes, there
#  are outliers in the form of negative valued prices.



In [36]:
# Part 3 Exercise Set III
#    Question 2: Scatterplot: Let's plot the true prices compared to
#       the predicted prices to see they disagree 
#      (we did this with statsmodels before).

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Create the scatter plot between predicted values and *PRICE*
plt.scatter(lm.predict(X), bos.PRICE)
plt.xlabel("Predicted Housing Price Based on Linear Regression")
plt.ylabel("Housing Price")
plt.title("Relationship between Predicted Price and Original Price")

#   Question 3: We have looked at fitting a linear model in both
#          statsmodels and scikit-learn. What are the advantages 
#          and disadvantages of each based on your exploration? 
#          Based on the information provided by both packages, 
#          what advantage does statsmodels provide?


Out[36]:
<matplotlib.text.Text at 0x22f98543048>

In [37]:
print(np.sum((bos.PRICE - lm.predict(X)) ** 2))


11080.276284149877

In [38]:
print(np.sum(lm.predict(X) - np.mean(bos.PRICE)) ** 2)


2.61725115691e-23

In [39]:
# Part 3 Exercise Set IV:

# Question 1: Fit a linear regression model using only the Pupil-teacher
# ratio by town (PTRATIO) column and interpret the coefficients.
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
import pandas as pd

lm = LinearRegression()
lm.fit(X[['PTRATIO']], bos.PRICE)

print('Estimated intercept coefficient: {}'.format(lm.intercept_))
print('Number of coefficients: {}'.format(len(lm.coef_)))


Estimated intercept coefficient: 62.344627474832706
Number of coefficients: 1

In [40]:
# Exercise 2: Calculate (or extract) the R2 value. What does it tell you?
lm.score(X[['PTRATIO']], bos.PRICE)


Out[40]:
0.257847318009223

In [41]:
# Exercise 3: Compute the F-statistic. What does it tell you?
m = ols('PRICE ~ PTRATIO',bos).fit()
print(m.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  PRICE   R-squared:                       0.258
Model:                            OLS   Adj. R-squared:                  0.256
Method:                 Least Squares   F-statistic:                     175.1
Date:                Fri, 10 Mar 2017   Prob (F-statistic):           1.61e-34
Time:                        11:26:23   Log-Likelihood:                -1764.8
No. Observations:                 506   AIC:                             3534.
Df Residuals:                     504   BIC:                             3542.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     62.3446      3.029     20.581      0.000        56.393    68.296
PTRATIO       -2.1572      0.163    -13.233      0.000        -2.477    -1.837
==============================================================================
Omnibus:                       92.924   Durbin-Watson:                   0.725
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              191.444
Skew:                           1.001   Prob(JB):                     2.68e-42
Kurtosis:                       5.252   Cond. No.                         160.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Part 3 Checkup Exercise Set V

Fit a linear regression model using three independent variables

  1. 'CRIM' (per capita crime rate by town)
  2. 'RM' (average number of rooms per dwelling)
  3. 'PTRATIO' (pupil-teacher ratio by town)

Exercise: Compute or extract the $F$-statistic. What does it tell you about the model?

Exercise: Compute or extract the $R^2$ statistic. What does it tell you about the model?

Exercise: Which variables in the model are significant in predicting house price? Write a story that interprets the coefficients.


In [42]:
# Part 3 Exercise Set V
# Fit a linear regression model using three independent variables
#   1) 'CRIM' (per capita crime rate by town) 
#   2) 'RM' (average number of rooms per dwelling)
#   3) 'PTRATIO' (pupil-teacher ratio by town)
lm = LinearRegression()
lm.fit(X[['CRIM','RM','PTRATIO']], bos.PRICE)


Out[42]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [43]:
# Calculate (or extract) the R2 value.
lm.score(X[['CRIM', 'RM', 'PTRATIO']], bos.PRICE)


Out[43]:
0.59341415511369788

In [44]:
# Compute the F-statistic.
m = ols('PRICE ~ CRIM + RM + PTRATIO',bos).fit()
print(m.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  PRICE   R-squared:                       0.593
Model:                            OLS   Adj. R-squared:                  0.591
Method:                 Least Squares   F-statistic:                     244.2
Date:                Fri, 10 Mar 2017   Prob (F-statistic):           1.09e-97
Time:                        11:26:23   Log-Likelihood:                -1612.6
No. Observations:                 506   AIC:                             3233.
Df Residuals:                     502   BIC:                             3250.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -3.3066      4.038     -0.819      0.413       -11.240     4.627
CRIM          -0.2021      0.032     -6.301      0.000        -0.265    -0.139
RM             7.3816      0.402     18.360      0.000         6.592     8.171
PTRATIO       -1.0742      0.133     -8.081      0.000        -1.335    -0.813
==============================================================================
Omnibus:                      233.249   Durbin-Watson:                   0.828
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2002.374
Skew:                           1.802   Prob(JB):                         0.00
Kurtosis:                      12.054   Cond. No.                         311.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Part 4: Comparing Models

During modeling, there will be times when we want to compare models to see which one is more predictive or fits the data better. There are many ways to compare models, but we will focus on two.

Part 4 Checkup Exercises

Exercise: Find another variable (or two) to add to the model we built in Part 3. Compute the $F$-test comparing the two models as well as the AIC. Which model is better?


In [45]:
# Part 4
# Find another variable (or two) to add to the model we built in Part 3. 
# Compute the F-test comparing the two models as well as the AIC. Which model is better?
m = ols('PRICE ~ CRIM + RM + PTRATIO + NOX + TAX',bos).fit()
print(m.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  PRICE   R-squared:                       0.616
Model:                            OLS   Adj. R-squared:                  0.612
Method:                 Least Squares   F-statistic:                     160.2
Date:                Fri, 10 Mar 2017   Prob (F-statistic):          2.22e-101
Time:                        11:26:24   Log-Likelihood:                -1598.3
No. Observations:                 506   AIC:                             3209.
Df Residuals:                     500   BIC:                             3234.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      5.8739      4.522      1.299      0.195        -3.011    14.759
CRIM          -0.1254      0.037     -3.435      0.001        -0.197    -0.054
RM             6.9133      0.402     17.186      0.000         6.123     7.704
PTRATIO       -1.0277      0.141     -7.275      0.000        -1.305    -0.750
NOX          -11.8933      3.077     -3.866      0.000       -17.938    -5.849
TAX           -0.0019      0.002     -0.763      0.446        -0.007     0.003
==============================================================================
Omnibus:                      293.231   Durbin-Watson:                   0.799
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3217.757
Skew:                           2.327   Prob(JB):                         0.00
Kurtosis:                      14.444   Cond. No.                     8.47e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.47e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [46]:
# Part 5 Exercise 1:
# Create a scatter plot of fitted values versus residuals
plt.scatter(m.fittedvalues, m.resid)
plt.ylabel("Fitted Values")
plt.xlabel("Normalized residuals")


Out[46]:
<matplotlib.text.Text at 0x22f98592dd8>

In [47]:
# Part 5 Exercise 2:
# Construct a quantile plot of the residuals.
from scipy import stats
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
x = stats.loggamma.rvs(c=2.5, size=500)
res = stats.probplot(m.resid, plot=ax)
ax.set_title("Probplot for loggamma dist with shape parameter 2.5")
plt.show()



In [48]:
# Part 5 Exercise 3:
#     What are some advantages and disadvantages of the fitted vs. 
#     residual and quantile plot compared to each other?
# Answer: The fitted vs. residual plot is the most frequently
#         created plot using residuals analysis. Adavatages of
#         plotting it is to be able to determine non-linearity,
#         unequal error variances and outliers.