# Regression in Python

This is a very quick run-through of some basic statistical concepts, adapted from Lab 4 in Harvard's CS109 course. Please feel free to try the original lab if you're feeling ambitious :-) The CS109 git repository also has the solutions if you're stuck.

• Linear Regression Models
• Prediction using linear regression
• Some re-sampling methods
• Train-Test splits
• Cross Validation

Linear regression is used to model and predict continuous outcomes while logistic regression is used to model binary outcomes. We'll see some examples of linear regression as well as Train-test splits.

The packages we'll cover are: statsmodels, seaborn, and scikit-learn. While we don't explicitly teach statsmodels and seaborn in the Springboard workshop, those are great libraries to know.



In :

# special IPython command to prepare the notebook for matplotlib and other libraries
%pylab inline

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn

import seaborn as sns

# special matplotlib argument for improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")




Populating the interactive namespace from numpy and matplotlib



# Part 1: Linear Regression

### Purpose of linear regression

Given a dataset $X$ and $Y$, linear regression can be used to:

• Build a predictive model to predict future values of $X_i$ without a $Y$ value.
• Model the strength of the relationship between each dependent variable $X_i$ and $Y$
• Sometimes not all $X_i$ will have a relationship with $Y$
• Need to figure out which $X_i$ contributes most information to determine $Y$
• Linear regression is used in so many applications that I won't warrant this with examples. It is in many cases, the first pass prediction algorithm for continuous outcomes.

### A brief recap (feel free to skip if you don't care about the math)

Linear Regression is a method to model the relationship between a set of independent variables $X$ (also knowns as explanatory variables, features, predictors) and a dependent variable $Y$. This method assumes the relationship between each predictor $X$ is linearly related to the dependent variable $Y$.

$$Y = \beta_0 + \beta_1 X + \epsilon$$

where $\epsilon$ is considered as an unobservable random variable that adds noise to the linear relationship. This is the simplest form of linear regression (one variable), we'll call this the simple model.

• $\beta_0$ is the intercept of the linear model

• Multiple linear regression is when you have more than one independent variable

• $X_1$, $X_2$, $X_3$, $\ldots$
$$Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon$$

• Back to the simple model. The model in linear regression is the conditional mean of $Y$ given the values in $X$ is expressed a linear function.
$$y = f(x) = E(Y | X = x)$$

• The goal is to estimate the coefficients (e.g. $\beta_0$ and $\beta_1$). We represent the estimates of the coefficients with a "hat" on top of the letter.
$$\hat{\beta}_0, \hat{\beta}_1$$
• Once you estimate the coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$, you can use these to predict new values of $Y$
$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1$$
• How do you estimate the coefficients?
• There are many ways to fit a linear regression model
• The method called least squares is one of the most common methods
• We will discuss least squares today

#### Estimating $\hat\beta$: Least squares

Least squares is a method that can estimate the coefficients of a linear model by minimizing the difference between the following:

$$S = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2$$

where $N$ is the number of observations.

• We will not go into the mathematical details, but the least squares estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ minimize the sum of the squared residuals $r_i = y_i - (\beta_0 + \beta_1 x_i)$ in the model (i.e. makes the difference between the observed $y_i$ and linear model $\beta_0 + \beta_1 x_i$ as small as possible).

The solution can be written in compact matrix notation as

$$\hat\beta = (X^T X)^{-1}X^T Y$$

We wanted to show you this in case you remember linear algebra, in order for this solution to exist we need $X^T X$ to be invertible. Of course this requires a few extra assumptions, $X$ must be full rank so that $X^T X$ is invertible, etc. This is important for us because this means that having redundant features in our regression models will lead to poorly fitting (and unstable) models. We'll see an implementation of this in the extra linear regression example.

Note: The "hat" means it is an estimate of the coefficient.

# Part 2: Boston Housing Data Set

The Boston Housing data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository.

## Load the Boston Housing data set from sklearn

This data set is available in the sklearn python module which is how we will access it today.



In :




In :

boston.keys()




Out:

dict_keys(['target', 'DESCR', 'data', 'feature_names'])




In :

boston.data.shape




Out:

(506, 13)




In :

# Print column names
print(boston.feature_names)




['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']




In :

# Print description of Boston housing data set
print(boston.DESCR)




Boston House Prices dataset

Notes
------
Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- TAX      full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in$1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

**References**

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)



Now let's explore the data set itself.



In :

bos = pd.DataFrame(boston.data)




Out:

0
1
2
3
4
5
6
7
8
9
10
11
12

0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98

1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14

2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03

3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94

4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33



There are no column names in the DataFrame. Let's add those.



In :

bos.columns = boston.feature_names




Out:

CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
TAX
PTRATIO
B
LSTAT

0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98

1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14

2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03

3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94

4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33



Now we have a pandas DataFrame called bos containing all the data we want to use to predict Boston Housing prices. Let's create a variable called PRICE which will contain the prices. This information is contained in the target data.



In :

print(boston.target.shape)




(506,)




In :

bos['PRICE'] = boston.target




Out:

CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
TAX
PTRATIO
B
LSTAT
PRICE

0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
24.0

1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
21.6

2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
34.7

3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
33.4

4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
36.2



## EDA and Summary Statistics

Let's explore this data set. First we use describe() to get basic summary statistics for each of the columns.



In :

bos.describe()




Out:

CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
TAX
PTRATIO
B
LSTAT
PRICE

count
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000

mean
3.593761
11.363636
11.136779
0.069170
0.554695
6.284634
68.574901
3.795043
9.549407
408.237154
18.455534
356.674032
12.653063
22.532806

std
8.596783
23.322453
6.860353
0.253994
0.115878
0.702617
28.148861
2.105710
8.707259
168.537116
2.164946
91.294864
7.141062
9.197104

min
0.006320
0.000000
0.460000
0.000000
0.385000
3.561000
2.900000
1.129600
1.000000
187.000000
12.600000
0.320000
1.730000
5.000000

25%
0.082045
0.000000
5.190000
0.000000
0.449000
5.885500
45.025000
2.100175
4.000000
279.000000
17.400000
375.377500
6.950000
17.025000

50%
0.256510
0.000000
9.690000
0.000000
0.538000
6.208500
77.500000
3.207450
5.000000
330.000000
19.050000
391.440000
11.360000
21.200000

75%
3.647423
12.500000
18.100000
0.000000
0.624000
6.623500
94.075000
5.188425
24.000000
666.000000
20.200000
396.225000
16.955000
25.000000

max
88.976200
100.000000
27.740000
1.000000
0.871000
8.780000
100.000000
12.126500
24.000000
711.000000
22.000000
396.900000
37.970000
50.000000



### Scatter plots

Let's look at some scatter plots for three variables: 'CRIM', 'RM' and 'PTRATIO'.

What kind of relationship do you see? e.g. positive, negative? linear? non-linear?



In :

plt.scatter(bos.CRIM, bos.PRICE)
plt.xlabel("Per capita crime rate by town (CRIM)")
plt.ylabel("Housing Price")
plt.title("Relationship between CRIM and Price")




Out:

<matplotlib.text.Text at 0x7f81cc9cdda0>



The housing price seems to have a negative correlation to crime rate. It looks like there is roughly an negative exponential relationship, where intial increases in crime rate greatly decrease housing price, but then later increases have less of a notable effect.

Your turn: Create scatter plots between RM and PRICE, and PTRATIO and PRICE. What do you notice?



In :

#your turn: scatter plot between *RM* and *PRICE*
plt.scatter(bos.RM, bos.PRICE)
plt.xlabel("Average number of rooms per dwelling (RM)")
plt.ylabel("Housing Price")
plt.title("Relationship between Rooms and Price")




Out:

<matplotlib.text.Text at 0x7f81c9cff160>



There are some outliers, but there is a very clear positive linear relationship between the average number of rooms and the housing price.



In :

#your turn: scatter plot between *PTRATIO* and *PRICE*
plt.scatter(bos.PTRATIO, bos.PRICE)
plt.xlabel("pupil-teacher ratio by town (PTRATIO)")
plt.ylabel("Housing Price")
plt.title("Relationship between Pupil-Teacher ratio and Price")




Out:

<matplotlib.text.Text at 0x7f81ca1379e8>



Here the relatoinship does not appear as clear cut. While there may be a weak negative correlation between pupil-teacher ratio and housing price, the variance in the data is quite large, and so we are not as confident in the importance of this column as a predictor of housing price.

Your turn: What are some other numeric variables of interest? Plot scatter plots with these variables and PRICE.



In :

#your turn: create some other scatter plots
plt.scatter(bos.NOX, bos.PRICE)
plt.xlabel("Nitric Oxides concentration (parts per 10 million)")
plt.ylabel("Housing Price")
plt.title("Relationship between NOX concentration and Price")




Out:

<matplotlib.text.Text at 0x7f81c33bc860>



Interestingly, air quality seems to affect housing price. There seems to be a negative correlation between NO concentration housing price. The trend looks linear for the bulk of the data, but there does seem like it may be a non-linear relationship as well. A comparison of two or more types of fits would be useful here.



In :

#your turn: create some other scatter plots
plt.scatter(bos.DIS, bos.PRICE)
plt.xlabel("weighted distances to five Boston employment centre (DIS)")
plt.ylabel("Housing Price")
plt.title("Relationship between Distance to Employment Centres and Price")




Out:

<matplotlib.text.Text at 0x7f81c9fd66d8>



Not suprisingly, the distance to employment centers strongly affects house price for towns, when the areas are "inner-city". These have the lowest housing prices. The relationship seems to be a postive non-linear one, where once we move a bit further out, hose prices are equally likely to be high or low. Once we look past 6 standardized distance units, we no longer see as many high priced outliers. This matches the anectodal experiences of many people.

### Scatter Plots using Seaborn

Seaborn is a cool Python plotting library built on top of matplotlib. It provides convenient syntax and shortcuts for many common types of plots, along with better-looking defaults.

We can also use seaborn regplot for the scatterplot above. This provides automatic linear regression fits (useful for data exploration later on). Here's one example below.



In :

sns.regplot(y="PRICE", x="RM", data=bos, fit_reg = True)




Out:

<matplotlib.axes._subplots.AxesSubplot at 0x7f81c9f89748>



### Histograms



In :

plt.hist(bos.CRIM)
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()






• Plot histograms for RM and PTRATIO, along with the two variables you picked in the previous section.


In :

rm = sns.distplot(bos.RM, kde=False)
rm.axes.set_title('RM')
rm.set_xlabel('Average number of rooms')
rm.set_ylabel('Frequency')




Out:

<matplotlib.text.Text at 0x7f81c80ae1d0>




In :

rm = sns.distplot(bos.PTRATIO, kde=False)
rm.axes.set_title('PTRATIO')
rm.set_xlabel('Pupil-Teacher Ratio by Town')
rm.set_ylabel('Frequency')




Out:

<matplotlib.text.Text at 0x7f81c9ed81d0>




In :

rm = sns.distplot(bos.NOX, kde=False)
rm.axes.set_title('NOX')
rm.set_xlabel('Nitric Oxides concentration (parts per 10 million)')
rm.set_ylabel('Frequency')




Out:

<matplotlib.text.Text at 0x7f81c330e7b8>




In :

rm = sns.distplot(bos.DIS, kde=False)
rm.axes.set_title('DIS')
rm.set_xlabel('Weighted distances to five Boston employment centre (DIS)')
rm.set_ylabel('Frequency')




Out:

<matplotlib.text.Text at 0x7f81c3255ba8>



## Linear regression with Boston housing data example

Here,

$Y$ = boston housing prices (also called "target" data in python)

and

$X$ = all the other features (or independent variables)

which we will use to fit a linear regression model and predict Boston housing prices. We will use the least squares method as the way to estimate the coefficients.

We'll use two ways of fitting a linear regression. We recommend the first but the second is also powerful in its features.

### Fitting Linear Regression using statsmodels

Statsmodels is a great Python library for a lot of basic and inferential statistics. It also provides basic regression functions using an R-like syntax, so it's commonly used by statisticians. While we don't cover statsmodels officially in the Data Science Intensive, it's a good library to have in your toolbox. Here's a quick example of what you could do with it.



In :

# Import regression modules
# ols - stands for Ordinary least squares, we'll use this
import statsmodels.api as sm
from statsmodels.formula.api import ols




In :

# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
m = ols('PRICE ~ RM',bos).fit()
print(m.summary())




OLS Regression Results
==============================================================================
Dep. Variable:                  PRICE   R-squared:                       0.484
Method:                 Least Squares   F-statistic:                     471.8
Date:                Tue, 01 Nov 2016   Prob (F-statistic):           2.49e-74
Time:                        21:52:54   Log-Likelihood:                -1673.1
No. Observations:                 506   AIC:                             3350.
Df Residuals:                     504   BIC:                             3359.
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    -34.6706      2.650    -13.084      0.000       -39.877   -29.465
RM             9.1021      0.419     21.722      0.000         8.279     9.925
==============================================================================
Omnibus:                      102.585   Durbin-Watson:                   0.684
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              612.449
Skew:                           0.726   Prob(JB):                    1.02e-133
Kurtosis:                       8.190   Cond. No.                         58.4
==============================================================================

Warnings:
 Standard Errors assume that the covariance matrix of the errors is correctly specified.



There is a ton of information in this output. But we'll concentrate on the coefficient table (middle table). We can interpret the RM coefficient (9.1021) by first noticing that the p-value (under P>|t|) is so small, basically zero. We can interpret the coefficient as, if we compare two groups of towns, one where the average number of rooms is say $5$ and the other group is the same except that they all have $6$ rooms. For these two groups the average difference in house prices is about $9.1$ (in thousands) so about $\$9,100$difference. The confidence interval gives us a range of plausible values for this difference, about ($\$8,279, \$9,925$), definitely not chump change. #### statsmodels formulas This formula notation will seem familiar to R users, but will take some getting used to for people coming from other languages or are new to statistics. The formula gives instruction for a general structure for a regression call. For statsmodels (ols or logit) calls you need to have a Pandas dataframe with column names that you will add to your formula. In the below example you need a pandas data frame that includes the columns named (Outcome, X1,X2, ...), bbut you don't need to build a new dataframe for every regression. Use the same dataframe with all these things in it. The structure is very simple: Outcome ~ X1 But of course we want to to be able to handle more complex models, for example multiple regression is doone like this: Outcome ~ X1 + X2 + X3 This is the very basic structure but it should be enough to get you through the homework. Things can get much more complex, for a quick run-down of further uses see the statsmodels help page. Let's see how our model actually fit our data. We can see below that there is a ceiling effect, we should probably look into that. Also, for large values of$Y$we get underpredictions, most predictions are below the 45-degree gridlines. Your turn: Create a scatterpot between the predicted prices, available in m.fittedvalues and the original prices. How does the plot look?  In : # your turn scatter2 = sns.regplot(x=bos.PRICE,y=m.fittedvalues, fit_reg=False) scatter2.set_title('Comparison of real prices and predicted prices') scatter2.set_ylabel('Predicted Prices from a #Rooms OLS regression')   Out: <matplotlib.text.Text at 0x7f81c141dd68>  ### Fitting Linear Regression using sklearn  In : from sklearn.linear_model import LinearRegression X = bos.drop('PRICE', axis = 1) # This creates a LinearRegression object lm = LinearRegression() lm   Out: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)  #### What can you do with a LinearRegression object? Check out the scikit-learn docs here. We have listed the main functions here. Main functions Description lm.fit() Fit a linear model lm.predit() Predict Y using the linear model with estimated coefficients lm.score() Returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model #### What output can you get?  In : # Look inside lm object #lm.<tab>  Output Description lm.coef_ Estimated coefficients lm.intercept_ Estimated intercept ### Fit a linear model The lm.fit() function estimates the coefficients the linear regression using least squares.  In : # Use all 13 predictors to fit linear regression model lm.fit(X, bos.PRICE)   Out: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)  Your turn: How would you change the model to not fit an intercept term? Would you recommend not having an intercept?  In : lm2 =LinearRegression(fit_intercept=False) lm2.fit(X, bos.PRICE)   Out: LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)  In this case, it doesn't make sense to force a zero intercept. We didn't do anything to center the data. Had we subtracted the mean from our continous variables (or performed some other centering method), then it would make sense to do this. ### Estimated intercept and coefficients Let's look at the estimated coefficients from the linear model using 1m.intercept_ and lm.coef_. After we have fit our linear regression model using the least squares method, we want to see what are the estimates of our coefficients$\beta_0$,$\beta_1$, ...,$\beta_{13}$: $$\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_{13}$$  In : print('Estimated intercept coefficient:', lm.intercept_)   Estimated intercept coefficient: 36.4911032804   In : print('Number of coefficients:', len(lm.coef_))   Number of coefficients: 13   In : # The coefficients pd.DataFrame(list(zip(X.columns, lm.coef_)), columns = ['features', 'estimatedCoefficients'])   Out: features estimatedCoefficients 0 CRIM -0.107171 1 ZN 0.046395 2 INDUS 0.020860 3 CHAS 2.688561 4 NOX -17.795759 5 RM 3.804752 6 AGE 0.000751 7 DIS -1.475759 8 RAD 0.305655 9 TAX -0.012329 10 PTRATIO -0.953464 11 B 0.009393 12 LSTAT -0.525467  ### Predict Prices We can calculate the predicted prices ($\hat{Y}_i\$) using lm.predict.

$$\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \ldots \hat{\beta}_{13} X_{13}$$


In :

# first five predicted prices
lm.predict(X)[0:5]




Out:

array([ 30.00821269,  25.0298606 ,  30.5702317 ,  28.60814055,  27.94288232])



• Histogram: Plot a histogram of all the predicted prices
• Scatter Plot: Let's plot the true prices compared to the predicted prices to see they disagree (we did this with statsmodels before).


In :

predicted_prices = lm.predict(X)
prices_hist = sns.distplot(predicted_prices)







In :

prices_scatter = sns.regplot(x=bos.PRICE, y=predicted_prices, fit_reg =False)
prices_scatter.set_title('Prices versus predicted prices from 13 variable regression')
prices_scatter.set_ylabel('Predicted price')




Out:

<matplotlib.text.Text at 0x7f81ab6be358>