Regression Week 2: Multiple Regression (Interpretation)

In this notebook, we will use data on house sales in King County, Seatle to predict prices using multiple regression. The goal of this notebook is to explore multiple regression and feature engineering. You will:

Use DataFrames to do some feature engineering
Use sklearn to compute the regression weights (coefficients/parameters)
Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares (RSS)
Look at coefficients and interpret their meanings
Evaluate multiple models via RSS

Importing Libraries



In [55]:

    
import os
import zipfile
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

Unzipping files with house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.



In [56]:

    
# Put files in current direction into a list
files_list = [f for f in os.listdir('.') if os.path.isfile(f)]



In [57]:

    
# Filenames of unzipped files
unzip_files = ['kc_house_train_data.csv','kc_house_test_data.csv', 'kc_house_data.csv']



In [58]:

    
# If upzipped file not in files_list, unzip the file
for filename in unzip_files:
    if filename not in files_list:
        zip_file = filename + '.zip'
        unzipping = zipfile.ZipFile(zip_file)
        unzipping.extractall()
        unzipping.close

Loading Sales data, Sales Training data, and Sales Test data



In [59]:

    
# Dictionary with the correct dtypes for the DataFrame columns
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 
              'sqft_living15':float, 'grade':int, 'yr_renovated':int, 
              'price':float, 'bedrooms':float, 'zipcode':str, 
              'long':float, 'sqft_lot15':float, 'sqft_living':float, 
              'floors':str, 'condition':int, 'lat':float, 'date':str, 
              'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}



In [60]:

    
# Loading sales data, sales training data, and test_data into DataFrames
sales = pd.read_csv('kc_house_data.csv', dtype = dtype_dict)
train_data = pd.read_csv('kc_house_train_data.csv', dtype = dtype_dict)
test_data = pd.read_csv('kc_house_test_data.csv', dtype = dtype_dict)



In [61]:

    
# Looking at head of training data DataFrame
train_data.head()









    Out[61]:






  
    
      
      id
      date
      price
      bedrooms
      bathrooms
      sqft_living
      sqft_lot
      floors
      waterfront
      view
      ...
      grade
      sqft_above
      sqft_basement
      yr_built
      yr_renovated
      zipcode
      lat
      long
      sqft_living15
      sqft_lot15
    
  
  
    
      0
      7129300520
      20141013T000000
      221900
      3
      1.00
      1180
      5650
      1
      0
      0
      ...
      7
      1180
      0
      1955
      0
      98178
      47.5112
      -122.257
      1340
      5650
    
    
      1
      6414100192
      20141209T000000
      538000
      3
      2.25
      2570
      7242
      2
      0
      0
      ...
      7
      2170
      400
      1951
      1991
      98125
      47.7210
      -122.319
      1690
      7639
    
    
      2
      5631500400
      20150225T000000
      180000
      2
      1.00
      770
      10000
      1
      0
      0
      ...
      6
      770
      0
      1933
      0
      98028
      47.7379
      -122.233
      2720
      8062
    
    
      3
      2487200875
      20141209T000000
      604000
      4
      3.00
      1960
      5000
      1
      0
      0
      ...
      7
      1050
      910
      1965
      0
      98136
      47.5208
      -122.393
      1360
      5000
    
    
      4
      1954400510
      20150218T000000
      510000
      3
      2.00
      1680
      8080
      1
      0
      0
      ...
      8
      1680
      0
      1987
      0
      98074
      47.6168
      -122.045
      1800
      7503
    
  

5 rows × 21 columns

Learning a multiple regression model

Now, learn a multiple regression model predicting 'price' based on the following features: example_features = ['sqft_living', 'bedrooms', 'bathrooms'] on training data. First, let's plot the data for these features.



In [62]:

    
plt.figure(figsize=(8,6))
plt.plot(train_data['sqft_living'], train_data['price'],'.')
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.show()



In [63]:

    
plt.figure(figsize=(12,8))

plt.subplot(1, 2, 1)
plt.plot(train_data['bedrooms'], train_data['price'],'.')
plt.xlabel('# Bedrooms', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)

plt.subplot(1, 2, 2)
plt.plot(train_data['bathrooms'], train_data['price'],'.')
plt.xlabel('# Bathrooms', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)

plt.show()

Now, creating a list of the features we are interested in, the feature matrix, and the output vector.



In [64]:

    
example_features = ['sqft_living', 'bedrooms', 'bathrooms']
X_multi_lin_reg = train_data[example_features]
y_multi_lin_reg = train_data['price']

Creating a Linear Regression Object for Sklearn library and using the feature matrix and output vector to perform linear regression.



In [65]:

    
example_model = LinearRegression()
example_model.fit(X_multi_lin_reg, y_multi_lin_reg)









    Out[65]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now that we have fitted the model we can extract the regression weights (coefficients):



In [66]:

    
# printing the intercept and coefficients
print example_model.intercept_
print example_model.coef_









    



87912.865815
[   315.40669062 -65081.88711588   6942.16598637]



In [67]:

    
# Putting the intercept and weights from the multiple linear regression into a Series
example_weight_summary = pd.Series( [example_model.intercept_] + list(example_model.coef_),
                                  index = ['intercept'] + example_features )
print example_weight_summary









    



intercept      87912.865815
sqft_living      315.406691
bedrooms      -65081.887116
bathrooms       6942.165986
dtype: float64

Making Predictions

Recall that once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:



In [68]:

    
example_predictions = example_model.predict(X_multi_lin_reg)
print example_predictions[0] # should be close to 271789.505878









    



271789.26538

Compute RSS

Now that we can make predictions given the model, let's write a function to compute the RSS of the model.



In [69]:

    
def get_residual_sum_of_squares(model, data, outcome):
    # - data holds the data points with the features (columns) we are interested in performing a linear regression fit
    # - model holds the linear regression model obtained from fitting to the data
    # - outcome is the y, the observed house price for each data point
    
    # By using the model and applying predict on the data, we return a numpy array which holds
    # the predicted outcome (house price) from the linear regression model
    model_predictions = model.predict(data)

    # Computing the residuals between the predicted house price and the actual house price for each data point
    residuals = outcome - model_predictions

    # To get RSS, square the residuals and add them up
    RSS = sum(residuals*residuals)

    return(RSS)

Create some new features

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms), we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms.

Create the following 4 new features as column in both TEST and TRAIN data:

bedrooms_squared = bedrooms*bedrooms
bed_bath_rooms = bedrooms*bathrooms
log_sqft_living = log(sqft_living)
lat_plus_long = lat + long



In [70]:

    
# Creating new 'bedrooms_squared' feature
train_data['bedrooms_squared'] = train_data['bedrooms']*train_data['bedrooms']
test_data['bedrooms_squared'] = test_data['bedrooms']*test_data['bedrooms']
# Creating new 'bed_bath_rooms' feature
train_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']
test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']
# Creating new 'log_sqft_living' feature
train_data['log_sqft_living'] = np.log(train_data['sqft_living'])
test_data['log_sqft_living'] = np.log(test_data['sqft_living'])
# Creating new 'lat_plus_long' feature
train_data['lat_plus_long'] = train_data['lat'] + train_data['long']
test_data['lat_plus_long'] = test_data['lat'] + test_data['long']



In [71]:

    
# Displaying head of train_data DataFrame and test_data DataFrame to verify that new features are present
train_data.head()
test_data.head()









    Out[71]:






  
    
      
      id
      date
      price
      bedrooms
      bathrooms
      sqft_living
      sqft_lot
      floors
      waterfront
      view
      ...
      yr_renovated
      zipcode
      lat
      long
      sqft_living15
      sqft_lot15
      bedrooms_squared
      bed_bath_rooms
      log_sqft_living
      lat_plus_long
    
  
  
    
      0
      0114101516
      20140528T000000
      310000
      3
      1.0
      1430
      19901
      1.5
      0
      0
      ...
      0
      98028
      47.7558
      -122.229
      1780
      12697
      9
      3.0
      7.265430
      -74.4732
    
    
      1
      9297300055
      20150124T000000
      650000
      4
      3.0
      2950
      5000
      2
      0
      3
      ...
      0
      98126
      47.5714
      -122.375
      2140
      4000
      16
      12.0
      7.989560
      -74.8036
    
    
      2
      1202000200
      20141103T000000
      233000
      3
      2.0
      1710
      4697
      1.5
      0
      0
      ...
      0
      98002
      47.3048
      -122.218
      1030
      4705
      9
      6.0
      7.444249
      -74.9132
    
    
      3
      8562750320
      20141110T000000
      580500
      3
      2.5
      2320
      3980
      2
      0
      0
      ...
      0
      98027
      47.5391
      -122.070
      2580
      3980
      9
      7.5
      7.749322
      -74.5309
    
    
      4
      7589200193
      20141110T000000
      535000
      3
      1.0
      1090
      3000
      1.5
      0
      0
      ...
      0
      98117
      47.6889
      -122.375
      1570
      5080
      9
      3.0
      6.993933
      -74.6861
    
  

5 rows × 25 columns

Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this feature will mostly affect houses with many bedrooms.
bedrooms times bathrooms gives what's called an "interaction" feature. It is large when both of them are large.
Taking the log of squarefeet has the effect of bringing large values closer together and spreading out small values.
Adding latitude to longitude is totally non-sensical but we will do it anyway (you'll see why)

Quiz Question: What is the mean (arithmetic average) value of your 4 new features on TEST data? (round to 2 digits)



In [72]:

    
print "Mean of Test data 'bedrooms_squared' feature: %.2f " % np.mean(test_data['bedrooms_squared'].values)
print "Mean of Test data 'bed_bath_rooms' feature: %.2f " % np.mean(test_data['bed_bath_rooms'].values)
print "Mean of Test data 'log_sqft_living' feature: %.2f " % np.mean(test_data['log_sqft_living'].values)
print "Mean of Test data 'lat_plus_long' feature: %.2f " % np.mean(test_data['lat_plus_long'].values)









    



Mean of Test data 'bedrooms_squared' feature: 12.45 
Mean of Test data 'bed_bath_rooms' feature: 7.50 
Mean of Test data 'log_sqft_living' feature: 7.55 
Mean of Test data 'lat_plus_long' feature: -74.65

Learning Multiple Models

Now we will learn the weights for three (nested) models for predicting house prices. The first model will have the fewest features the second model will add one more feature and the third will add a few more:

Model 1: squarefeet, # bedrooms, # bathrooms, latitude & longitude
Model 2: add bedrooms*bathrooms
Model 3: Add log squarefeet, bedrooms squared, and the (nonsensical) latitude + longitude



In [73]:

    
model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']

Now that you have the features, learn the weights for the three different models for predicting target = 'price' and look at the value of the weights/coefficients:



In [74]:

    
# Creating a LinearRegression Object for Model 1 and learning the multiple linear regression model
model_1 = LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
# Creating a LinearRegression Object for Model 2 and learning the multiple linear regression model
model_2 = LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
# Creating a LinearRegression Object for Model 3 and learning the multiple linear regression model
model_3 = LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])









    Out[74]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now, Examine/extract each model's coefficients:



In [75]:

    
# Putting the Model 1 intercept and weights from the multiple linear regression for the 3 models into a Series
model_1_summary = pd.Series( [model_1.intercept_] + list(model_1.coef_),
                                  index = ['intercept'] + model_1_features , name='Model 1 Coefficients' )
print model_1_summary









    



intercept     -69075726.792570
sqft_living         312.258646
bedrooms         -59586.533154
bathrooms         15706.742083
lat              658619.263931
long            -309374.351268
Name: Model 1 Coefficients, dtype: float64



In [76]:

    
# Putting the Model 2 intercept and weights from the multiple linear regression for the 3 models into a Series
model_2_summary = pd.Series( [model_2.intercept_] + list(model_2.coef_),
                                  index = ['intercept'] + model_2_features , name='Model 2 Coefficients' )
print model_2_summary









    



intercept        -66867968.871079
sqft_living            306.610053
bedrooms           -113446.368070
bathrooms           -71461.308293
lat                 654844.629503
long               -294298.969138
bed_bath_rooms       25579.652001
Name: Model 2 Coefficients, dtype: float64



In [77]:

    
# Putting the Model 3 intercept and weights from the multiple linear regression for the 3 models into a Series
model_3_summary = pd.Series( [model_3.intercept_] + list(model_3.coef_),
                                  index = ['intercept'] + model_3_features , name='Model 3 Coefficients' )
print model_3_summary









    



intercept          -62036084.986098
sqft_living              529.422820
bedrooms               34514.229578
bathrooms              67060.781319
lat                   534085.610867
long                 -406750.710861
bed_bath_rooms         -8570.504395
bedrooms_squared       -6788.586670
log_sqft_living      -561831.484076
lat_plus_long         127334.900006
Name: Model 3 Coefficients, dtype: float64

Quiz Question: What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 1?



In [78]:

    
print "Positive: ", model_1_summary['bathrooms']









    



Positive:  15706.7420827

Quiz Question: What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 2?



In [79]:

    
print "Negative: ", model_2_summary['bathrooms']









    



Negative:  -71461.3082928

Think about what this means:

In model 2, the new 'bed_bath_rooms' feature causes the house price to be over-estimated. Thus, the 'bathrooms' turns to negative to better agree with the observed values for the house prices.

Comparing multiple models

Now that you've learned three models and extracted the model weights we want to evaluate which model is best.

First use your functions from earlier to compute the RSS on TRAINING Data for each of the three models.



In [80]:

    
# Compute the RSS on TRAINING data for each of the three models and record the values:
rss_model_1_train = get_residual_sum_of_squares(model_1, train_data[model_1_features], train_data['price'])
rss_model_2_train = get_residual_sum_of_squares(model_2, train_data[model_2_features], train_data['price'])
rss_model_3_train = get_residual_sum_of_squares(model_3, train_data[model_3_features], train_data['price'])
print "RSS for Model 1 Training Data: ", rss_model_1_train
print "RSS for Model 2 Training Data: ", rss_model_2_train
print "RSS for Model 3 Training Data: ", rss_model_3_train









    



RSS for Model 1 Training Data:  9.6787996305e+14
RSS for Model 2 Training Data:  9.58419635074e+14
RSS for Model 3 Training Data:  9.0343645505e+14

Quiz Question: Which model (1, 2 or 3) has lowest RSS on TRAINING Data? Is this what you expected?

Model 3 has the lowest RSS on the Training Data. This is expected since Model 3 has the most features.

Now compute the RSS on on TEST data for each of the three models.



In [81]:

    
# Compute the RSS on TESTING data for each of the three models and record the values:
rss_model_1_test = get_residual_sum_of_squares(model_1, test_data[model_1_features], test_data['price'])
rss_model_2_test = get_residual_sum_of_squares(model_2, test_data[model_2_features], test_data['price'])
rss_model_3_test = get_residual_sum_of_squares(model_3, test_data[model_3_features], test_data['price'])
print "RSS for Model 1 Test Data: ", rss_model_1_test
print "RSS for Model 2 Test Data: ", rss_model_2_test
print "RSS for Model 3 T Data: ", rss_model_3_test









    



RSS for Model 1 Test Data:  2.25500469795e+14
RSS for Model 2 Test Data:  2.23377462976e+14
RSS for Model 3 T Data:  2.59236319207e+14

Quiz Question: Which model (1, 2 or 3) has lowest RSS on TESTING Data? Is this what you expected?Think about the features that were added to each model from the previous.

Model 2 has the lowest RSS on the Test Data.



In [ ]:

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	20141013T000000	221900	3	1.00	1180	5650	1	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
1	6414100192	20141209T000000	538000	3	2.25	2570	7242	2	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
2	5631500400	20150225T000000	180000	2	1.00	770	10000	1	...	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
3	2487200875	20141209T000000	604000	4	3.00	1960	5000	1	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
4	1954400510	20150218T000000	510000	3	2.00	1680	8080	1	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800	7503

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	view	...	zipcode	lat	long	sqft_living15	sqft_lot15	bedrooms_squared	bed_bath_rooms	log_sqft_living	lat_plus_long
0	0114101516	20140528T000000	310000	3	1.0	1430	19901	1.5	0	...	98028	47.7558	-122.229	1780	12697	9	3.0	7.265430	-74.4732
1	9297300055	20150124T000000	650000	4	3.0	2950	5000	2	3	...	98126	47.5714	-122.375	2140	4000	16	12.0	7.989560	-74.8036
2	1202000200	20141103T000000	233000	3	2.0	1710	4697	1.5	0	...	98002	47.3048	-122.218	1030	4705	9	6.0	7.444249	-74.9132
3	8562750320	20141110T000000	580500	3	2.5	2320	3980	2	0	...	98027	47.5391	-122.070	2580	3980	9	7.5	7.749322	-74.5309
4	7589200193	20141110T000000	535000	3	1.0	1090	3000	1.5	0	...	98117	47.6889	-122.375	1570	5080	9	3.0	6.993933	-74.6861