Regression Week 2: Multiple Regression (Interpretation)

In this notebook, we will use data on house sales in King County, Seatle to predict prices using multiple regression. The goal of this notebook is to explore multiple regression and feature engineering. You will:

  • Use DataFrames to do some feature engineering
  • Use sklearn to compute the regression weights (coefficients/parameters)
  • Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares (RSS)
  • Look at coefficients and interpret their meanings
  • Evaluate multiple models via RSS

Importing Libraries


In [55]:
import os
import zipfile
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

Unzipping files with house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.


In [56]:
# Put files in current direction into a list
files_list = [f for f in os.listdir('.') if os.path.isfile(f)]

In [57]:
# Filenames of unzipped files
unzip_files = ['kc_house_train_data.csv','kc_house_test_data.csv', 'kc_house_data.csv']

In [58]:
# If upzipped file not in files_list, unzip the file
for filename in unzip_files:
    if filename not in files_list:
        zip_file = filename + '.zip'
        unzipping = zipfile.ZipFile(zip_file)
        unzipping.extractall()
        unzipping.close

Loading Sales data, Sales Training data, and Sales Test data


In [59]:
# Dictionary with the correct dtypes for the DataFrame columns
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 
              'sqft_living15':float, 'grade':int, 'yr_renovated':int, 
              'price':float, 'bedrooms':float, 'zipcode':str, 
              'long':float, 'sqft_lot15':float, 'sqft_living':float, 
              'floors':str, 'condition':int, 'lat':float, 'date':str, 
              'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

In [60]:
# Loading sales data, sales training data, and test_data into DataFrames
sales = pd.read_csv('kc_house_data.csv', dtype = dtype_dict)
train_data = pd.read_csv('kc_house_train_data.csv', dtype = dtype_dict)
test_data = pd.read_csv('kc_house_test_data.csv', dtype = dtype_dict)

In [61]:
# Looking at head of training data DataFrame
train_data.head()


Out[61]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900 3 1.00 1180 5650 1 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000 3 2.25 2570 7242 2 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000 2 1.00 770 10000 1 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000 4 3.00 1960 5000 1 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 20150218T000000 510000 3 2.00 1680 8080 1 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

Learning a multiple regression model

Now, learn a multiple regression model predicting 'price' based on the following features: example_features = ['sqft_living', 'bedrooms', 'bathrooms'] on training data. First, let's plot the data for these features.


In [62]:
plt.figure(figsize=(8,6))
plt.plot(train_data['sqft_living'], train_data['price'],'.')
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.show()



In [63]:
plt.figure(figsize=(12,8))

plt.subplot(1, 2, 1)
plt.plot(train_data['bedrooms'], train_data['price'],'.')
plt.xlabel('# Bedrooms', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)

plt.subplot(1, 2, 2)
plt.plot(train_data['bathrooms'], train_data['price'],'.')
plt.xlabel('# Bathrooms', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)

plt.show()


Now, creating a list of the features we are interested in, the feature matrix, and the output vector.


In [64]:
example_features = ['sqft_living', 'bedrooms', 'bathrooms']
X_multi_lin_reg = train_data[example_features]
y_multi_lin_reg = train_data['price']

Creating a Linear Regression Object for Sklearn library and using the feature matrix and output vector to perform linear regression.


In [65]:
example_model = LinearRegression()
example_model.fit(X_multi_lin_reg, y_multi_lin_reg)


Out[65]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now that we have fitted the model we can extract the regression weights (coefficients):


In [66]:
# printing the intercept and coefficients
print example_model.intercept_
print example_model.coef_


87912.865815
[   315.40669062 -65081.88711588   6942.16598637]

In [67]:
# Putting the intercept and weights from the multiple linear regression into a Series
example_weight_summary = pd.Series( [example_model.intercept_] + list(example_model.coef_),
                                  index = ['intercept'] + example_features )
print example_weight_summary


intercept      87912.865815
sqft_living      315.406691
bedrooms      -65081.887116
bathrooms       6942.165986
dtype: float64

Making Predictions

Recall that once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:


In [68]:
example_predictions = example_model.predict(X_multi_lin_reg)
print example_predictions[0] # should be close to 271789.505878


271789.26538

Compute RSS

Now that we can make predictions given the model, let's write a function to compute the RSS of the model.


In [69]:
def get_residual_sum_of_squares(model, data, outcome):
    # - data holds the data points with the features (columns) we are interested in performing a linear regression fit
    # - model holds the linear regression model obtained from fitting to the data
    # - outcome is the y, the observed house price for each data point
    
    # By using the model and applying predict on the data, we return a numpy array which holds
    # the predicted outcome (house price) from the linear regression model
    model_predictions = model.predict(data)

    # Computing the residuals between the predicted house price and the actual house price for each data point
    residuals = outcome - model_predictions

    # To get RSS, square the residuals and add them up
    RSS = sum(residuals*residuals)

    return(RSS)

Create some new features

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms), we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms.

Create the following 4 new features as column in both TEST and TRAIN data:

  • bedrooms_squared = bedrooms*bedrooms
  • bed_bath_rooms = bedrooms*bathrooms
  • log_sqft_living = log(sqft_living)
  • lat_plus_long = lat + long

In [70]:
# Creating new 'bedrooms_squared' feature
train_data['bedrooms_squared'] = train_data['bedrooms']*train_data['bedrooms']
test_data['bedrooms_squared'] = test_data['bedrooms']*test_data['bedrooms']
# Creating new 'bed_bath_rooms' feature
train_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']
test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']
# Creating new 'log_sqft_living' feature
train_data['log_sqft_living'] = np.log(train_data['sqft_living'])
test_data['log_sqft_living'] = np.log(test_data['sqft_living'])
# Creating new 'lat_plus_long' feature
train_data['lat_plus_long'] = train_data['lat'] + train_data['long']
test_data['lat_plus_long'] = test_data['lat'] + test_data['long']

In [71]:
# Displaying head of train_data DataFrame and test_data DataFrame to verify that new features are present
train_data.head()
test_data.head()


Out[71]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... yr_renovated zipcode lat long sqft_living15 sqft_lot15 bedrooms_squared bed_bath_rooms log_sqft_living lat_plus_long
0 0114101516 20140528T000000 310000 3 1.0 1430 19901 1.5 0 0 ... 0 98028 47.7558 -122.229 1780 12697 9 3.0 7.265430 -74.4732
1 9297300055 20150124T000000 650000 4 3.0 2950 5000 2 0 3 ... 0 98126 47.5714 -122.375 2140 4000 16 12.0 7.989560 -74.8036
2 1202000200 20141103T000000 233000 3 2.0 1710 4697 1.5 0 0 ... 0 98002 47.3048 -122.218 1030 4705 9 6.0 7.444249 -74.9132
3 8562750320 20141110T000000 580500 3 2.5 2320 3980 2 0 0 ... 0 98027 47.5391 -122.070 2580 3980 9 7.5 7.749322 -74.5309
4 7589200193 20141110T000000 535000 3 1.0 1090 3000 1.5 0 0 ... 0 98117 47.6889 -122.375 1570 5080 9 3.0 6.993933 -74.6861

5 rows × 25 columns

  • Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this feature will mostly affect houses with many bedrooms.
  • bedrooms times bathrooms gives what's called an "interaction" feature. It is large when both of them are large.
  • Taking the log of squarefeet has the effect of bringing large values closer together and spreading out small values.
  • Adding latitude to longitude is totally non-sensical but we will do it anyway (you'll see why)

Quiz Question: What is the mean (arithmetic average) value of your 4 new features on TEST data? (round to 2 digits)


In [72]:
print "Mean of Test data 'bedrooms_squared' feature: %.2f " % np.mean(test_data['bedrooms_squared'].values)
print "Mean of Test data 'bed_bath_rooms' feature: %.2f " % np.mean(test_data['bed_bath_rooms'].values)
print "Mean of Test data 'log_sqft_living' feature: %.2f " % np.mean(test_data['log_sqft_living'].values)
print "Mean of Test data 'lat_plus_long' feature: %.2f " % np.mean(test_data['lat_plus_long'].values)


Mean of Test data 'bedrooms_squared' feature: 12.45 
Mean of Test data 'bed_bath_rooms' feature: 7.50 
Mean of Test data 'log_sqft_living' feature: 7.55 
Mean of Test data 'lat_plus_long' feature: -74.65 

Learning Multiple Models

Now we will learn the weights for three (nested) models for predicting house prices. The first model will have the fewest features the second model will add one more feature and the third will add a few more:

  • Model 1: squarefeet, # bedrooms, # bathrooms, latitude & longitude
  • Model 2: add bedrooms*bathrooms
  • Model 3: Add log squarefeet, bedrooms squared, and the (nonsensical) latitude + longitude

In [73]:
model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']

Now that you have the features, learn the weights for the three different models for predicting target = 'price' and look at the value of the weights/coefficients:


In [74]:
# Creating a LinearRegression Object for Model 1 and learning the multiple linear regression model
model_1 = LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
# Creating a LinearRegression Object for Model 2 and learning the multiple linear regression model
model_2 = LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
# Creating a LinearRegression Object for Model 3 and learning the multiple linear regression model
model_3 = LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])


Out[74]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now, Examine/extract each model's coefficients:


In [75]:
# Putting the Model 1 intercept and weights from the multiple linear regression for the 3 models into a Series
model_1_summary = pd.Series( [model_1.intercept_] + list(model_1.coef_),
                                  index = ['intercept'] + model_1_features , name='Model 1 Coefficients' )
print model_1_summary


intercept     -69075726.792570
sqft_living         312.258646
bedrooms         -59586.533154
bathrooms         15706.742083
lat              658619.263931
long            -309374.351268
Name: Model 1 Coefficients, dtype: float64

In [76]:
# Putting the Model 2 intercept and weights from the multiple linear regression for the 3 models into a Series
model_2_summary = pd.Series( [model_2.intercept_] + list(model_2.coef_),
                                  index = ['intercept'] + model_2_features , name='Model 2 Coefficients' )
print model_2_summary


intercept        -66867968.871079
sqft_living            306.610053
bedrooms           -113446.368070
bathrooms           -71461.308293
lat                 654844.629503
long               -294298.969138
bed_bath_rooms       25579.652001
Name: Model 2 Coefficients, dtype: float64

In [77]:
# Putting the Model 3 intercept and weights from the multiple linear regression for the 3 models into a Series
model_3_summary = pd.Series( [model_3.intercept_] + list(model_3.coef_),
                                  index = ['intercept'] + model_3_features , name='Model 3 Coefficients' )
print model_3_summary


intercept          -62036084.986098
sqft_living              529.422820
bedrooms               34514.229578
bathrooms              67060.781319
lat                   534085.610867
long                 -406750.710861
bed_bath_rooms         -8570.504395
bedrooms_squared       -6788.586670
log_sqft_living      -561831.484076
lat_plus_long         127334.900006
Name: Model 3 Coefficients, dtype: float64

Quiz Question: What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 1?


In [78]:
print "Positive: ", model_1_summary['bathrooms']


Positive:  15706.7420827

Quiz Question: What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 2?


In [79]:
print "Negative: ", model_2_summary['bathrooms']


Negative:  -71461.3082928

Think about what this means:

In model 2, the new 'bed_bath_rooms' feature causes the house price to be over-estimated. Thus, the 'bathrooms' turns to negative to better agree with the observed values for the house prices.

Comparing multiple models

Now that you've learned three models and extracted the model weights we want to evaluate which model is best.

First use your functions from earlier to compute the RSS on TRAINING Data for each of the three models.


In [80]:
# Compute the RSS on TRAINING data for each of the three models and record the values:
rss_model_1_train = get_residual_sum_of_squares(model_1, train_data[model_1_features], train_data['price'])
rss_model_2_train = get_residual_sum_of_squares(model_2, train_data[model_2_features], train_data['price'])
rss_model_3_train = get_residual_sum_of_squares(model_3, train_data[model_3_features], train_data['price'])
print "RSS for Model 1 Training Data: ", rss_model_1_train
print "RSS for Model 2 Training Data: ", rss_model_2_train
print "RSS for Model 3 Training Data: ", rss_model_3_train


RSS for Model 1 Training Data:  9.6787996305e+14
RSS for Model 2 Training Data:  9.58419635074e+14
RSS for Model 3 Training Data:  9.0343645505e+14

Quiz Question: Which model (1, 2 or 3) has lowest RSS on TRAINING Data? Is this what you expected?

Model 3 has the lowest RSS on the Training Data. This is expected since Model 3 has the most features.

Now compute the RSS on on TEST data for each of the three models.


In [81]:
# Compute the RSS on TESTING data for each of the three models and record the values:
rss_model_1_test = get_residual_sum_of_squares(model_1, test_data[model_1_features], test_data['price'])
rss_model_2_test = get_residual_sum_of_squares(model_2, test_data[model_2_features], test_data['price'])
rss_model_3_test = get_residual_sum_of_squares(model_3, test_data[model_3_features], test_data['price'])
print "RSS for Model 1 Test Data: ", rss_model_1_test
print "RSS for Model 2 Test Data: ", rss_model_2_test
print "RSS for Model 3 T Data: ", rss_model_3_test


RSS for Model 1 Test Data:  2.25500469795e+14
RSS for Model 2 Test Data:  2.23377462976e+14
RSS for Model 3 T Data:  2.59236319207e+14

Quiz Question: Which model (1, 2 or 3) has lowest RSS on TESTING Data? Is this what you expected?Think about the features that were added to each model from the previous.

Model 2 has the lowest RSS on the Test Data.


In [ ]: