In this notebook, we will use data on house sales in King County, Seatle to predict prices using multiple regression. The goal of this notebook is to explore multiple regression and feature engineering. You will:
In [55]:
import os
import zipfile
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
In [56]:
# Put files in current direction into a list
files_list = [f for f in os.listdir('.') if os.path.isfile(f)]
In [57]:
# Filenames of unzipped files
unzip_files = ['kc_house_train_data.csv','kc_house_test_data.csv', 'kc_house_data.csv']
In [58]:
# If upzipped file not in files_list, unzip the file
for filename in unzip_files:
if filename not in files_list:
zip_file = filename + '.zip'
unzipping = zipfile.ZipFile(zip_file)
unzipping.extractall()
unzipping.close
In [59]:
# Dictionary with the correct dtypes for the DataFrame columns
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int,
'sqft_living15':float, 'grade':int, 'yr_renovated':int,
'price':float, 'bedrooms':float, 'zipcode':str,
'long':float, 'sqft_lot15':float, 'sqft_living':float,
'floors':str, 'condition':int, 'lat':float, 'date':str,
'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
In [60]:
# Loading sales data, sales training data, and test_data into DataFrames
sales = pd.read_csv('kc_house_data.csv', dtype = dtype_dict)
train_data = pd.read_csv('kc_house_train_data.csv', dtype = dtype_dict)
test_data = pd.read_csv('kc_house_test_data.csv', dtype = dtype_dict)
In [61]:
# Looking at head of training data DataFrame
train_data.head()
Out[61]:
Now, learn a multiple regression model predicting 'price' based on the following features: example_features = ['sqft_living', 'bedrooms', 'bathrooms'] on training data. First, let's plot the data for these features.
In [62]:
plt.figure(figsize=(8,6))
plt.plot(train_data['sqft_living'], train_data['price'],'.')
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.show()
In [63]:
plt.figure(figsize=(12,8))
plt.subplot(1, 2, 1)
plt.plot(train_data['bedrooms'], train_data['price'],'.')
plt.xlabel('# Bedrooms', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.subplot(1, 2, 2)
plt.plot(train_data['bathrooms'], train_data['price'],'.')
plt.xlabel('# Bathrooms', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.show()
Now, creating a list of the features we are interested in, the feature matrix, and the output vector.
In [64]:
example_features = ['sqft_living', 'bedrooms', 'bathrooms']
X_multi_lin_reg = train_data[example_features]
y_multi_lin_reg = train_data['price']
Creating a Linear Regression Object for Sklearn library and using the feature matrix and output vector to perform linear regression.
In [65]:
example_model = LinearRegression()
example_model.fit(X_multi_lin_reg, y_multi_lin_reg)
Out[65]:
Now that we have fitted the model we can extract the regression weights (coefficients):
In [66]:
# printing the intercept and coefficients
print example_model.intercept_
print example_model.coef_
In [67]:
# Putting the intercept and weights from the multiple linear regression into a Series
example_weight_summary = pd.Series( [example_model.intercept_] + list(example_model.coef_),
index = ['intercept'] + example_features )
print example_weight_summary
In [68]:
example_predictions = example_model.predict(X_multi_lin_reg)
print example_predictions[0] # should be close to 271789.505878
Now that we can make predictions given the model, let's write a function to compute the RSS of the model.
In [69]:
def get_residual_sum_of_squares(model, data, outcome):
# - data holds the data points with the features (columns) we are interested in performing a linear regression fit
# - model holds the linear regression model obtained from fitting to the data
# - outcome is the y, the observed house price for each data point
# By using the model and applying predict on the data, we return a numpy array which holds
# the predicted outcome (house price) from the linear regression model
model_predictions = model.predict(data)
# Computing the residuals between the predicted house price and the actual house price for each data point
residuals = outcome - model_predictions
# To get RSS, square the residuals and add them up
RSS = sum(residuals*residuals)
return(RSS)
Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms), we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms.
Create the following 4 new features as column in both TEST and TRAIN data:
In [70]:
# Creating new 'bedrooms_squared' feature
train_data['bedrooms_squared'] = train_data['bedrooms']*train_data['bedrooms']
test_data['bedrooms_squared'] = test_data['bedrooms']*test_data['bedrooms']
# Creating new 'bed_bath_rooms' feature
train_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']
test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']
# Creating new 'log_sqft_living' feature
train_data['log_sqft_living'] = np.log(train_data['sqft_living'])
test_data['log_sqft_living'] = np.log(test_data['sqft_living'])
# Creating new 'lat_plus_long' feature
train_data['lat_plus_long'] = train_data['lat'] + train_data['long']
test_data['lat_plus_long'] = test_data['lat'] + test_data['long']
In [71]:
# Displaying head of train_data DataFrame and test_data DataFrame to verify that new features are present
train_data.head()
test_data.head()
Out[71]:
Quiz Question: What is the mean (arithmetic average) value of your 4 new features on TEST data? (round to 2 digits)
In [72]:
print "Mean of Test data 'bedrooms_squared' feature: %.2f " % np.mean(test_data['bedrooms_squared'].values)
print "Mean of Test data 'bed_bath_rooms' feature: %.2f " % np.mean(test_data['bed_bath_rooms'].values)
print "Mean of Test data 'log_sqft_living' feature: %.2f " % np.mean(test_data['log_sqft_living'].values)
print "Mean of Test data 'lat_plus_long' feature: %.2f " % np.mean(test_data['lat_plus_long'].values)
Now we will learn the weights for three (nested) models for predicting house prices. The first model will have the fewest features the second model will add one more feature and the third will add a few more:
In [73]:
model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
Now that you have the features, learn the weights for the three different models for predicting target = 'price' and look at the value of the weights/coefficients:
In [74]:
# Creating a LinearRegression Object for Model 1 and learning the multiple linear regression model
model_1 = LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
# Creating a LinearRegression Object for Model 2 and learning the multiple linear regression model
model_2 = LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
# Creating a LinearRegression Object for Model 3 and learning the multiple linear regression model
model_3 = LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
Out[74]:
Now, Examine/extract each model's coefficients:
In [75]:
# Putting the Model 1 intercept and weights from the multiple linear regression for the 3 models into a Series
model_1_summary = pd.Series( [model_1.intercept_] + list(model_1.coef_),
index = ['intercept'] + model_1_features , name='Model 1 Coefficients' )
print model_1_summary
In [76]:
# Putting the Model 2 intercept and weights from the multiple linear regression for the 3 models into a Series
model_2_summary = pd.Series( [model_2.intercept_] + list(model_2.coef_),
index = ['intercept'] + model_2_features , name='Model 2 Coefficients' )
print model_2_summary
In [77]:
# Putting the Model 3 intercept and weights from the multiple linear regression for the 3 models into a Series
model_3_summary = pd.Series( [model_3.intercept_] + list(model_3.coef_),
index = ['intercept'] + model_3_features , name='Model 3 Coefficients' )
print model_3_summary
Quiz Question: What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 1?
In [78]:
print "Positive: ", model_1_summary['bathrooms']
Quiz Question: What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 2?
In [79]:
print "Negative: ", model_2_summary['bathrooms']
Think about what this means:
In model 2, the new 'bed_bath_rooms' feature causes the house price to be over-estimated. Thus, the 'bathrooms' turns to negative to better agree with the observed values for the house prices.
First use your functions from earlier to compute the RSS on TRAINING Data for each of the three models.
In [80]:
# Compute the RSS on TRAINING data for each of the three models and record the values:
rss_model_1_train = get_residual_sum_of_squares(model_1, train_data[model_1_features], train_data['price'])
rss_model_2_train = get_residual_sum_of_squares(model_2, train_data[model_2_features], train_data['price'])
rss_model_3_train = get_residual_sum_of_squares(model_3, train_data[model_3_features], train_data['price'])
print "RSS for Model 1 Training Data: ", rss_model_1_train
print "RSS for Model 2 Training Data: ", rss_model_2_train
print "RSS for Model 3 Training Data: ", rss_model_3_train
Quiz Question: Which model (1, 2 or 3) has lowest RSS on TRAINING Data? Is this what you expected?
Model 3 has the lowest RSS on the Training Data. This is expected since Model 3 has the most features.
Now compute the RSS on on TEST data for each of the three models.
In [81]:
# Compute the RSS on TESTING data for each of the three models and record the values:
rss_model_1_test = get_residual_sum_of_squares(model_1, test_data[model_1_features], test_data['price'])
rss_model_2_test = get_residual_sum_of_squares(model_2, test_data[model_2_features], test_data['price'])
rss_model_3_test = get_residual_sum_of_squares(model_3, test_data[model_3_features], test_data['price'])
print "RSS for Model 1 Test Data: ", rss_model_1_test
print "RSS for Model 2 Test Data: ", rss_model_2_test
print "RSS for Model 3 T Data: ", rss_model_3_test
Quiz Question: Which model (1, 2 or 3) has lowest RSS on TESTING Data? Is this what you expected?Think about the features that were added to each model from the previous.
Model 2 has the lowest RSS on the Test Data.
In [ ]: