# Regression Week 3: Assessing Fit (polynomial regression)

In this notebook, we will compare different regression models in order to assess which model fits best. We will be using polynomial regression as a means to examine this topic. In particular, you will:

• Write a function to take a Series and a degree and return a DataFrame where each column is the Series to a polynomial value up to the total degree e.g. degree = 3 then column 1 is the Series column 2 is the Series squared and column 3 is the Series cubed
• Use matplotlib to visualize polynomial regressions
• Use matplotlib to visualize the same polynomial degree on different subsets of the data
• Use a validation set to select a polynomial degree
• Assess the final fit using test data

## Importing Libraries

``````

In [107]:

import os
import zipfile
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

``````

## Unzipping files with house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

``````

In [108]:

# Put files in current direction into a list
files_list = [f for f in os.listdir('.') if os.path.isfile(f)]

``````
``````

In [109]:

# Filenames of unzipped files
unzip_files = ['kc_house_data.csv','wk3_kc_house_set_1_data.csv', 'wk3_kc_house_set_2_data.csv',
'wk3_kc_house_set_3_data.csv', 'wk3_kc_house_set_4_data.csv', 'wk3_kc_house_test_data.csv',
'wk3_kc_house_train_data.csv', 'wk3_kc_house_valid_data.csv']

``````
``````

In [110]:

# If upzipped file not in files_list, unzip the file
for filename in unzip_files:
if filename not in files_list:
zip_file = filename + '.zip'
unzipping = zipfile.ZipFile(zip_file)
unzipping.extractall()
unzipping.close

``````

## Basics of apply function for Pandas DataFrames

Next we're going to write a polynomial function that takes an Series and a maximal degree and returns an DataFrame with columns containing the Series to all the powers up to the maximal degree.

The easiest way to apply a power to a Series is to use the .apply() and lambda x: functions. For example to take the example array and compute the third power we can do as follows: (note running this cell the first time may take longer than expected since it loads graphlab)

``````

In [111]:

tmp = pd.Series([1.0, 2.0, 3.0])
tmp_cubed = tmp.apply(lambda x: x**3)
print tmp
print tmp_cubed

``````
``````

0    1
1    2
2    3
dtype: float64
0     1
1     8
2    27
dtype: float64

``````

We can create an empty DataFrame using pd.DataFrame() and then add any columns to it with ex_dframe['column_name'] = value. For example we create an empty DataFrame and make the column 'power_1' to be the first power of tmp (i.e. tmp itself).

``````

In [112]:

ex_dframe = pd.DataFrame()
ex_dframe['power_1'] = tmp
print ex_dframe
print type(ex_dframe)

``````
``````

power_1
0        1
1        2
2        3
<class 'pandas.core.frame.DataFrame'>

``````

## Polynomial_dataframe function

Using the hints above complete the following function to create an DataFrame consisting of the powers of a Series up to a specific degree:

``````

In [113]:

def polynomial_dataframe(feature, degree): # feature is pandas.Series type
# assume that degree >= 1
# initialize the dataframe:
poly_dataframe = pd.DataFrame()
# and set poly_dataframe['power_1'] equal to the passed feature
poly_dataframe['power_1'] = feature

# first check if degree > 1
if degree > 1:
# then loop over the remaining degrees:
for power in range(2, degree+1):
# first we'll give the column a name:
name = 'power_' + str(power)
# assign poly_dataframe[name] to be feature^power; use apply(*)
poly_dataframe[name] = poly_dataframe['power_1'].apply(lambda x: x**power)
return poly_dataframe

``````

To test your function consider the smaller tmp variable and what you would expect the outcome of the following call:

``````

In [114]:

tmp = pd.Series([1.0, 2.0, 3.0])
print polynomial_dataframe(tmp, 3)

``````
``````

power_1  power_2  power_3
0        1        1        1
1        2        4        8
2        3        9       27

``````

## Visualizing polynomial regression

Let's use matplotlib to visualize what a polynomial regression looks like on some real data. First, let's load house sales data

``````

In [115]:

# Dictionary with the correct dtypes for the DataFrame columns
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float,
'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float,
'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int,
'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

``````
``````

In [116]:

sales = pd.read_csv('kc_house_data.csv', dtype = dtype_dict)

``````

As in Week 3, we will use the sqft_living variable. For plotting purposes (connecting the dots), you'll need to sort by the values of sqft_living. For houses with identical square footage, we break the tie by their prices.

``````

In [117]:

sales = sales.sort_values(['sqft_living', 'price'])

``````
``````

Out[117]:

sqft_living
price

19452
290
142000

15381
370
276000

860
380
245000

18379
384
265000

4868
390
228000

``````

Let's start with a degree 1 polynomial using 'sqft_living' (i.e. a line) to predict 'price' and plot what it looks like.

``````

In [118]:

poly1_data = polynomial_dataframe(sales['sqft_living'], 1)
poly1_data['price'] = sales['price'] # add price to the data since it's the target

``````
``````

Out[118]:

power_1
price

19452
290
142000

15381
370
276000

860
380
245000

18379
384
265000

4868
390
228000

``````

Creating feature matrix and output vector to perform linear reggression with Sklearn

``````

In [119]:

# Note: Must pass list of features to feature matrix X_feat_model_1 for sklearn to work
X_feat_model_1 = poly1_data[ ['power_1'] ]
y_output_model_1 = poly1_data['price']

``````
``````

In [120]:

model_1 = LinearRegression()
model_1.fit(X_feat_model_1, y_output_model_1)

``````
``````

Out[120]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

``````

Let's look at the intercept and weight before we plot.

``````

In [121]:

print model_1.intercept_
print model_1.coef_

``````
``````

-43580.7430945
[ 280.6235679]

``````

Now, plotting the data and the line learned by linear regression

``````

In [122]:

plt.figure(figsize=(8,6))
plt.plot(poly1_data['power_1'],poly1_data['price'],'.', label= 'House Price Data')
plt.hold(True)
plt.plot(poly1_data['power_1'], model_1.predict(X_feat_model_1), '-' , label= 'Linear Regression Model')
plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price (\$)', fontsize=16)
plt.title('Simple Linear Regression', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

``````
``````

``````

We can see, not surprisingly, that the predicted values all fall on a line, specifically the one with slope 280 and intercept -43579. What if we wanted to plot a second degree polynomial?

``````

In [123]:

poly2_data = polynomial_dataframe(sales['sqft_living'], 2)
my_features = list(poly2_data) # Get col_names of DataFrame and put in list
poly2_data['price'] = sales['price'] # add price to the data since it's the target

``````
``````

In [124]:

# Creating feature matrix and output vector to perform regression w/ sklearn.
X_feat_model_2 = poly2_data[my_features]
y_output_model_2 = poly2_data['price']

``````
``````

In [125]:

# Creating a LinearRegression Object. Then, performing linear regression on feature matrix and output vector
model_2 = LinearRegression()
model_2.fit(X_feat_model_2, y_output_model_2)

``````
``````

Out[125]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

``````

Let's look at the intercept and weights before we plot.

``````

In [126]:

print model_2.intercept_
print model_2.coef_

``````
``````

199222.279305
[  6.79940947e+01   3.85812609e-02]

``````
``````

In [127]:

plt.figure(figsize=(8,6))
plt.plot(poly2_data['power_1'],poly2_data['price'],'.', label= 'House Price Data')
plt.hold(True)
plt.plot(poly2_data['power_1'], model_2.predict(X_feat_model_2), '-' , label= 'Regression Model')
plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price (\$)', fontsize=16)
plt.title('2nd Degree Polynomial Regression', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

``````
``````

``````

The resulting model looks like half a parabola. Try on your own to see what the cubic looks like:

``````

In [128]:

poly3_data = polynomial_dataframe(sales['sqft_living'], 3)
my_features = list(poly3_data) # Get col_names of DataFrame and put in list
poly3_data['price'] = sales['price'] # add price to the data since it's the target

``````
``````

In [129]:

# Creating feature matrix and output vector to perform regression w/ sklearn.
X_feat_model_3 = poly3_data[my_features]
y_output_model_3 = poly3_data['price']

``````
``````

In [130]:

# Creating a LinearRegression Object. Then, performing linear regression on feature matrix and output vector
model_3 = LinearRegression()
model_3.fit(X_feat_model_3, y_output_model_3)

``````
``````

Out[130]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

``````

Looking at intercept and weights before plotting

``````

In [131]:

print model_3.intercept_
print model_3.coef_

``````
``````

336819.748221
[ -9.01819864e+01   8.70465089e-02  -3.84055260e-06]

``````
``````

In [132]:

plt.figure(figsize=(8,6))
plt.plot(poly3_data['power_1'],poly3_data['price'],'.', label= 'House Price Data')
plt.hold(True)
plt.plot(poly3_data['power_1'], model_3.predict(X_feat_model_3), '-' , label= 'Regression Model')
plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price (\$)', fontsize=16)
plt.title('3rd Degree Polynomial Regression', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

``````
``````

``````

Now try a 15th degree polynomial:

``````

In [133]:

poly15_data = polynomial_dataframe(sales['sqft_living'], 15)
my_features = list(poly15_data) # Get col_names of DataFrame and put in list
poly15_data['price'] = sales['price'] # add price to the data since it's the target

``````
``````

In [134]:

# Creating feature matrix and output vector to perform regression w/ sklearn.
X_feat_model_15 = poly15_data[my_features]
y_output_model_15 = poly15_data['price']

``````
``````

In [135]:

# Creating a LinearRegression Object. Then, performing linear regression on feature matrix and output vector
model_15 = LinearRegression()
model_15.fit(X_feat_model_15, y_output_model_15)

``````
``````

Out[135]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

``````

Looking at intercept and weights before plotting

``````

In [136]:

print model_15.intercept_
print model_15.coef_

``````
``````

537116.329638
[  4.56404164e-91   1.42711538e-50   1.72764646e-55   3.01685526e-60
1.30077358e-74   2.68575522e-71   2.26147568e-67   1.85900299e-63
1.47144116e-59   1.09771012e-55   7.43509038e-52   4.23015578e-48
1.61618577e-44  -2.49283826e-48   9.59718336e-53]

``````
``````

In [137]:

plt.figure(figsize=(8,6))
plt.plot(poly15_data['power_1'],poly15_data['price'],'.', label= 'House Price Data')
plt.hold(True)
plt.plot(poly15_data['power_1'], model_15.predict(X_feat_model_15), '-' , label= 'Regression Model')
plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price (\$)', fontsize=16)
plt.title('15th Degree Polynomial Regression', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

``````
``````

``````

What do you think of the 15th degree polynomial? Do you think this is appropriate? If we were to change the data do you think you'd get pretty much the same curve? Let's take a look.

# Changing the data and re-learning

We're going to split the sales data into four subsets of roughly equal size. Then you will estimate a 15th degree polynomial model on all four subsets of the data. Print the coefficients (you should use .print_rows(num_rows = 16) to view all of them) and plot the resulting fit (as we did above). The quiz will ask you some questions about these results.

``````

In [138]:

``````

Making 4 dataframes with 15 features and a price column

``````

In [139]:

(poly15_set_1, poly15_set_2) = ( polynomial_dataframe(set_1['sqft_living'], 15) , polynomial_dataframe(set_2['sqft_living'], 15) )
(poly15_set_3, poly15_set_4) = ( polynomial_dataframe(set_3['sqft_living'], 15) , polynomial_dataframe(set_4['sqft_living'], 15) )

( poly15_set_1['price'], poly15_set_2['price'] ) = ( set_1['price'] , set_2['price'] )
( poly15_set_3['price'], poly15_set_4['price'] ) = ( set_3['price'] , set_4['price'] )

my_features = list(poly15_set_1) # Put DataFrame col_names in a list. All dataframes have same col_names

``````
``````

In [140]:

( X_feat_set_1, X_feat_set_2 ) = ( poly15_set_1[my_features], poly15_set_2[my_features] )
( X_feat_set_3, X_feat_set_4 ) = ( poly15_set_3[my_features], poly15_set_4[my_features] )

( y_output_set_1, y_output_set_2 ) = ( poly15_set_1['price'], poly15_set_2['price'] )
( y_output_set_3, y_output_set_4 ) = ( poly15_set_3['price'], poly15_set_4['price'] )

``````
``````

In [141]:

# Creating a LinearRegression Object. Then, performing linear regression on feature matrix and output vector
model_deg_15_set_1 = LinearRegression()
model_deg_15_set_2 = LinearRegression()
model_deg_15_set_3 = LinearRegression()
model_deg_15_set_4 = LinearRegression()

model_deg_15_set_1.fit(X_feat_set_1, y_output_set_1)
model_deg_15_set_2.fit(X_feat_set_2, y_output_set_2)
model_deg_15_set_3.fit(X_feat_set_3, y_output_set_3)
model_deg_15_set_4.fit(X_feat_set_4, y_output_set_4)

``````
``````

Out[141]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

``````
``````

In [142]:

plt.figure(figsize=(8,6))
plt.plot(poly15_data['power_1'],poly15_data['price'],'.', label= 'House Price Data')
plt.hold(True)
plt.plot(poly15_set_1['power_1'], model_deg_15_set_1.predict(X_feat_set_1), '-' , label= 'Model 1')
plt.plot(poly15_set_2['power_1'], model_deg_15_set_2.predict(X_feat_set_2), '-' , label= 'Model 2')
plt.plot(poly15_set_3['power_1'], model_deg_15_set_3.predict(X_feat_set_3), '-' , label= 'Model 3')
plt.plot(poly15_set_4['power_1'], model_deg_15_set_4.predict(X_feat_set_4), '-' , label= 'Model 4')
plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price (\$)', fontsize=16)
plt.title('4 Different 15th Deg. Polynomial Regr. Models', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

``````
``````

``````

Quiz Question: Is the sign (positive or negative) for power_15 the same in all four models?

``````

In [143]:

power_15_coeff = [ model_deg_15_set_1.coef_[-1], model_deg_15_set_2.coef_[-1], model_deg_15_set_3.coef_[-1], model_deg_15_set_4.coef_[-1] ]
print power_15_coeff
print
if all(i > 0 for i in power_15_coeff):
print 'Sign the SAME (Positive) for all 4 models'
elif all(i < 0 for i in power_15_coeff):
print 'Sign the SAME (Negative) for all 4 models'
else:
print 'Sign NOT the same for all 4 models'

``````
``````

[1.3117216099014014e-87, 8.8062799992843061e-75, 1.1139318003215489e-85, 5.0630590615135435e-74]

Sign the SAME (Positive) for all 4 models

``````

Quiz Question: (True/False) the plotted fitted lines look the same in all four plots

Fits for 4 different models look very different

# Selecting a Polynomial Degree

Whenever we have a "magic" parameter like the degree of the polynomial there is one well-known way to select these parameters: validation set. (We will explore another approach in week 4).

We now load sales dataset split 3-way into training set, test set, and validation set:

``````

In [144]:

train_data = pd.read_csv('wk3_kc_house_train_data.csv', dtype = dtype_dict)
valid_data = pd.read_csv('wk3_kc_house_valid_data.csv', dtype = dtype_dict)
test_data = pd.read_csv('wk3_kc_house_test_data.csv', dtype = dtype_dict)

``````
``````

In [145]:

# Sorting the Training Data for Plotting
train_data = train_data.sort_values(['sqft_living', 'price'])

``````
``````

Out[145]:

sqft_living
price

8840
290
142000

6982
370
276000

8338
384
265000

2228
390
228000

9631
390
245000

``````

Next you should write a loop that does the following:

• For degree in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] (to get this in python type range(1, 15+1))
• Build an DAtaFrame of polynomial data of train_data['sqft_living'] at the current degree
• Add train_data['price'] to the polynomial DataFrame
• Learn a polynomial regression model to sqft vs price with that degree on TRAIN data
• Compute the RSS on VALIDATION data (here you will want to use .predict()) for that degree and you will need to make a polynomial DataFrame using validation data.
``````

In [146]:

# poly_deg_dict is a dict which holds poly_features dataframes. key_list is a list of keys for the dicts.
# The keys in key_list are of the form 'poly_deg_i', where i refers to the ith polynomial
poly_deg_dict = {}
key_list = []

# X_feat_dict is a dict with all the feature matrices and y_output_dict is a dict with all the output vectors
X_feat_dict = {}
y_output_dict = {}

# model_poly_deg is a dict which holds all the regression models for the ith polynomial fit
model_poly_deg = {}

# Looping over polynomial features from 1-15
for i in range(1, 15+1, 1):

# Defining key-name and appending key_name to the key_list
key_poly_deg = 'poly_deg_' + str(i)
key_list.append(key_poly_deg)

# Entering each dataframe returned from polynomial_dataframe function into a dict
# Then, saving col_names into a list to do regression w/ these features. Then, adding price column to dataframe
poly_deg_dict[key_poly_deg] = polynomial_dataframe(train_data['sqft_living'], i)
feat_poly_fit = list(poly_deg_dict[key_poly_deg])
poly_deg_dict[key_poly_deg]['price'] = train_data['price']

# Adding feature matrix and output_vector into dicts
X_feat_dict[key_poly_deg] = poly_deg_dict[key_poly_deg][feat_poly_fit]
y_output_dict[key_poly_deg] = poly_deg_dict[key_poly_deg]['price']

# Adding regression models to dicts
model_poly_deg[key_poly_deg] = LinearRegression()
model_poly_deg[key_poly_deg].fit( X_feat_dict[key_poly_deg], y_output_dict[key_poly_deg] )

``````
``````

In [147]:

plt.figure(figsize=(8,6))
plt.plot(train_data['sqft_living'], train_data['price'],'.', label= 'House Price Data')
plt.hold(True)

for i in range(0,5):
leg_label = 'Deg. ' + str(i+1)
plt.plot( poly_deg_dict[key_list[i]]['power_1'], model_poly_deg[key_list[i]].predict(X_feat_dict[key_list[i]]), '-', label = leg_label )

plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price (\$)', fontsize=16)
plt.title('Degree 1-5 Polynomial Regression Models', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

``````
``````

``````
``````

In [148]:

plt.figure(figsize=(8,6))
plt.plot(train_data['sqft_living'], train_data['price'],'.', label= 'House Price Data')
plt.hold(True)

for i in range(5,10):
leg_label = 'Deg. ' + str(i+1)
plt.plot( poly_deg_dict[key_list[i]]['power_1'], model_poly_deg[key_list[i]].predict(X_feat_dict[key_list[i]]), '-', label = leg_label )

plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price (\$)', fontsize=16)
plt.title('Degree 6-10 Polynomial Regression Models', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

``````
``````

``````
``````

In [149]:

plt.figure(figsize=(8,6))
plt.plot(train_data['sqft_living'], train_data['price'],'.', label= 'House Price Data')
plt.hold(True)

for i in range(10,15):
leg_label = 'Deg. ' + str(i+1)
plt.plot( poly_deg_dict[key_list[i]]['power_1'], model_poly_deg[key_list[i]].predict(X_feat_dict[key_list[i]]), '-', label = leg_label )

plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price (\$)', fontsize=16)
plt.title('Degree 11-15 Polynomial Regression Models', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

``````
``````

``````

Quiz Question: Which degree (1, 2, …, 15) had the lowest RSS on Validation data?

Now that you have chosen the degree of your polynomial using validation data, compute the RSS of this model on TEST data. Report the RSS on your quiz.

First, sorting validation data in case of plotting

``````

In [150]:

valid_data = valid_data.sort_values(['sqft_living', 'price'])

``````

Now, building a function to compute the RSS

``````

In [151]:

def get_residual_sum_of_squares(model, data, outcome):
# - data holds the data points with the features (columns) we are interested in performing a linear regression fit
# - model holds the linear regression model obtained from fitting to the data
# - outcome is the y, the observed house price for each data point

# By using the model and applying predict on the data, we return a numpy array which holds
# the predicted outcome (house price) from the linear regression model
model_predictions = model.predict(data)

# Computing the residuals between the predicted house price and the actual house price for each data point
residuals = outcome - model_predictions

``````

Now, creating a list of tuples with the values (RSS_deg_i , i). Finding min of list will give min RSS_val and and degree of polynomial

``````

In [152]:

# First, clearing empty list which will hold tuples

# Looping over polynomial features from 1-15
for i in range(1, 15+1, 1):

# Creating dataframe w/ additional features on the validation data. Then, putting these features into a list
valid_data_poly = polynomial_dataframe(valid_data['sqft_living'], i)
feat_val_poly = list(valid_data_poly)

# Using get_residual_sum_of_squares to compute RSS. Using the key_list[i-1] since index starts at 0.
# Each entry of key_list[i-1] contains the key we want for the dict of regression models

``````
``````

In [153]:

``````
``````

Polynomial Degree with lowest RSS from validation set:  6

``````

Quiz Question: what is the RSS on TEST data for the model with the degree selected from Validation data?

First, sorting test data in case of plotting

``````

In [154]:

test_data = test_data.sort_values(['sqft_living', 'price'])

``````

Now, finding RSS of polynomial degree 6 on TEST data

``````

In [155]:

# Creating dataframe w/ additional features on the test data. Then, putting these features into a list
test_data_poly_6 = polynomial_dataframe(test_data['sqft_living'], 6)
feat_val_poly_6 = list(test_data_poly_6)

``````
``````

In [156]:

``````
``````

RSS on Test data for Degree 6 Polynomial:  1.35225117491e+14

``````
``````

In [ ]:

``````