Regression Week 2: Multiple Regression (Gradient Descent)

In this notebook, we will cover estimating multiple regression weights via gradient descent. You will:

Compute the derivative of the regression weights with respect to a single feature.
Write a gradient descent function to compute the regression weights given an initial weight vector, step size, and tolerance.
Use the gradient descent function to estimate regression weights for multiple features
Write a predict_output() function. This function uses the weights obtained from gradient descent to predict the output for any feature we are interested in predicting (in this case, house price).

Importing Libraries



In [35]:

    
import os
import zipfile
from math import sqrt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

Unzipping files with house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.



In [36]:

    
# Put files in current direction into a list
files_list = [f for f in os.listdir('.') if os.path.isfile(f)]



In [37]:

    
# Filenames of unzipped files
unzip_files = ['kc_house_train_data.csv','kc_house_test_data.csv', 'kc_house_data.csv']



In [38]:

    
# If upzipped file not in files_list, unzip the file
for filename in unzip_files:
    if filename not in files_list:
        zip_file = filename + '.zip'
        unzipping = zipfile.ZipFile(zip_file)
        unzipping.extractall()
        unzipping.close

Loading Sales data, Sales training data, and Sales test data



In [39]:

    
# Dictionary with the correct dtypes for the DataFrame columns
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 
              'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 
              'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 
              'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 
              'id':str, 'sqft_lot':int, 'view':int}



In [40]:

    
# Loading sales data, sales training data, and test_data into DataFrames
sales = pd.read_csv('kc_house_data.csv', dtype = dtype_dict)
train_data = pd.read_csv('kc_house_train_data.csv', dtype = dtype_dict)
test_data = pd.read_csv('kc_house_test_data.csv', dtype = dtype_dict)



In [41]:

    
# Looking at head of training data DataFrame
train_data.head()









    Out[41]:






  
    
      
      id
      date
      price
      bedrooms
      bathrooms
      sqft_living
      sqft_lot
      floors
      waterfront
      view
      ...
      grade
      sqft_above
      sqft_basement
      yr_built
      yr_renovated
      zipcode
      lat
      long
      sqft_living15
      sqft_lot15
    
  
  
    
      0
      7129300520
      20141013T000000
      221900
      3
      1.00
      1180
      5650
      1
      0
      0
      ...
      7
      1180
      0
      1955
      0
      98178
      47.5112
      -122.257
      1340
      5650
    
    
      1
      6414100192
      20141209T000000
      538000
      3
      2.25
      2570
      7242
      2
      0
      0
      ...
      7
      2170
      400
      1951
      1991
      98125
      47.7210
      -122.319
      1690
      7639
    
    
      2
      5631500400
      20150225T000000
      180000
      2
      1.00
      770
      10000
      1
      0
      0
      ...
      6
      770
      0
      1933
      0
      98028
      47.7379
      -122.233
      2720
      8062
    
    
      3
      2487200875
      20141209T000000
      604000
      4
      3.00
      1960
      5000
      1
      0
      0
      ...
      7
      1050
      910
      1965
      0
      98136
      47.5208
      -122.393
      1360
      5000
    
    
      4
      1954400510
      20150218T000000
      510000
      3
      2.00
      1680
      8080
      1
      0
      0
      ...
      8
      1680
      0
      1987
      0
      98074
      47.6168
      -122.045
      1800
      7503
    
  

5 rows × 21 columns

Convert to DataFrame data to np matrix and np array

Now, we will write a function that will accept a DataFrame, a list of features (e.g. ['sqft_living', 'bedrooms']), and an target feature e.g. ('price'). This function will return two things:

A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
A numpy array containing the values of the output



In [42]:

    
def get_numpy_data(input_df, features, output):
    
    input_df['constant'] = 1.0 # Adding column 'constant' to input DataFrame with all values = 1.0
    features = ['constant'] + features # Adding constant' to List of features

    feature_matrix = input_df.as_matrix(columns=features) # Convert DataFrame w/ columns in features list to np.ndarray
    output_array = input_df[output].values # Convert column with output feature into np.array
    
    return(feature_matrix, output_array)

Predicting output given regression weights

Suppose we had the weights [1.0, 1.0] and the features [1.0, 1180.0] and we wanted to compute the predicted output 1.0*1.0 + 1.0*1180.0 = 1181.0 this is the dot product between these two arrays. If they're numpy arrays, we can use np.dot() to compute this. With this in mind, write the predict_output function to compute the predictions for an entire matrix of features given the matrix and the weights.



In [43]:

    
def predict_output(feature_matrix, weights):
    predictions = np.dot(feature_matrix, weights)
    return predictions

Computing the Derivative

The cost function is the sum over the data points of the squared difference between an observed output and a predicted output. We can write the squared difference between the observed output and predicted output for a single point as follows:

(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)^2

Where we have k features and a constant. So the derivative with respect to weight w[i] is:

2*(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)* [feature_i]

The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:

2*error*[feature_i]

That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors.

With this in mind, complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).



In [44]:

    
def feature_derivative(errors, feature):
    derivative = 2.0*np.dot(errors, feature)
    return derivative

Gradient Descent

Now, we will write a function that performs gradient descent. The basic steps are as follows. Given a starting point, we update the current weights by moving in the negative gradient direction (since we're trying to minimize the cost function).

The amount by which we move in the negative gradient direction is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.

In the gradient descent function below, for each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria.



In [45]:

    
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    
    converged = False 
    weights = np.array(initial_weights) # Initializing the weights to be the initial weights
    
    while not converged:
        
        predictions = predict_output(feature_matrix, weights) # Finding predicted output w/ weights and feature_matrix
        
        errors = predictions - output # Computing error of predicted output and actual output for each data point
        
        gradient_sum_squares = 0 # initialize the gradient sum of squares
        
        # While we haven't reached the tolerance, update the weight of each feature
        # Looping over each feature
        for i in range(len(weights)): # loop over each weight

            der_feat_i = feature_derivative(errors, feature_matrix[:,i]) # Cost function derivative for feature i
            
            gradient_sum_squares += der_feat_i**2.0 # Add derivative^2 to grad. magnitude (for assessing convergence)
            
            weights[i] = weights[i] - step_size*der_feat_i # Update weight[i] by subtr. step_size * der. weight[i]
            
        # Compute square-root of gradient sum of squares to get the gradient magnigude:
        gradient_magnitude = sqrt(gradient_sum_squares)
        
        if gradient_magnitude < tolerance:
            converged = True
        
    return(weights)

A few things to note before we run the gradient descent. Since the gradient is a sum over all the data points and involves a product of an error and a feature the gradient itself will be very large since the features are large (squarefeet) and the output is large (prices). So while you might expect "tolerance" to be small, small is only relative to the size of the features.

For similar reasons the step size will be much smaller than you might expect but this is because the gradient has such large values.

Model 1: Running Gradient Descent as Simple Regression (with one feature)

Model 1: Running gradient descent with 1 feature: 'sqft_living'. The goal is to predict the feature 'price'. The following cell sets the feature list, the output feature, the feature matrix and output vector, the initial weights, the step_size, and the tolerane for the first model.



In [46]:

    
simple_features = ['sqft_living']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

Next, run gradient descent with the above parameters to determine the weights of each feature.



In [47]:

    
weights_model_1 = regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, tolerance)

Quiz Question: What is the value of the weight for sqft_living -- the second element of ‘simple_weights’ (rounded to 1 decimal place)?



In [48]:

    
round(weights_model_1[1], 1)









    Out[48]:





281.9

Use newly estimated weights and predict_output() function to compute the predictions on the TEST data.



In [49]:

    
# First, from test data, create a feature matrix and output vector
(test_model_1_feature_matrix, test_output_model_1) = get_numpy_data(test_data, simple_features, my_output)

Now, compute predictions using test_model_1_feature_matrix and weights from above.



In [50]:

    
test_model_1_predictions =  predict_output(test_model_1_feature_matrix, weights_model_1)

Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 1 (round to nearest dollar)?



In [51]:

    
round(test_model_1_predictions[0] , 0)









    Out[51]:





356134.0

Now, with the predictions on test data, compute the RSS (Residual Sum of Squares) on the test data set.



In [52]:

    
RSS_test_model_1 = sum( (test_model_1_predictions-test_data['price'].values)**2.0 )

Comparing Model 1 Gradient Descent with Sklearn Library

Now, using Python sklearn Library to find weights and compare them to gradient descent algorithm



In [53]:

    
# Creating feature matrix with 'sqft_living' feature and output vector with 'price' feature
X_model_1 = train_data[ ['sqft_living'] ]
y_model_1 = train_data['price']



In [54]:

    
# Creating a LinearRegression Object. Then, performing linear regression on feature matrix and output vector
lin_reg_model_1 = LinearRegression()
lin_reg_model_1.fit(X_model_1, y_model_1)









    Out[54]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [55]:

    
# Creating x-vector for plotting. Then, defining line with weights from gradient descent and sklearn
x_vect_simple_reg = np.arange(0,14000+1,1)
y_model_1_grad_desc = weights_model_1[0] + weights_model_1[1]*x_vect_simple_reg
y_model_1_sklearn = lin_reg_model_1.intercept_ + lin_reg_model_1.coef_[0]*x_vect_simple_reg

Figure below shows that weights from Gradient Descent and Sklearn library are in good agreement



In [56]:

    
plt.figure(figsize=(8,6))
plt.plot(train_data['sqft_living'], train_data['price'],'.',label= 'House Price Data')
plt.hold(True)
plt.plot(x_vect_simple_reg, y_model_1_grad_desc, label= 'Gradient Descent')
plt.plot(x_vect_simple_reg, y_model_1_sklearn, '--' , label= 'Sklearn Library')
plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=18)
plt.ylabel('House Price ($)', fontsize=18)
plt.title('Simple Linear Regression', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

Model 2: Running Gradient Descent as Multiple Regression (with two features)

Now, we will use two features. First, we will plot the house price versus these two features



In [57]:

    
plt.figure(figsize=(12,8))

plt.subplot(1, 2, 1)
plt.plot(train_data['sqft_living'], train_data['price'],'.')
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)

plt.subplot(1, 2, 2)
plt.plot(train_data['sqft_living15'], train_data['price'],'.')
plt.xlabel('Ave. ft^2 of 15 nearest neighbors', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)

plt.show()

The following code produces the weights for a second model with the following parameters:



In [58]:

    
# sqft_living15 is the average squarefeet for the nearest 15 neighbors. 
model_features = ['sqft_living', 'sqft_living15']
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

Use the parameters above to determine the weghts for model 2.



In [59]:

    
weights_model_2_mulp_reg = regression_gradient_descent(feature_matrix, output, 
                                                       initial_weights, step_size, tolerance)

Using newly determine weights and the predict_output function to compute the predictions on the TEST data.



In [60]:

    
# First, creating the feature matrix and output array. Then, calculating predictions for Test data set/
(test_2_feature_matrix, test_output) = get_numpy_data(test_data, model_features, my_output)
test_2_feat_predictions =  predict_output(test_2_feature_matrix, weights_model_2_mulp_reg)

Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 2 (round to nearest dollar)?



In [61]:

    
round(test_2_feat_predictions[0] , 0)









    Out[61]:





366651.0

What is the actual price for the 1st house in the test data set?



In [62]:

    
test_data['price'][0]









    Out[62]:





310000.0

Comparing Model 2 Gradient Descent with Sklearn Library

Now, using Python sklearn Library to find weights and compare them to gradient descent algorithm



In [63]:

    
# Creating feature matrix with ['sqft_living', 'sqft_living15'] features and output vector with 'price' feature
X_model_2 = train_data[ ['sqft_living', 'sqft_living15'] ]
y_model_2 = train_data['price']



In [64]:

    
# Creating a LinearRegression Object. Then, performing linear regression on feature matrix and output vector
lin_reg_model_2 = LinearRegression()
lin_reg_model_2.fit(X_model_2, y_model_2)









    Out[64]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [65]:

    
print 'Grad Desc Weights = Intercept: %.3e, sqft_living feat: %.3e, sqft_living15 feat: %.3e' % (weights_model_2_mulp_reg[0], weights_model_2_mulp_reg[1], weights_model_2_mulp_reg[2])
print 'Sklearn   Weights = Intercept: %.3e, sqft_living feat: %.3e, sqft_living15 feat: %.3e' % (lin_reg_model_2.intercept_, lin_reg_model_2.coef_[0], lin_reg_model_2.coef_[1])









    



Grad Desc Weights = Intercept: -1.000e+05, sqft_living feat: 2.451e+02, sqft_living15 feat: 6.528e+01
Sklearn   Weights = Intercept: -1.003e+05, sqft_living feat: 2.452e+02, sqft_living15 feat: 6.527e+01

Weights from gradient descent and Sklearn in good agreement

Comparing Model 1 (one feature) vs Model 2 (multiple features)

Quiz Question: Which estimate was closer to the true price for the 1st house on the Test data set, model 1 or model 2?



In [66]:

    
diff_model_1_house_1_price = abs(test_model_1_predictions[0] - test_data['price'][0])
diff_model_2_house_1_price = abs(test_2_feat_predictions[0] - test_data['price'][0])
if diff_model_1_house_1_price < diff_model_2_house_1_price:
    print 'Model 1 closer to true price for 1st house'
else:
    print 'Model 2 closer to true price for 1st house'









    



Model 1 closer to true price for 1st house

Now, use your predictions and the output to compute the RSS for model 2 on TEST data.



In [67]:

    
RSS_test_model_2 = sum( (test_2_feat_predictions-test_data['price'].values)**2.0 )
print RSS_test_model_2









    



2.7026344363e+14

Quiz Question: Which model (1 or 2) has lowest RSS on all of the TEST data?



In [68]:

    
if RSS_test_model_1 < RSS_test_model_2:
    print 'RSS lower for Model 1'
else:
    print 'RSS lower for Model 2'









    



RSS lower for Model 2



In [ ]:

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	20141013T000000	221900	3	1.00	1180	5650	1	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
1	6414100192	20141209T000000	538000	3	2.25	2570	7242	2	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
2	5631500400	20150225T000000	180000	2	1.00	770	10000	1	...	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
3	2487200875	20141209T000000	604000	4	3.00	1960	5000	1	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
4	1954400510	20150218T000000	510000	3	2.00	1680	8080	1	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800	7503