In this notebook, we will cover estimating multiple regression weights via gradient descent. You will:
In [35]:
import os
import zipfile
from math import sqrt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
Dataset is from house sales in King County, the region where the city of Seattle, WA is located.
In [36]:
# Put files in current direction into a list
files_list = [f for f in os.listdir('.') if os.path.isfile(f)]
In [37]:
# Filenames of unzipped files
unzip_files = ['kc_house_train_data.csv','kc_house_test_data.csv', 'kc_house_data.csv']
In [38]:
# If upzipped file not in files_list, unzip the file
for filename in unzip_files:
if filename not in files_list:
zip_file = filename + '.zip'
unzipping = zipfile.ZipFile(zip_file)
unzipping.extractall()
unzipping.close
Loading Sales data, Sales training data, and Sales test data
In [39]:
# Dictionary with the correct dtypes for the DataFrame columns
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float,
'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str,
'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str,
'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int,
'id':str, 'sqft_lot':int, 'view':int}
In [40]:
# Loading sales data, sales training data, and test_data into DataFrames
sales = pd.read_csv('kc_house_data.csv', dtype = dtype_dict)
train_data = pd.read_csv('kc_house_train_data.csv', dtype = dtype_dict)
test_data = pd.read_csv('kc_house_test_data.csv', dtype = dtype_dict)
In [41]:
# Looking at head of training data DataFrame
train_data.head()
Out[41]:
Now, we will write a function that will accept a DataFrame, a list of features (e.g. ['sqft_living', 'bedrooms']), and an target feature e.g. ('price'). This function will return two things:
In [42]:
def get_numpy_data(input_df, features, output):
input_df['constant'] = 1.0 # Adding column 'constant' to input DataFrame with all values = 1.0
features = ['constant'] + features # Adding constant' to List of features
feature_matrix = input_df.as_matrix(columns=features) # Convert DataFrame w/ columns in features list to np.ndarray
output_array = input_df[output].values # Convert column with output feature into np.array
return(feature_matrix, output_array)
Suppose we had the weights [1.0, 1.0] and the features [1.0, 1180.0] and we wanted to compute the predicted output 1.0*1.0 + 1.0*1180.0 = 1181.0 this is the dot product between these two arrays. If they're numpy arrays, we can use np.dot() to compute this. With this in mind, write the predict_output function to compute the predictions for an entire matrix of features given the matrix and the weights.
In [43]:
def predict_output(feature_matrix, weights):
predictions = np.dot(feature_matrix, weights)
return predictions
The cost function is the sum over the data points of the squared difference between an observed output and a predicted output. We can write the squared difference between the observed output and predicted output for a single point as follows:
(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)^2
Where we have k features and a constant. So the derivative with respect to weight w[i] is:
2*(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)* [feature_i]
The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:
2*error*[feature_i]
That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors.
With this in mind, complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).
In [44]:
def feature_derivative(errors, feature):
derivative = 2.0*np.dot(errors, feature)
return derivative
Now, we will write a function that performs gradient descent. The basic steps are as follows. Given a starting point, we update the current weights by moving in the negative gradient direction (since we're trying to minimize the cost function).
The amount by which we move in the negative gradient direction is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.
In the gradient descent function below, for each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria.
In [45]:
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
converged = False
weights = np.array(initial_weights) # Initializing the weights to be the initial weights
while not converged:
predictions = predict_output(feature_matrix, weights) # Finding predicted output w/ weights and feature_matrix
errors = predictions - output # Computing error of predicted output and actual output for each data point
gradient_sum_squares = 0 # initialize the gradient sum of squares
# While we haven't reached the tolerance, update the weight of each feature
# Looping over each feature
for i in range(len(weights)): # loop over each weight
der_feat_i = feature_derivative(errors, feature_matrix[:,i]) # Cost function derivative for feature i
gradient_sum_squares += der_feat_i**2.0 # Add derivative^2 to grad. magnitude (for assessing convergence)
weights[i] = weights[i] - step_size*der_feat_i # Update weight[i] by subtr. step_size * der. weight[i]
# Compute square-root of gradient sum of squares to get the gradient magnigude:
gradient_magnitude = sqrt(gradient_sum_squares)
if gradient_magnitude < tolerance:
converged = True
return(weights)
A few things to note before we run the gradient descent. Since the gradient is a sum over all the data points and involves a product of an error and a feature the gradient itself will be very large since the features are large (squarefeet) and the output is large (prices). So while you might expect "tolerance" to be small, small is only relative to the size of the features.
For similar reasons the step size will be much smaller than you might expect but this is because the gradient has such large values.
Model 1: Running gradient descent with 1 feature: 'sqft_living'. The goal is to predict the feature 'price'. The following cell sets the feature list, the output feature, the feature matrix and output vector, the initial weights, the step_size, and the tolerane for the first model.
In [46]:
simple_features = ['sqft_living']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7
Next, run gradient descent with the above parameters to determine the weights of each feature.
In [47]:
weights_model_1 = regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, tolerance)
Quiz Question: What is the value of the weight for sqft_living -- the second element of ‘simple_weights’ (rounded to 1 decimal place)?
In [48]:
round(weights_model_1[1], 1)
Out[48]:
Use newly estimated weights and predict_output() function to compute the predictions on the TEST data.
In [49]:
# First, from test data, create a feature matrix and output vector
(test_model_1_feature_matrix, test_output_model_1) = get_numpy_data(test_data, simple_features, my_output)
Now, compute predictions using test_model_1_feature_matrix and weights from above.
In [50]:
test_model_1_predictions = predict_output(test_model_1_feature_matrix, weights_model_1)
Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 1 (round to nearest dollar)?
In [51]:
round(test_model_1_predictions[0] , 0)
Out[51]:
Now, with the predictions on test data, compute the RSS (Residual Sum of Squares) on the test data set.
In [52]:
RSS_test_model_1 = sum( (test_model_1_predictions-test_data['price'].values)**2.0 )
Now, using Python sklearn Library to find weights and compare them to gradient descent algorithm
In [53]:
# Creating feature matrix with 'sqft_living' feature and output vector with 'price' feature
X_model_1 = train_data[ ['sqft_living'] ]
y_model_1 = train_data['price']
In [54]:
# Creating a LinearRegression Object. Then, performing linear regression on feature matrix and output vector
lin_reg_model_1 = LinearRegression()
lin_reg_model_1.fit(X_model_1, y_model_1)
Out[54]:
In [55]:
# Creating x-vector for plotting. Then, defining line with weights from gradient descent and sklearn
x_vect_simple_reg = np.arange(0,14000+1,1)
y_model_1_grad_desc = weights_model_1[0] + weights_model_1[1]*x_vect_simple_reg
y_model_1_sklearn = lin_reg_model_1.intercept_ + lin_reg_model_1.coef_[0]*x_vect_simple_reg
Figure below shows that weights from Gradient Descent and Sklearn library are in good agreement
In [56]:
plt.figure(figsize=(8,6))
plt.plot(train_data['sqft_living'], train_data['price'],'.',label= 'House Price Data')
plt.hold(True)
plt.plot(x_vect_simple_reg, y_model_1_grad_desc, label= 'Gradient Descent')
plt.plot(x_vect_simple_reg, y_model_1_sklearn, '--' , label= 'Sklearn Library')
plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=18)
plt.ylabel('House Price ($)', fontsize=18)
plt.title('Simple Linear Regression', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()
Now, we will use two features. First, we will plot the house price versus these two features
In [57]:
plt.figure(figsize=(12,8))
plt.subplot(1, 2, 1)
plt.plot(train_data['sqft_living'], train_data['price'],'.')
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.subplot(1, 2, 2)
plt.plot(train_data['sqft_living15'], train_data['price'],'.')
plt.xlabel('Ave. ft^2 of 15 nearest neighbors', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.show()
The following code produces the weights for a second model with the following parameters:
In [58]:
# sqft_living15 is the average squarefeet for the nearest 15 neighbors.
model_features = ['sqft_living', 'sqft_living15']
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9
Use the parameters above to determine the weghts for model 2.
In [59]:
weights_model_2_mulp_reg = regression_gradient_descent(feature_matrix, output,
initial_weights, step_size, tolerance)
Using newly determine weights and the predict_output function to compute the predictions on the TEST data.
In [60]:
# First, creating the feature matrix and output array. Then, calculating predictions for Test data set/
(test_2_feature_matrix, test_output) = get_numpy_data(test_data, model_features, my_output)
test_2_feat_predictions = predict_output(test_2_feature_matrix, weights_model_2_mulp_reg)
Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 2 (round to nearest dollar)?
In [61]:
round(test_2_feat_predictions[0] , 0)
Out[61]:
What is the actual price for the 1st house in the test data set?
In [62]:
test_data['price'][0]
Out[62]:
Now, using Python sklearn Library to find weights and compare them to gradient descent algorithm
In [63]:
# Creating feature matrix with ['sqft_living', 'sqft_living15'] features and output vector with 'price' feature
X_model_2 = train_data[ ['sqft_living', 'sqft_living15'] ]
y_model_2 = train_data['price']
In [64]:
# Creating a LinearRegression Object. Then, performing linear regression on feature matrix and output vector
lin_reg_model_2 = LinearRegression()
lin_reg_model_2.fit(X_model_2, y_model_2)
Out[64]:
In [65]:
print 'Grad Desc Weights = Intercept: %.3e, sqft_living feat: %.3e, sqft_living15 feat: %.3e' % (weights_model_2_mulp_reg[0], weights_model_2_mulp_reg[1], weights_model_2_mulp_reg[2])
print 'Sklearn Weights = Intercept: %.3e, sqft_living feat: %.3e, sqft_living15 feat: %.3e' % (lin_reg_model_2.intercept_, lin_reg_model_2.coef_[0], lin_reg_model_2.coef_[1])
Weights from gradient descent and Sklearn in good agreement
Quiz Question: Which estimate was closer to the true price for the 1st house on the Test data set, model 1 or model 2?
In [66]:
diff_model_1_house_1_price = abs(test_model_1_predictions[0] - test_data['price'][0])
diff_model_2_house_1_price = abs(test_2_feat_predictions[0] - test_data['price'][0])
if diff_model_1_house_1_price < diff_model_2_house_1_price:
print 'Model 1 closer to true price for 1st house'
else:
print 'Model 2 closer to true price for 1st house'
Now, use your predictions and the output to compute the RSS for model 2 on TEST data.
In [67]:
RSS_test_model_2 = sum( (test_2_feat_predictions-test_data['price'].values)**2.0 )
print RSS_test_model_2
Quiz Question: Which model (1 or 2) has lowest RSS on all of the TEST data?
In [68]:
if RSS_test_model_1 < RSS_test_model_2:
print 'RSS lower for Model 1'
else:
print 'RSS lower for Model 2'
In [ ]: