Regression Week 4: Ridge Regression (gradient descent)

In this notebook, you will implement ridge regression via gradient descent. You will:

  • Convert an SFrame into a Numpy array
  • Write a Numpy function to compute the derivative of the regression weights with respect to a single feature
  • Write gradient descent function to compute the regression weights given an initial weight vector, step size, tolerance, and L2 penalty

Fire up graphlab create

Make sure you have the latest version of GraphLab Create (>= 1.7)


In [1]:
import numpy as np
import pandas as pd
import sys
from sklearn import linear_model
import matplotlib.pyplot as plt
%matplotlib inline

Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.


In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

In [3]:
sales = pd.read_csv('kc_house_data.csv', dtype=dtype_dict)
train_data = pd.read_csv('kc_house_train_data.csv', dtype=dtype_dict)
test_data = pd.read_csv('kc_house_test_data.csv', dtype=dtype_dict)


---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-3-0b9bbe787aaa> in <module>()
      1 sales = pd.read_csv('kc_house_data.csv', dtype=dtype_dict)
----> 2 train_data = pd.read_csv('kc_house_train_data.csv', dtype=dtype_dict)
      3 test_data = pd.read_csv('kc_house_test_data.csv', dtype=dtype_dict)

/Users/viktorp/anaconda/envs/dato-env/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    496                     skip_blank_lines=skip_blank_lines)
    497 
--> 498         return _read(filepath_or_buffer, kwds)
    499 
    500     parser_f.__name__ = name

/Users/viktorp/anaconda/envs/dato-env/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    273 
    274     # Create the parser.
--> 275     parser = TextFileReader(filepath_or_buffer, **kwds)
    276 
    277     if (nrows is not None) and (chunksize is not None):

/Users/viktorp/anaconda/envs/dato-env/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    588             self.options['has_index_names'] = kwds['has_index_names']
    589 
--> 590         self._make_engine(self.engine)
    591 
    592     def _get_options_with_defaults(self, engine):

/Users/viktorp/anaconda/envs/dato-env/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    729     def _make_engine(self, engine='c'):
    730         if engine == 'c':
--> 731             self._engine = CParserWrapper(self.f, **self.options)
    732         else:
    733             if engine == 'python':

/Users/viktorp/anaconda/envs/dato-env/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1101         kwds['allow_leading_cols'] = self.index_col is not False
   1102 
-> 1103         self._reader = _parser.TextReader(src, **kwds)
   1104 
   1105         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)()

pandas/parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)()

IOError: File kc_house_train_data.csv does not exist

If we want to do any "feature engineering" like creating new features or adjusting existing ones we should do this directly using the SFrames as seen in the first notebook of Week 2. For this notebook, however, we will work with the existing features.

Import useful functions from previous notebook

As in Week 2, we convert the SFrame into a 2D Numpy array. Copy and paste get_num_data() from the second notebook of Week 2.


In [ ]:
print(sales['sqft_living'].values.dtype)

In [ ]:
def get_numpy_data(dataset, features, output_name):
    dataset['constant'] = 1
    output = dataset[[output_name]].values
    return (dataset[['constant'] + features].values.reshape((len(output), len(features) + 1)), output.reshape((len(output), 1)))

In [ ]:
(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') 
print(example_features[:5])
print(example_output[:5])

Also, copy and paste the predict_output() function to compute the predictions for an entire matrix of features given the matrix and the weights:


In [ ]:
def predict_output(X, w):
    return X.dot(w)

Computing the Derivative

We are now going to move to computing the derivative of the regression cost function. Recall that the cost function is the sum over the data points of the squared difference between an observed output and a predicted output, plus the L2 penalty term.

Cost(w)
= SUM[ (prediction - output)^2 ]
+ l2_penalty*(w[0]^2 + w[1]^2 + ... + w[k]^2).

Since the derivative of a sum is the sum of the derivatives, we can take the derivative of the first part (the RSS) as we did in the notebook for the unregularized case in Week 2 and add the derivative of the regularization part. As we saw, the derivative of the RSS with respect to w[i] can be written as:

2*SUM[ error*[feature_i] ].

The derivative of the regularization term with respect to w[i] is:

2*l2_penalty*w[i].

Summing both, we get

2*SUM[ error*[feature_i] ] + 2*l2_penalty*w[i].

That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself, plus 2*l2_penalty*w[i].

We will not regularize the constant. Thus, in the case of the constant, the derivative is just twice the sum of the errors (without the 2*l2_penalty*w[0] term).

Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors, plus 2*l2_penalty*w[i].

With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points). To decide when to we are dealing with the constant (so we don't regularize it) we added the extra parameter to the call feature_is_constant which you should set to True when computing the derivative of the constant and False otherwise.


In [ ]:
def feature_derivative_ridge(errors, feature, weight, l2_penalty, feature_is_constant):
    # If feature_is_constant is True, derivative is twice the dot product of errors and feature
    derivative = 2*feature.T.dot(errors)
    # Otherwise, derivative is twice the dot product plus 2*l2_penalty*weight
    if not feature_is_constant:
        derivative += 2*l2_penalty*weight
    return derivative

To test your feature derivartive run the following:


In [ ]:
(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') 
my_weights = np.array([1., 10.], dtype=np.float16).reshape((2,1))
test_predictions = predict_output(example_features, my_weights) 
errors = test_predictions - example_output # prediction errors

In [ ]:
# next two lines should print the same values
print(feature_derivative_ridge(errors, example_features[:,1].reshape((len(example_features[:,1]), 1)), my_weights[1], 1, False))
print(np.sum(errors*example_features[:,1].reshape((len(example_features[:,1]), 1)))*2+20.)

In [ ]:
# next two lines should print the same values
print(feature_derivative_ridge(errors, example_features[:,0].reshape((len(example_features[:,0]), 1)), my_weights[0], 1, True))
print(np.sum(errors)*2.)

Gradient Descent

Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of increase and therefore the negative gradient is the direction of decrease and we're trying to minimize a cost function.

The amount by which we move in the negative gradient direction is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. Unlike in Week 2, this time we will set a maximum number of iterations and take gradient steps until we reach this maximum number. If no maximum number is supplied, the maximum should be set 100 by default. (Use default parameter values in Python.)

With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent, we update the weight for each feature before computing our stopping criteria.


In [ ]:
def ridge_regression_gradient_descent(feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations=100):
    weights = np.array(initial_weights).reshape((len(initial_weights), 1)) # make sure it's a numpy array
    print('feature_matrix: %s' % (feature_matrix[:5,:]))
    iteration = 0
    while iteration < max_iterations:
    #while not reached maximum number of iterations:
        # compute the predictions based on feature_matrix and weights using your predict_output() function
        predictions = predict_output(feature_matrix, weights)
        # compute the errors as predictions - output
        errors = predictions - output
        old_weights = np.copy(weights)
        for i in range(len(weights)): # loop over each weight
            # Recall that feature_matrix[:,i] is the feature column associated with weights[i]
            # compute the derivative for weight[i].
            #(Remember: when i=0, you are computing the derivative of the constant!)
            derivative = feature_derivative_ridge(errors, feature_matrix[:, i], old_weights[i,0], l2_penalty, i == 0)
            # subtract the step size times the derivative from the current weight
            weights[i,0] -= step_size * derivative
        iteration += 1
    return weights

Visualizing effect of L2 penalty

The L2 penalty gets its name because it causes weights to have small L2 norms than otherwise. Let's see how large weights get penalized. Let us consider a simple model with 1 feature:


In [ ]:
simple_features = ['sqft_living']
my_output = 'price'

In this part, we will only use 'sqft_living' to predict 'price'. Use the get_numpy_data function to get a Numpy versions of your data with only this feature, for both the train_data and the test_data.


In [ ]:
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
(simple_test_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)

Let's set the parameters for our optimization:


In [ ]:
initial_weights = np.array([0., 0.])
step_size = 1e-12
max_iterations=1000
l2_penalty = 0

First, let's consider no regularization. Set the l2_penalty to 0.0 and run your ridge regression algorithm to learn the weights of your model. Call your weights:

simple_weights_0_penalty

we'll use them later.


In [ ]:
simple_weights_0_penalty = ridge_regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations)

Next, let's consider high regularization. Set the l2_penalty to 1e11 and run your ridge regression algorithm to learn the weights of your model. Call your weights:

simple_weights_high_penalty

we'll use them later.


In [ ]:
l2_penalty = 1e11
simple_weights_high_penalty = ridge_regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations)

In [ ]:
print(simple_weights_0_penalty)
print(simple_weights_high_penalty)

This code will plot the two learned models. (The blue line is for the model with no regularization and the red line is for the one with high regularization.)


In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(simple_feature_matrix,output,'k.',
         simple_feature_matrix,predict_output(simple_feature_matrix, simple_weights_0_penalty),'b-',
        simple_feature_matrix,predict_output(simple_feature_matrix, simple_weights_high_penalty),'r-')

Compute the RSS on the TEST data for the following three sets of weights:

  1. The initial weights (all zeros)
  2. The weights learned with no regularization
  3. The weights learned with high regularization

Which weights perform best?


In [ ]:
(test_features, test_output) = get_numpy_data(test_data, ['sqft_living'], 'price')

In [ ]:
no_regularization_prediction = predict_output(test_features, simple_weights_0_penalty)
test_errors = no_regularization_prediction - test_output
RSS_no_penalty = test_errors.T.dot(test_errors)
print(RSS_no_penalty)

In [ ]:
high_regularization_prediction = predict_output(test_features, simple_weights_high_penalty)
test_errors = high_regularization_prediction - test_output
RSS_high_penalty = test_errors.T.dot(test_errors)
print(RSS_high_penalty)

QUIZ QUESTIONS

  1. What is the value of the coefficient for sqft_living that you learned with no regularization, rounded to 1 decimal place? What about the one with high regularization?
  2. Comparing the lines you fit with the with no regularization versus high regularization, which one is steeper?
  3. What are the RSS on the test data for each of the set of weights above (initial, no regularization, high regularization)?

In [ ]:
print(simple_weights_0_penalty[1,0])
print(simple_weights_high_penalty[1,0])

Running a multiple regression with L2 penalty

Let us now consider a model with 2 features: ['sqft_living', 'sqft_living15'].

First, create Numpy versions of your training and test data with these two features.


In [ ]:
model_features = ['sqft_living', 'sqft_living15'] # sqft_living15 is the average squarefeet for the nearest 15 neighbors. 
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)
(test_feature_matrix, test_output) = get_numpy_data(test_data, model_features, my_output)

We need to re-inialize the weights, since we have one extra parameter. Let us also set the step size and maximum number of iterations.


In [ ]:
initial_weights = np.array([0.0,0.0,0.0])
step_size = 1e-12
max_iterations = 1000

First, let's consider no regularization. Set the l2_penalty to 0.0 and run your ridge regression algorithm to learn the weights of your model. Call your weights:

multiple_weights_0_penalty


In [ ]:
l2_penalty=0.0
multiple_weights_0_penalty = ridge_regression_gradient_descent(feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations)

Next, let's consider high regularization. Set the l2_penalty to 1e11 and run your ridge regression algorithm to learn the weights of your model. Call your weights:

multiple_weights_high_penalty


In [ ]:
l2_penalty=1e11
multiple_weights_high_penalty = ridge_regression_gradient_descent(feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations)

Compute the RSS on the TEST data for the following three sets of weights:

  1. The initial weights (all zeros)
  2. The weights learned with no regularization
  3. The weights learned with high regularization

Which weights perform best?


In [ ]:
all_zeros_weights = np.array([[0],[0],[0]])
test_predictions_all_zeros = predict_output(test_feature_matrix, all_zeros_weights)
test_errors = test_predictions_all_zeros - test_output
RSS_all_zeros_penalty = test_errors.T.dot(test_errors)
print(RSS_all_zeros_penalty)

In [ ]:
test_predictions_no = predict_output(test_feature_matrix, multiple_weights_0_penalty)
test_errors = test_predictions_no - test_output
RSS_no_penalty = test_errors.T.dot(test_errors)
print(RSS_no_penalty)

In [ ]:
test_predictions_high = predict_output(test_feature_matrix, multiple_weights_high_penalty)
test_errors = test_predictions_high - test_output
RSS_high_penalty = test_errors.T.dot(test_errors)
print(RSS_high_penalty)

Predict the house price for the 1st house in the test set using the no regularization and high regularization models. (Remember that python starts indexing from 0.) How far is the prediction from the actual price? Which weights perform best for the 1st house?


In [ ]:
print(test_predictions_no[0] - test_output[0])

In [ ]:
print(test_predictions_high[0] - test_output[0])

QUIZ QUESTIONS

  1. What is the value of the coefficient for sqft_living that you learned with no regularization, rounded to 1 decimal place? What about the one with high regularization?
  2. What are the RSS on the test data for each of the set of weights above (initial, no regularization, high regularization)?
  3. We make prediction for the first house in the test set using two sets of weights (no regularization vs high regularization). Which weights make better prediction for that particular house?

In [ ]:
print(multiple_weights_0_penalty[1])
print(multiple_weights_high_penalty[1])

In [ ]:
RSS_no_penalty[0][0]

In [ ]:
RSS_high_penalty[0,0]

Estimating 1 assignment


In [4]:
sales = pd.read_csv('kc_house_data.csv', dtype=dtype_dict)

In [5]:
sales = sales.sort(['sqft_living','price'])


/Users/viktorp/anaconda/envs/dato-env/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  if __name__ == '__main__':

In [ ]: