Regression Week 2: Multiple Regression (gradient descent)

In the first notebook we explored multiple regression using graphlab create. Now we will use graphlab along with numpy to solve for the regression weights with gradient descent.

In this notebook we will cover estimating multiple regression weights via gradient descent. You will:

Add a constant column of 1's to a graphlab SFrame to account for the intercept
Convert an SFrame into a Numpy array
Write a predict_output() function using Numpy
Write a numpy function to compute the derivative of the regression weights with respect to a single feature
Write gradient descent function to compute the regression weights given an initial weight vector, step size and tolerance.
Use the gradient descent function to estimate regression weights for multiple features

Fire up graphlab create

Make sure you have the latest version of graphlab (>= 1.7)



In [1]:

    
import graphlab

Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.



In [2]:

    
sales = graphlab.SFrame('kc_house_data.gl/')









    



[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1469501576.log






    



This non-commercial license of GraphLab Create for academic use is assigned to neo20iitkgp@gmail.com and will expire on July 05, 2017.

If we want to do any "feature engineering" like creating new features or adjusting existing ones we should do this directly using the SFrames as seen in the other Week 2 notebook. For this notebook, however, we will work with the existing features.

Convert to Numpy Array

Although SFrames offer a number of benefits to users (especially when using Big Data and built-in graphlab functions) in order to understand the details of the implementation of algorithms it's important to work with a library that allows for direct (and optimized) matrix operations. Numpy is a Python solution to work with matrices (or any multi-dimensional "array").

Recall that the predicted value given the weights and the features is just the dot product between the feature and weight vector. Similarly, if we put all of the features row-by-row in a matrix then the predicted value for all the observations can be computed by right multiplying the "feature matrix" by the "weight vector".

First we need to take the SFrame of our data and convert it into a 2D numpy array (also called a matrix). To do this we use graphlab's built in .to_dataframe() which converts the SFrame into a Pandas (another python library) dataframe. We can then use Panda's .as_matrix() to convert the dataframe into a numpy matrix.



In [3]:

    
print sales.head()
import numpy as np # note this allows us to refer to numpy as np instead









    



+------------+---------------------------+-----------+----------+-----------+
|     id     |            date           |   price   | bedrooms | bathrooms |
+------------+---------------------------+-----------+----------+-----------+
| 7129300520 | 2014-10-13 00:00:00+00:00 |  221900.0 |   3.0    |    1.0    |
| 6414100192 | 2014-12-09 00:00:00+00:00 |  538000.0 |   3.0    |    2.25   |
| 5631500400 | 2015-02-25 00:00:00+00:00 |  180000.0 |   2.0    |    1.0    |
| 2487200875 | 2014-12-09 00:00:00+00:00 |  604000.0 |   4.0    |    3.0    |
| 1954400510 | 2015-02-18 00:00:00+00:00 |  510000.0 |   3.0    |    2.0    |
| 7237550310 | 2014-05-12 00:00:00+00:00 | 1225000.0 |   4.0    |    4.5    |
| 1321400060 | 2014-06-27 00:00:00+00:00 |  257500.0 |   3.0    |    2.25   |
| 2008000270 | 2015-01-15 00:00:00+00:00 |  291850.0 |   3.0    |    1.5    |
| 2414600126 | 2015-04-15 00:00:00+00:00 |  229500.0 |   3.0    |    1.0    |
| 3793500160 | 2015-03-12 00:00:00+00:00 |  323000.0 |   3.0    |    2.5    |
+------------+---------------------------+-----------+----------+-----------+
+-------------+----------+--------+------------+------+-----------+-------+------------+
| sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above |
+-------------+----------+--------+------------+------+-----------+-------+------------+
|    1180.0   |   5650   |   1    |     0      |  0   |     3     |   7   |    1180    |
|    2570.0   |   7242   |   2    |     0      |  0   |     3     |   7   |    2170    |
|    770.0    |  10000   |   1    |     0      |  0   |     3     |   6   |    770     |
|    1960.0   |   5000   |   1    |     0      |  0   |     5     |   7   |    1050    |
|    1680.0   |   8080   |   1    |     0      |  0   |     3     |   8   |    1680    |
|    5420.0   |  101930  |   1    |     0      |  0   |     3     |   11  |    3890    |
|    1715.0   |   6819   |   2    |     0      |  0   |     3     |   7   |    1715    |
|    1060.0   |   9711   |   1    |     0      |  0   |     3     |   7   |    1060    |
|    1780.0   |   7470   |   1    |     0      |  0   |     3     |   7   |    1050    |
|    1890.0   |   6560   |   2    |     0      |  0   |     3     |   7   |    1890    |
+-------------+----------+--------+------------+------+-----------+-------+------------+
+---------------+----------+--------------+---------+-------------+
| sqft_basement | yr_built | yr_renovated | zipcode |     lat     |
+---------------+----------+--------------+---------+-------------+
|       0       |   1955   |      0       |  98178  | 47.51123398 |
|      400      |   1951   |     1991     |  98125  | 47.72102274 |
|       0       |   1933   |      0       |  98028  | 47.73792661 |
|      910      |   1965   |      0       |  98136  |   47.52082  |
|       0       |   1987   |      0       |  98074  | 47.61681228 |
|      1530     |   2001   |      0       |  98053  | 47.65611835 |
|       0       |   1995   |      0       |  98003  | 47.30972002 |
|       0       |   1963   |      0       |  98198  | 47.40949984 |
|      730      |   1960   |      0       |  98146  | 47.51229381 |
|       0       |   2003   |      0       |  98038  | 47.36840673 |
+---------------+----------+--------------+---------+-------------+
+---------------+---------------+-----+
|      long     | sqft_living15 | ... |
+---------------+---------------+-----+
| -122.25677536 |     1340.0    | ... |
|  -122.3188624 |     1690.0    | ... |
| -122.23319601 |     2720.0    | ... |
| -122.39318505 |     1360.0    | ... |
| -122.04490059 |     1800.0    | ... |
| -122.00528655 |     4760.0    | ... |
| -122.32704857 |     2238.0    | ... |
| -122.31457273 |     1650.0    | ... |
| -122.33659507 |     1780.0    | ... |
|  -122.0308176 |     2390.0    | ... |
+---------------+---------------+-----+
[10 rows x 21 columns]

Now we will write a function that will accept an SFrame, a list of feature names (e.g. ['sqft_living', 'bedrooms']) and an target feature e.g. ('price') and will return two things:

A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
A numpy array containing the values of the output

With this in mind, complete the following function (where there's an empty line you should write a line of code that does what the comment above indicates)

Please note you will need GraphLab Create version at least 1.7.1 in order for .to_numpy() to work!



In [4]:

    
def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 # this is how you add a constant column to an SFrame
    # add the column 'constant' to the front of the features list so that we can extract it along with the others:
    features = ['constant'] + features # this is how you combine two lists
    # select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant):
    features_sframe = data_sframe[features]
    # the following line will convert the features_SFrame into a numpy matrix:
    feature_matrix = features_sframe.to_numpy()
    # assign the column of data_sframe associated with the output to the SArray output_sarray
    output_sarray = data_sframe[output]
    # the following will convert the SArray into a numpy array by first converting it to a list
    output_array = output_sarray.to_numpy()
    return(feature_matrix, output_array)

For testing let's use the 'sqft_living' feature and a constant as our features and price as our output:



In [5]:

    
(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list
print example_features[0,:] # this accesses the first row of the data the ':' indicates 'all columns'
print example_output[0] # and the corresponding output









    



[  1.00000000e+00   1.18000000e+03]
221900.0

Predicting output given regression weights

Suppose we had the weights [1.0, 1.0] and the features [1.0, 1180.0] and we wanted to compute the predicted output 1.0*1.0 + 1.0*1180.0 = 1181.0 this is the dot product between these two arrays. If they're numpy arrayws we can use np.dot() to compute this:



In [6]:

    
my_weights = np.array([1., 1.]) # the example weights
my_features = example_features[0,] # we'll use the first data point
predicted_value = np.dot(my_features, my_weights)
print predicted_value

np.dot() also works when dealing with a matrix and a vector. Recall that the predictions from all the observations is just the RIGHT (as in weights on the right) dot product between the features matrix and the weights vector. With this in mind finish the following predict_output function to compute the predictions for an entire matrix of features given the matrix and the weights:



In [7]:

    
def predict_output(feature_matrix, weights):
    # assume feature_matrix is a numpy matrix containing the features as columns and weights is a corresponding numpy array
    # create the predictions vector by using np.dot()
    predictions = np.dot(feature_matrix, weights)
    return(predictions)

If you want to test your code run the following cell:



In [8]:

    
test_predictions = predict_output(example_features, my_weights)
print test_predictions[0] # should be 1181.0
print test_predictions[1] # should be 2571.0

Computing the Derivative

We are now going to move to computing the derivative of the regression cost function. Recall that the cost function is the sum over the data points of the squared difference between an observed output and a predicted output.

Since the derivative of a sum is the sum of the derivatives we can compute the derivative for a single data point and then sum over data points. We can write the squared difference between the observed output and predicted output for a single point as follows:

(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)^2

Where we have k features and a constant. So the derivative with respect to weight w[i] by the chain rule is:

2*(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)* [feature_i]

The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:

2*error*[feature_i]

That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. In the case of the constant then this is just twice the sum of the errors!

Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors.

With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).



In [9]:

    
def feature_derivative(errors, feature):
    # Assume that errors and feature are both numpy arrays of the same length (number of data points)
    # compute twice the dot product of these vectors as 'derivative' and return the value
    derivative = 2 * np.dot(errors, feature)
    return derivative

To test your feature derivartive run the following:



In [10]:

    
(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') 
my_weights = np.array([0., 0.]) # this makes all the predictions 0
test_predictions = predict_output(example_features, my_weights) 
# just like SFrames 2 numpy arrays can be elementwise subtracted with '-': 
errors = test_predictions - example_output # prediction errors in this case is just the -example_output
feature = example_features[:,0] # let's compute the derivative with respect to 'constant', the ":" indicates "all rows"
derivative = feature_derivative(errors, feature)
print derivative
print -np.sum(example_output)*2 # should be the same as derivative









    



-23345850022.0
-23345850022.0

Gradient Descent

Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of increase and therefore the negative gradient is the direction of decrease and we're trying to minimize a cost function.

The amount by which we move in the negative gradient direction is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.

With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria



In [11]:

    
from math import sqrt # recall that the magnitude/length of a vector [g[0], g[1], g[2]] is sqrt(g[0]^2 + g[1]^2 + g[2]^2)



In [12]:

    
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False 
    weights = np.array(initial_weights) # make sure it's a numpy array
    count = 0 
    while not converged:
        print 'weights in ',count ,'iteration is: ', weights
        # compute the predictions based on feature_matrix and weights using your predict_output() function
        predictions = np.dot(feature_matrix, weights)
        # compute the errors as predictions - output
        errors = predictions - output
        gradient_sum_squares = 0 # initialize the gradient sum of squares
        # while we haven't reached the tolerance yet, update each feature's weight
        for i in range(len(weights)): # loop over each weight
            # Recall that feature_matrix[:, i] is the feature column associated with weights[i]
            # compute the derivative for weight[i]:
            derivative = feature_derivative(errors, feature_matrix[:, i])
            # add the squared value of the derivative to the gradient sum of squares (for assessing convergence)
            gradient_sum_squares = gradient_sum_squares + (derivative * derivative)
            # subtract the step size times the derivative from the current weight
            weights[i] = weights[i] - (step_size * derivative) 
        # compute the square-root of the gradient sum of squares to get the gradient matnigude:
        gradient_magnitude = sqrt(gradient_sum_squares)
#         print 'gradient_magnitude: ', gradient_magnitude , 'and tolerance: ', tolerance
        if gradient_magnitude < tolerance:
            converged = True
        count  = count + 1
    return(weights)

A few things to note before we run the gradient descent. Since the gradient is a sum over all the data points and involves a product of an error and a feature the gradient itself will be very large since the features are large (squarefeet) and the output is large (prices). So while you might expect "tolerance" to be small, small is only relative to the size of the features.

For similar reasons the step size will be much smaller than you might expect but this is because the gradient has such large values.

Running the Gradient Descent as Simple Regression

First let's split the data into training and test data.



In [13]:

    
train_data,test_data = sales.random_split(.8,seed=0)

Although the gradient descent is designed for multiple regression since the constant is now a feature we can use the gradient descent function to estimat the parameters in the simple regression on squarefeet. The folowing cell sets up the feature_matrix, output, initial weights and step size for the first model:



In [14]:

    
# let's test out the gradient descent
simple_features = ['sqft_living']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

Next run your gradient descent with the above parameters.



In [15]:

    
simple_weights = regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, tolerance)
print simple_weights









    



weights in  0 iteration is:  [ -4.70000000e+04   1.00000000e+00]
weights in  1 iteration is:  [-46999.85779866    354.86068685]
weights in  2 iteration is:  [-46999.894732      262.96853711]
weights in  3 iteration is:  [-46999.88514683    286.83150776]
weights in  4 iteration is:  [-46999.88764179    280.6346632 ]
weights in  5 iteration is:  [-46999.88699974    282.24388793]
weights in  6 iteration is:  [-46999.88717231    281.82599715]
weights in  7 iteration is:  [-46999.88713334    281.93451693]
weights in  8 iteration is:  [-46999.88714931    281.90633602]
weights in  9 iteration is:  [-46999.88715101    281.91365417]
weights in  10 iteration is:  [-46999.88715641    281.91175376]
weights in  11 iteration is:  [-46999.88716085    281.91224727]
[-46999.88716555    281.91211912]

How do your weights compare to those achieved in week 1 (don't expect them to be exactly the same)?

Quiz Question: What is the value of the weight for sqft_living -- the second element of ‘simple_weights’ (rounded to 1 decimal place)?

Use your newly estimated weights and your predict_output() function to compute the predictions on all the TEST data (you will need to create a numpy array of the test feature_matrix and test output first:



In [16]:

    
(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)

Now compute your predictions using test_simple_feature_matrix and your weights from above.



In [17]:

    
predictions = predict_output(test_simple_feature_matrix, simple_weights)
print predictions









    



[ 356134.44317093  784640.86422788  435069.83652353 ...,  663418.65300782
  604217.10799338  240550.4743332 ]

Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 1 (round to nearest dollar)?



In [18]:

    
print predictions[0]









    



356134.443171

Now that you have the predictions on test data, compute the RSS on the test data set. Save this value for comparison later. Recall that RSS is the sum of the squared errors (difference between prediction and output).



In [19]:

    
rss = ((predictions - test_output) ** 2).sum()
print rss









    



2.75400047593e+14

Running a multiple regression

Now we will use more than one actual feature. Use the following code to produce the weights for a second model with the following parameters:



In [20]:

    
model_features = ['sqft_living', 'sqft_living15'] # sqft_living15 is the average squarefeet for the nearest 15 neighbors. 
my_output = 'price'
(feature_matrix, multi_output) = get_numpy_data(train_data, model_features, my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

Use the above parameters to estimate the model weights. Record these values for your quiz.



In [21]:

    
multi_weights = regression_gradient_descent(feature_matrix, multi_output, initial_weights, step_size, tolerance)
print multi_weights









    



weights in  0 iteration is:  [ -1.00000000e+05   1.00000000e+00   1.00000000e+00]
weights in  1 iteration is:  [-99999.91164747    217.89658253    196.92903735]
weights in  2 iteration is:  [-99999.94015235    153.06856217    133.50564933]
weights in  3 iteration is:  [-99999.93238686    175.5618987     150.5856632 ]
weights in  4 iteration is:  [-99999.93584555    170.91513236    142.75832416]
weights in  5 iteration is:  [-99999.93579816    174.63083049    142.71624204]
weights in  6 iteration is:  [-99999.93681401    175.69887247    140.31809353]
weights in  7 iteration is:  [-99999.93747645    177.53542398    138.7079085 ]
weights in  8 iteration is:  [-99999.93822543    179.08218133    136.90918121]
weights in  9 iteration is:  [-99999.93892505    180.66860703    135.2234583 ]
weights in  10 iteration is:  [-99999.93961798    182.19370677    133.55591875]
weights in  11 iteration is:  [-99999.94029152    183.68996794    131.9347091 ]
weights in  12 iteration is:  [-99999.94095012    185.14844983    130.34986642]
weights in  13 iteration is:  [-99999.94159289    186.57303406    128.803277  ]
weights in  14 iteration is:  [-99999.94222059    187.96359899    127.29318056]
weights in  15 iteration is:  [-99999.94283347    189.32123894    125.81897573]
weights in  16 iteration is:  [-99999.9434319     190.64664604    124.3797289 ]
weights in  17 iteration is:  [-99999.94401623    191.94061268    122.97463609]
weights in  18 iteration is:  [-99999.94458678    193.20387626    121.60287906]
weights in  19 iteration is:  [-99999.94514389    194.4371679     120.26366932]
weights in  20 iteration is:  [-99999.94568787    195.6411979     118.95623389]
weights in  21 iteration is:  [-99999.94621904    196.81666079    117.67981911]
weights in  22 iteration is:  [-99999.9467377     197.96423428    116.43368892]
weights in  23 iteration is:  [-99999.94724414    199.08458011    115.21712478]
weights in  24 iteration is:  [-99999.94773866    200.17834427    114.0294252 ]
weights in  25 iteration is:  [-99999.94822154    201.24615747    112.86990534]
weights in  26 iteration is:  [-99999.94869306    202.28863541    111.73789659]
weights in  27 iteration is:  [-99999.94915348    203.30637921    110.63274621]
weights in  28 iteration is:  [-99999.94960306    204.29997572    109.55381696]
weights in  29 iteration is:  [-99999.95004208    205.26999787    108.5004867 ]
weights in  30 iteration is:  [-99999.95047077    206.21700498    107.47214807]
weights in  31 iteration is:  [-99999.95088938    207.14154312    106.46820811]
weights in  32 iteration is:  [-99999.95129815    208.0441454     105.48808792]
weights in  33 iteration is:  [-99999.95169731    208.92533227    104.53122236]
weights in  34 iteration is:  [-99999.9520871     209.78561183    103.59705967]
weights in  35 iteration is:  [-99999.95246773    210.62548015    102.6850612 ]
weights in  36 iteration is:  [-99999.95283942    211.4454215     101.79470108]
weights in  37 iteration is:  [-99999.95320238    212.24590869    100.92546591]
weights in  38 iteration is:  [-99999.95355683    213.02740327    100.07685447]
weights in  39 iteration is:  [ -9.99999539e+04   2.13790356e+02   9.92483774e+01]
weights in  40 iteration is:  [ -9.99999542e+04   2.14535206e+02   9.84395571e+01]
weights in  41 iteration is:  [ -9.99999546e+04   2.15262384e+02   9.76499271e+01]
weights in  42 iteration is:  [ -9.99999549e+04   2.15972309e+02   9.68790320e+01]
weights in  43 iteration is:  [ -9.99999552e+04   2.16665390e+02   9.61264275e+01]
weights in  44 iteration is:  [ -9.99999555e+04   2.17342027e+02   9.53916794e+01]
weights in  45 iteration is:  [ -9.99999558e+04   2.18002609e+02   9.46743643e+01]
weights in  46 iteration is:  [ -9.99999561e+04   2.18647519e+02   9.39740683e+01]
weights in  47 iteration is:  [ -9.99999564e+04   2.19277127e+02   9.32903878e+01]
weights in  48 iteration is:  [ -9.99999567e+04   2.19891797e+02   9.26229285e+01]
weights in  49 iteration is:  [ -9.99999569e+04   2.20491883e+02   9.19713055e+01]
weights in  50 iteration is:  [ -9.99999572e+04   2.21077731e+02   9.13351431e+01]
weights in  51 iteration is:  [ -9.99999575e+04   2.21649679e+02   9.07140745e+01]
weights in  52 iteration is:  [ -9.99999577e+04   2.22208057e+02   9.01077416e+01]
weights in  53 iteration is:  [ -9.99999580e+04   2.22753187e+02   8.95157947e+01]
weights in  54 iteration is:  [ -9.99999582e+04   2.23285382e+02   8.89378925e+01]
weights in  55 iteration is:  [ -9.99999585e+04   2.23804951e+02   8.83737017e+01]
weights in  56 iteration is:  [ -9.99999587e+04   2.24312193e+02   8.78228972e+01]
weights in  57 iteration is:  [ -9.99999589e+04   2.24807399e+02   8.72851612e+01]
weights in  58 iteration is:  [ -9.99999591e+04   2.25290856e+02   8.67601836e+01]
weights in  59 iteration is:  [ -9.99999594e+04   2.25762842e+02   8.62476619e+01]
weights in  60 iteration is:  [ -9.99999596e+04   2.26223630e+02   8.57473004e+01]
weights in  61 iteration is:  [ -9.99999598e+04   2.26673485e+02   8.52588106e+01]
weights in  62 iteration is:  [ -9.99999600e+04   2.27112667e+02   8.47819108e+01]
weights in  63 iteration is:  [ -9.99999602e+04   2.27541428e+02   8.43163262e+01]
weights in  64 iteration is:  [ -9.99999604e+04   2.27960017e+02   8.38617881e+01]
weights in  65 iteration is:  [ -9.99999605e+04   2.28368674e+02   8.34180345e+01]
weights in  66 iteration is:  [ -9.99999607e+04   2.28767635e+02   8.29848096e+01]
weights in  67 iteration is:  [ -9.99999609e+04   2.29157130e+02   8.25618635e+01]
weights in  68 iteration is:  [ -9.99999611e+04   2.29537384e+02   8.21489523e+01]
weights in  69 iteration is:  [ -9.99999612e+04   2.29908616e+02   8.17458380e+01]
weights in  70 iteration is:  [ -9.99999614e+04   2.30271040e+02   8.13522881e+01]
weights in  71 iteration is:  [ -9.99999616e+04   2.30624865e+02   8.09680757e+01]
weights in  72 iteration is:  [ -9.99999617e+04   2.30970295e+02   8.05929792e+01]
weights in  73 iteration is:  [ -9.99999619e+04   2.31307529e+02   8.02267824e+01]
weights in  74 iteration is:  [ -9.99999620e+04   2.31636762e+02   7.98692740e+01]
weights in  75 iteration is:  [ -9.99999622e+04   2.31958183e+02   7.95202480e+01]
weights in  76 iteration is:  [ -9.99999623e+04   2.32271979e+02   7.91795031e+01]
weights in  77 iteration is:  [ -9.99999625e+04   2.32578329e+02   7.88468429e+01]
weights in  78 iteration is:  [ -9.99999626e+04   2.32877410e+02   7.85220754e+01]
weights in  79 iteration is:  [ -9.99999628e+04   2.33169396e+02   7.82050134e+01]
weights in  80 iteration is:  [ -9.99999629e+04   2.33454454e+02   7.78954742e+01]
weights in  81 iteration is:  [ -9.99999630e+04   2.33732748e+02   7.75932791e+01]
weights in  82 iteration is:  [ -9.99999631e+04   2.34004439e+02   7.72982541e+01]
weights in  83 iteration is:  [ -9.99999633e+04   2.34269685e+02   7.70102289e+01]
weights in  84 iteration is:  [ -9.99999634e+04   2.34528637e+02   7.67290374e+01]
weights in  85 iteration is:  [ -9.99999635e+04   2.34781445e+02   7.64545176e+01]
weights in  86 iteration is:  [ -9.99999636e+04   2.35028254e+02   7.61865111e+01]
weights in  87 iteration is:  [ -9.99999637e+04   2.35269208e+02   7.59248635e+01]
weights in  88 iteration is:  [ -9.99999638e+04   2.35504445e+02   7.56694237e+01]
weights in  89 iteration is:  [ -9.99999639e+04   2.35734101e+02   7.54200446e+01]
weights in  90 iteration is:  [ -9.99999640e+04   2.35958308e+02   7.51765824e+01]
weights in  91 iteration is:  [ -9.99999641e+04   2.36177195e+02   7.49388966e+01]
weights in  92 iteration is:  [ -9.99999642e+04   2.36390889e+02   7.47068502e+01]
weights in  93 iteration is:  [ -9.99999643e+04   2.36599512e+02   7.44803094e+01]
weights in  94 iteration is:  [ -9.99999644e+04   2.36803186e+02   7.42591436e+01]
weights in  95 iteration is:  [ -9.99999645e+04   2.37002027e+02   7.40432252e+01]
weights in  96 iteration is:  [ -9.99999646e+04   2.37196151e+02   7.38324298e+01]
weights in  97 iteration is:  [ -9.99999647e+04   2.37385669e+02   7.36266357e+01]
weights in  98 iteration is:  [ -9.99999648e+04   2.37570690e+02   7.34257244e+01]
weights in  99 iteration is:  [ -9.99999649e+04   2.37751321e+02   7.32295800e+01]
weights in  100 iteration is:  [ -9.99999650e+04   2.37927667e+02   7.30380894e+01]
weights in  101 iteration is:  [ -9.99999650e+04   2.38099828e+02   7.28511421e+01]
weights in  102 iteration is:  [ -9.99999651e+04   2.38267905e+02   7.26686304e+01]
weights in  103 iteration is:  [ -9.99999652e+04   2.38431994e+02   7.24904490e+01]
weights in  104 iteration is:  [ -9.99999653e+04   2.38592190e+02   7.23164952e+01]
weights in  105 iteration is:  [ -9.99999654e+04   2.38748585e+02   7.21466687e+01]
weights in  106 iteration is:  [ -9.99999654e+04   2.38901269e+02   7.19808715e+01]
weights in  107 iteration is:  [ -9.99999655e+04   2.39050331e+02   7.18190081e+01]
weights in  108 iteration is:  [ -9.99999656e+04   2.39195856e+02   7.16609851e+01]
weights in  109 iteration is:  [ -9.99999656e+04   2.39337928e+02   7.15067114e+01]
weights in  110 iteration is:  [ -9.99999657e+04   2.39476629e+02   7.13560980e+01]
weights in  111 iteration is:  [ -9.99999658e+04   2.39612040e+02   7.12090582e+01]
weights in  112 iteration is:  [ -9.99999658e+04   2.39744237e+02   7.10655071e+01]
weights in  113 iteration is:  [ -9.99999659e+04   2.39873298e+02   7.09253618e+01]
weights in  114 iteration is:  [ -9.99999660e+04   2.39999297e+02   7.07885418e+01]
weights in  115 iteration is:  [ -9.99999660e+04   2.40122307e+02   7.06549679e+01]
weights in  116 iteration is:  [ -9.99999661e+04   2.40242398e+02   7.05245633e+01]
weights in  117 iteration is:  [ -9.99999661e+04   2.40359639e+02   7.03972527e+01]
weights in  118 iteration is:  [ -9.99999662e+04   2.40474099e+02   7.02729627e+01]
weights in  119 iteration is:  [ -9.99999662e+04   2.40585843e+02   7.01516216e+01]
weights in  120 iteration is:  [ -9.99999663e+04   2.40694936e+02   7.00331595e+01]
weights in  121 iteration is:  [ -9.99999663e+04   2.40801441e+02   6.99175081e+01]
weights in  122 iteration is:  [ -9.99999664e+04   2.40905418e+02   6.98046007e+01]
weights in  123 iteration is:  [ -9.99999664e+04   2.41006929e+02   6.96943721e+01]
weights in  124 iteration is:  [ -9.99999665e+04   2.41106031e+02   6.95867588e+01]
weights in  125 iteration is:  [ -9.99999665e+04   2.41202782e+02   6.94816989e+01]
weights in  126 iteration is:  [ -9.99999666e+04   2.41297237e+02   6.93791315e+01]
weights in  127 iteration is:  [ -9.99999666e+04   2.41389451e+02   6.92789978e+01]
weights in  128 iteration is:  [ -9.99999667e+04   2.41479477e+02   6.91812398e+01]
weights in  129 iteration is:  [ -9.99999667e+04   2.41567368e+02   6.90858013e+01]
weights in  130 iteration is:  [ -9.99999668e+04   2.41653173e+02   6.89926272e+01]
weights in  131 iteration is:  [ -9.99999668e+04   2.41736942e+02   6.89016637e+01]
weights in  132 iteration is:  [ -9.99999668e+04   2.41818723e+02   6.88128585e+01]
weights in  133 iteration is:  [ -9.99999669e+04   2.41898565e+02   6.87261603e+01]
weights in  134 iteration is:  [ -9.99999669e+04   2.41976511e+02   6.86415192e+01]
weights in  135 iteration is:  [ -9.99999670e+04   2.42052609e+02   6.85588862e+01]
weights in  136 iteration is:  [ -9.99999670e+04   2.42126901e+02   6.84782138e+01]
weights in  137 iteration is:  [ -9.99999670e+04   2.42199430e+02   6.83994555e+01]
weights in  138 iteration is:  [ -9.99999671e+04   2.42270239e+02   6.83225658e+01]
weights in  139 iteration is:  [ -9.99999671e+04   2.42339367e+02   6.82475005e+01]
weights in  140 iteration is:  [ -9.99999671e+04   2.42406855e+02   6.81742161e+01]
weights in  141 iteration is:  [ -9.99999672e+04   2.42472742e+02   6.81026705e+01]
weights in  142 iteration is:  [ -9.99999672e+04   2.42537066e+02   6.80328225e+01]
weights in  143 iteration is:  [ -9.99999672e+04   2.42599864e+02   6.79646316e+01]
weights in  144 iteration is:  [ -9.99999673e+04   2.42661171e+02   6.78980587e+01]
weights in  145 iteration is:  [ -9.99999673e+04   2.42721024e+02   6.78330653e+01]
weights in  146 iteration is:  [ -9.99999673e+04   2.42779457e+02   6.77696140e+01]
weights in  147 iteration is:  [ -9.99999674e+04   2.42836504e+02   6.77076681e+01]
weights in  148 iteration is:  [ -9.99999674e+04   2.42892197e+02   6.76471920e+01]
weights in  149 iteration is:  [ -9.99999674e+04   2.42946569e+02   6.75881508e+01]
weights in  150 iteration is:  [ -9.99999674e+04   2.42999650e+02   6.75305103e+01]
weights in  151 iteration is:  [ -9.99999675e+04   2.43051473e+02   6.74742375e+01]
weights in  152 iteration is:  [ -9.99999675e+04   2.43102065e+02   6.74192998e+01]
weights in  153 iteration is:  [ -9.99999675e+04   2.43151457e+02   6.73656656e+01]
weights in  154 iteration is:  [ -9.99999675e+04   2.43199678e+02   6.73133040e+01]
weights in  155 iteration is:  [ -9.99999676e+04   2.43246754e+02   6.72621846e+01]
weights in  156 iteration is:  [ -9.99999676e+04   2.43292713e+02   6.72122782e+01]
weights in  157 iteration is:  [ -9.99999676e+04   2.43337582e+02   6.71635558e+01]
weights in  158 iteration is:  [ -9.99999676e+04   2.43381387e+02   6.71159895e+01]
weights in  159 iteration is:  [ -9.99999677e+04   2.43424152e+02   6.70695517e+01]
weights in  160 iteration is:  [ -9.99999677e+04   2.43465902e+02   6.70242157e+01]
weights in  161 iteration is:  [ -9.99999677e+04   2.43506662e+02   6.69799554e+01]
weights in  162 iteration is:  [ -9.99999677e+04   2.43546455e+02   6.69367452e+01]
weights in  163 iteration is:  [ -9.99999678e+04   2.43585303e+02   6.68945602e+01]
weights in  164 iteration is:  [ -9.99999678e+04   2.43623230e+02   6.68533761e+01]
weights in  165 iteration is:  [ -9.99999678e+04   2.43660257e+02   6.68131692e+01]
weights in  166 iteration is:  [ -9.99999678e+04   2.43696405e+02   6.67739162e+01]
weights in  167 iteration is:  [ -9.99999678e+04   2.43731696e+02   6.67355946e+01]
weights in  168 iteration is:  [ -9.99999679e+04   2.43766150e+02   6.66981822e+01]
weights in  169 iteration is:  [ -9.99999679e+04   2.43799786e+02   6.66616574e+01]
weights in  170 iteration is:  [ -9.99999679e+04   2.43832624e+02   6.66259992e+01]
weights in  171 iteration is:  [ -9.99999679e+04   2.43864682e+02   6.65911871e+01]
weights in  172 iteration is:  [ -9.99999679e+04   2.43895981e+02   6.65572010e+01]
weights in  173 iteration is:  [ -9.99999679e+04   2.43926536e+02   6.65240212e+01]
weights in  174 iteration is:  [ -9.99999680e+04   2.43956367e+02   6.64916286e+01]
weights in  175 iteration is:  [ -9.99999680e+04   2.43985490e+02   6.64600046e+01]
weights in  176 iteration is:  [ -9.99999680e+04   2.44013922e+02   6.64291309e+01]
weights in  177 iteration is:  [ -9.99999680e+04   2.44041679e+02   6.63989897e+01]
weights in  178 iteration is:  [ -9.99999680e+04   2.44068778e+02   6.63695637e+01]
weights in  179 iteration is:  [ -9.99999680e+04   2.44095233e+02   6.63408358e+01]
weights in  180 iteration is:  [ -9.99999681e+04   2.44121061e+02   6.63127896e+01]
weights in  181 iteration is:  [ -9.99999681e+04   2.44146277e+02   6.62854088e+01]
weights in  182 iteration is:  [ -9.99999681e+04   2.44170894e+02   6.62586776e+01]
weights in  183 iteration is:  [ -9.99999681e+04   2.44194927e+02   6.62325806e+01]
weights in  184 iteration is:  [ -9.99999681e+04   2.44218389e+02   6.62071029e+01]
weights in  185 iteration is:  [ -9.99999681e+04   2.44241295e+02   6.61822296e+01]
weights in  186 iteration is:  [ -9.99999681e+04   2.44263658e+02   6.61579465e+01]
weights in  187 iteration is:  [ -9.99999682e+04   2.44285490e+02   6.61342395e+01]
weights in  188 iteration is:  [ -9.99999682e+04   2.44306804e+02   6.61110951e+01]
weights in  189 iteration is:  [ -9.99999682e+04   2.44327612e+02   6.60884997e+01]
weights in  190 iteration is:  [ -9.99999682e+04   2.44347927e+02   6.60664404e+01]
weights in  191 iteration is:  [ -9.99999682e+04   2.44367759e+02   6.60449046e+01]
weights in  192 iteration is:  [ -9.99999682e+04   2.44387121e+02   6.60238797e+01]
weights in  193 iteration is:  [ -9.99999682e+04   2.44406024e+02   6.60033536e+01]
weights in  194 iteration is:  [ -9.99999682e+04   2.44424478e+02   6.59833146e+01]
weights in  195 iteration is:  [ -9.99999683e+04   2.44442494e+02   6.59637510e+01]
weights in  196 iteration is:  [ -9.99999683e+04   2.44460083e+02   6.59446515e+01]
weights in  197 iteration is:  [ -9.99999683e+04   2.44477255e+02   6.59260053e+01]
weights in  198 iteration is:  [ -9.99999683e+04   2.44494019e+02   6.59078014e+01]
weights in  199 iteration is:  [ -9.99999683e+04   2.44510385e+02   6.58900295e+01]
weights in  200 iteration is:  [ -9.99999683e+04   2.44526363e+02   6.58726792e+01]
weights in  201 iteration is:  [ -9.99999683e+04   2.44541962e+02   6.58557406e+01]
weights in  202 iteration is:  [ -9.99999683e+04   2.44557191e+02   6.58392038e+01]
weights in  203 iteration is:  [ -9.99999683e+04   2.44572059e+02   6.58230594e+01]
weights in  204 iteration is:  [ -9.99999684e+04   2.44586573e+02   6.58072981e+01]
weights in  205 iteration is:  [ -9.99999684e+04   2.44600744e+02   6.57919107e+01]
weights in  206 iteration is:  [ -9.99999684e+04   2.44614578e+02   6.57768884e+01]
weights in  207 iteration is:  [ -9.99999684e+04   2.44628084e+02   6.57622226e+01]
weights in  208 iteration is:  [ -9.99999684e+04   2.44641269e+02   6.57479047e+01]
weights in  209 iteration is:  [ -9.99999684e+04   2.44654142e+02   6.57339265e+01]
weights in  210 iteration is:  [ -9.99999684e+04   2.44666709e+02   6.57202799e+01]
weights in  211 iteration is:  [ -9.99999684e+04   2.44678978e+02   6.57069572e+01]
weights in  212 iteration is:  [ -9.99999684e+04   2.44690956e+02   6.56939505e+01]
weights in  213 iteration is:  [ -9.99999684e+04   2.44702650e+02   6.56812524e+01]
weights in  214 iteration is:  [ -9.99999685e+04   2.44714066e+02   6.56688557e+01]
weights in  215 iteration is:  [ -9.99999685e+04   2.44725212e+02   6.56567530e+01]
weights in  216 iteration is:  [ -9.99999685e+04   2.44736093e+02   6.56449375e+01]
weights in  217 iteration is:  [ -9.99999685e+04   2.44746716e+02   6.56334023e+01]
weights in  218 iteration is:  [ -9.99999685e+04   2.44757087e+02   6.56221409e+01]
weights in  219 iteration is:  [ -9.99999685e+04   2.44767211e+02   6.56111466e+01]
weights in  220 iteration is:  [ -9.99999685e+04   2.44777096e+02   6.56004132e+01]
weights in  221 iteration is:  [ -9.99999685e+04   2.44786746e+02   6.55899344e+01]
weights in  222 iteration is:  [ -9.99999685e+04   2.44796167e+02   6.55797043e+01]
weights in  223 iteration is:  [ -9.99999685e+04   2.44805364e+02   6.55697168e+01]
weights in  224 iteration is:  [ -9.99999685e+04   2.44814344e+02   6.55599664e+01]
weights in  225 iteration is:  [ -9.99999685e+04   2.44823110e+02   6.55504473e+01]
weights in  226 iteration is:  [ -9.99999686e+04   2.44831668e+02   6.55411540e+01]
weights in  227 iteration is:  [ -9.99999686e+04   2.44840023e+02   6.55320812e+01]
weights in  228 iteration is:  [ -9.99999686e+04   2.44848180e+02   6.55232237e+01]
weights in  229 iteration is:  [ -9.99999686e+04   2.44856144e+02   6.55145764e+01]
weights in  230 iteration is:  [ -9.99999686e+04   2.44863918e+02   6.55061342e+01]
weights in  231 iteration is:  [ -9.99999686e+04   2.44871508e+02   6.54978924e+01]
weights in  232 iteration is:  [ -9.99999686e+04   2.44878918e+02   6.54898460e+01]
weights in  233 iteration is:  [ -9.99999686e+04   2.44886152e+02   6.54819906e+01]
weights in  234 iteration is:  [ -9.99999686e+04   2.44893215e+02   6.54743216e+01]
weights in  235 iteration is:  [ -9.99999686e+04   2.44900110e+02   6.54668345e+01]
weights in  236 iteration is:  [ -9.99999686e+04   2.44906841e+02   6.54595251e+01]
weights in  237 iteration is:  [ -9.99999686e+04   2.44913413e+02   6.54523891e+01]
weights in  238 iteration is:  [ -9.99999686e+04   2.44919828e+02   6.54454224e+01]
weights in  239 iteration is:  [ -9.99999686e+04   2.44926092e+02   6.54386209e+01]
weights in  240 iteration is:  [ -9.99999687e+04   2.44932207e+02   6.54319809e+01]
weights in  241 iteration is:  [ -9.99999687e+04   2.44938177e+02   6.54254984e+01]
weights in  242 iteration is:  [ -9.99999687e+04   2.44944005e+02   6.54191697e+01]
weights in  243 iteration is:  [ -9.99999687e+04   2.44949695e+02   6.54129912e+01]
weights in  244 iteration is:  [ -9.99999687e+04   2.44955249e+02   6.54069593e+01]
weights in  245 iteration is:  [ -9.99999687e+04   2.44960673e+02   6.54010705e+01]
weights in  246 iteration is:  [ -9.99999687e+04   2.44965967e+02   6.53953214e+01]
weights in  247 iteration is:  [ -9.99999687e+04   2.44971136e+02   6.53897087e+01]
weights in  248 iteration is:  [ -9.99999687e+04   2.44976182e+02   6.53842291e+01]
weights in  249 iteration is:  [ -9.99999687e+04   2.44981108e+02   6.53788796e+01]
weights in  250 iteration is:  [ -9.99999687e+04   2.44985918e+02   6.53736570e+01]
weights in  251 iteration is:  [ -9.99999687e+04   2.44990613e+02   6.53685583e+01]
weights in  252 iteration is:  [ -9.99999687e+04   2.44995197e+02   6.53635806e+01]
weights in  253 iteration is:  [ -9.99999687e+04   2.44999673e+02   6.53587210e+01]
weights in  254 iteration is:  [ -9.99999687e+04   2.45004042e+02   6.53539767e+01]
weights in  255 iteration is:  [ -9.99999687e+04   2.45008307e+02   6.53493450e+01]
weights in  256 iteration is:  [ -9.99999688e+04   2.45012471e+02   6.53448231e+01]
weights in  257 iteration is:  [ -9.99999688e+04   2.45016537e+02   6.53404086e+01]
weights in  258 iteration is:  [ -9.99999688e+04   2.45020506e+02   6.53360988e+01]
weights in  259 iteration is:  [ -9.99999688e+04   2.45024380e+02   6.53318912e+01]
weights in  260 iteration is:  [ -9.99999688e+04   2.45028163e+02   6.53277835e+01]
weights in  261 iteration is:  [ -9.99999688e+04   2.45031856e+02   6.53237732e+01]
weights in  262 iteration is:  [ -9.99999688e+04   2.45035462e+02   6.53198581e+01]
weights in  263 iteration is:  [ -9.99999688e+04   2.45038982e+02   6.53160359e+01]
weights in  264 iteration is:  [ -9.99999688e+04   2.45042418e+02   6.53123043e+01]
weights in  265 iteration is:  [ -9.99999688e+04   2.45045773e+02   6.53086613e+01]
weights in  266 iteration is:  [ -9.99999688e+04   2.45049048e+02   6.53051047e+01]
weights in  267 iteration is:  [ -9.99999688e+04   2.45052246e+02   6.53016326e+01]
weights in  268 iteration is:  [ -9.99999688e+04   2.45055368e+02   6.52982427e+01]
weights in  269 iteration is:  [ -9.99999688e+04   2.45058415e+02   6.52949334e+01]
weights in  270 iteration is:  [ -9.99999688e+04   2.45061391e+02   6.52917025e+01]
weights in  271 iteration is:  [ -9.99999688e+04   2.45064295e+02   6.52885483e+01]
weights in  272 iteration is:  [ -9.99999688e+04   2.45067131e+02   6.52854689e+01]
weights in  273 iteration is:  [ -9.99999688e+04   2.45069900e+02   6.52824626e+01]
[ -9.99999688e+04   2.45072603e+02   6.52795277e+01]

Use your newly estimated weights and the predict_output function to compute the predictions on the TEST data. Don't forget to create a numpy array for these features from the test set first!



In [23]:

    
(test_multi_feature_matrix, multi_output) = get_numpy_data(test_data, model_features, my_output)
multi_predictions = predict_output(test_multi_feature_matrix, multi_weights)
print multi_predictions









    



[ 366651.41203656  762662.39786164  386312.09499712 ...,  682087.39928241
  585579.27865729  216559.20396617]

Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 2 (round to nearest dollar)?



In [24]:

    
print multi_predictions[0]









    



366651.412037

What is the actual price for the 1st house in the test data set?



In [26]:

    
test_data['price'][0]









    Out[26]:





310000.0

Quiz Question: Which estimate was closer to the true price for the 1st house on the Test data set, model 1 or model 2?

Now use your predictions and the output to compute the RSS for model 2 on TEST data.



In [27]:

    
print 'prediction from first model is $356134 and prediction from 2nd model is $366651'









    



prediction from first model is $356134 and prediction from 2nd model is $366651

Quiz Question: Which model (1 or 2) has lowest RSS on all of the TEST data?



In [28]:

    
rss = ((multi_predictions - multi_output) ** 2).sum()
print rss









    



2.70263446465e+14



In [ ]:

    
print 'RSS from first model is 2.75400047593e+14 and RSS from 2nd model is 2.70263446465e+14'