Regression Week 4: Ridge Regression (interpretation)

In this notebook, we will run ridge regression multiple times with different L2 penalties to see which one produces the best fit. We will revisit the example of polynomial regression as a means to see the effect of L2 regularization. In particular, we will:

Use Sklearn to run polynomial regression
Use matplotlib to visualize polynomial regressions
Use Sklearn to run polynomial regression with L2 penalty
Use matplotlib to visualize polynomial regressions under L2 regularization
Choose best L2 penalty using cross-validation.
Assess the final fit using test data.

In the next programming assignment for this module, you will implement your own ridge regression learning algorithm using gradient descent.

Importing Libraries



In [50]:

    
import os
import zipfile
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

Unzipping files with house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.



In [51]:

    
# Put files in current direction into a list
files_list = [f for f in os.listdir('.') if os.path.isfile(f)]



In [52]:

    
# Filenames of unzipped files
unzip_files = ['kc_house_data.csv','wk3_kc_house_set_1_data.csv', 'wk3_kc_house_set_2_data.csv',
               'wk3_kc_house_set_3_data.csv', 'wk3_kc_house_set_4_data.csv', 'wk3_kc_house_test_data.csv',
               'wk3_kc_house_train_data.csv', 'wk3_kc_house_valid_data.csv', 'wk3_kc_house_train_valid_shuffled.csv']



In [53]:

    
# If upzipped file not in files_list, unzip the file
for filename in unzip_files:
    if filename not in files_list:
        zip_file = filename + '.zip'
        unzipping = zipfile.ZipFile(zip_file)
        unzipping.extractall()
        unzipping.close

Polynomial regression, revisited

We build on the material from Week 3, where we wrote the function to produce an SFrame with columns containing the powers of a given input. Copy and paste the function polynomial_sframe from Week 3:



In [54]:

    
def polynomial_dataframe(feature, degree): # feature is pandas.Series type
    # assume that degree >= 1
    # initialize the dataframe:
    poly_dataframe = pd.DataFrame()
    # and set poly_dataframe['power_1'] equal to the passed feature
    poly_dataframe['power_1'] = feature
    
    # first check if degree > 1
    if degree > 1:
        # then loop over the remaining degrees:
        for power in range(2, degree+1):
            # first we'll give the column a name:
            name = 'power_' + str(power)
            # assign poly_dataframe[name] to be feature^power; use apply(*)
            poly_dataframe[name] = poly_dataframe['power_1'].apply(lambda x: x**power) 
    return poly_dataframe

Let's use matplotlib to visualize what a polynomial regression looks like on the house data.



In [55]:

    
# Dictionary with the correct dtypes for the DataFrame columns
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 
              'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 
              'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 
              'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 
              'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}



In [56]:

    
sales = pd.read_csv('kc_house_data.csv', dtype = dtype_dict)

As in Week 3, we will use the sqft_living variable. For plotting purposes (connecting the dots), you'll need to sort by the values of sqft_living. For houses with identical square footage, we break the tie by their prices.



In [57]:

    
sales = sales.sort_values(['sqft_living', 'price'])
sales[['sqft_living', 'price']].head()









    Out[57]:






  
    
      
      sqft_living
      price
    
  
  
    
      19452
      290
      142000
    
    
      15381
      370
      276000
    
    
      860
      380
      245000
    
    
      18379
      384
      265000
    
    
      4868
      390
      228000

Plotting the data we are working with



In [58]:

    
plt.figure(figsize=(8,6))
plt.plot(sales['sqft_living'], sales['price'],'.')
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.title('King County, Seattle House Price Data', fontsize=18)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

Let us revisit the 15th-order polynomial model using the 'sqft_living' input. Generate polynomial features up to degree 15 using polynomial_dataframe() and fit a model with these features. When fitting the model, use an L2 penalty of 1e-5:



In [59]:

    
# Bulding dataframe with 15 polynomial features
poly15_data = polynomial_dataframe(sales['sqft_living'], 15)



In [60]:

    
l2_small_penalty = 1e-5



In [61]:

    
model = linear_model.Ridge(alpha=l2_small_penalty, normalize=True)
model.fit(poly15_data, sales['price'])









    Out[61]:





Ridge(alpha=1e-05, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, solver='auto', tol=0.001)

Note: When we have so many features and so few data points, the solution can become highly numerically unstable, which can sometimes lead to strange unpredictable results. Thus, rather than using no regularization, we will introduce a tiny amount of regularization (l2_penalty=1e-5) to make the solution numerically stable. (In lecture, we discussed the fact that regularization can also help with numerical stability, and here we are seeing a practical example.)

With the L2 penalty specified above, fit the model and print out the learned weights.

QUIZ QUESTION: What's the learned value for the coefficient of feature power_1?



In [62]:

    
print 'Weight for power_1 feature is: %.2f' % (model.coef_[0])









    



Weight for power_1 feature is: 174.43

Observe overfitting

Recall from Week 3 that the polynomial fit of degree 15 changed wildly whenever the data changed. In particular, when we split the sales data into four subsets and fit the model of degree 15, the result came out to be very different for each subset. The model had a high variance. We will see in a moment that ridge regression reduces such variance. But first, we must reproduce the experiment we did in Week 3.

First, split the data into split the sales data into four subsets of roughly equal size and call them set_1, set_2, set_3, and set_4. Use .random_split function and make sure you set seed=0.



In [63]:

    
set_1 = pd.read_csv('wk3_kc_house_set_1_data.csv', dtype=dtype_dict)
set_2 = pd.read_csv('wk3_kc_house_set_2_data.csv', dtype=dtype_dict)
set_3 = pd.read_csv('wk3_kc_house_set_3_data.csv', dtype=dtype_dict)
set_4 = pd.read_csv('wk3_kc_house_set_4_data.csv', dtype=dtype_dict)

Next, fit a 15th degree polynomial on set_1, set_2, set_3, and set_4, using 'sqft_living' to predict prices. Print the weights and make a plot of the resulting model.



In [64]:

    
# Putting data and keys in a list for looping
data_list = [set_1, set_2, set_3, set_4]
key_list = ['set_1', 'set_2', 'set_3', 'set_4']



In [65]:

    
# model_poly_deg is a dict which holds all the regression models for the ith polynomial fit
poly_15_dframe_dict = {}
models_poly_15_dict = {}



In [66]:

    
# First, learn models with a really small L2 penalty
l2_small_penalty = 1e-9



In [67]:

    
# Looping over polynomial features from 1-15
for key, dframe in zip(key_list, data_list):
    
    # Entering each dataframe returned from polynomial_dataframe function into a dict
    # Then, saving col_names into a list to do regression w/ these features. Then, adding price column to dataframe
    poly_15_dframe_dict[key] = polynomial_dataframe(dframe['sqft_living'], 15)
    
    # Adding regression models to dicts
    models_poly_15_dict[key] = linear_model.Ridge(alpha=l2_small_penalty, normalize=True)
    models_poly_15_dict[key].fit( poly_15_dframe_dict[key], dframe['price'] )

Plotting the data and the 4 different 15 degree polynomials we learned from the data.



In [68]:

    
plt.figure(figsize=(8,6))
plt.plot(sales['sqft_living'], sales['price'],'.', label= 'House Price Data')
plt.hold(True)
#
for i, key in enumerate(key_list):
    leg_label = 'Model ' + str(i+1)
    plt.plot( poly_15_dframe_dict[key]['power_1'], models_poly_15_dict[key].predict(poly_15_dframe_dict[key]), '-', label = leg_label )
#             
plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.title('4 Diff. 15th Deg. Polynomial Regr. Models, Small L2 Penalty', fontsize=16)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

The four curves should differ from one another a lot, as should the coefficients you learned.

QUIZ QUESTION: For the models learned in each of these training sets, what are the smallest and largest values you learned for the coefficient of feature power_1? (For the purpose of answering this question, negative numbers are considered "smaller" than positive numbers. So -5 is smaller than -3, and -3 is smaller than 5 and so forth.)



In [69]:

    
power_l_coeff_list = []
for key in key_list:
    power_l_coeff_list.append( models_poly_15_dict[key].coef_[0] )



In [70]:

    
print 'Smallest power_1 weight with small L2 penalty is: %.2f' %( min(power_l_coeff_list))
print 'Largest  power_1 weight with small L2 penalty is: %.2f' %( max(power_l_coeff_list))









    



Smallest power_1 weight with small L2 penalty is: -755.40
Largest  power_1 weight with small L2 penalty is: 1119.45

Ridge regression comes to rescue

Generally, whenever we see weights change so much in response to change in data, we believe the variance of our estimate to be large. Ridge regression aims to address this issue by penalizing "large" weights. (Weights of model15 looked quite small, but they are not that small because 'sqft_living' input is in the order of thousands.)

With the argument l2_penalty=1e5, fit a 15th-order polynomial model on set_1, set_2, set_3, and set_4. Other than the change in the l2_penalty parameter, the code should be the same as the experiment above.



In [71]:

    
# model_poly_deg is a dict which holds all the regression models for the ith polynomial fit
poly_15_dframe_dict = {}
models_poly_15_dict = {}



In [72]:

    
# Re-learn models with a large L2 penalty
l2_large_penalty=1.23e2



In [73]:

    
# Looping over polynomial features from 1-15
for key, dframe in zip(key_list, data_list):
    
    # Entering each dataframe returned from polynomial_dataframe function into a dict
    # Then, saving col_names into a list to do regression w/ these features. Then, adding price column to dataframe
    poly_15_dframe_dict[key] = polynomial_dataframe(dframe['sqft_living'], 15)
    
    # Adding regression models to dicts
    models_poly_15_dict[key] = linear_model.Ridge(alpha=l2_large_penalty, normalize=True)
    models_poly_15_dict[key].fit( poly_15_dframe_dict[key], dframe['price'] )



In [74]:

    
plt.figure(figsize=(8,6))
plt.plot(sales['sqft_living'], sales['price'],'.', label= 'House Price Data')
plt.hold(True)
#
for i, key in enumerate(key_list):
    leg_label = 'Model ' + str(i+1)
    plt.plot( poly_15_dframe_dict[key]['power_1'], models_poly_15_dict[key].predict(poly_15_dframe_dict[key]), '-', label = leg_label )
#             
plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.title('4 Diff. 15th Deg. Polynomial Regr. Models, Large L2 Penalty', fontsize=16)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

These curves should vary a lot less, now that you applied a high degree of regularization.

QUIZ QUESTION: For the models learned with the high level of regularization in each of these training sets, what are the smallest and largest values you learned for the coefficient of feature power_1? (For the purpose of answering this question, negative numbers are considered "smaller" than positive numbers. So -5 is smaller than -3, and -3 is smaller than 5 and so forth.)



In [75]:

    
power_l_coeff_list = []
for key in key_list:
    power_l_coeff_list.append( models_poly_15_dict[key].coef_[0] )



In [76]:

    
print 'Smallest power_1 weight with large L2 penalty is: %.2f' %( min(power_l_coeff_list))
print 'Largest  power_1 weight with large L2 penalty is: %.2f' %( max(power_l_coeff_list))









    



Smallest power_1 weight with large L2 penalty is: 2.09
Largest  power_1 weight with large L2 penalty is: 2.33

Selecting an L2 penalty via cross-validation

Just like the polynomial degree, the L2 penalty is a "magic" parameter we need to select. We could use the validation set approach as we did in the last module, but that approach has a major disadvantage: it leaves fewer observations available for training. Cross-validation seeks to overcome this issue by using all of the training set in a smart way.

We will implement a kind of cross-validation called k-fold cross-validation. The method gets its name because it involves dividing the training set into k segments of roughtly equal size. Similar to the validation set method, we measure the validation error with one of the segments designated as the validation set. The major difference is that we repeat the process k times as follows:

Set aside segment 0 as the validation set, and fit a model on rest of data, and evalutate it on this validation set
Set aside segment 1 as the validation set, and fit a model on rest of data, and evalutate it on this validation set
...
Set aside segment k-1 as the validation set, and fit a model on rest of data, and evalutate it on this validation set

After this process, we compute the average of the k validation errors, and use it as an estimate of the generalization error. Notice that all observations are used for both training and validation, as we iterate over segments of data.

To estimate the generalization error well, it is crucial to shuffle the training data before dividing them into segments. We reserve 10% of the data as the test set and shuffle the remainder.



In [77]:

    
train_valid_shuffled = pd.read_csv('wk3_kc_house_train_valid_shuffled.csv', dtype=dtype_dict)
test = pd.read_csv('wk3_kc_house_test_data.csv', dtype=dtype_dict)

Once the data is shuffled, we divide it into equal segments. Each segment should receive n/k elements, where n is the number of observations in the training set and k is the number of segments. Since the segment 0 starts at index 0 and contains n/k elements, it ends at index (n/k)-1. The segment 1 starts where the segment 0 left off, at index (n/k). With n/k elements, the segment 1 ends at index (n*2/k)-1. Continuing in this fashion, we deduce that the segment i starts at index (n*i/k) and ends at (n*(i+1)/k)-1.

With this pattern in mind, we write a short loop that prints the starting and ending indices of each segment, just to make sure you are getting the splits right.



In [78]:

    
n = len(train_valid_shuffled)
k = 10 # 10-fold cross-validation

for i in xrange(k):
    start = (n*i)/k
    end = (n*(i+1))/k-1
    print i, (start, end)









    



0 (0, 1938)
1 (1939, 3878)
2 (3879, 5817)
3 (5818, 7757)
4 (7758, 9697)
5 (9698, 11636)
6 (11637, 13576)
7 (13577, 15515)
8 (15516, 17455)
9 (17456, 19395)

Let us familiarize ourselves with array slicing with DataFrames. To extract a continuous slice from a DataFrame, use colon in square brackets. For instance, the following cell extracts rows 0 to 9 of train_valid_shuffled. Notice that the first index (0) is included in the slice but the last index (10) is omitted.



In [79]:

    
train_valid_shuffled[0:10] # rows 0 to 9









    Out[79]:






  
    
      
      id
      date
      price
      bedrooms
      bathrooms
      sqft_living
      sqft_lot
      floors
      waterfront
      view
      ...
      grade
      sqft_above
      sqft_basement
      yr_built
      yr_renovated
      zipcode
      lat
      long
      sqft_living15
      sqft_lot15
    
  
  
    
      0
      2780400035
      20140505T000000
      665000
      4
      2.50
      2800
      5900
      1
      0
      0
      ...
      8
      1660
      1140
      1963
      0
      98115
      47.6809
      -122.286
      2580
      5900
    
    
      1
      1703050500
      20150321T000000
      645000
      3
      2.50
      2490
      5978
      2
      0
      0
      ...
      9
      2490
      0
      2003
      0
      98074
      47.6298
      -122.022
      2710
      6629
    
    
      2
      5700002325
      20140605T000000
      640000
      3
      1.75
      2340
      4206
      1
      0
      0
      ...
      7
      1170
      1170
      1917
      0
      98144
      47.5759
      -122.288
      1360
      4725
    
    
      3
      0475000510
      20141118T000000
      594000
      3
      1.00
      1320
      5000
      1
      0
      0
      ...
      7
      1090
      230
      1920
      0
      98107
      47.6674
      -122.365
      1700
      5000
    
    
      4
      0844001052
      20150128T000000
      365000
      4
      2.50
      1904
      8200
      2
      0
      0
      ...
      7
      1904
      0
      1999
      0
      98010
      47.3107
      -122.001
      1560
      12426
    
    
      5
      2781280290
      20150427T000000
      305000
      3
      2.50
      1610
      3516
      2
      0
      0
      ...
      8
      1610
      0
      2006
      0
      98055
      47.4491
      -122.188
      1610
      3056
    
    
      6
      2214800630
      20141105T000000
      239950
      3
      2.25
      1560
      8280
      2
      0
      0
      ...
      7
      1560
      0
      1979
      0
      98001
      47.3393
      -122.259
      1920
      8120
    
    
      7
      2114700540
      20141021T000000
      366000
      3
      2.50
      1320
      4320
      1
      0
      0
      ...
      6
      660
      660
      1918
      0
      98106
      47.5327
      -122.347
      1190
      4200
    
    
      8
      2596400050
      20140730T000000
      375000
      3
      1.00
      1960
      7955
      1
      0
      0
      ...
      7
      1260
      700
      1963
      0
      98177
      47.7641
      -122.364
      1850
      8219
    
    
      9
      4140900050
      20150126T000000
      440000
      4
      1.75
      2180
      10200
      1
      0
      2
      ...
      8
      2000
      180
      1966
      0
      98028
      47.7638
      -122.270
      2590
      10445
    
  

10 rows × 21 columns

Now let us extract individual segments with array slicing. Consider the scenario where we group the houses in the train_valid_shuffled dataframe into k=10 segments of roughly equal size, with starting and ending indices computed as above. Extract the fourth segment (segment 3) and assign it to a variable called validation4.



In [80]:

    
i = 3
start_ind = (n*3)/k
end_ind = (n*(3+1))/k-1
validation4 = train_valid_shuffled[ start_ind : end_ind + 1]

To verify that we have the right elements extracted, run the following cell, which computes the average price of the fourth segment. When rounded to nearest whole number, the average should be $536,234.



In [81]:

    
print int(round(validation4['price'].mean(), 0))

After designating one of the k segments as the validation set, we train a model using the rest of the data. To choose the remainder, we slice (0:start) and (end+1:n) of the data and paste them together. SFrame has append() method that pastes together two disjoint sets of rows originating from a common dataset. For instance, the following cell pastes together the first and last two rows of the train_valid_shuffled dataframe.



In [82]:

    
n = len(train_valid_shuffled)
first_two = train_valid_shuffled[0:2]
last_two = train_valid_shuffled[n-2:n]
print first_two.append(last_two)









    



               id             date    price  bedrooms  bathrooms  sqft_living  \
0      2780400035  20140505T000000   665000         4       2.50         2800   
1      1703050500  20150321T000000   645000         3       2.50         2490   
19394  4139480190  20140916T000000  1153000         3       3.25         3780   
19395  7237300290  20150326T000000   338000         5       2.50         2400   

       sqft_lot floors  waterfront  view     ...      grade  sqft_above  \
0          5900      1           0     0     ...          8        1660   
1          5978      2           0     0     ...          9        2490   
19394     10623      1           0     1     ...         11        2650   
19395      4496      2           0     0     ...          7        2400   

       sqft_basement  yr_built  yr_renovated  zipcode      lat     long  \
0               1140      1963             0    98115  47.6809 -122.286   
1                  0      2003             0    98074  47.6298 -122.022   
19394           1130      1999             0    98006  47.5506 -122.101   
19395              0      2004             0    98042  47.3692 -122.126   

       sqft_living15  sqft_lot15  
0               2580        5900  
1               2710        6629  
19394           3850       11170  
19395           1880        4319  

[4 rows x 21 columns]

Extract the remainder of the data after excluding fourth segment (segment 3) and assign the subset to train4.



In [83]:

    
i = 3
k = 10
n = len(train_valid_shuffled)
start_ind = (n*3)/k
end_ind = (n*(3+1))/k-1
train4 = train_valid_shuffled[0:start].append(train_valid_shuffled[end+1:n])

To verify that we have the right elements extracted, run the following cell, which computes the average price of the data with fourth segment excluded. When rounded to nearest whole number, the average should be $539,450.



In [84]:

    
print int(round(train4['price'].mean(), 0))

Now we are ready to implement k-fold cross-validation. Write a function that computes k validation errors by designating each of the k segments as the validation set. It accepts as parameters (i) k, (ii) l2_penalty, (iii) dataframe, (iv) name of output column (e.g. price) and (v) list of feature names. The function returns the average validation error using k segments as validation sets.

For each i in [0, 1, ..., k-1]:
- Compute starting and ending indices of segment i and call 'start' and 'end'
- Form validation set by taking a slice (start:end+1) from the data.
- Form training set by appending slice (end+1:n) to the end of slice (0:start).
- Train a linear model using training set just formed, with a given l2_penalty
- Compute validation error using validation set just formed



In [85]:

    
def k_fold_cross_validation(k, l2_penalty, data, output_vals):
    
    # Defining n as the number of observations and an empty list to store the k cross_validation errors
    n = len(data)
    cv_error_list = []
    
    # Looping to compute k slices. Computing the array index to get the kth_slice.
    for i in range(k):
        
        # Getting the starting and ending index of the kth slice
        start = (n*i)/k
        end = (n*(i+1))/k-1
        
        # Using start and end to split data into cross-validation and training set
        cv_set = data[start: end + 1]
        training_set = data[0:start].append(data[end+1:n])
        
        # Using the training data to create a linear regression model
        model_train_data = linear_model.Ridge(alpha=l2_penalty, normalize=True)
        model_train_data.fit( data, output_vals )
        
        # Computing np.array with predictions from the model we learn
        predictions = model_train_data.predict(data)
        
        # Computing the error on the cross-validation set
        RSS_cv_set = sum( (predictions - output_vals)**2 )
        
        cv_error_list.append(RSS_cv_set)
     
    # Return the average validation error
    return sum(cv_error_list)/float(len(cv_error_list))

Once we have a function to compute the average validation error for a model, we can write a loop to find the model that minimizes the average validation error. Write a loop that does the following:

We will again be aiming to fit a 15th-order polynomial model using the sqft_living input
For l2_penalty in [10^1, 10^1.5, 10^2, 10^2.5, ..., 10^7] (to get this in Python, you can use this Numpy function: np.logspace(1, 7, num=13).)
- Run 10-fold cross-validation with l2_penalty
Report which L2 penalty produced the lowest average validation error.

Note: since the degree of the polynomial is now fixed to 15, to make things faster, you should generate polynomial features in advance and re-use them throughout the loop. Make sure to use train_valid_shuffled when generating polynomial features!



In [86]:

    
l2_penalty_list = np.logspace(3, 9, num=26)



In [87]:

    
poly_15_dframe = polynomial_dataframe(train_valid_shuffled['sqft_living'], 15)
output_values = train_valid_shuffled['price']



In [88]:

    
l2_RSS_list = []
for l2_pen in l2_penalty_list:
    
    RSS_error = k_fold_cross_validation(10, l2_pen, poly_15_dframe, output_values)
    l2_RSS_list.append( (RSS_error, l2_pen) )

QUIZ QUESTIONS: What is the best value for the L2 penalty according to 10-fold validation?



In [89]:

    
print 'Minimum value for RSS error is  : %.2e' %min(l2_RSS_list)[0]
print 'L2 penalty for this RSS error is: %.2e' %min(l2_RSS_list)[1]









    



Minimum value for RSS error is  : 2.65e+15
L2 penalty for this RSS error is: 1.00e+03

You may find it useful to plot the k-fold cross-validation errors you have obtained to better understand the behavior of the method.



In [90]:

    
# Putting all L2 penalties and RSS errors for plotting
L2_plot_list = []
RSS_plot_list = []

for entry in l2_RSS_list:
    L2_plot_list.append(entry[1])
    RSS_plot_list.append(entry[0])



In [91]:

    
# Plot the l2_penalty values in the x axis and the cross-validation error in the y axis.
# Using plt.xscale('log') will make your plot more intuitive.

plt.figure(figsize=(8,6))
plt.plot(L2_plot_list, RSS_plot_list,'-')
plt.xscale('log')
#
plt.xlabel('L2 penalty ' + r'$(\lambda)$', fontsize=16)
plt.ylabel('Average Cross-Validation RSS', fontsize=16)
plt.title('Cross-Validation RSS vs. L2 Penalty', fontsize=16)
#
plt.show()

Once you found the best value for the L2 penalty using cross-validation, it is important to retrain a final model on all of the training data using this value of l2_penalty. This way, your final model will be trained on the entire dataset.



In [92]:

    
min_L2_pen = min(l2_RSS_list)[1]

QUIZ QUESTION: Using the best L2 penalty found above, train a model using all training data. What is the RSS on the TEST data of the model you learn with this L2 penalty?



In [93]:

    
# Loading the training set data and defining the dataframe w/ 15 polynomial features
train_data = pd.read_csv('wk3_kc_house_train_data.csv', dtype = dtype_dict)
train_data = train_data.sort_values(['sqft_living', 'price'])
poly_15_train_data = polynomial_dataframe(train_data['sqft_living'], 15)



In [94]:

    
# Training a linear regression model with L2 penalty that gave the smallest RSS error
model_train_data = linear_model.Ridge(alpha=min_L2_pen, normalize=True)
model_train_data.fit( poly_15_train_data, train_data['price'] )









    Out[94]:





Ridge(alpha=1000.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, solver='auto', tol=0.001)



In [95]:

    
# Now, loading the test data and defining the dataframe w/ 15 polynomial features
test_data = pd.read_csv('wk3_kc_house_test_data.csv', dtype = dtype_dict)
poly_15_test_data = polynomial_dataframe(test_data['sqft_living'], 15)



In [96]:

    
# Using the weights learning from the training data to calculate the predictions on the test data
predictions = model_train_data.predict(poly_15_test_data)



In [97]:

    
# Computing the RSS on the test data
RSS_test_set = sum( (predictions - test_data['price'])**2 )



In [98]:

    
print 'RSS on test data with min_L2_pen: %.2e' %(RSS_test_set)









    



RSS on test data with min_L2_pen: 2.84e+14



In [ ]:

	sqft_living	price
19452	290	142000
15381	370	276000
860	380	245000
18379	384	265000
4868	390	228000

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	view	...	grade	sqft_above	sqft_basement	yr_built	zipcode	lat	long	sqft_living15	sqft_lot15
0	2780400035	20140505T000000	665000	4	2.50	2800	5900	1	0	...	8	1660	1140	1963	98115	47.6809	-122.286	2580	5900
1	1703050500	20150321T000000	645000	3	2.50	2490	5978	2	0	...	9	2490	0	2003	98074	47.6298	-122.022	2710	6629
2	5700002325	20140605T000000	640000	3	1.75	2340	4206	1	0	...	7	1170	1170	1917	98144	47.5759	-122.288	1360	4725
3	0475000510	20141118T000000	594000	3	1.00	1320	5000	1	0	...	7	1090	230	1920	98107	47.6674	-122.365	1700	5000
4	0844001052	20150128T000000	365000	4	2.50	1904	8200	2	0	...	7	1904	0	1999	98010	47.3107	-122.001	1560	12426
5	2781280290	20150427T000000	305000	3	2.50	1610	3516	2	0	...	8	1610	0	2006	98055	47.4491	-122.188	1610	3056
6	2214800630	20141105T000000	239950	3	2.25	1560	8280	2	0	...	7	1560	0	1979	98001	47.3393	-122.259	1920	8120
7	2114700540	20141021T000000	366000	3	2.50	1320	4320	1	0	...	6	660	660	1918	98106	47.5327	-122.347	1190	4200
8	2596400050	20140730T000000	375000	3	1.00	1960	7955	1	0	...	7	1260	700	1963	98177	47.7641	-122.364	1850	8219
9	4140900050	20150126T000000	440000	4	1.75	2180	10200	1	2	...	8	2000	180	1966	98028	47.7638	-122.270	2590	10445