In this notebook, we will use LASSO to select features. You will:
In the second notebook, you will implement your own LASSO solver, using coordinate descent.
In [33]:
import os
import zipfile
from math import log, sqrt
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
In [34]:
# Put files in current direction into a list
files_list = [f for f in os.listdir('.') if os.path.isfile(f)]
In [35]:
# Filenames of unzipped files
unzip_files = ['kc_house_data.csv','wk3_kc_house_train_data.csv',
'wk3_kc_house_test_data.csv', 'wk3_kc_house_train_data.csv',
'wk3_kc_house_valid_data.csv']
In [36]:
# If upzipped file not in files_list, unzip the file
for filename in unzip_files:
if filename not in files_list:
zip_file = filename + '.zip'
unzipping = zipfile.ZipFile(zip_file)
unzipping.extractall()
unzipping.close
In [37]:
# Dictionary with the correct dtypes for the DataFrame columns
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int,
'sqft_living15':float, 'grade':int, 'yr_renovated':int,
'price':float, 'bedrooms':float, 'zipcode':str,
'long':float, 'sqft_lot15':float, 'sqft_living':float,
'floors':float, 'condition':int, 'lat':float, 'date':str,
'sqft_basement':int, 'yr_built':int, 'id':str,
'sqft_lot':int, 'view':int}
In [38]:
sales = pd.read_csv('kc_house_data.csv', dtype=dtype_dict)
As in Week 2, we consider features that are some transformations of inputs.
In [39]:
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']
sales['floors_square'] = sales['floors']*sales['floors']
Let us fit a model with all the features available, plus the features we just created above.
In [40]:
all_features = ['bedrooms', 'bedrooms_square',
'bathrooms',
'sqft_living', 'sqft_living_sqrt',
'sqft_lot', 'sqft_lot_sqrt',
'floors', 'floors_square',
'waterfront', 'view', 'condition', 'grade',
'sqft_above',
'sqft_basement',
'yr_built', 'yr_renovated']
Using the entire house dataset, learn regression weights using an L1 penalty of 5e2. Make sure to add "normalize=True" when creating the Lasso object.
In [41]:
model_all = linear_model.Lasso(alpha=5e2, normalize=True) # set parameters
model_all.fit(sales[all_features], sales['price']) # learn weights
Out[41]:
Note that a majority of the weights have been set to zero. So by setting an L1 penalty that's large enough, we are performing a subset selection.
In [42]:
print model_all.coef_
Note that a majority of the weights have been set to zero. So by setting an L1 penalty that's large enough, we are performing a subset selection.
QUIZ QUESTION: For the model_all model, which of the features have been chosen, i.e. what features had non-zero weights?
In [43]:
for feat, weight in zip(all_features, model_all.coef_):
if weight != 0.0:
print feat + ':', weight
To find a good L1 penalty, we will explore multiple values using a validation set. Let us do three way split into train, validation, and test sets:
In [44]:
testing = pd.read_csv('wk3_kc_house_test_data.csv', dtype=dtype_dict)
training = pd.read_csv('wk3_kc_house_train_data.csv', dtype=dtype_dict)
validation = pd.read_csv('wk3_kc_house_valid_data.csv', dtype=dtype_dict)
Make sure to create the 4 features as we did above:
In [45]:
testing['sqft_living_sqrt'] = testing['sqft_living'].apply(sqrt)
testing['sqft_lot_sqrt'] = testing['sqft_lot'].apply(sqrt)
testing['bedrooms_square'] = testing['bedrooms']*testing['bedrooms']
testing['floors_square'] = testing['floors']*testing['floors']
training['sqft_living_sqrt'] = training['sqft_living'].apply(sqrt)
training['sqft_lot_sqrt'] = training['sqft_lot'].apply(sqrt)
training['bedrooms_square'] = training['bedrooms']*training['bedrooms']
training['floors_square'] = training['floors']*training['floors']
validation['sqft_living_sqrt'] = validation['sqft_living'].apply(sqrt)
validation['sqft_lot_sqrt'] = validation['sqft_lot'].apply(sqrt)
validation['bedrooms_square'] = validation['bedrooms']*validation['bedrooms']
validation['floors_square'] = validation['floors']*validation['floors']
Next, we write a loop that does the following:
l1_penalty
in [10^1, 10^1.5, 10^2, 10^2.5, ..., 10^7] (to get this in Python, type np.logspace(1, 7, num=13)
.)l1_penalty
on TRAIN data. Specify l1_penalty=l1_penalty
and l2_penalty=0.
in the parameter list..predict()
) for that l1_penalty
l1_penalty
produced the lowest RSS on validation data.
In [46]:
l1_pen_val = np.logspace(1, 7, num=13)
Creating a dictionary to store the regression models for each L1 penalty. The key of the dictionary will be the index of the l1_pen_val array, passed as a string
In [47]:
models_diff_l1 = {}
Creating a regression model for each L1 penalty
In [48]:
for i in range(len(l1_pen_val)):
key_val = str(i)
models_diff_l1[key_val] = linear_model.Lasso(alpha=l1_pen_val[i], normalize=True) # set parameters
models_diff_l1[key_val].fit(training[all_features], training['price']) # learn weights
Making a function to compute the RSS on the validation data
In [49]:
def RSS_val(output_vals, predictions):
RSS_error = sum( (output_vals - predictions)**2.0 )
return RSS_error
Making a list to store tuples of the form (RSS value for a L1 penalty, index of L1 penalty array)
In [50]:
RSS_L1_vals = []
In this loop, we use the repression model to calculate the predicted output values. We then use the predicted values and observed output value to calculate the RSS error. We then fill in values for the RSS_L1_vals.
In [51]:
for i in range(len(l1_pen_val)):
key_val = str(i)
pred_vals = models_diff_l1[key_val].predict(validation[all_features])
RSS = RSS_val(validation['price'], pred_vals)
RSS_L1_vals.append( (RSS, i) )
QUIZ QUESTIONS
Q1. What was the best value for the l1_penalty
?
In [52]:
print l1_pen_val[ min(RSS_L1_vals)[1] ]
print '%.2e' % ( min(RSS_L1_vals)[0] )
QUIZ QUESTION Also, using this value of L1 penalty, how many nonzero weights do you have?
In [53]:
print ( np.count_nonzero(models_diff_l1[ str(min(RSS_L1_vals)[1]) ].coef_) +
np.count_nonzero(models_diff_l1[ str(min(RSS_L1_vals)[1]) ].intercept_) )
In this section, you are going to implement a simple, two phase procedure to achive this goal:
l1_penalty
values to find a narrow region of l1_penalty
values where models are likely to have the desired number of non-zero weights.l1_penalty
that achieves the desired sparsity. Here, we will again use a validation set to choose the best value for l1_penalty
.
In [54]:
max_nonzeros = 7
In [55]:
l1_penalty_values = np.logspace(1, 4, num=20)
Now, implement a loop that search through this space of possible l1_penalty
values:
l1_penalty
in np.logspace(8, 10, num=20)
:l1_penalty
on TRAIN data. Specify l1_penalty=l1_penalty
and l2_penalty=0.
in the parameter list. Creating lists to store L1 penalties for models with features less than max_nonzeros and for models with features more than max_nonzeros
In [56]:
list_l1_pen_n_less_nmax = []
list_l1_pen_n_larger_nmax = []
Creating a regression model for each L1 penalty. Then, finding the non-zero entries for the regression models. If number of non-zero weights are larger or smaller than max_nonzeros, store the number of non_zero weights
In [57]:
for i in range(len(l1_penalty_values)):
mod_diff_l1_n7 = linear_model.Lasso(alpha=l1_penalty_values[i], normalize=True) # set parameters
mod_diff_l1_n7.fit(training[all_features], training['price']) # learn weights
non_0_weights = ( np.count_nonzero(mod_diff_l1_n7.coef_) +
np.count_nonzero(mod_diff_l1_n7.intercept_) )
if non_0_weights<max_nonzeros:
list_l1_pen_n_less_nmax.append(l1_penalty_values[i])
if non_0_weights>max_nonzeros:
list_l1_pen_n_larger_nmax.append(l1_penalty_values[i])
Out of this large range, we want to find the two ends of our desired narrow range of l1_penalty
. At one end, we will have l1_penalty
values that have too few non-zeros, and at the other end, we will have an l1_penalty
that has too many non-zeros.
More formally, find:
l1_penalty
that has more non-zeros than max_nonzero
(if we pick a penalty smaller than this value, we will definitely have too many non-zero weights)l1_penalty_min
(we will use it later)l1_penalty
that has fewer non-zeros than max_nonzero
(if we pick a penalty larger than this value, we will definitely have too few non-zero weights)l1_penalty_max
(we will use it later)QUIZ QUESTIONS
What values did you find for l1_penalty_min
andl1_penalty_max
?
In [58]:
l1_penalty_min = max(list_l1_pen_n_larger_nmax)
l1_penalty_max = min(list_l1_pen_n_less_nmax)
print 'l1_penalty_min: ', round(l1_penalty_min,0)
print 'l1_penalty_max: ', round(l1_penalty_max,0)
In [59]:
l1_penalty_values = np.linspace(l1_penalty_min,l1_penalty_max,20)
l1_penalty
in np.linspace(l1_penalty_min,l1_penalty_max,20)
:l1_penalty
on TRAIN data. Specify l1_penalty=l1_penalty
and l2_penalty=0.
in the parameter list. When you call linear_regression.create()
make sure you set validation_set = None
Find the model that the lowest RSS on the VALIDATION set and has sparsity equal to max_nonzero
.
Creatting a list to store RSS values if number of non-zero weights is equal to max_nonzeros
In [60]:
RSS_L1_vals_ref = []
Creating a regression model for each L1 penalty. If the the number of non-zero weights is equal to max_nonzeros, storing the RSS on the validation set and the index for this L1 penalty in the l1_penalty_values list
In [61]:
for i in range(len(l1_penalty_values)):
mod_diff_l1_ref = linear_model.Lasso(alpha=l1_penalty_values[i], normalize=True) # set parameters
mod_diff_l1_ref.fit(training[all_features], training['price']) # learn weights
non_0_weights = ( np.count_nonzero(mod_diff_l1_ref.coef_) +
np.count_nonzero(mod_diff_l1_ref.intercept_) )
if non_0_weights==max_nonzeros:
pred_vals = mod_diff_l1_ref.predict(validation[all_features])
RSS = RSS_val(validation['price'], pred_vals)
RSS_L1_vals_ref.append( (RSS, i) )
QUIZ QUESTIONS
Q1. What value of l1_penalty
in our narrow range has the lowest RSS on the VALIDATION set and has sparsity equal to max_nonzeros
?
In [62]:
print round( l1_penalty_values[ min(RSS_L1_vals_ref)[1] ] , 0 )
Q2. What features in this model have non-zero coefficients?
Re-learning the model with this L1 penalty
In [63]:
best_L1_index = min(RSS_L1_vals_ref)[1]
mod_diff_l1_ref = linear_model.Lasso(alpha=l1_penalty_values[ best_L1_index ], normalize=True) # set parameters
mod_diff_l1_ref.fit(training[all_features], training['price']) # learn weights
Out[63]:
Printing the features with non-zero weights and the values of the weights.
In [64]:
if mod_diff_l1_ref.intercept_ != 0:
print 'intercept: %.2e' % (mod_diff_l1_ref.intercept_)
for feat, weight in zip(all_features, mod_diff_l1_ref.coef_):
if weight != 0.0:
print feat + ':', weight
In [ ]: