Regression Week 5: Feature Selection and LASSO (Interpretation)

In this notebook, you will use LASSO to select features, building on a pre-implemented solver for LASSO (using GraphLab Create, though you can use other solvers). You will:

  • Run LASSO with different L1 penalties.
  • Choose best L1 penalty using a validation set.
  • Choose best L1 penalty using a validation set, with additional constraint on the size of subset.

In the second notebook, you will implement your own LASSO solver, using coordinate descent.

Fire up graphlab create


In [1]:
import graphlab

Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.


In [2]:
sales = graphlab.SFrame('kc_house_data.gl/')


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1476930985.log
This non-commercial license of GraphLab Create for academic use is assigned to sudhanshu.shekhar.iitd@gmail.com and will expire on September 18, 2017.

Create new features


In [3]:
sales.head()


Out[3]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900.0 3.0 1.0 1180.0 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000.0 3.0 2.25 2570.0 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000.0 2.0 1.0 770.0 10000 1 0
2487200875 2014-12-09 00:00:00+00:00 604000.0 4.0 3.0 1960.0 5000 1 0
1954400510 2015-02-18 00:00:00+00:00 510000.0 3.0 2.0 1680.0 8080 1 0
7237550310 2014-05-12 00:00:00+00:00 1225000.0 4.0 4.5 5420.0 101930 1 0
1321400060 2014-06-27 00:00:00+00:00 257500.0 3.0 2.25 1715.0 6819 2 0
2008000270 2015-01-15 00:00:00+00:00 291850.0 3.0 1.5 1060.0 9711 1 0
2414600126 2015-04-15 00:00:00+00:00 229500.0 3.0 1.0 1780.0 7470 1 0
3793500160 2015-03-12 00:00:00+00:00 323000.0 3.0 2.5 1890.0 6560 2 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
0 5 7 1050 910 1965 0 98136 47.52082
0 3 8 1680 0 1987 0 98074 47.61681228
0 3 11 3890 1530 2001 0 98053 47.65611835
0 3 7 1715 0 1995 0 98003 47.30972002
0 3 7 1060 0 1963 0 98198 47.40949984
0 3 7 1050 730 1960 0 98146 47.51229381
0 3 7 1890 0 2003 0 98038 47.36840673
long sqft_living15 sqft_lot15
-122.25677536 1340.0 5650.0
-122.3188624 1690.0 7639.0
-122.23319601 2720.0 8062.0
-122.39318505 1360.0 5000.0
-122.04490059 1800.0 7503.0
-122.00528655 4760.0 101930.0
-122.32704857 2238.0 6819.0
-122.31457273 1650.0 9711.0
-122.33659507 1780.0 8113.0
-122.0308176 2390.0 7570.0
[10 rows x 21 columns]

As in Week 2, we consider features that are some transformations of inputs.


In [4]:
from math import log, sqrt
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']

# In the dataset, 'floors' was defined with type string, 
# so we'll convert them to float, before creating a new feature.
sales['floors'] = sales['floors'].astype(float) 
sales['floors_square'] = sales['floors']*sales['floors']
  • Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
  • On the other hand, taking square root of sqft_living will decrease the separation between big house and small house. The owner may not be exactly twice as happy for getting a house that is twice as big.

Learn regression weights with L1 penalty

Let us fit a model with all the features available, plus the features we just created above.


In [5]:
all_features = ['bedrooms', 'bedrooms_square',
            'bathrooms',
            'sqft_living', 'sqft_living_sqrt',
            'sqft_lot', 'sqft_lot_sqrt',
            'floors', 'floors_square',
            'waterfront', 'view', 'condition', 'grade',
            'sqft_above',
            'sqft_basement',
            'yr_built', 'yr_renovated']

Applying L1 penalty requires adding an extra parameter (l1_penalty) to the linear regression call in GraphLab Create. (Other tools may have separate implementations of LASSO.) Note that it's important to set l2_penalty=0 to ensure we don't introduce an additional L2 penalty.


In [6]:
model_all = graphlab.linear_regression.create(sales, target='price', features=all_features,
                                              validation_set=None, 
                                              l2_penalty=0., l1_penalty=1e10)


Linear regression:
--------------------------------------------------------
Number of examples          : 21613
Number of features          : 17
Number of unpacked features : 17
Number of coefficients    : 18
Starting Accelerated Gradient (FISTA)
--------------------------------------------------------
+-----------+----------+-----------+--------------+--------------------+---------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+-----------+--------------+--------------------+---------------+
Tuning step size. First iteration could take longer than subsequent iterations.
| 1         | 2        | 0.000002  | 1.378619     | 6962915.603493     | 426631.749026 |
| 2         | 3        | 0.000002  | 1.410414     | 6843144.200219     | 392488.929838 |
| 3         | 4        | 0.000002  | 1.451617     | 6831900.032123     | 385340.166783 |
| 4         | 5        | 0.000002  | 1.483224     | 6847166.848958     | 384842.383767 |
| 5         | 6        | 0.000002  | 1.516477     | 6869667.895833     | 385998.458623 |
| 6         | 7        | 0.000002  | 1.554943     | 6847177.773672     | 380824.455891 |
+-----------+----------+-----------+--------------+--------------------+---------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.

Find what features had non-zero weight.


In [9]:
model_all.get('coefficients')[model_all.get('coefficients')['value'] > 0.0]


Out[9]:
name index value stderr
(intercept) None 274873.05595 None
bathrooms None 8468.53108691 None
sqft_living None 24.4207209824 None
sqft_living_sqrt None 350.060553386 None
grade None 842.068034898 None
sqft_above None 20.0247224171 None
[? rows x 4 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.

Note that a majority of the weights have been set to zero. So by setting an L1 penalty that's large enough, we are performing a subset selection.

QUIZ QUESTION: According to this list of weights, which of the features have been chosen?

Selecting an L1 penalty

To find a good L1 penalty, we will explore multiple values using a validation set. Let us do three way split into train, validation, and test sets:

  • Split our sales data into 2 sets: training and test
  • Further split our training data into two sets: train, validation

Be very careful that you use seed = 1 to ensure you get the same answer!


In [10]:
(training_and_validation, testing) = sales.random_split(.9,seed=1) # initial train/test split
(training, validation) = training_and_validation.random_split(0.5, seed=1) # split training into train and validate

Next, we write a loop that does the following:

  • For l1_penalty in [10^1, 10^1.5, 10^2, 10^2.5, ..., 10^7] (to get this in Python, type np.logspace(1, 7, num=13).)
    • Fit a regression model with a given l1_penalty on TRAIN data. Specify l1_penalty=l1_penalty and l2_penalty=0. in the parameter list.
    • Compute the RSS on VALIDATION data (here you will want to use .predict()) for that l1_penalty
  • Report which l1_penalty produced the lowest RSS on validation data.

When you call linear_regression.create() make sure you set validation_set = None.

Note: you can turn off the print out of linear_regression.create() with verbose = False


In [15]:
validation_rss_avg_list = []
best_l1_penalty = 1
min_rss = float("inf")
import numpy as np
for l1_penalty in np.logspace(1, 7, num=13):
    model = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, 
                                              l2_penalty=0., l1_penalty=l1_penalty, verbose=False)
    
    # find validation error
    prediction = model.predict(validation[all_features])
    error = prediction - validation['price']
    error_squared = error * error
    rss = error_squared.sum()
    print "L1 penalty " + str(l1_penalty) + " validation rss = " + str(rss)
    
    if (rss < min_rss):
        min_rss = rss
        best_l1_penalty = l1_penalty
    validation_rss_avg_list.append(rss)


print "Best L1 penalty " + str(best_l1_penalty) + " validation rss = " + str(min_rss)
validation_rss_avg_list


L1 penalty 10.0 validation rss = 6.25766285142e+14
L1 penalty 31.6227766017 validation rss = 6.25766285362e+14
L1 penalty 100.0 validation rss = 6.25766286058e+14
L1 penalty 316.227766017 validation rss = 6.25766288257e+14
L1 penalty 1000.0 validation rss = 6.25766295212e+14
L1 penalty 3162.27766017 validation rss = 6.25766317206e+14
L1 penalty 10000.0 validation rss = 6.25766386761e+14
L1 penalty 31622.7766017 validation rss = 6.25766606749e+14
L1 penalty 100000.0 validation rss = 6.25767302792e+14
L1 penalty 316227.766017 validation rss = 6.25769507644e+14
L1 penalty 1000000.0 validation rss = 6.25776517727e+14
L1 penalty 3162277.66017 validation rss = 6.25799062845e+14
L1 penalty 10000000.0 validation rss = 6.25883719085e+14
Best L1 penalty 10.0 validation rss = 6.25766285142e+14
Out[15]:
[625766285142460.5,
 625766285362394.4,
 625766286057885.1,
 625766288257224.4,
 625766295212186.1,
 625766317206080.8,
 625766386760658.0,
 625766606749278.5,
 625767302791634.9,
 625769507643885.8,
 625776517727024.5,
 625799062845466.6,
 625883719085425.0]

In [16]:
np.logspace(1, 7, num=13)


Out[16]:
array([  1.00000000e+01,   3.16227766e+01,   1.00000000e+02,
         3.16227766e+02,   1.00000000e+03,   3.16227766e+03,
         1.00000000e+04,   3.16227766e+04,   1.00000000e+05,
         3.16227766e+05,   1.00000000e+06,   3.16227766e+06,
         1.00000000e+07])

QUIZ QUESTIONS

  1. What was the best value for the l1_penalty?
  2. What is the RSS on TEST data of the model with the best l1_penalty?

In [17]:
best_l1_penalty


Out[17]:
10.0

In [18]:
model_best = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, 
                                              l2_penalty=0., l1_penalty=best_l1_penalty, verbose=False)

QUIZ QUESTION Also, using this value of L1 penalty, how many nonzero weights do you have?


In [20]:
len(model_best.get('coefficients')[model_best.get('coefficients')['value'] > 0.0])


Out[20]:
18

Limit the number of nonzero weights

What if we absolutely wanted to limit ourselves to, say, 7 features? This may be important if we want to derive "a rule of thumb" --- an interpretable model that has only a few features in them.

In this section, you are going to implement a simple, two phase procedure to achive this goal:

  1. Explore a large range of l1_penalty values to find a narrow region of l1_penalty values where models are likely to have the desired number of non-zero weights.
  2. Further explore the narrow region you found to find a good value for l1_penalty that achieves the desired sparsity. Here, we will again use a validation set to choose the best value for l1_penalty.

In [21]:
max_nonzeros = 7

Exploring the larger range of values to find a narrow range with the desired sparsity

Let's define a wide range of possible l1_penalty_values:


In [22]:
l1_penalty_values = np.logspace(8, 10, num=20)

Now, implement a loop that search through this space of possible l1_penalty values:

  • For l1_penalty in np.logspace(8, 10, num=20):
    • Fit a regression model with a given l1_penalty on TRAIN data. Specify l1_penalty=l1_penalty and l2_penalty=0. in the parameter list. When you call linear_regression.create() make sure you set validation_set = None
    • Extract the weights of the model and count the number of nonzeros. Save the number of nonzeros to a list.
      • Hint: model['coefficients']['value'] gives you an SArray with the parameters you learned. If you call the method .nnz() on it, you will find the number of non-zero parameters!

In [23]:
nnz_list = []
for l1_penalty in np.logspace(8, 10, num=20):
    model = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, 
                                              l2_penalty=0., l1_penalty=l1_penalty, verbose=False)
    
    # extract number of nnz
    nnz = model['coefficients']['value'].nnz()
    
    print "L1 penalty " + str(l1_penalty) + " : # nnz = " + str(nnz)

    nnz_list.append(nnz)


nnz_list


L1 penalty 100000000.0 : # nnz = 18
L1 penalty 127427498.57 : # nnz = 18
L1 penalty 162377673.919 : # nnz = 18
L1 penalty 206913808.111 : # nnz = 18
L1 penalty 263665089.873 : # nnz = 17
L1 penalty 335981828.628 : # nnz = 17
L1 penalty 428133239.872 : # nnz = 17
L1 penalty 545559478.117 : # nnz = 17
L1 penalty 695192796.178 : # nnz = 17
L1 penalty 885866790.41 : # nnz = 16
L1 penalty 1128837891.68 : # nnz = 15
L1 penalty 1438449888.29 : # nnz = 15
L1 penalty 1832980710.83 : # nnz = 13
L1 penalty 2335721469.09 : # nnz = 12
L1 penalty 2976351441.63 : # nnz = 10
L1 penalty 3792690190.73 : # nnz = 6
L1 penalty 4832930238.57 : # nnz = 5
L1 penalty 6158482110.66 : # nnz = 3
L1 penalty 7847599703.51 : # nnz = 1
L1 penalty 10000000000.0 : # nnz = 1
Out[23]:
[18, 18, 18, 18, 17, 17, 17, 17, 17, 16, 15, 15, 13, 12, 10, 6, 5, 3, 1, 1]

Out of this large range, we want to find the two ends of our desired narrow range of l1_penalty. At one end, we will have l1_penalty values that have too few non-zeros, and at the other end, we will have an l1_penalty that has too many non-zeros.

More formally, find:

  • The largest l1_penalty that has more non-zeros than max_nonzero (if we pick a penalty smaller than this value, we will definitely have too many non-zero weights)
    • Store this value in the variable l1_penalty_min (we will use it later)
  • The smallest l1_penalty that has fewer non-zeros than max_nonzero (if we pick a penalty larger than this value, we will definitely have too few non-zero weights)
    • Store this value in the variable l1_penalty_max (we will use it later)

Hint: there are many ways to do this, e.g.:

  • Programmatically within the loop above
  • Creating a list with the number of non-zeros for each value of l1_penalty and inspecting it to find the appropriate boundaries.

In [25]:
l1_penalty_min = 2976351441.63
l1_penalty_max = 3792690190.73

QUIZ QUESTIONS

What values did you find for l1_penalty_min andl1_penalty_max?

Exploring the narrow range of values to find the solution with the right number of non-zeros that has lowest RSS on the validation set

We will now explore the narrow region of l1_penalty values we found:


In [26]:
l1_penalty_values = np.linspace(l1_penalty_min,l1_penalty_max,20)
  • For l1_penalty in np.linspace(l1_penalty_min,l1_penalty_max,20):
    • Fit a regression model with a given l1_penalty on TRAIN data. Specify l1_penalty=l1_penalty and l2_penalty=0. in the parameter list. When you call linear_regression.create() make sure you set validation_set = None
    • Measure the RSS of the learned model on the VALIDATION set

Find the model that the lowest RSS on the VALIDATION set and has sparsity equal to max_nonzero.


In [29]:
nnz_list = []
validation_rss_avg_list = []
best_l1_penalty = 1
min_rss = float("inf")
import numpy as np
for l1_penalty in np.linspace(l1_penalty_min,l1_penalty_max,20):
    model = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, 
                                              l2_penalty=0., l1_penalty=l1_penalty, verbose=False)
    
    # find validation error
    prediction = model.predict(validation[all_features])
    error = prediction - validation['price']
    error_squared = error * error
    rss = error_squared.sum()
    print "L1 penalty " + str(l1_penalty) + " validation rss = " + str(rss)
    
    # extract number of nnz
    nnz = model['coefficients']['value'].nnz()
    
    print "L1 penalty " + str(l1_penalty) + " : # nnz = " + str(nnz)

    nnz_list.append(nnz)
    
    print "----------------------------------------------------------"
    
    if (nnz == max_nonzeros and rss < min_rss):
        min_rss = rss
        best_l1_penalty = l1_penalty
    validation_rss_avg_list.append(rss)

print "Best L1 penalty " + str(best_l1_penalty) + " validation rss = " + str(min_rss)


L1 penalty 2976351441.63 validation rss = 9.66925692362e+14
L1 penalty 2976351441.63 : # nnz = 10
----------------------------------------------------------
L1 penalty 3019316638.95 validation rss = 9.74019450085e+14
L1 penalty 3019316638.95 : # nnz = 10
----------------------------------------------------------
L1 penalty 3062281836.27 validation rss = 9.81188367942e+14
L1 penalty 3062281836.27 : # nnz = 10
----------------------------------------------------------
L1 penalty 3105247033.59 validation rss = 9.89328342459e+14
L1 penalty 3105247033.59 : # nnz = 10
----------------------------------------------------------
L1 penalty 3148212230.91 validation rss = 9.98783211266e+14
L1 penalty 3148212230.91 : # nnz = 10
----------------------------------------------------------
L1 penalty 3191177428.24 validation rss = 1.00847716702e+15
L1 penalty 3191177428.24 : # nnz = 10
----------------------------------------------------------
L1 penalty 3234142625.56 validation rss = 1.01829878055e+15
L1 penalty 3234142625.56 : # nnz = 10
----------------------------------------------------------
L1 penalty 3277107822.88 validation rss = 1.02824799221e+15
L1 penalty 3277107822.88 : # nnz = 10
----------------------------------------------------------
L1 penalty 3320073020.2 validation rss = 1.03461690923e+15
L1 penalty 3320073020.2 : # nnz = 8
----------------------------------------------------------
L1 penalty 3363038217.52 validation rss = 1.03855473594e+15
L1 penalty 3363038217.52 : # nnz = 8
----------------------------------------------------------
L1 penalty 3406003414.84 validation rss = 1.04323723787e+15
L1 penalty 3406003414.84 : # nnz = 8
----------------------------------------------------------
L1 penalty 3448968612.16 validation rss = 1.04693748875e+15
L1 penalty 3448968612.16 : # nnz = 7
----------------------------------------------------------
L1 penalty 3491933809.48 validation rss = 1.05114762561e+15
L1 penalty 3491933809.48 : # nnz = 7
----------------------------------------------------------
L1 penalty 3534899006.8 validation rss = 1.05599273534e+15
L1 penalty 3534899006.8 : # nnz = 7
----------------------------------------------------------
L1 penalty 3577864204.12 validation rss = 1.06079953176e+15
L1 penalty 3577864204.12 : # nnz = 7
----------------------------------------------------------
L1 penalty 3620829401.45 validation rss = 1.0657076895e+15
L1 penalty 3620829401.45 : # nnz = 6
----------------------------------------------------------
L1 penalty 3663794598.77 validation rss = 1.06946433543e+15
L1 penalty 3663794598.77 : # nnz = 6
----------------------------------------------------------
L1 penalty 3706759796.09 validation rss = 1.07350454959e+15
L1 penalty 3706759796.09 : # nnz = 6
----------------------------------------------------------
L1 penalty 3749724993.41 validation rss = 1.07763277558e+15
L1 penalty 3749724993.41 : # nnz = 6
----------------------------------------------------------
L1 penalty 3792690190.73 validation rss = 1.08186759232e+15
L1 penalty 3792690190.73 : # nnz = 6
----------------------------------------------------------
Best L1 penalty 3448968612.16 validation rss = 1.04693748875e+15

QUIZ QUESTIONS

  1. What value of l1_penalty in our narrow range has the lowest RSS on the VALIDATION set and has sparsity equal to max_nonzeros?
  2. What features in this model have non-zero coefficients?

In [31]:
model_best = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, 
                                              l2_penalty=0., l1_penalty=best_l1_penalty, verbose=False)
model_best.get('coefficients')[model_best.get('coefficients')['value'] > 0.0]


Out[31]:
name index value stderr
(intercept) None 222253.192544 None
bedrooms None 661.722717782 None
bathrooms None 15873.9572593 None
sqft_living None 32.4102214513 None
sqft_living_sqrt None 690.114773313 None
grade None 2899.42026975 None
sqft_above None 30.0115753022 None
[? rows x 4 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.

In [ ]: