The Lasso

StatML: Lecture 5

Prof. James Sharpnack

Some content and images are from "The Elements of Statistical Learning" by Hastie, Tibshirani, Friedman
Reading ESL Chapter 3

Recall Convex Optimization

Def A function $f : \mathbb R^p \to \mathbb R$ is convex if for any $0 \le \alpha \le 1$, $x_0, x_1 \in \mathbb R^p$, $$ f(\alpha x_0 + (1 - \alpha) x_1) \le \alpha f(x_0) + (1 - \alpha) f(x_1). $$

For convex functions, local minima are global minima

Recall 1st Order Condition. If f is differentiable then it is convex if $$ f(x) \ge f(x_0) + \nabla f(x_0)^\top (x - x_0), \forall x,x_0 $$ and when $\nabla f(x_0) = 0$ then $$ f(x) \ge f(x_0), \forall x $$ so any fixed point of gradient descent is a global min (for convex, differentiable f)

Subdifferential

Def. $g(x_0) \in \mathbb R^p$ is a subgradient of $f$ at $x_0$ if
$$ f(x) \ge f(x_0) + g(x_0)^\top (x - x_0), \forall x. $$ The set of all subgradients at $x_0$ is call the subdifferential, denoted $\partial f(x_0)$.

For any global optima, $0 \in \partial f(x_0)$.

Wavelet denoising

Soft thresholding is commonly used for orthonormal bases.

Suppose that we have a vector $y_1,\ldots, y_T$ (like a time series).
And we want to reconstruct $y$ with $W \beta$ where $\beta$ has a small sum of absolute values $\sum_i |\beta_i|$
$W$ is $T \times T$ and $W W^\top = W^\top W = I$ (orthonormal full rank design)

Want to minimize $$ \frac 12 \sum_{i=1}^T (y - W \beta)_i^2 + \lambda \sum_{i=1}^T |\beta_i|. $$



In [7]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



In [8]:

    
## Explore Turkish stock exchange dataset

tse = pd.read_excel('../../data/data_akbilgic.xlsx',skiprows=1)

tse = tse.rename(columns={'ISE':'TLISE','ISE.1':'USDISE'})



In [9]:

    
def const_wave(T,a,b):
    wave = np.zeros(T)
    s1 = (b-a) // 2
    s2 = (b-a) - s1
    norm_C = (s1*s2 / (s1+s2))**0.5
    wave[a:a+s1] = norm_C / s1
    wave[a+s1:b] = -norm_C / s2
    return wave



In [10]:

    
def _const_wave_basis(T,a,b):
    if b-a < 2:
        return []
    wave_basis = []
    wave_basis.append(const_wave(T,a,b))
    mid_pt = a + (b-a)//2
    wave_basis += _const_wave_basis(T,a,mid_pt)
    wave_basis += _const_wave_basis(T,mid_pt,b)
    return wave_basis



In [11]:

    
def const_wave_basis(T,a,b):
    father = np.ones(T) / T**0.5
    return [father] + _const_wave_basis(T,a,b)



In [12]:

    
# Construct discrete Haar wavelet basis
T,p = tse.shape
wave_basis = const_wave_basis(T,0,T)
W = np.array(wave_basis).T



In [13]:

    
_ = plt.plot(W[:,:3])



In [14]:

    
def soft(y,lamb):
    pos_part = (y - lamb) * (y > lamb)
    neg_part = (y + lamb) * (y < -lamb)
    return pos_part + neg_part



In [15]:

    
## Volatility seems most interesting
## will construct local measure of volatility
## remove rolling window estimate (local centering)
## square the residuals

tse = tse.set_index('date')
tse_trem = tse - tse.rolling("7D").mean()
tse_vol = tse_trem**2.

## Make wavelet transformation and soft threshold

tse_wave = W.T @ tse_vol.values
lamb = .001
tse_soft = soft(tse_wave,lamb)
tse_rec = W @ tse_soft
tse_den = tse_vol.copy()
tse_den.iloc[:,:] = tse_rec



In [16]:

    
_ = tse_vol.plot(subplots=True,figsize=(10,10))



In [17]:

    
_ = tse_den.plot(subplots=True,figsize=(10,10))

Wavelet reconstruction

Can reconstruct the sequence by $$ \hat y = W \hat \beta. $$ The objective is likelihood term + L1 penalty term, $$ \frac 12 \sum_{i=1}^T (y - W \beta)_i^2 + \lambda \sum_{i=1}^T |\beta_i|. $$

The L1 penalty "forces" some $\beta_i = 0$, inducing sparsity



In [18]:

    
plt.plot(tse_soft[:,4])
high_idx = np.where(np.abs(tse_soft[:,4]) > .0001)[0]
print(high_idx)









    



[  0   2  48  63 344 345]



In [19]:

    
fig, axs = plt.subplots(len(high_idx) + 1,1)
for i, idx in enumerate(high_idx):
    axs[i].plot(W[:,idx])
plt.plot(tse_den['FTSE'],c='r')









    Out[19]:





[<matplotlib.lines.Line2D at 0x7f17b67b3090>]

Non-orthogonal design

The objective is likelihood term + L1 penalty term, $$ \frac 12 \sum_{i=1}^T (y - X \beta)_i^2 + \lambda \sum_{i=1}^T |\beta_i|. $$ does not have closed form for $X$ that is non-orthogonal.

it is convex
it is non-smooth (recall $|x|$)
has tuning parameter $\lambda$

Compare to best subset selection (NP-hard): $$ \min \frac 12 \sum_{i=1}^T (y - X \beta)_i^2. $$ for $$ \| \beta \|_0 = |{\rm supp}(\beta)| < s. $$

Image of Lasso solution

Solving the Lasso

The lasso can be written in regularized form, $$ \min \frac 12 \sum_{i=1}^T (y - X \beta)_i^2 + \lambda \sum_{i=1}^T |\beta_i|, $$ or in constrained form, $$ \min \frac 12 \sum_{i=1}^T (y - X \beta)_i^2, \quad \textrm{s.t.} \sum_{i=1}^T |\beta_i| \le C, $$

For every $\lambda$ there is a $C$ such that the regularized form and constrained form have the same argmin
This correspondence is data dependent

Exercise 5.1. Solving the Lasso

A quadratic program (QP) is any convex optimization of the form $$ \min \beta^\top Q \beta + \beta^\top a \quad \textrm{ s.t. } A\beta \le c $$ where $Q$ is positive semi-definite.

Show that the lasso in constrained form is a QP. (Hint: write $\beta = \beta_+ - \beta_-$ where $\beta_{+,j} = \beta_{j} 1\{ \beta_j > 0\}$ and $\beta_{-,j} = - \beta_{j} 1\{ \beta_j < 0\}$).

Solution to 5.1

The objective is certainly quadratic... $$ \frac 12 \sum_{i=1}^T (y - X \beta)_i^2 = \frac 12 \beta^\top (X^\top X) \beta - \beta^\top (X^\top y) + C $$ and we know that $X^\top X$ is PSD because $a^\top X^\top X a = \| X a\|^2 \ge 0$.

What about $\| \beta \|_1$?

Solving the lasso

For a single $\lambda$ (or $C$ in constrained form) can solve the lasso with many specialized methods

quadratic program solver
proximal gradient
alternating direction method of multipliers

but $\lambda$ is a tuning parameter. Options

Construct a grid of $\lambda$ and solve each lasso
Solve for all $\lambda$ values - path algorithm

Active sets and why lasso works better

Let $\hat \beta_\lambda$ be the $\hat \beta$ at tuning parameter $\lambda$.
Define $\mathcal A_\lambda = {\rm supp}(\hat \beta_\lambda)$ the non-zero elements of $\hat \beta_\lambda$.
1. For large $\lambda \rightarrow \infty$, $|\mathcal A_\lambda| = 0$
2. For small $\lambda = 0$, $|\mathcal A_\lambda| = p$ (when OLS solution has full support)

Forward greedy selection only adds elements to the active set, does not remove elements.

Exercise 5.2.1

Verify 1 and 2 above.

Lasso Path

Start at $\lambda = +\infty, \hat \beta = 0$.
Decrease $\lambda$ until $\hat \beta_{j_1} \ne 0$, $\mathcal A \gets \{j_1\}$. (Hitting event)
Continue decreasing $\lambda$ updating $\mathcal A$ with hitting and leaving events

$x_{j_1}$ is the predictor variable most correlated with $y$
Hitting events are when element is added to $\mathcal A$
Leaving events are when element is removed from $\mathcal A$
$\hat \beta_{\lambda,j}$ is piecewise linear, continuous, as a function of $\lambda$
knots are at "hitting" and "leaving" events

from sklearn.org

Least Angle Regression (LAR)

Standardize predictors and start with residual $r = y - \bar y$, $\hat \beta = 0$
Find $x_j$ most correlated with $r$
Move $\beta_j$ in the direction of $x_j^\top r$ until the residual is more correlated with another $x_k$
Move $\beta_j,\beta_k$ in the direction of their joint OLS coefficients of $r$ on $(x_j,x_k)$ until some other competitor $x_l$ has as much correlation with the current residual
Continue until all predictors have been entered.

Exercise 5.2.2

How do we know that LAR does not give us the Lasso solution?

Lasso modification

4.5 If a non-zero coefficient drops to 0 then remove it from the active set and recompute the restricted OLS.

from ESL



In [20]:

    
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing, model_selection, linear_model

%matplotlib inline



In [21]:

    
## Modified from the github repo: https://github.com/JWarmenhoven/ISLR-python 
## which is based on the book by James et al. Intro to Statistical Learning.

df = pd.read_csv('../../data/Hitters.csv', index_col=0).dropna()
df.index.name = 'Player'
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, -Alan Ashby to -Willie Wilson
Data columns (total 20 columns):
AtBat        263 non-null int64
Hits         263 non-null int64
HmRun        263 non-null int64
Runs         263 non-null int64
RBI          263 non-null int64
Walks        263 non-null int64
Years        263 non-null int64
CAtBat       263 non-null int64
CHits        263 non-null int64
CHmRun       263 non-null int64
CRuns        263 non-null int64
CRBI         263 non-null int64
CWalks       263 non-null int64
League       263 non-null object
Division     263 non-null object
PutOuts      263 non-null int64
Assists      263 non-null int64
Errors       263 non-null int64
Salary       263 non-null float64
NewLeague    263 non-null object
dtypes: float64(1), int64(16), object(3)
memory usage: 43.1+ KB



In [22]:

    
## Simulate a dataset for lasso

n=100
p=1000
X = np.random.randn(n,p)
X = preprocessing.scale(X)



In [23]:

    
## Subselect true active set

sprob = 0.02
Sbool = np.random.rand(p) < sprob
s = np.sum(Sbool)
print("Number of non-zero's: {}".format(s))









    



Number of non-zero's: 20



In [24]:

    
## Construct beta and y

mu = 100.
beta = np.zeros(p)
beta[Sbool] = mu * np.random.randn(s)

eps = np.random.randn(n)
y = X.dot(beta) + eps

Exercise 5.3

Run the lasso using linear_model.lars_path with the lasso modification (see docstring with ?linear_model.lars_path)
Plot the lasso coefficients that are learned as a function of lambda. You should have a plot with the x-axis being lambda and the y-axis being the coefficient value, with $p=1000$ lines plotted. Highlight the $s$ coefficients that are truly non-zero by plotting them in red.



In [25]:

    
?linear_model.lars_path



In [26]:

    
## Answer to exercise 5.3
## Run lars with lasso mod, find active set

larper = linear_model.lars_path(X,y,method="lasso")
S = set(np.where(Sbool)[0])

def plot_it():
    for j in S:
        _ = plt.plot(larper[0],larper[2][j,:],'r')
    for j in set(range(p)) - S:
        _ = plt.plot(larper[0],larper[2][j,:],'k',linewidth=.75)
    _ = plt.title('Lasso path for simulated data')
    _ = plt.xlabel('lambda')
    _ = plt.ylabel('Coef')



In [27]:

    
plot_it()



In [28]:

    
## Hitters dataset

df = pd.read_csv('../../data/Hitters.csv', index_col=0).dropna()
df.index.name = 'Player'
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, -Alan Ashby to -Willie Wilson
Data columns (total 20 columns):
AtBat        263 non-null int64
Hits         263 non-null int64
HmRun        263 non-null int64
Runs         263 non-null int64
RBI          263 non-null int64
Walks        263 non-null int64
Years        263 non-null int64
CAtBat       263 non-null int64
CHits        263 non-null int64
CHmRun       263 non-null int64
CRuns        263 non-null int64
CRBI         263 non-null int64
CWalks       263 non-null int64
League       263 non-null object
Division     263 non-null object
PutOuts      263 non-null int64
Assists      263 non-null int64
Errors       263 non-null int64
Salary       263 non-null float64
NewLeague    263 non-null object
dtypes: float64(1), int64(16), object(3)
memory usage: 43.1+ KB



In [29]:

    
df.head()









    Out[29]:







  
    
      
      AtBat
      Hits
      HmRun
      Runs
      RBI
      Walks
      Years
      CAtBat
      CHits
      CHmRun
      CRuns
      CRBI
      CWalks
      League
      Division
      PutOuts
      Assists
      Errors
      Salary
      NewLeague
    
    
      Player
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      -Alan Ashby
      315
      81
      7
      24
      38
      39
      14
      3449
      835
      69
      321
      414
      375
      N
      W
      632
      43
      10
      475.0
      N
    
    
      -Alvin Davis
      479
      130
      18
      66
      72
      76
      3
      1624
      457
      63
      224
      266
      263
      A
      W
      880
      82
      14
      480.0
      A
    
    
      -Andre Dawson
      496
      141
      20
      65
      78
      37
      11
      5628
      1575
      225
      828
      838
      354
      N
      E
      200
      11
      3
      500.0
      N
    
    
      -Andres Galarraga
      321
      87
      10
      39
      42
      30
      2
      396
      101
      12
      48
      46
      33
      N
      E
      805
      40
      4
      91.5
      N
    
    
      -Alfredo Griffin
      594
      169
      4
      74
      51
      35
      11
      4408
      1133
      19
      501
      336
      194
      A
      W
      282
      421
      25
      750.0
      A



In [30]:

    
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
dummies.info()
print(dummies.head())









    



<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, -Alan Ashby to -Willie Wilson
Data columns (total 6 columns):
League_A       263 non-null uint8
League_N       263 non-null uint8
Division_E     263 non-null uint8
Division_W     263 non-null uint8
NewLeague_A    263 non-null uint8
NewLeague_N    263 non-null uint8
dtypes: uint8(6)
memory usage: 3.6+ KB
                   League_A  League_N  Division_E  Division_W  NewLeague_A  \
Player                                                                       
-Alan Ashby               0         1           0           1            0   
-Alvin Davis              1         0           0           1            1   
-Andre Dawson             0         1           1           0            0   
-Andres Galarraga         0         1           1           0            0   
-Alfredo Griffin          1         0           0           1            1   

                   NewLeague_N  
Player                          
-Alan Ashby                  1  
-Alvin Davis                 0  
-Andre Dawson                1  
-Andres Galarraga            1  
-Alfredo Griffin             0



In [31]:

    
y = df.Salary

# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, -Alan Ashby to -Willie Wilson
Data columns (total 19 columns):
AtBat          263 non-null float64
Hits           263 non-null float64
HmRun          263 non-null float64
Runs           263 non-null float64
RBI            263 non-null float64
Walks          263 non-null float64
Years          263 non-null float64
CAtBat         263 non-null float64
CHits          263 non-null float64
CHmRun         263 non-null float64
CRuns          263 non-null float64
CRBI           263 non-null float64
CWalks         263 non-null float64
PutOuts        263 non-null float64
Assists        263 non-null float64
Errors         263 non-null float64
League_N       263 non-null uint8
Division_W     263 non-null uint8
NewLeague_N    263 non-null uint8
dtypes: float64(16), uint8(3)
memory usage: 35.7+ KB



In [32]:

    
X.head(5)









    Out[32]:







  
    
      
      AtBat
      Hits
      HmRun
      Runs
      RBI
      Walks
      Years
      CAtBat
      CHits
      CHmRun
      CRuns
      CRBI
      CWalks
      PutOuts
      Assists
      Errors
      League_N
      Division_W
      NewLeague_N
    
    
      Player
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      -Alan Ashby
      315.0
      81.0
      7.0
      24.0
      38.0
      39.0
      14.0
      3449.0
      835.0
      69.0
      321.0
      414.0
      375.0
      632.0
      43.0
      10.0
      1
      1
      1
    
    
      -Alvin Davis
      479.0
      130.0
      18.0
      66.0
      72.0
      76.0
      3.0
      1624.0
      457.0
      63.0
      224.0
      266.0
      263.0
      880.0
      82.0
      14.0
      0
      1
      0
    
    
      -Andre Dawson
      496.0
      141.0
      20.0
      65.0
      78.0
      37.0
      11.0
      5628.0
      1575.0
      225.0
      828.0
      838.0
      354.0
      200.0
      11.0
      3.0
      1
      0
      1
    
    
      -Andres Galarraga
      321.0
      87.0
      10.0
      39.0
      42.0
      30.0
      2.0
      396.0
      101.0
      12.0
      48.0
      46.0
      33.0
      805.0
      40.0
      4.0
      1
      0
      1
    
    
      -Alfredo Griffin
      594.0
      169.0
      4.0
      74.0
      51.0
      35.0
      11.0
      4408.0
      1133.0
      19.0
      501.0
      336.0
      194.0
      282.0
      421.0
      25.0
      0
      1
      0

Exercise 5.4

You should cross-validate to select the lambda just like any other tuning parameter. Sklearn gives you the option of using their fast cross-validation script via linear_model.LassoCV, see the documentation. You can create a leave-one-out cross validator with model_selection.LeaveOneOut then pass this to LassoCV with the cv argument. Do this, and see what the returned fit and selected lambda are.



In [39]:

    
## Answer to 5.4
## Fit the lasso and cross-validate, increased max_iter to achieve convergence
loo = model_selection.LeaveOneOut()
looiter = loo.split(X)
hitlasso = linear_model.LassoCV(cv=looiter,max_iter=2000) 
hitlasso.fit(X,y)









    Out[39]:





LassoCV(alphas=None, copy_X=True,
        cv=<generator object BaseCrossValidator.split at 0x7f17afebdcd0>,
        eps=0.001, fit_intercept=True, max_iter=2000, n_alphas=100, n_jobs=None,
        normalize=False, positive=False, precompute='auto', random_state=None,
        selection='cyclic', tol=0.0001, verbose=False)



In [34]:

    
print("The selected lambda value is {:.2f}".format(hitlasso.alpha_))









    



The selected lambda value is 540.66



In [35]:

    
hitlasso.coef_









    Out[35]:





array([-0.        ,  1.49586273,  0.        ,  0.        ,  0.        ,
        1.13979587, -0.        , -0.33734233,  0.82306877,  0.        ,
        0.79947238,  0.65458696, -0.03331169,  0.27575248,  0.13166923,
       -0.        ,  0.        , -0.        ,  0.        ])

We can also compare this to the selected model from forward stagewise regression:

[-0.21830515,  0.38154135,  0.        ,  0.        ,  0.        ,
        0.16139123,  0.        ,  0.        ,  0.        ,  0.        ,
        0.09994524,  0.56696569, -0.16872682,  0.16924078,  0.        ,
        0.        ,  0.        , -0.19429699,  0.        ]

This is not exactly the same model with differences in the inclusion or exclusion of AtBat, HmRun, Runs, RBI, Years, CHmRun, Errors, League_N, Division_W, NewLeague_N



In [36]:

    
bforw = [-0.21830515,  0.38154135,  0.        ,  0.        ,  0.        ,
        0.16139123,  0.        ,  0.        ,  0.        ,  0.        ,
        0.09994524,  0.56696569, -0.16872682,  0.16924078,  0.        ,
        0.        ,  0.        , -0.19429699,  0.        ]



In [37]:

    
print(", ".join(X.columns[(hitlasso.coef_ != 0.) != (bforw != 0.)]))









    



AtBat, HmRun, Runs, RBI, Years, CHmRun, Errors, League_N, Division_W, NewLeague_N

	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	League	Division	PutOuts	Assists	Errors	Salary	NewLeague
Player
-Alan Ashby	315	81	7	24	38	39	14	3449	835	69	321	414	375	N	W	632	43	10	475.0	N
-Alvin Davis	479	130	18	66	72	76	3	1624	457	63	224	266	263	A	W	880	82	14	480.0	A
-Andre Dawson	496	141	20	65	78	37	11	5628	1575	225	828	838	354	N	E	200	11	3	500.0	N
-Andres Galarraga	321	87	10	39	42	30	2	396	101	12	48	46	33	N	E	805	40	4	91.5	N
-Alfredo Griffin	594	169	4	74	51	35	11	4408	1133	19	501	336	194	A	W	282	421	25	750.0	A

	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	PutOuts	Assists	Errors	League_N	Division_W	NewLeague_N
Player
-Alan Ashby	315.0	81.0	7.0	24.0	38.0	39.0	14.0	3449.0	835.0	69.0	321.0	414.0	375.0	632.0	43.0	10.0	1	1	1
-Alvin Davis	479.0	130.0	18.0	66.0	72.0	76.0	3.0	1624.0	457.0	63.0	224.0	266.0	263.0	880.0	82.0	14.0	0	1	0
-Andre Dawson	496.0	141.0	20.0	65.0	78.0	37.0	11.0	5628.0	1575.0	225.0	828.0	838.0	354.0	200.0	11.0	3.0	1	0	1
-Andres Galarraga	321.0	87.0	10.0	39.0	42.0	30.0	2.0	396.0	101.0	12.0	48.0	46.0	33.0	805.0	40.0	4.0	1	0	1
-Alfredo Griffin	594.0	169.0	4.0	74.0	51.0	35.0	11.0	4408.0	1133.0	19.0	501.0	336.0	194.0	282.0	421.0	25.0	0	1	0