Recall 1st Order Condition. If f is differentiable then it is convex if $$ f(x) \ge f(x_0) + \nabla f(x_0)^\top (x - x_0), \forall x,x_0 $$ and when $\nabla f(x_0) = 0$ then $$ f(x) \ge f(x_0), \forall x $$ so any fixed point of gradient descent is a global min (for convex, differentiable f)
Soft thresholding is commonly used for orthonormal bases.
Want to minimize $$ \frac 12 \sum_{i=1}^T (y - W \beta)_i^2 + \lambda \sum_{i=1}^T |\beta_i|. $$
In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [8]:
## Explore Turkish stock exchange dataset
tse = pd.read_excel('../../data/data_akbilgic.xlsx',skiprows=1)
tse = tse.rename(columns={'ISE':'TLISE','ISE.1':'USDISE'})
In [9]:
def const_wave(T,a,b):
wave = np.zeros(T)
s1 = (b-a) // 2
s2 = (b-a) - s1
norm_C = (s1*s2 / (s1+s2))**0.5
wave[a:a+s1] = norm_C / s1
wave[a+s1:b] = -norm_C / s2
return wave
In [10]:
def _const_wave_basis(T,a,b):
if b-a < 2:
return []
wave_basis = []
wave_basis.append(const_wave(T,a,b))
mid_pt = a + (b-a)//2
wave_basis += _const_wave_basis(T,a,mid_pt)
wave_basis += _const_wave_basis(T,mid_pt,b)
return wave_basis
In [11]:
def const_wave_basis(T,a,b):
father = np.ones(T) / T**0.5
return [father] + _const_wave_basis(T,a,b)
In [12]:
# Construct discrete Haar wavelet basis
T,p = tse.shape
wave_basis = const_wave_basis(T,0,T)
W = np.array(wave_basis).T
In [13]:
_ = plt.plot(W[:,:3])
In [14]:
def soft(y,lamb):
pos_part = (y - lamb) * (y > lamb)
neg_part = (y + lamb) * (y < -lamb)
return pos_part + neg_part
In [15]:
## Volatility seems most interesting
## will construct local measure of volatility
## remove rolling window estimate (local centering)
## square the residuals
tse = tse.set_index('date')
tse_trem = tse - tse.rolling("7D").mean()
tse_vol = tse_trem**2.
## Make wavelet transformation and soft threshold
tse_wave = W.T @ tse_vol.values
lamb = .001
tse_soft = soft(tse_wave,lamb)
tse_rec = W @ tse_soft
tse_den = tse_vol.copy()
tse_den.iloc[:,:] = tse_rec
In [16]:
_ = tse_vol.plot(subplots=True,figsize=(10,10))
In [17]:
_ = tse_den.plot(subplots=True,figsize=(10,10))
In [18]:
plt.plot(tse_soft[:,4])
high_idx = np.where(np.abs(tse_soft[:,4]) > .0001)[0]
print(high_idx)
In [19]:
fig, axs = plt.subplots(len(high_idx) + 1,1)
for i, idx in enumerate(high_idx):
axs[i].plot(W[:,idx])
plt.plot(tse_den['FTSE'],c='r')
Out[19]:
Compare to best subset selection (NP-hard): $$ \min \frac 12 \sum_{i=1}^T (y - X \beta)_i^2. $$ for $$ \| \beta \|_0 = |{\rm supp}(\beta)| < s. $$
The lasso can be written in regularized form, $$ \min \frac 12 \sum_{i=1}^T (y - X \beta)_i^2 + \lambda \sum_{i=1}^T |\beta_i|, $$ or in constrained form, $$ \min \frac 12 \sum_{i=1}^T (y - X \beta)_i^2, \quad \textrm{s.t.} \sum_{i=1}^T |\beta_i| \le C, $$
A quadratic program (QP) is any convex optimization of the form $$ \min \beta^\top Q \beta + \beta^\top a \quad \textrm{ s.t. } A\beta \le c $$ where $Q$ is positive semi-definite.
Show that the lasso in constrained form is a QP. (Hint: write $\beta = \beta_+ - \beta_-$ where $\beta_{+,j} = \beta_{j} 1\{ \beta_j > 0\}$ and $\beta_{-,j} = - \beta_{j} 1\{ \beta_j < 0\}$).
Solution to 5.1
The objective is certainly quadratic... $$ \frac 12 \sum_{i=1}^T (y - X \beta)_i^2 = \frac 12 \beta^\top (X^\top X) \beta - \beta^\top (X^\top y) + C $$ and we know that $X^\top X$ is PSD because $a^\top X^\top X a = \| X a\|^2 \ge 0$.
What about $\| \beta \|_1$?
For a single $\lambda$ (or $C$ in constrained form) can solve the lasso with many specialized methods
but $\lambda$ is a tuning parameter. Options
Forward greedy selection only adds elements to the active set, does not remove elements.
Verify 1 and 2 above.
How do we know that LAR does not give us the Lasso solution?
In [20]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing, model_selection, linear_model
%matplotlib inline
In [21]:
## Modified from the github repo: https://github.com/JWarmenhoven/ISLR-python
## which is based on the book by James et al. Intro to Statistical Learning.
df = pd.read_csv('../../data/Hitters.csv', index_col=0).dropna()
df.index.name = 'Player'
df.info()
In [22]:
## Simulate a dataset for lasso
n=100
p=1000
X = np.random.randn(n,p)
X = preprocessing.scale(X)
In [23]:
## Subselect true active set
sprob = 0.02
Sbool = np.random.rand(p) < sprob
s = np.sum(Sbool)
print("Number of non-zero's: {}".format(s))
In [24]:
## Construct beta and y
mu = 100.
beta = np.zeros(p)
beta[Sbool] = mu * np.random.randn(s)
eps = np.random.randn(n)
y = X.dot(beta) + eps
linear_model.lars_path
with the lasso modification (see docstring with ?linear_model.lars_path)
In [25]:
?linear_model.lars_path
In [26]:
## Answer to exercise 5.3
## Run lars with lasso mod, find active set
larper = linear_model.lars_path(X,y,method="lasso")
S = set(np.where(Sbool)[0])
def plot_it():
for j in S:
_ = plt.plot(larper[0],larper[2][j,:],'r')
for j in set(range(p)) - S:
_ = plt.plot(larper[0],larper[2][j,:],'k',linewidth=.75)
_ = plt.title('Lasso path for simulated data')
_ = plt.xlabel('lambda')
_ = plt.ylabel('Coef')
In [27]:
plot_it()
In [28]:
## Hitters dataset
df = pd.read_csv('../../data/Hitters.csv', index_col=0).dropna()
df.index.name = 'Player'
df.info()
In [29]:
df.head()
Out[29]:
In [30]:
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
dummies.info()
print(dummies.head())
In [31]:
y = df.Salary
# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X.info()
In [32]:
X.head(5)
Out[32]:
You should cross-validate to select the lambda just like any other tuning parameter. Sklearn gives you the option of using their fast cross-validation script via linear_model.LassoCV
, see the documentation. You can create a leave-one-out cross validator with model_selection.LeaveOneOut
then pass this to LassoCV
with the cv
argument. Do this, and see what the returned fit and selected lambda are.
In [39]:
## Answer to 5.4
## Fit the lasso and cross-validate, increased max_iter to achieve convergence
loo = model_selection.LeaveOneOut()
looiter = loo.split(X)
hitlasso = linear_model.LassoCV(cv=looiter,max_iter=2000)
hitlasso.fit(X,y)
Out[39]:
In [34]:
print("The selected lambda value is {:.2f}".format(hitlasso.alpha_))
In [35]:
hitlasso.coef_
Out[35]:
We can also compare this to the selected model from forward stagewise regression:
[-0.21830515, 0.38154135, 0. , 0. , 0. ,
0.16139123, 0. , 0. , 0. , 0. ,
0.09994524, 0.56696569, -0.16872682, 0.16924078, 0. ,
0. , 0. , -0.19429699, 0. ]
This is not exactly the same model with differences in the inclusion or exclusion of AtBat, HmRun, Runs, RBI, Years, CHmRun, Errors, League_N, Division_W, NewLeague_N
In [36]:
bforw = [-0.21830515, 0.38154135, 0. , 0. , 0. ,
0.16139123, 0. , 0. , 0. , 0. ,
0.09994524, 0.56696569, -0.16872682, 0.16924078, 0. ,
0. , 0. , -0.19429699, 0. ]
In [37]:
print(", ".join(X.columns[(hitlasso.coef_ != 0.) != (bforw != 0.)]))