Recall 1st Order Condition. If f is differentiable then it is convex if $$ f(x) \ge f(x_0) + \nabla f(x_0)^\top (x - x_0), \forall x,x_0 $$ and when $\nabla f(x_0) = 0$ then $$ f(x) \ge f(x_0), \forall x $$ so any fixed point of gradient descent is a global min (for convex, differentiable f)
Soft thresholding is commonly used for orthonormal bases.
Want to minimize $$ \frac 12 \sum_{i=1}^T (y - W \beta)_i^2 + \lambda \sum_{i=1}^T |\beta_i|. $$
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [17]:
## Explore Turkish stock exchange dataset
tse = pd.read_excel('../../data/data_akbilgic.xlsx',skiprows=1)
tse = tse.rename(columns={'ISE':'TLISE','ISE.1':'USDISE'})
In [5]:
def const_wave(T,a,b):
wave = np.zeros(T)
s1 = (b-a) // 2
s2 = (b-a) - s1
norm_C = (s1*s2 / (s1+s2))**0.5
wave[a:a+s1] = norm_C / s1
wave[a+s1:b] = -norm_C / s2
return wave
In [6]:
def _const_wave_basis(T,a,b):
if b-a < 2:
return []
wave_basis = []
wave_basis.append(const_wave(T,a,b))
mid_pt = a + (b-a)//2
wave_basis += _const_wave_basis(T,a,mid_pt)
wave_basis += _const_wave_basis(T,mid_pt,b)
return wave_basis
In [7]:
def const_wave_basis(T,a,b):
father = np.ones(T) / T**0.5
return [father] + _const_wave_basis(T,a,b)
In [9]:
# Construct discrete Haar wavelet basis
T,p = tse.shape
wave_basis = const_wave_basis(T,0,T)
W = np.array(wave_basis).T
In [13]:
_ = plt.plot(W[:,:3])
In [14]:
def soft(y,lamb):
pos_part = (y - lamb) * (y > lamb)
neg_part = (y + lamb) * (y < -lamb)
return pos_part + neg_part
In [22]:
## Volatility seems most interesting
## will construct local measure of volatility
## remove rolling window estimate (local centering)
## square the residuals
#tse = tse.set_index('date')
tse_trem = tse - tse.rolling("7D").mean()
tse_vol = tse_trem**2.
## Make wavelet transformation and soft threshold
tse_wave = W.T @ tse_vol.values
lamb = .001
tse_soft = soft(tse_wave,lamb)
tse_rec = W @ tse_soft
tse_den = tse_vol.copy()
tse_den.iloc[:,:] = tse_rec
In [23]:
_ = tse_vol.plot(subplots=True,figsize=(10,10))
In [24]:
_ = tse_den.plot(subplots=True,figsize=(10,10))
In [60]:
plt.plot(tse_soft[:,4])
high_idx = np.where(np.abs(tse_soft[:,5]) > .0001)[0]
print(high_idx)
In [61]:
fig, axs = plt.subplots(len(high_idx) + 1,1)
for i, idx in enumerate(high_idx):
axs[i].plot(W[:,idx])
plt.plot(tse_den['FTSE'],c='r')
Out[61]:
Compare to best subset selection (NP-hard): $$ \min \frac 12 \sum_{i=1}^T (y - X \beta)_i^2. $$ for $$ \| \beta \|_0 = |{\rm supp}(\beta)| < s. $$
The lasso can be written in regularized form, $$ \min \frac 12 \sum_{i=1}^T (y - X \beta)_i^2 + \lambda \sum_{i=1}^T |\beta_i|, $$ or in constrained form, $$ \min \frac 12 \sum_{i=1}^T (y - X \beta)_i^2, \quad \textrm{s.t.} \sum_{i=1}^T |\beta_i| \le C, $$
A quadratic program (QP) is a convex optimization of the form $$ \min \beta^\top Q \beta + \beta^\top a \quad \textrm{ s.t. } A\beta \le c $$ where $Q$ is positive semi-definite.
claim: The lasso (constrained form) is a QP.
$$ \sum_{i=1}^T (y - X \beta)_i^2 = \frac 12 \beta^\top (X^\top X) \beta + \beta^\top (X^\top y) + C $$but what about $\| \beta \|_1$?
For a single $\lambda$ (or $C$ in constrained form) can solve the lasso with many specialized methods
but $\lambda$ is a tuning parameter. Options
Forward greedy selection only adds elements to the active set, does not remove elements.
In [66]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing, model_selection, linear_model
%matplotlib inline
In [65]:
## Modified from the github repo: https://github.com/JWarmenhoven/ISLR-python
## which is based on the book by James et al. Intro to Statistical Learning.
df = pd.read_csv('../../data/Hitters.csv', index_col=0).dropna()
df.index.name = 'Player'
df.info()
In [90]:
## Simulate a dataset for lasso
n=100
p=1000
X = np.random.randn(n,p)
X = preprocessing.scale(X)
In [91]:
## Subselect true active set
sprob = 0.02
Sbool = np.random.rand(p) < sprob
s = np.sum(Sbool)
print("Number of non-zero's: {}".format(s))
In [92]:
## Construct beta and y
mu = 100.
beta = np.zeros(p)
beta[Sbool] = mu * np.random.randn(s)
eps = np.random.randn(n)
y = X.dot(beta) + eps
In [94]:
## Run lars with lasso mod, find active set
larper = linear_model.lars_path(X,y,method="lasso")
S = set(np.where(Sbool)[0])
def plot_it():
for j in S:
_ = plt.plot(larper[0],larper[2][j,:],'r')
for j in set(range(p)) - S:
_ = plt.plot(larper[0],larper[2][j,:],'k',linewidth=.75)
_ = plt.title('Lasso path for simulated data')
_ = plt.xlabel('lambda')
_ = plt.ylabel('Coef')
In [95]:
plot_it()
In [77]:
## Hitters dataset
df = pd.read_csv('../../data/Hitters.csv', index_col=0).dropna()
df.index.name = 'Player'
df.info()
In [78]:
df.head()
Out[78]:
In [79]:
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
dummies.info()
print(dummies.head())
In [96]:
y = df.Salary
# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X.info()
In [81]:
X.head(5)
Out[81]:
In [83]:
loo = model_selection.LeaveOneOut()
looiter = loo.split(X)
hitlasso = linear_model.LassoCV(cv=looiter)
hitlasso.fit(X,y)
Out[83]:
In [84]:
print("The selected lambda value is {:.2f}".format(hitlasso.alpha_))
In [85]:
hitlasso.coef_
Out[85]:
We can also compare this to the selected model from forward stagewise regression:
[-0.21830515, 0.38154135, 0. , 0. , 0. ,
0.16139123, 0. , 0. , 0. , 0. ,
0.09994524, 0.56696569, -0.16872682, 0.16924078, 0. ,
0. , 0. , -0.19429699, 0. ]
This is not exactly the same model with differences in the inclusion or exclusion of AtBat, HmRun, Runs, RBI, Years, CHmRun, Errors, League_N, Division_W, NewLeague_N
In [87]:
bforw = [-0.21830515, 0.38154135, 0. , 0. , 0. ,
0.16139123, 0. , 0. , 0. , 0. ,
0.09994524, 0.56696569, -0.16872682, 0.16924078, 0. ,
0. , 0. , -0.19429699, 0. ]
In [98]:
print(", ".join(X.columns[(hitlasso.coef_ != 0.) != (bforw != 0.)]))