Modified from the github repo: https://github.com/JWarmenhoven/ISLR-python which is based on the book by James et al. Intro to Statistical Learning.
In [1]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LinearRegression, lars_path, Lasso, LassoCV
%matplotlib inline
In [2]:
n=100
p=1000
X = np.random.randn(n,p)
X = scale(X)
In [3]:
sprob = 0.02
Sbool = np.random.rand(p) < sprob
s = np.sum(Sbool)
print("Number of non-zero's: {}".format(s))
In [4]:
mu = 100.
beta = np.zeros(p)
beta[Sbool] = mu * np.random.randn(s)
In [5]:
eps = np.random.randn(n)
y = X.dot(beta) + eps
In [6]:
larper = lars_path(X,y,method="lasso")
In [7]:
S = set(np.where(Sbool)[0])
In [8]:
for j in S:
_ = plt.plot(larper[0],larper[2][j,:],'r')
for j in set(range(p)) - S:
_ = plt.plot(larper[0],larper[2][j,:],'k',linewidth=.5)
_ = plt.title('Lasso path for simulated data')
_ = plt.xlabel('lambda')
_ = plt.ylabel('Coef')
Let's load the dataset from the previous lab.
In [14]:
# In R, I exported the dataset from package 'ISLR' to a csv file.
df = pd.read_csv('../data/Hitters.csv', index_col=0).dropna()
df.index.name = 'Player'
df.info()
In [15]:
df.head()
Out[15]:
In [16]:
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
dummies.info()
print(dummies.head())
In [21]:
y = df.Salary
# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X.info()
In [18]:
X.head(5)
Out[18]:
Exercise Compare the previous methods to the Lasso on this dataset. Tune $\lambda$ and compare the LOO risk to other methods (ridge, forward selection, etc.)
The following is a fast implementation of the lasso path cross-validated using LOO.
In [25]:
loo = LeaveOneOut()
looiter = loo.split(X)
hitlasso = LassoCV(cv=looiter)
hitlasso.fit(X,y)
Out[25]:
In [32]:
print("The selected lambda value is {:.2f}".format(hitlasso.alpha_))
The following is the fitted coefficient vector for this chosen lambda.
In [30]:
hitlasso.coef_
Out[30]:
In [36]:
np.mean(hitlasso.mse_path_[hitlasso.alphas_ == hitlasso.alpha_])
Out[36]:
The above is the MSE for the selected model. The best performance for ridge regression was roughly 120,000, so this does not outperform ridge. We can also compare this to the selected model from forward stagewise regression:
[-0.21830515, 0.38154135, 0. , 0. , 0. ,
0.16139123, 0. , 0. , 0. , 0. ,
0.09994524, 0.56696569, -0.16872682, 0.16924078, 0. ,
0. , 0. , -0.19429699, 0. ]
This is not exactly the same model with differences in the inclusion or exclusion of AtBat, HmRun, Runs, RBI, Years, CHmRun, Errors, League_N, Division_W, NewLeague_N
In [41]:
bforw = [-0.21830515, 0.38154135, 0. , 0. , 0. ,
0.16139123, 0. , 0. , 0. , 0. ,
0.09994524, 0.56696569, -0.16872682, 0.16924078, 0. ,
0. , 0. , -0.19429699, 0. ]
In [44]:
print(", ".join(X.columns[(hitlasso.coef_ != 0.) != (bforw != 0.)]))