The Lasso

Modified from the github repo: https://github.com/JWarmenhoven/ISLR-python which is based on the book by James et al. Intro to Statistical Learning.


In [2]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale 
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LinearRegression, lars_path, Lasso, LassoCV

%matplotlib inline

In [3]:
n=100
p=1000
X = np.random.randn(n,p)
X = scale(X)

In [74]:
sprob = 0.02
Sbool = np.random.rand(p) < sprob
s = np.sum(Sbool)
print("Number of non-zero's: {}".format(s))


Number of non-zero's: 25

In [75]:
mu = 100.
beta = np.zeros(p)
beta[Sbool] = mu * np.random.randn(s)

In [76]:
eps = np.random.randn(n)
y = X.dot(beta) + eps

In [77]:
larper = lars_path(X,y,method="lasso")

In [81]:
S = set(np.where(Sbool)[0])

In [92]:
for j in S:
    _ = plt.plot(larper[0],larper[2][j,:],'r')
for j in set(range(p)) - S:
    _ = plt.plot(larper[0],larper[2][j,:],'k',linewidth=.5)
_ = plt.title('Lasso path for simulated data')
_ = plt.xlabel('lambda')
_ = plt.ylabel('Coef')


Hitters dataset

Let's load the dataset from the previous lab.


In [2]:
# In R, I exported the dataset from package 'ISLR' to a csv file.
df = pd.read_csv('Data/Hitters.csv', index_col=0).dropna()
df.index.name = 'Player'
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, -Alan Ashby to -Willie Wilson
Data columns (total 20 columns):
AtBat        263 non-null int64
Hits         263 non-null int64
HmRun        263 non-null int64
Runs         263 non-null int64
RBI          263 non-null int64
Walks        263 non-null int64
Years        263 non-null int64
CAtBat       263 non-null int64
CHits        263 non-null int64
CHmRun       263 non-null int64
CRuns        263 non-null int64
CRBI         263 non-null int64
CWalks       263 non-null int64
League       263 non-null object
Division     263 non-null object
PutOuts      263 non-null int64
Assists      263 non-null int64
Errors       263 non-null int64
Salary       263 non-null float64
NewLeague    263 non-null object
dtypes: float64(1), int64(16), object(3)
memory usage: 43.1+ KB

In [3]:
df.head()


Out[3]:
                   AtBat  Hits  HmRun  Runs  RBI  Walks  Years  CAtBat  CHits  \
Player                                                                          
-Alan Ashby          315    81      7    24   38     39     14    3449    835   
-Alvin Davis         479   130     18    66   72     76      3    1624    457   
-Andre Dawson        496   141     20    65   78     37     11    5628   1575   
-Andres Galarraga    321    87     10    39   42     30      2     396    101   
-Alfredo Griffin     594   169      4    74   51     35     11    4408   1133   

                   CHmRun  CRuns  CRBI  CWalks League Division  PutOuts  \
Player                                                                    
-Alan Ashby            69    321   414     375      N        W      632   
-Alvin Davis           63    224   266     263      A        W      880   
-Andre Dawson         225    828   838     354      N        E      200   
-Andres Galarraga      12     48    46      33      N        E      805   
-Alfredo Griffin       19    501   336     194      A        W      282   

                   Assists  Errors  Salary NewLeague  
Player                                                
-Alan Ashby             43      10   475.0         N  
-Alvin Davis            82      14   480.0         A  
-Andre Dawson           11       3   500.0         N  
-Andres Galarraga       40       4    91.5         N  
-Alfredo Griffin       421      25   750.0         A  

In [4]:
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
dummies.info()
print(dummies.head())


<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, -Alan Ashby to -Willie Wilson
Data columns (total 6 columns):
League_A       263 non-null float64
League_N       263 non-null float64
Division_E     263 non-null float64
Division_W     263 non-null float64
NewLeague_A    263 non-null float64
NewLeague_N    263 non-null float64
dtypes: float64(6)
memory usage: 14.4+ KB
                   League_A  League_N  Division_E  Division_W  NewLeague_A  \
Player                                                                       
-Alan Ashby             0.0       1.0         0.0         1.0          0.0   
-Alvin Davis            1.0       0.0         0.0         1.0          1.0   
-Andre Dawson           0.0       1.0         1.0         0.0          0.0   
-Andres Galarraga       0.0       1.0         1.0         0.0          0.0   
-Alfredo Griffin        1.0       0.0         0.0         1.0          1.0   

                   NewLeague_N  
Player                          
-Alan Ashby                1.0  
-Alvin Davis               0.0  
-Andre Dawson              1.0  
-Andres Galarraga          1.0  
-Alfredo Griffin           0.0  

In [5]:
y = df.Salary

# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X.info()


<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, -Alan Ashby to -Willie Wilson
Data columns (total 19 columns):
AtBat          263 non-null float64
Hits           263 non-null float64
HmRun          263 non-null float64
Runs           263 non-null float64
RBI            263 non-null float64
Walks          263 non-null float64
Years          263 non-null float64
CAtBat         263 non-null float64
CHits          263 non-null float64
CHmRun         263 non-null float64
CRuns          263 non-null float64
CRBI           263 non-null float64
CWalks         263 non-null float64
PutOuts        263 non-null float64
Assists        263 non-null float64
Errors         263 non-null float64
League_N       263 non-null float64
Division_W     263 non-null float64
NewLeague_N    263 non-null float64
dtypes: float64(19)
memory usage: 41.1+ KB

In [6]:
X.head(5)


Out[6]:
                   AtBat   Hits  HmRun  Runs   RBI  Walks  Years  CAtBat  \
Player                                                                     
-Alan Ashby        315.0   81.0    7.0  24.0  38.0   39.0   14.0  3449.0   
-Alvin Davis       479.0  130.0   18.0  66.0  72.0   76.0    3.0  1624.0   
-Andre Dawson      496.0  141.0   20.0  65.0  78.0   37.0   11.0  5628.0   
-Andres Galarraga  321.0   87.0   10.0  39.0  42.0   30.0    2.0   396.0   
-Alfredo Griffin   594.0  169.0    4.0  74.0  51.0   35.0   11.0  4408.0   

                    CHits  CHmRun  CRuns   CRBI  CWalks  PutOuts  Assists  \
Player                                                                      
-Alan Ashby         835.0    69.0  321.0  414.0   375.0    632.0     43.0   
-Alvin Davis        457.0    63.0  224.0  266.0   263.0    880.0     82.0   
-Andre Dawson      1575.0   225.0  828.0  838.0   354.0    200.0     11.0   
-Andres Galarraga   101.0    12.0   48.0   46.0    33.0    805.0     40.0   
-Alfredo Griffin   1133.0    19.0  501.0  336.0   194.0    282.0    421.0   

                   Errors  League_N  Division_W  NewLeague_N  
Player                                                        
-Alan Ashby          10.0       1.0         1.0          1.0  
-Alvin Davis         14.0       0.0         1.0          0.0  
-Andre Dawson         3.0       1.0         0.0          1.0  
-Andres Galarraga     4.0       1.0         0.0          1.0  
-Alfredo Griffin     25.0       0.0         1.0          0.0  

Exercise Compare the previous methods to the Lasso on this dataset. Tune $\lambda$ and compare the LOO risk to other methods (ridge, forward selection, etc.)