The basic idea of regularisation is to penalise or shrink the large coefficients of a regression model. This can help with the bias / variance trade-off and can help with model selection by automatically removing irrelevant features.
The shrinkage models that we will see here are the Ridge regression and Lasso.
The Credit data set records Balance (average credit card debt for a number of individuals) as well as several quantitative predictors: age, cards (number of credit cards), education (years of education), income (in thousands of dollars), limit (credit limit), and rating (credit rating). In addition to these quantitative variables, we also have four qualitative variables: gender, student (student status), status (marital status), and ethnicity (Caucasian, African American or Asian).
In [4]:
import pandas as pd
# load up the Credit dataset
#
data = pd.read_csv("../datasets/credit.csv", index_col=0)
In [5]:
data.shape
Out[5]:
In [6]:
data.columns
Out[6]:
In [7]:
data.head()
Out[7]:
In [8]:
data.describe()
Out[8]:
In [9]:
data.info()
In [10]:
y = data['Balance'].copy() # our target
X = data.copy()
X.drop(['Balance'], axis=1, inplace=True)
Suppose that we wish to investigate differences in credit card balance between males and females, ignoring the other variables for the moment.
If a qualitative predictor has only two levels then we can simply create an indicator or dummy variable that takes on two possible numerical values.
We can use the Python map primitive to map Female to 1 and Male to 0.
Be careful that the labels need to be mapped correctly, eventual spaces included.
In [11]:
X.loc[1, 'Gender']
Out[11]:
In [12]:
X.Gender = X.Gender.map({'Female':1, ' Male':0})
X.Gender.head()
Out[12]:
The decision to use 0 or 1 is arbitrary and has no effect on the regression fit but different values could result in different coefficients.
We now use the new dummy variable as a predictor in a regression model.
In [13]:
from sklearn import linear_model
#
# : Create the linear regression model
#
model = linear_model.LinearRegression()
In [14]:
model.fit(X[['Gender']], y)
model.intercept_
Out[14]:
In [15]:
model.coef_
Out[15]:
This results in the model:
Balance = 509.7 if male
Balance = 509.7 + 19.7 if female
And 19.7 can be interpreted as the average difference in credit card balance between females and males (females are estimated to carry $19.7 in additional debt for a total of 509.7+19.7 = $529.4).
However without checking its p-value or confidence intervals (outside the scope of this) we cannot say anything about the statistical evidence of this difference.
When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. In this situation, we can split a categorical variable into additional different binary variables, one for each level.
For example, for the ethnicity variable we create three dummy variables, using the pandas get_dummies() function.
In [16]:
etno = pd.get_dummies(X['Ethnicity']) # creates extra columns
X = X.join(etno)
X.head()
Out[16]:
Since the dummies variable are additional - not new values mapped to the existing variable like for the gender - we can now drop the Ethnicity variable:
In [17]:
X.drop(['Ethnicity'], axis=1, inplace=True)
In [18]:
features = X[['Asian', 'Caucasian', 'African American']]
model.fit(features, y)
model.intercept_
Out[18]:
In [19]:
model.coef_
Out[19]:
We see that the estimated balance for the African American is $531.00 = $520.60 + 10.40; it is estimated that the Asian category will have 18.7 dollars less debt than the African American category, and that the Caucasian category will have $12.50 less debt than the African American category.
Again, no statistical evidence.
First we convert the other two categorical variables:
In [20]:
X.Student = X.Student.map({'Yes':1, 'No':0})
X.Married = X.Married.map({'Yes':1, 'No':0})
X.head()
Out[20]:
Now we implement a naive version of the Forward Stepwise Selection:
start from a null model and add the next feature with the best score (in terms of square error).
In [21]:
def find_best_feature(X):
n_features = X.shape[1]
max_score = 0
for i in range(n_features):
feature = X.columns[i]
X_subset = X[[feature]]
model.fit(X_subset, y)
score = model.score(X_subset, y) # R^2 metric
if score > max_score:
max_score = score
max_feature = feature
return (max_score, max_feature)
find_best_feature(X)
Out[21]:
In [22]:
n_features = X.shape[1]
scores = [0]
X_selection = X.copy()
X_subsets = pd.DataFrame(index=X.index)
for i in range(n_features):
(_, next_best_feature) = find_best_feature(X_selection)
print('Next best feature = {}'.format(next_best_feature))
X_subset = X_selection[[next_best_feature]]
X_subsets[next_best_feature] = X_subset
model.fit(X_subsets, y)
scores.append(model.score(X_subsets, y))
X_selection.drop([next_best_feature], axis=1, inplace=True)
print(scores)
The first features are Rating, Limit, Income and Student.
Just with these 4 features, the R2 score is already at 0.95
In [23]:
import matplotlib.pyplot as plt
In [24]:
plt.title("Score versus features")
plt.xlabel('Number of features')
plt.ylabel('R squared')
plt.plot(scores)
Out[24]:
The first 4 features are the best predictors, the others do not add too much value.
As we add more and more parameters to our model its complexity increases, which results in increasing variance and decreasing bias, i.e. overfitting.
Basically there are two methods to overcome overfitting,
The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularises the coefficient estimates.
In [25]:
predictors = X.columns
coef = pd.Series(model.coef_,predictors).sort_values()
coef
Out[25]:
In [26]:
coef.plot(kind='bar', title="Modal coefficients")
Out[26]:
the ridge regression coefficient estimates ˆR are the values that minimize .. As with least squares, ridge regression seeks coe cient estimates that fit the data wePll, by making the RSS small. • However, the second term, j j2, called a shrinkage penalty, is small when 1, . . . , p are close to zero, and so it has the e↵ect of shrinking the estimates of j towards zero. • The tuning parameter serves to control the relative impact of these two terms on the regression coe cient estimates. • Selecting a good value for is critical; cross-validation is used for this.
SKlearn has a module Ridge to apply a Ridge regression model.
Let's see an example using lambda = 0.05 (called alpha in SciKitLearn)
In [27]:
from sklearn.linear_model import Ridge
## training the model
ridgeReg = Ridge(alpha=0.05)
ridgeReg.fit(X, y)
Out[27]:
In [28]:
ridgeReg.score(X, y)
Out[28]:
In [29]:
ridgeReg.coef_
Out[29]:
As you can see, the coefficients have been shrinked, and Cards is not sticking out anymore.
The standard least squares coefficient estimates are scale equivariant: multiplying Xj by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of 1/c.
In other words, regardless of how the jth predictor is scaled, Xj ˆj will remain the same.
In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.
Therefore, it is best to apply ridge regression after standardizing the predictors.
In [30]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
X = scaler.fit_transform(X)
In [31]:
import numpy as np
lambdas = [0.01, 1, 100, 10000]
coefficients = np.zeros(shape=(12, len(lambdas)))
i=0
for l in lambdas:
ridgeReg = Ridge(alpha=l)
ridgeReg.fit(X, y)
coefficients[:,i] = ridgeReg.coef_
i += 1
In [32]:
fig, ax = plt.subplots()
plt.title("Ridge regularisation")
plt.xlabel("lambda")
plt.ylabel("standardised coefficients")
styles=['-','--','-.',':']
labels = ['0.01','','1','', '100','', '1e4','']
ax.set_xticklabels(labels) # custom x labels
for i in range(0,12):
s = styles[i % len(styles)]
if i<3 or i==7:
plt.plot(coefficients[i], label=predictors[i], linestyle=s)
else:
plt.plot(coefficients[i])
plt.legend(loc='best')
Out[32]:
Each curve corresponds to the ridge regression coefficient estimate for one of the variables, plotted as a function of λ.
At the left side of the plot, λ is essentially zero, and so the corresponding ridge coefficient estimates are the same as the usual least squares estimates. But as λ increases, the ridge coefficient estimates shrink towards zero. When λ is extremely large, then all of the ridge coefficient estimates are basically zero; this corresponds to the null model that contains no predictors.
In this plot, the income, limit, rating, and student variables are displayed in distinct colours, since these variables are the ones to have the largest coefficient estimates.
Ridge regression does have one obvious disadvantage.
Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model.
The penalty λ will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero.
This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large.
For example, in the Credit data set, it appears that the most important variables are income, limit, rating, and student. So we might wish to build a model including just these predictors. However, ridge regression will always generate a model involving all ten predictors.
Increasing the value of λ will tend to reduce the magnitudes of the coefficients, but will not result in exclusion of any of the variables.
The lasso is a relatively recent alternative to ridge regression that over-comes this disadvantage.
In [33]:
from sklearn.linear_model import Lasso
lassoReg = Lasso(alpha=0.3)
lassoReg.fit(X, y)
Out[33]:
In [34]:
lassoReg.score(X, y)
Out[34]:
In statistical parlance, the lasso uses an L1 penalty instead of an L2 penalty.
The L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection.
As a result, models generated from the lasso are generally much easier to interpret than those produced by ridge regression. The lasso yields sparse models — that is, models that involve only a subset of the variables.
In [35]:
lambdas = [0.01, 1, 10, 50, 100, 200, 500, 1000]
coefficients = np.zeros(shape=(12, len(lambdas)))
i=0
for l in lambdas:
lassoReg = Lasso(alpha=l)
lassoReg.fit(X, y)
coefficients[:,i] = lassoReg.coef_
i += 1
In [36]:
fig, ax = plt.subplots()
plt.title("Lasso Regularisation")
plt.xlabel("lambda")
plt.ylabel("standardised coefficients")
styles=['-','--','-.',':']
labels = ['0.01','1','10','50', '100', '200','500','1000']
ax.set_xticklabels(labels) # custom x labels
for i in range(0,12):
s = styles[i % len(styles)]
if i<3 or i==7:
plt.plot(coefficients[i], label=predictors[i], linestyle=s)
else:
plt.plot(coefficients[i])
plt.legend(loc='best')
Out[36]:
When λ = 0, the lasso simply gives the least squares fit and when λ becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero.
However, in between these two extremes, the ridge regression and lasso models are quite different from each other, as you can see from the two plots.
Both in ridge regression and lasso, the tuning parameter serves to control the relative impact of the shrinking terms on the regression coeffcient estimates.
Selecting a good value for λ is critical; cross-validation is used for this.
Cross-validation provides a simple way to tackle this problem. We choose a grid of λ values, and compute the cross-validation error for each value of λ We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.
In [38]:
from sklearn.model_selection import cross_val_score, KFold
lambdas = np.linspace(500,0.01,num=50)
scoresCV = []
for l in lambdas:
lassoReg = Lasso(alpha=l)
lassoReg.fit(X, y)
scoreCV = cross_val_score(lassoReg, X, y, scoring='neg_mean_squared_error',
cv=KFold(n_splits=10, shuffle=True,
random_state=1))
scoresCV.append(np.mean(scoreCV))
In [39]:
plt.title("Lambda tuning for Lasso Regularisation")
plt.xlabel("lambda")
plt.ylabel("Cross-Validation error (MSE)")
plt.plot(lambdas,scoresCV)
Out[39]:
The plot displays how the cross-validation error changes with different values of lambda.
The point at which the cross-validation error is smallest is for very small values of lambda, close to zero.
Thanks to the sklearn function LassoCV, that applies cross-validation to the Lasso, we can calculate the exact value of lambda that minimise the error:
In [40]:
from sklearn.linear_model import LassoCV
lassoCV = LassoCV()
lassoCV.fit(X,y)
Out[40]:
In [41]:
lassoCV.alpha_
Out[41]:
In [ ]: