Regularisation

The basic idea of regularisation is to penalise or shrink the large coefficients of a regression model. This can help with the bias / variance trade-off and can help with model selection by automatically removing irrelevant features.

The shrinkage models that we will see here are the Ridge regression and Lasso.

An example: predict credit balance

The Credit data set records Balance (average credit card debt for a number of individuals) as well as several quantitative predictors: age, cards (number of credit cards), education (years of education), income (in thousands of dollars), limit (credit limit), and rating (credit rating). In addition to these quantitative variables, we also have four qualitative variables: gender, student (student status), status (marital status), and ethnicity (Caucasian, African American or Asian).



In [4]:

    
import pandas as pd

# load up the Credit dataset 
#
data = pd.read_csv("../datasets/credit.csv", index_col=0)



In [5]:

    
data.shape









    Out[5]:





(400, 11)



In [6]:

    
data.columns









    Out[6]:





Index(['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Gender',
       'Student', 'Married', 'Ethnicity', 'Balance'],
      dtype='object')



In [7]:

    
data.head()



In [8]:

    
data.describe()









    Out[8]:






  
    
      
      Income
      Limit
      Rating
      Cards
      Age
      Education
      Balance
    
  
  
    
      count
      400.000000
      400.000000
      400.000000
      400.000000
      400.000000
      400.000000
      400.000000
    
    
      mean
      45.218885
      4735.600000
      354.940000
      2.957500
      55.667500
      13.450000
      520.015000
    
    
      std
      35.244273
      2308.198848
      154.724143
      1.371275
      17.249807
      3.125207
      459.758877
    
    
      min
      10.354000
      855.000000
      93.000000
      1.000000
      23.000000
      5.000000
      0.000000
    
    
      25%
      21.007250
      3088.000000
      247.250000
      2.000000
      41.750000
      11.000000
      68.750000
    
    
      50%
      33.115500
      4622.500000
      344.000000
      3.000000
      56.000000
      14.000000
      459.500000
    
    
      75%
      57.470750
      5872.750000
      437.250000
      4.000000
      70.000000
      16.000000
      863.000000
    
    
      max
      186.634000
      13913.000000
      982.000000
      9.000000
      98.000000
      20.000000
      1999.000000



In [9]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 1 to 400
Data columns (total 11 columns):
Income       400 non-null float64
Limit        400 non-null int64
Rating       400 non-null int64
Cards        400 non-null int64
Age          400 non-null int64
Education    400 non-null int64
Gender       400 non-null object
Student      400 non-null object
Married      400 non-null object
Ethnicity    400 non-null object
Balance      400 non-null int64
dtypes: float64(1), int64(6), object(4)
memory usage: 37.5+ KB

Prepare training data: features and target



In [10]:

    
y = data['Balance'].copy() # our target
X = data.copy()
X.drop(['Balance'], axis=1, inplace=True)

Predict balance from gender (variable with two levels)

Suppose that we wish to investigate differences in credit card balance between males and females, ignoring the other variables for the moment.
If a qualitative predictor has only two levels then we can simply create an indicator or dummy variable that takes on two possible numerical values.

We can use the Python map primitive to map Female to 1 and Male to 0.
Be careful that the labels need to be mapped correctly, eventual spaces included.



In [11]:

    
X.loc[1, 'Gender']









    Out[11]:





' Male'



In [12]:

    
X.Gender = X.Gender.map({'Female':1, ' Male':0})
X.Gender.head()









    Out[12]:





1    0
2    1
3    0
4    1
5    0
Name: Gender, dtype: int64

The decision to use 0 or 1 is arbitrary and has no effect on the regression fit but different values could result in different coefficients.

We now use the new dummy variable as a predictor in a regression model.



In [13]:

    
from sklearn import linear_model

#
# : Create the linear regression model 
#
model = linear_model.LinearRegression()



In [14]:

    
model.fit(X[['Gender']], y)
model.intercept_









    Out[14]:





509.80310880829012



In [15]:

    
model.coef_









    Out[15]:





array([ 19.73312308])

This results in the model:
Balance = 509.7 if male
Balance = 509.7 + 19.7 if female
And 19.7 can be interpreted as the average difference in credit card balance between females and males (females are estimated to carry $19.7 in additional debt for a total of 509.7+19.7 = $529.4).
However without checking its p-value or confidence intervals (outside the scope of this) we cannot say anything about the statistical evidence of this difference.

Predict balance from ethnicity (variable with 3 levels)

When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. In this situation, we can split a categorical variable into additional different binary variables, one for each level.
For example, for the ethnicity variable we create three dummy variables, using the pandas get_dummies() function.



In [16]:

    
etno = pd.get_dummies(X['Ethnicity']) # creates extra columns
X = X.join(etno)
X.head()









    Out[16]:






  
    
      
      Income
      Limit
      Rating
      Cards
      Age
      Education
      Gender
      Student
      Married
      Ethnicity
      African American
      Asian
      Caucasian
    
  
  
    
      1
      14.891
      3606
      283
      2
      34
      11
      0
      No
      Yes
      Caucasian
      0.0
      0.0
      1.0
    
    
      2
      106.025
      6645
      483
      3
      82
      15
      1
      Yes
      Yes
      Asian
      0.0
      1.0
      0.0
    
    
      3
      104.593
      7075
      514
      4
      71
      11
      0
      No
      No
      Asian
      0.0
      1.0
      0.0
    
    
      4
      148.924
      9504
      681
      3
      36
      11
      1
      No
      No
      Asian
      0.0
      1.0
      0.0
    
    
      5
      55.882
      4897
      357
      2
      68
      16
      0
      No
      Yes
      Caucasian
      0.0
      0.0
      1.0

Since the dummies variable are additional - not new values mapped to the existing variable like for the gender - we can now drop the Ethnicity variable:



In [17]:

    
X.drop(['Ethnicity'], axis=1, inplace=True)



In [18]:

    
features = X[['Asian', 'Caucasian', 'African American']]

model.fit(features, y)
model.intercept_









    Out[18]:





520.60373764246071



In [19]:

    
model.coef_









    Out[19]:





array([ -8.29001215,  -2.10625021,  10.39626236])

We see that the estimated balance for the African American is $531.00 = $520.60 + 10.40; it is estimated that the Asian category will have 18.7 dollars less debt than the African American category, and that the Caucasian category will have $12.50 less debt than the African American category.

Again, no statistical evidence.

Predict from all variables

First we convert the other two categorical variables:



In [20]:

    
X.Student = X.Student.map({'Yes':1, 'No':0})
X.Married = X.Married.map({'Yes':1, 'No':0})
X.head()









    Out[20]:






  
    
      
      Income
      Limit
      Rating
      Cards
      Age
      Education
      Gender
      Student
      Married
      African American
      Asian
      Caucasian
    
  
  
    
      1
      14.891
      3606
      283
      2
      34
      11
      0
      0
      1
      0.0
      0.0
      1.0
    
    
      2
      106.025
      6645
      483
      3
      82
      15
      1
      1
      1
      0.0
      1.0
      0.0
    
    
      3
      104.593
      7075
      514
      4
      71
      11
      0
      0
      0
      0.0
      1.0
      0.0
    
    
      4
      148.924
      9504
      681
      3
      36
      11
      1
      0
      0
      0.0
      1.0
      0.0
    
    
      5
      55.882
      4897
      357
      2
      68
      16
      0
      0
      1
      0.0
      0.0
      1.0

Now we implement a naive version of the Forward Stepwise Selection:
start from a null model and add the next feature with the best score (in terms of square error).



In [21]:

    
def find_best_feature(X):
    n_features = X.shape[1]
    max_score = 0
    for i in range(n_features):
        feature = X.columns[i]
        X_subset = X[[feature]]
        model.fit(X_subset, y)
        score = model.score(X_subset, y) # R^2 metric
        if score > max_score:
            max_score = score
            max_feature = feature

    return (max_score, max_feature)

find_best_feature(X)









    Out[21]:





(0.74584841805850366, 'Rating')



In [22]:

    
n_features = X.shape[1]
scores = [0]
X_selection = X.copy()
X_subsets = pd.DataFrame(index=X.index)

for i in range(n_features):
    (_, next_best_feature) = find_best_feature(X_selection)
    print('Next best feature = {}'.format(next_best_feature))
    X_subset = X_selection[[next_best_feature]]
    X_subsets[next_best_feature] = X_subset

    model.fit(X_subsets, y)
    scores.append(model.score(X_subsets, y))
    
    X_selection.drop([next_best_feature], axis=1, inplace=True)

print(scores)









    



Next best feature = Rating
Next best feature = Limit
Next best feature = Income
Next best feature = Student
Next best feature = Cards
Next best feature = Gender
Next best feature = African American
Next best feature = Asian
Next best feature = Education
Next best feature = Married
Next best feature = Caucasian
Next best feature = Age
[0, 0.74584841805850366, 0.74594279610140901, 0.87623894562628635, 0.95218750444038314, 0.95416059706123491, 0.95429516950679683, 0.95443993317642362, 0.9544769728962863, 0.95454588366259818, 0.9545969672853537, 0.9545969672853537, 0.95510156336517582]

The first features are Rating, Limit, Income and Student.
Just with these 4 features, the R2 score is already at 0.95



In [23]:

    
import matplotlib.pyplot as plt



In [24]:

    
plt.title("Score versus features")
plt.xlabel('Number of features') 
plt.ylabel('R squared')
plt.plot(scores)









    Out[24]:





[<matplotlib.lines.Line2D at 0x11b905550>]

The first 4 features are the best predictors, the others do not add too much value.

As we add more and more parameters to our model its complexity increases, which results in increasing variance and decreasing bias, i.e. overfitting.

Basically there are two methods to overcome overfitting,

Reduce the model complexity (less features! Use a selection algorithm as the one above and extract a subset of features).
Regularisation (what we are going to see)

Regularisation

The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularises the coefficient estimates.



In [25]:

    
predictors = X.columns
coef = pd.Series(model.coef_,predictors).sort_values()
coef









    Out[25]:





Education           -10.653248
Gender               -8.970401
African American     -8.533901
Rating               -7.803102
Married              -1.098855
Caucasian            -0.613909
Limit                 0.190907
Income                1.136527
Asian                 1.136624
Student               7.833778
Age                  17.724484
Cards               425.747360
dtype: float64



In [26]:

    
coef.plot(kind='bar', title="Modal coefficients")









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b917518>

We can see that the coefficient of Card is much higher as compared to rest of the coefficients. Therefore the credit balance of a person would be more driven by this feature.

Ridge Regression

the ridge regression coefficient estimates ˆR are the values that minimize .. As with least squares, ridge regression seeks coe cient estimates that fit the data wePll, by making the RSS small. • However, the second term, j j2, called a shrinkage penalty, is small when 1, . . . , p are close to zero, and so it has the e↵ect of shrinking the estimates of j towards zero. • The tuning parameter serves to control the relative impact of these two terms on the regression coe cient estimates. • Selecting a good value for is critical; cross-validation is used for this.

SKlearn has a module Ridge to apply a Ridge regression model.
Let's see an example using lambda = 0.05 (called alpha in SciKitLearn)



In [27]:

    
from sklearn.linear_model import Ridge

## training the model
ridgeReg = Ridge(alpha=0.05)

ridgeReg.fit(X, y)









    Out[27]:





Ridge(alpha=0.05, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)



In [28]:

    
ridgeReg.score(X, y)









    Out[28]:





0.95510140982209057



In [29]:

    
ridgeReg.coef_









    Out[29]:





array([ -7.80274243e+00,   1.90788433e-01,   1.13823650e+00,
         1.77125942e+01,  -6.14351757e-01,  -1.09412802e+00,
        -1.06276339e+01,   4.25138458e+02,  -8.56581700e+00,
        -8.97095448e+00,   7.84581065e+00,   1.12514383e+00])

As you can see, the coefficients have been shrinked, and Cards is not sticking out anymore.

Standardise

The standard least squares coefficient estimates are scale equivariant: multiplying Xj by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of 1/c.
In other words, regardless of how the jth predictor is scaled, Xj ˆj will remain the same. In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.
Therefore, it is best to apply ridge regression after standardizing the predictors.



In [30]:

    
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
X = scaler.fit_transform(X)

how lambda affects the coefficients

Let's plot the coefficients values based on the lambda parameter.



In [31]:

    
import numpy as np

lambdas = [0.01, 1, 100, 10000]
coefficients = np.zeros(shape=(12, len(lambdas)))
i=0
for l in lambdas:
    ridgeReg = Ridge(alpha=l)
    ridgeReg.fit(X, y)
    coefficients[:,i] = ridgeReg.coef_
    i += 1



In [32]:

    
fig, ax = plt.subplots()

plt.title("Ridge regularisation")
plt.xlabel("lambda")
plt.ylabel("standardised coefficients")
styles=['-','--','-.',':']

labels = ['0.01','','1','', '100','', '1e4','']
ax.set_xticklabels(labels) # custom x labels

for i in range(0,12):
    s = styles[i % len(styles)]
    if i<3 or i==7:
        plt.plot(coefficients[i], label=predictors[i], linestyle=s)
    else:
        plt.plot(coefficients[i])

plt.legend(loc='best')









    Out[32]:





<matplotlib.legend.Legend at 0x11bda2e10>

Each curve corresponds to the ridge regression coefficient estimate for one of the variables, plotted as a function of λ.
At the left side of the plot, λ is essentially zero, and so the corresponding ridge coefficient estimates are the same as the usual least squares estimates. But as λ increases, the ridge coefficient estimates shrink towards zero. When λ is extremely large, then all of the ridge coefficient estimates are basically zero; this corresponds to the null model that contains no predictors.
In this plot, the income, limit, rating, and student variables are displayed in distinct colours, since these variables are the ones to have the largest coefficient estimates.

Ridge regression does have one obvious disadvantage.
Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model.
The penalty λ will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero.
This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large.
For example, in the Credit data set, it appears that the most important variables are income, limit, rating, and student. So we might wish to build a model including just these predictors. However, ridge regression will always generate a model involving all ten predictors.
Increasing the value of λ will tend to reduce the magnitudes of the coefficients, but will not result in exclusion of any of the variables. The lasso is a relatively recent alternative to ridge regression that over-comes this disadvantage.

The lasso



In [33]:

    
from sklearn.linear_model import Lasso

lassoReg = Lasso(alpha=0.3)

lassoReg.fit(X, y)









    Out[33]:





Lasso(alpha=0.3, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)



In [34]:

    
lassoReg.score(X, y)









    Out[34]:





0.95509337268728689

In statistical parlance, the lasso uses an L1 penalty instead of an L2 penalty. The L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection.
As a result, models generated from the lasso are generally much easier to interpret than those produced by ridge regression. The lasso yields sparse models — that is, models that involve only a subset of the variables.



In [35]:

    
lambdas = [0.01, 1, 10, 50, 100, 200, 500, 1000]
coefficients = np.zeros(shape=(12, len(lambdas)))
i=0
for l in lambdas:
    lassoReg = Lasso(alpha=l)
    lassoReg.fit(X, y)
    coefficients[:,i] = lassoReg.coef_
    i += 1



In [36]:

    
fig, ax = plt.subplots()

plt.title("Lasso Regularisation")
plt.xlabel("lambda")
plt.ylabel("standardised coefficients")
styles=['-','--','-.',':']

labels = ['0.01','1','10','50', '100', '200','500','1000']

ax.set_xticklabels(labels) # custom x labels

for i in range(0,12):
    s = styles[i % len(styles)]
    if i<3 or i==7:
        plt.plot(coefficients[i], label=predictors[i], linestyle=s)
    else:
        plt.plot(coefficients[i])

plt.legend(loc='best')









    Out[36]:





<matplotlib.legend.Legend at 0x11bf29eb8>

When λ = 0, the lasso simply gives the least squares fit and when λ becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero.
However, in between these two extremes, the ridge regression and lasso models are quite different from each other, as you can see from the two plots.

Both in ridge regression and lasso, the tuning parameter serves to control the relative impact of the shrinking terms on the regression coeffcient estimates.
Selecting a good value for λ is critical; cross-validation is used for this.

Selecting the Tuning Parameter lambda

Cross-validation provides a simple way to tackle this problem. We choose a grid of λ values, and compute the cross-validation error for each value of λ We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.



In [38]:

    
from sklearn.model_selection import cross_val_score, KFold

lambdas = np.linspace(500,0.01,num=50)
scoresCV = []
for l in lambdas:
    lassoReg = Lasso(alpha=l)
    lassoReg.fit(X, y)    
    
    scoreCV = cross_val_score(lassoReg, X, y, scoring='neg_mean_squared_error', 
                             cv=KFold(n_splits=10, shuffle=True,
                                            random_state=1))
    scoresCV.append(np.mean(scoreCV))



In [39]:

    
plt.title("Lambda tuning for Lasso Regularisation")
plt.xlabel("lambda")
plt.ylabel("Cross-Validation error (MSE)")

plt.plot(lambdas,scoresCV)









    Out[39]:





[<matplotlib.lines.Line2D at 0x11c0b86a0>]

The plot displays how the cross-validation error changes with different values of lambda. The point at which the cross-validation error is smallest is for very small values of lambda, close to zero.
Thanks to the sklearn function LassoCV, that applies cross-validation to the Lasso, we can calculate the exact value of lambda that minimise the error:



In [40]:

    
from sklearn.linear_model import LassoCV

lassoCV = LassoCV()
lassoCV.fit(X,y)









    Out[40]:





LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)



In [41]:

    
lassoCV.alpha_









    Out[41]:





0.39656269957368379



In [ ]:

	Income	Limit	Rating	Cards	Age	Education	Gender	Student	Married	Ethnicity	Balance
1	14.891	3606	283	2	34	11	Male	No	Yes	Caucasian	333
2	106.025	6645	483	3	82	15	Female	Yes	Yes	Asian	903
3	104.593	7075	514	4	71	11	Male	No	No	Asian	580
4	148.924	9504	681	3	36	11	Female	No	No	Asian	964
5	55.882	4897	357	2	68	16	Male	No	Yes	Caucasian	331

	Income	Limit	Rating	Cards	Age	Education	Balance
count	400.000000	400.000000	400.000000	400.000000	400.000000	400.000000	400.000000
mean	45.218885	4735.600000	354.940000	2.957500	55.667500	13.450000	520.015000
std	35.244273	2308.198848	154.724143	1.371275	17.249807	3.125207	459.758877
min	10.354000	855.000000	93.000000	1.000000	23.000000	5.000000	0.000000
25%	21.007250	3088.000000	247.250000	2.000000	41.750000	11.000000	68.750000
50%	33.115500	4622.500000	344.000000	3.000000	56.000000	14.000000	459.500000
75%	57.470750	5872.750000	437.250000	4.000000	70.000000	16.000000	863.000000
max	186.634000	13913.000000	982.000000	9.000000	98.000000	20.000000	1999.000000