This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Applying Lasso and Ridge Regression

A common problem in machine learning is that an algorithm might work really well on the training set, but when applied to unseen data it makes a lot of mistakes. You can see how this is problematic, since often we are most interested in how a model generalizes to new data. Some algorithms (such as decision trees) are more susceptible to this phenomenon than others, but even linear regression can be affected.

This phenomeon is also known as overfitting, and we will talk about it extensively in Chapter 5, Using Decision Trees to Make a Medical Diagnosis, and Chapter 11, Selecting the Right Model with Hyperparameter Tuning.

A common technique for reducing overfitting is called regularization, which involves adding an additional constraint to the cost function that is independent of all feature values.

The two most commonly used regularizors are as follows:

  • L1 regularization: This adds a term to the scoring function that is proportional to the sum of all absolute weight values. In other words, it is based on the L1 norm of the weight vector (also known as the rectilinear distance, snake distance, or Manhattan distance). The resulting algorithm is also known as Lasso regression.
  • L2 regularization: This adds a term to the scoring function that is proportional to the sum of all squared weight values. In other words, it is based on the L2 norm of the weight vector (also known as the Euclidean distance). Since the L2 norm involves a squaring operation, it punishes strong outliers in the weight vector much harder than the L1 norm. The resulting algorithm is also known as ridge regression.

The procedure is exactly the same as the preceding one, but we replace the initialization command to load either a Lasso or a RidgeRegression object. Specifically, we have to replace the following command:

In [6]: linreg = linear_model.LinearRegression()

For the Lasso regression algorithm, we would change the preceding line of code to the following:

In [6]: lassoreg = linear_model.Lasso()

For the ridge regression algorithm, we would change the preceding line of code to the following:

In [6]: ridgereg = linear_model.RidgeRegression()

I encourage you to test these two algorithms on the Boston dataset in place of conventional linear regression. You can use the code below.

How does the generalization error (In [12]) change? How does the prediction plot (In [14]) change? Do you see any improvements in performance?

Loading the dataset


In [1]:
import numpy as np
import cv2

from sklearn import datasets
from sklearn import metrics
from sklearn import model_selection
from sklearn import linear_model

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16})

The Boston dataset is included in Scikit-Learn's example datasets


In [2]:
boston = datasets.load_boston()

Inspect the boston object:

  • DESCR: Get a description of the data
  • data: The actual data, <num_samples x num_features>
  • feature_names: The names of the features
  • target: The class labels, <num_samples x 1>
  • target_names: The names of the class labels

In [3]:
dir(boston)


Out[3]:
['DESCR', 'data', 'feature_names', 'target']

In [4]:
boston.data.shape


Out[4]:
(506, 13)

In [5]:
boston.target.shape


Out[5]:
(506,)

Training the model

This is the line you want to replace with either Lasso or Ridge.


In [6]:
ridgereg = linear_model.Ridge()

In [7]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    boston.data, boston.target, test_size=0.1, random_state=42
)

In [8]:
ridgereg.fit(X_train, y_train)


Out[8]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [9]:
metrics.mean_squared_error(y_train, ridgereg.predict(X_train))


Out[9]:
22.926424500454317

In [10]:
ridgereg.score(X_train, y_train)


Out[10]:
0.73533535350875179

Testing the model


In [11]:
y_pred = ridgereg.predict(X_test)

In [12]:
metrics.mean_squared_error(y_test, y_pred)


Out[12]:
14.785888941249883

In [13]:
plt.figure(figsize=(10, 6))
plt.plot(y_test, linewidth=3, label='ground truth')
plt.plot(y_pred, linewidth=3, label='predicted')
plt.legend(loc='best')
plt.xlabel('test data points')
plt.ylabel('target value')


Out[13]:
<matplotlib.text.Text at 0x7f1bf042a080>

In [14]:
plt.figure(figsize=(10, 6))
plt.plot(y_test, y_pred, 'o')
plt.plot([-10, 60], [-10, 60], 'k--')
plt.axis([-10, 60, -10, 60])
plt.xlabel('ground truth')
plt.ylabel('predicted')

scorestr = r'R$^2$ = %.3f' % ridgereg.score(X_test, y_test)
errstr = 'MSE = %.3f' % metrics.mean_squared_error(y_test, y_pred)
plt.text(-5, 50, scorestr, fontsize=12)
plt.text(-5, 45, errstr, fontsize=12)


Out[14]:
<matplotlib.text.Text at 0x7f1becafb0f0>