Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
A common problem in machine learning is that an algorithm might work really well on the training set, but when applied to unseen data it makes a lot of mistakes. You can see how this is problematic, since often we are most interested in how a model generalizes to new data. Some algorithms (such as decision trees) are more susceptible to this phenomenon than others, but even linear regression can be affected.
This phenomeon is also known as overfitting, and we will talk about it extensively in Chapter 5, Using Decision Trees to Make a Medical Diagnosis, and Chapter 11, Selecting the Right Model with Hyperparameter Tuning.
A common technique for reducing overfitting is called regularization, which involves adding an additional constraint to the cost function that is independent of all feature values.
The two most commonly used regularizors are as follows:
The procedure is exactly the same as the preceding one, but we replace the initialization command to load either a Lasso or a RidgeRegression object. Specifically, we have to replace the following command:
In [6]: linreg = linear_model.LinearRegression()
For the Lasso regression algorithm, we would change the preceding line of code to the following:
In [6]: lassoreg = linear_model.Lasso()
For the ridge regression algorithm, we would change the preceding line of code to the following:
In [6]: ridgereg = linear_model.RidgeRegression()
I encourage you to test these two algorithms on the Boston dataset in place of conventional linear regression. You can use the code below.
How does the generalization error (In [12]) change? How does the
prediction plot (In [14]) change? Do you see any improvements in performance?
In [1]:
import numpy as np
import cv2
from sklearn import datasets
from sklearn import metrics
from sklearn import model_selection
from sklearn import linear_model
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16})
The Boston dataset is included in Scikit-Learn's example datasets
In [2]:
boston = datasets.load_boston()
Inspect the boston object:
DESCR: Get a description of the datadata: The actual data, <num_samples x num_features>feature_names: The names of the featurestarget: The class labels, <num_samples x 1>target_names: The names of the class labels
In [3]:
dir(boston)
Out[3]:
In [4]:
boston.data.shape
Out[4]:
In [5]:
boston.target.shape
Out[5]:
In [6]:
ridgereg = linear_model.Ridge()
In [7]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
boston.data, boston.target, test_size=0.1, random_state=42
)
In [8]:
ridgereg.fit(X_train, y_train)
Out[8]:
In [9]:
metrics.mean_squared_error(y_train, ridgereg.predict(X_train))
Out[9]:
In [10]:
ridgereg.score(X_train, y_train)
Out[10]:
In [11]:
y_pred = ridgereg.predict(X_test)
In [12]:
metrics.mean_squared_error(y_test, y_pred)
Out[12]:
In [13]:
plt.figure(figsize=(10, 6))
plt.plot(y_test, linewidth=3, label='ground truth')
plt.plot(y_pred, linewidth=3, label='predicted')
plt.legend(loc='best')
plt.xlabel('test data points')
plt.ylabel('target value')
Out[13]:
In [14]:
plt.figure(figsize=(10, 6))
plt.plot(y_test, y_pred, 'o')
plt.plot([-10, 60], [-10, 60], 'k--')
plt.axis([-10, 60, -10, 60])
plt.xlabel('ground truth')
plt.ylabel('predicted')
scorestr = r'R$^2$ = %.3f' % ridgereg.score(X_test, y_test)
errstr = 'MSE = %.3f' % metrics.mean_squared_error(y_test, y_pred)
plt.text(-5, 50, scorestr, fontsize=12)
plt.text(-5, 45, errstr, fontsize=12)
Out[14]: