by Alejandro Correa Bahnsen and Jesus Solano
version 1.5, January 2019
This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Rick Muller, Sandia National Laboratories(https://github.com/justmarkham)
What is overfitting?
How does overfitting occur?
What is the impact of overfitting?
Linear models can overfit if you include "irrelevant features", meaning features that are unrelated to the response. Why?
Because it will learn a coefficient for every feature you include in the model, regardless of whether that feature has the signal or the noise.
This is especially a problem when p (number of features) is close to n (number of observations), because that model will naturally have high variance.
Linear models can overfit if the included features are highly correlated with one another. Why?
From the scikit-learn documentation:
"...coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed response, producing a large variance."
Our goal is to locate the optimum model complexity, and thus regularization is useful when we believe our model is too complex.
For a regularized linear regression model, we minimize the sum of RSS and a "penalty term" that penalizes coefficient size.
Ridge regression (or "L2 regularization") minimizes: $$\text{RSS} + \alpha \sum_{j=1}^p \beta_j^2$$
Lasso regression (or "L1 regularization") minimizes: $$\text{RSS} + \alpha \sum_{j=1}^p |\beta_j|$$
A larger alpha (towards the left of each diagram) results in more regularization:
Source code for the diagrams: Lasso regression and Ridge regression
Should features be standardized?
How should you choose between Lasso regression and Ridge regression?
In this diagram:
In this case, $\hat\beta$ is not within the blue constraint region. Thus, we need to move $\hat\beta$ until it intersects the blue region, while increasing the RSS as little as possible.
From page 222 of An Introduction to Statistical Learning:
The ellipses that are centered around $\hat\beta$ represent regions of constant RSS. In other words, all of the points on a given ellipse share a common value of the RSS. As the ellipses expand away from the least squares coefficient estimates, the RSS increases. Equations (6.8) and (6.9) indicate that the lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region.
Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero. In higher dimensions, many of the coefficient estimates may equal zero simultaneously. In Figure 6.7, the intersection occurs at $\beta_1 = 0$, and so the resulting model will only include $\beta_2$.
In [1]:
# read in the dataset
import pandas as pd
url = 'https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/communities.data'
crime = pd.read_csv(url, header=None, na_values=['?'])
crime.head()
Out[1]:
In [2]:
# examine the response variable
crime[127].describe()
Out[2]:
In [3]:
# remove categorical features
crime.drop([0, 1, 2, 3, 4], axis=1, inplace=True)
In [4]:
# remove rows with any missing values
crime.dropna(inplace=True)
In [5]:
# check the shape
crime.shape
Out[5]:
In [6]:
# define X and y
X = crime.drop(127, axis=1)
y = crime[127]
In [7]:
# split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
In [8]:
# build a linear regression model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
Out[8]:
In [9]:
# examine the coefficients
print(linreg.coef_)
In [10]:
# make predictions
y_pred = linreg.predict(X_test)
In [11]:
# calculate RMSE
from sklearn import metrics
import numpy as np
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
In [12]:
# alpha=0 is equivalent to linear regression
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=0, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
In [13]:
# try alpha=0.1
ridgereg = Ridge(alpha=0.1, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
In [14]:
# examine the coefficients
print(ridgereg.coef_)
In [15]:
# create an array of alpha values
alpha_range = 10.**np.arange(-2, 3)
alpha_range
Out[15]:
In [16]:
# select the best alpha with RidgeCV
from sklearn.linear_model import RidgeCV
ridgeregcv = RidgeCV(alphas=alpha_range, normalize=True, scoring='neg_mean_squared_error')
ridgeregcv.fit(X_train, y_train)
ridgeregcv.alpha_
Out[16]:
In [17]:
# predict method uses the best alpha value
y_pred = ridgeregcv.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
In [18]:
# try alpha=0.001 and examine coefficients
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.001, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
In [19]:
# try alpha=0.01 and examine coefficients
lassoreg = Lasso(alpha=0.01, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
In [20]:
# calculate RMSE (for alpha=0.01)
y_pred = lassoreg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
In [42]:
# select the best alpha with LassoCV
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LassoCV
lassoregcv = LassoCV(n_alphas=100, normalize=True, random_state=1,cv=5)
lassoregcv.fit(X_train, y_train)
lassoregcv.alpha_
Out[42]:
In [43]:
# examine the coefficients
print(lassoregcv.coef_)
In [44]:
# predict method uses the best alpha value
y_pred = lassoregcv.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
In [45]:
# read in the dataset
url = 'https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/wine.data'
wine = pd.read_csv(url, header=None)
wine.head()
Out[45]:
In [25]:
# examine the response variable
wine[0].value_counts()
Out[25]:
In [26]:
# define X and y
X = wine.drop(0, axis=1)
y = wine[0]
In [27]:
# split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
In [40]:
# build a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9,solver='liblinear',multi_class='auto')
logreg.fit(X_train, y_train)
Out[40]:
In [29]:
# examine the coefficients
print(logreg.coef_)
In [30]:
# generate predicted probabilities
y_pred_prob = logreg.predict_proba(X_test)
print(y_pred_prob)
In [31]:
# calculate log loss
print(metrics.log_loss(y_test, y_pred_prob))
In [32]:
# standardize X_train and X_test
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = X_train.astype(float)
X_test = X_test.astype(float)
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
In [37]:
# try C=0.1 with L1 penalty
logreg = LogisticRegression(C=0.1, penalty='l1',solver='liblinear',multi_class='auto')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)
In [38]:
# generate predicted probabilities and calculate log loss
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(metrics.log_loss(y_test, y_pred_prob))
In [39]:
# try C=0.1 with L2 penalty
logreg = LogisticRegression(C=0.1, penalty='l2',multi_class='auto',solver='liblinear')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)
In [36]:
# generate predicted probabilities and calculate log loss
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(metrics.log_loss(y_test, y_pred_prob))
Advantages of regularized linear models:
Disadvantages of regularized linear models: