In [ ]:
%pylab inline
import numpy as np
import pylab as pl

In depth with linear models

Linear models are useful when little data is available. I addition, they form a good case study of regularization.

Linear models for regression

The most standard linear model is the 'ordinary least regression', often simply called 'linear regression'. When the number of features is large, it becomes ill-posed and overfits.

Let us generate a simple simulation, to see the behavior of these models.


In [ ]:
rng = np.random.RandomState(0)
X = rng.normal(size=(1000, 50))
beta = rng.normal(size=50)
y = np.dot(X, beta) + 4*rng.normal(size=1000)

from sklearn import linear_model, cross_validation

def plot_learning_curve(estimator):
    scores = list()
    train_sizes = np.linspace(50, 250, 20).astype(np.int)
    for train_size in train_sizes:
        test_error = cross_validation.cross_val_score(estimator, X, y,
                        cv=cross_validation.ShuffleSplit(train_size=train_size, test_size=500, n=1000,
                                                         random_state=0)
                        )
        scores.append(test_error)
    pl.plot(train_sizes, np.mean(scores, axis=1), label=estimator.__class__.__name__)
    pl.ylim(0, 1)
    pl.ylabel('Explained variance on test set')
    pl.xlabel('Training set size')
    pl.legend(loc='best')


plot_learning_curve(linear_model.LinearRegression())

As we can see, the ordinary linear regression is not defined if there are less training samples than features. In the presence of noise, it does poorly as long as the number of sample is not several times the number of features.

The LinearRegression is then overfitting: fitting noise. We need to regularize.

The Ridge estimator is a simple regularization of the ordinary LinearRegression. In particular, it has the benefit of being not computationaly more expensive than the ordinary least square estimate.


In [ ]:
plot_learning_curve(linear_model.LinearRegression())
plot_learning_curve(linear_model.Ridge())

We can see that in the low-sample limit, the Ridge estimator performs much better than the unregularized model.

The regularization of the Ridge is a shrinkage: the coefficients learned are bias towards zero. Too much bias is not beneficial, but with very little samples, we will need more bias.

The amount of regularization is set via the alpha parameter of the Ridge. Tuning it is critical for performance. We can set it automatically by cross-validation using the RidgeCV estimator:


In [ ]:
plot_learning_curve(linear_model.LinearRegression())
plot_learning_curve(linear_model.Ridge())
plot_learning_curve(linear_model.RidgeCV())

The Lasso estimator is useful to impose sparsity on the coefficient. In other words, it is to be prefered if we believe that many of the features are not relevant.

Let us create such a situation with a new simulation where only 10 out of the 50 features are relevant:


In [ ]:
beta[10:] = 0
y = np.dot(X, beta) + 4*rng.normal(size=1000)
plot_learning_curve(linear_model.RidgeCV())
plot_learning_curve(linear_model.Lasso())

We can see that the Lasso estimator performs, in our case, better than the Ridge when their is a small number of training observations. However, when they are a lot of observations, the Lasso under-performs. Indeed, the variance-reducing effect of the regularization is less critical in these situation, and the bias becomes too detrimental.

As with any estimator, we should tune the regularization parameter to get best prediction. For this purpose, we can use the LassoCV object. Note that it is a significantly more computationally costly operation than the RidgeCV. To speed it up, we reduce the number of values explored for the alpha parameter.


In [ ]:
plot_learning_curve(linear_model.RidgeCV())
plot_learning_curve(linear_model.LassoCV(n_alphas=20))

ElasticNet sits in between Lasso and Ridge. It has a tuning parameter, l1_ratio, that controls this behavior: when set to 0, ElasticNet is a Ridge, when set to 1, it is a Lasso. It is useful when your coefficients are not that sparse. The sparser the coefficients, the higher we should set l1_ratio. Note that l1_ratio can also be set by cross-validation, although we won't do it here to limit computational cost.


In [ ]:
plot_learning_curve(linear_model.RidgeCV())
plot_learning_curve(linear_model.ElasticNetCV(l1_ratio=.8, n_alphas=20))

Practical tips:

  • Different algorithms for the same problem: The Lasso model can be solved with different algorithms: the Lasso estimator implements a coordinate-descent approach; the LassoLars estimator implements the LARS alorithm. Depending on the situation, one may be faster than the other.

  • Normalization: as with all estimators, it can be useful to normalization your data. The linear models have a 'normalize' option that does the normalization, runs the models, and scales the coefficients so that they are suitable for non normalized data. Note that at the time of writing Lasso and LassoLars have different default settings for normalization.

Exercise: Find the best linear regression prediction on the diabetes dataset, that is available in the scikit-learn datasets.


In [ ]:

Linear models for classification

In general, a 2-class classification task can be seen as a regression problem with output variables (y) drawn from [0, 1]. Let us simulate a simple example and applied linear regression to it.


In [ ]:
# Generate data
np.random.seed(0)
X = np.random.normal(size=100)
y = (X > 0).astype(np.float)
X[X > 0] *= 4
X += .3 * np.random.normal(size=100)
X = X[:, np.newaxis]
pl.plot(X, y, 'o')

# Fit a linear regression to it
ols = linear_model.LinearRegression()
ols.fit(X, y)
X_test = np.linspace(-4, 10, 100)[:, np.newaxis]
pl.plot(X_test, ols.predict(X_test))

We can see that a linear fit is clearly not a good fit for variable in [0, 1].

A better approach is to fit a sigmoid, this is called the logistic regression. The sigmoid can be seen as coming from a probabilistic model of 'y'.

In scikit-learn, some estimators have a 'predict_proba' method that gives the probablity of each class under the model learned. Let us use the predict_proba method of the LogisticRegression object to plot the fit:


In [ ]:
# First plot the data and the ordinary least square regression
pl.plot(X, y, 'o')
pl.plot(X_test, ols.predict(X_test))

# Plot a logistic regression
log_reg = linear_model.LogisticRegression(C=1e5)
log_reg.fit(X, y)
pl.plot(X_test, log_reg.predict_proba(X_test)[:, 1])

Regularization: the logistic regression suffers from over-fit in the presence of many features and must be regularized. The 'C' parameter controls that regularization: large C values give unregularized model, while small C give strongly regularized models.

Similar to the Ridge/Lasso separation, you can set the 'penalty' parameter to 'l1' to enforce sparsity of the coefficients.

Multi-class: Multi-class with the LogisticRegression object is tackled using a one-vs-all approach.

Exercise: Use LogisticRegression to classify digits.


In [ ]: