Exercise: Cross Validation and Model Selection


In [ ]:
%pylab inline

This exercise covers cross-validation of regression models on the Diabetes dataset. The diabetes data consists of 10 physiological variables (age, sex, weight, blood pressure) measure on 442 patients, and an indication of disease progression after one year:


In [ ]:
from sklearn.datasets import load_diabetes
data = load_diabetes()
X, y = data.data, data.target

In [ ]:
print X.shape

In [ ]:
print y.shape

Here we'll be fitting two regularized linear models, Ridge Regression, which uses $\ell_2$ regularlization, and Lasso Regression, which uses $\ell_1$ regularization.


In [ ]:
from sklearn.linear_model import Ridge, Lasso

We'll first use the default hyper-parameters to see the baseline estimator. We'll use the cross-validation score to determine goodness-of-fit.


In [ ]:
from sklearn.cross_validation import cross_val_score

for Model in [Ridge, Lasso]:
    model = Model()
    print Model.__name__, cross_val_score(model, X, y).mean()

We see that for the default hyper-parameter values, Lasso outperforms Ridge. But is this the case for the optimal hyperparameters of each model?

Exercise: Basic Hyperparameter Optimization

Here spend some time writing a function which computes the cross-validation score as a function of alpha, the strength of the regularization for Lasso and Ridge. We'll choose 20 values of alpha between 0.0001 and 1:


In [ ]:
alphas = np.logspace(-3, -1, 30)

# plot the mean cross-validation score for a Ridge estimator and a Lasso estimator
# as a function of alpha.  Which is more difficult to tune?

Solution


In [ ]:
%load solutions/06B_basic_grid_search.py

Because searching a grid of hyperparameters is such a common task, scikit-learn provides several hyper-parameter estimators to automate this. We'll explore this more in depth later in the tutorial, but for now it is interesting to see how GridSearchCV works:


In [ ]:
from sklearn.grid_search import GridSearchCV

GridSearchCV is constructed with an estimator, as well as a dictionary of parameter values to be searched. We can find the optimal parameters this way:


In [ ]:
for Model in [Ridge, Lasso]:
    gscv = GridSearchCV(Model(), dict(alpha=alphas), cv=3).fit(X, y)
    print Model.__name__, gscv.best_params_

For some models within scikit-learn, cross-validation can be performed more efficiently on large datasets. In this case, a cross-validated version of the particular model is included. The cross-validated versions of Ridge and Lasso are RidgeCV and LassoCV, respectively. The grid search on these estimators can be performed as follows:


In [ ]:
from sklearn.linear_model import RidgeCV, LassoCV
for Model in [RidgeCV, LassoCV]:
    model = Model(alphas=alphas, cv=3).fit(X, y)
    print Model.__name__, model.alpha_

We see that the results match those returned by GridSearchCV.

Exercise: Learning Curves

Here we'll apply our learning curves to the diabetes data. The question to answer is this:

  • Given the optimal models above, which is over-fitting and which is under-fitting the data?
  • To obtain better results, would you invest time and effort in gathering more training samples, or gathering more attributes for each sample? Recall the previous discussion of reading learning curves.

You can follow the process used in the previous notebook to plot the learning curves. A good metric to use is the mean_squared_error, which we'll import below:


In [ ]:
from sklearn.metrics import mean_squared_error
# define a function that computes the learning curve (i.e. mean_squared_error as a function
# of training set size, for both training and test sets) and plot the result

Solution


In [ ]:
%load solutions/06B_learning_curves.py