Notes on Chapter 6: Best Practices for Model Evaluation and Hyperparameter Tuning

So far, Raschka has introduced us to a world of different models for classifying data and compressing features. But how do we know how well any given model is performing, and how can we figure out how to improve performance when it's bad?

In Chapter 6, we explore many different techniques for improving model performance. Broadly, our focus includes:

  • Estimating model performance
  • Diagnosing common problems
  • Fine-tuning models by adjusting hyperparameters
  • Getting familiar with different performance metrics

The specific techniques we'll cover are:

  1. Data-processing pipelines: chaining algorithms together
  2. Cross-validation: robust measures of performance
  3. Learning and validation curves: measuring bias and variance
  4. Grid search: tuning hyperparameters
  5. Nested cross-validation: selecting good algorithms
  6. Performance metrics: different ways of judging "good" and "bad" models

Data-processing pipelines

Many preprocessing techniques (like PCA) find parameters that must be reused across all training and testing datasets in order to produce sensible results. To help standardize this procedure, we can build pipelines that record our transformation steps and allow us to reuse them across training, testing, and validation sets (or even on new samples from the same population). Pipelines also encourage an object-oriented approach to model-building.

In scikit-learn, we can use the Pipeline class to construct a pipeline:


In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([('scaler', StandardScaler()),
                     ('clf', LogisticRegression)])

Note that Pipeline estimators have the standard set of fit(), predict(), and fit_transform() methods, but that in order to evaluate correctly, all steps up until the final classification step must call fit_transform.

Cross-validation

Raschka covers two types of cross-validation, or techniques for measuring how well a model generalizes. The two techniques include holdout and k-fold cross-validation.

Note that both of these techniques represent means of partitioning a dataset into training/testing/validation sets for analysis, sets whose performance we must measure using a specific metric. Cross-validation doesn't describe the metric used to quantify a model's performance, it simply describes the process of partitioning a dataset in such a way that we get the most mileage out of our metric.

Holdout cross-validation

Holdout is the most basic kind of cross-validation. The intuition is that we split a dataset into three partitions: training, testing, and validation. We use the training dataset for training the model; the validation dataset for getting feedback on how well our model is performing on data it hasn't seen before, and iterating on it; and the testing set for verifying our performance once we're confident our model is performing well.

Why use two different test partitions (training and validation)? Well, we need to have a validation partition to get feedback on how our model is doing, but after iterating many times we can't rely on this partition to accurately measure the model's performance. By testing against the validation partition over and over, we actually incorporate it into our model's training: not by literally learning parameters using the test dataset as training, but rather by tuning to fit the model in the human act of model selection. In order to get a final, unbiased estimate of our model's performance, we need to reserve ("hold out") one partition outside of the training process completely.

$k$-fold cross-validation

Holdout cross-validation has a major drawback: the randomness of the partitioning can have unintended consequences for measuring a model's generalizability. If we accidentally partition the dataset in a strange way, such that (for example) most samples of class A end up in the training partition and most samples of class B end up in the testing partition, we're going to get mysteriously poor results.

$k$-fold cross-validation attempts to overcome this drawback by sampling and partitioning the dataset multiple times, retraining the model and measuring performance each time in order to get a good sense of the "average" performance of the model. The intuition involves repeating the holdout method $k$ times, in order to create $k$ different partitions (or "folds") of the sample space: On each fold, we sample (without replacement) from the sample space to create $k-1$ training sets and 1 validation set; we then measure performance through the holdout method and save our performance metric for the end, when we average them all together.

Optionally, we can use a slightly modified form known as stratified $k$-fold cross-validation to further improve the robustness of our final average. In stratified $k$-fold, we weight the final average of the metrics by the class proportion of each fold, so that folds with more even distributions of classes among training/validation sets hold more weight than folds with uneven distributions.

Note that $k$-fold cross-validation only splits each fold into training/validation sets, so we need to first partition our dataset into training and test sets and then perform the cross-validation only on the test dataset.

Learning and validation curves

As we've heard time and time again, training a model involves dealing with a tradeoff between bias (underfitting) and variance (overfitting). Models that are high bias will often score severely below desired accuracy; models that are high variance, on the other hand, will often perform well on training data, but perform poorly when applied to testing data.

How can we tell when a model is suffering from high bias and high variance? And how do we know what to do about it? Learning and validation curves offer two related techniques for addressing this question.

In both cases, we perform training, testing, and validation many different times, tweaking slightly different variables - either sample size or hyperparameters, depending on the particular technique - on each iteration. Then, we plot the change in accuracy as the variables change.

Learning curves

Learning curves seek to answer the question: will collecting more data improve model performance? By using sample size as the independent variable, we can use the plot to see how the model behaves as sample sizes increase.

Validation curves

Validation curves are identical to learning curves, with the difference that the independent variable is a model hyperparameter instead of dataset sample size. By varying the hyperparameter, we can get a sense of a good range in which we can define it.

The gist of learning and validation curves is very simple. We'll experiment with them in the exercises, since they're quite intuitive.

Grid search is a method for finding optimal combinations of hyperparameters. As far as model-tuning techniques go, it's extremely expensive, but helpful for doing initial exploration.

The intuition is to set ranges (as arrays) of different hyperparameters and the possible values they can take, and then perform a brute-force, exhaustive search to find the best one (that is, test every single possible combination). Again, scikit-learn provides a class for defining reproducible grid searches.

Randomized search: a cheaper alternative

To improve efficiency, we can make use of a cheaper alternative to grid search called randomized search that will also test combinations of hyperparameters. Instead of performing an exhaustive search, the goal of randomized search is to draw parameter combinations from different sampling distributions with a preset budget. This way, we can define exactly how hard we want our search to look for a good combination.

Nested cross-validation

So far, we've mostly considered techniques for tuning hyperparameters (and sample sizes). But how should we go about selecting the right model in the first place? How do we know to use SVM over logistic regression, or to make use of a random forest?

In an ideal world, we'd have a strong theory of the phenomenon underlying our data, and we'd choose the best model to fit that theory. In the real world, however, sometimes we just want to try stuff and see what sticks. Nested cross-validation is a principled way of accomplishing that.

Intuition

The basic idea behind nested cross-validation is that we'll perform $k$-fold cross-validation within a $k$-fold cross-validation. We'll use the inner folds to test a bunch of different hyperparameters, either with a grid search or a randomized search, while using the outer folds to test different model types. By using a second set of inner folds to test hyperparameters, we can get a good sense of how a model performs "on average" for the task at hand.

In scikit-learn, we can implement nested cross-validation by using an instance of the GridSearchCV class for the inner loop, and the cross_val_score method for the outer loop.

Performance metrics

Up til this point, we've relied on a single metric to help us judge the performance of a model, whether on training, testing, or validation sets: model accuracy, or the proportion of all predictions that turn out to be correct that the model makes.

In this section, we'll take a look at some more sophisticated metrics we can use to judge model performance, including:

Confusion matrices

Confusion matrices are a fundamental structure for specifying the ways in which a model is doing well or doing poorly. They also conveniently help us get acquainted with the concept of true and false positives/negatives, which underly all of the metrics we consider in this section.

True/false and positive/negative form two different axes on which we can judge a model's performance. When a prediction is true, the prediction matches the actual value, and the reverse for false; when a prediction is positive, on the other hand, it simply means that it belongs to one class (the "positive" class) and not the other (the "negative" class). Hence, a prediction can be both positive and false at the same time (as when a model predicts it will rain tomorrow and it doesn't, a false positive), and in the same way it can be both negative and true (when the model predicts it won't rain tomorrow, and it turns out to be right).

We can represent these axes (true/false and positive/negative) as a matrix — or, as here, a table:

predict + predict -
actual + true positive (TP) false negative (FN)
actual - false positive (FP) true negative (TN)

Then, we can use the quadrants to record the scores that our model gets. For the sake of an example:

predict + predict -
actual + 500 (TP) 3 (FN)
actual - 5 (FP) 450 (TN)

Looks like this model's doing pretty well!

Note that we can express our current favorite metric, accuracy, in terms of true/false positives/negatives:

$$ accuracy := \frac{TP + TN}{FP + FN + TP + TN} $$

We can also define error in these terms, as the proportion of incorrect decisions: $$ error := \frac{FP + FN}{FP + FN + TP + TN} = accuracy - 1 $$

Two more terms will be helpful for us. The true positive rate (TPR) measures the proportion of all positive samples that get correctly predicted by the model. Formally:

$$ TPR := \frac{TP}{P} = \frac{TP}{TP + FN} $$

The denominator that defines $P$ was a little confusing to me at first. We need the sum of true positives and false negatives because both represent positive samples (since "false negatives" are actually positive).

Likewise, the false positive rate (FPR) measures the opposite, that is, the proportion of all negative samples that get misclassified as positive:

$$ FPR := \frac{FP}{N} = \frac{FP}{TN + FP} $$

TPR and FPR are particularly useful in application domains that are disproportionately affected by misclassification. Medical diagnosis is a particularly dramatic example: when we're considering a treatment that is very risky, we'll probably care a lot more about lowering the false positive rate, since treatment could unnecessarily endanger a healthy patient. When considering a safe treatment for a life-threatening disease, on the other hand, we'll be much more interested in increasing the true postive rate, since we want to make sure that we identify as many sick people as possible.

Precision, recall, and F1 scores

Recall is a fancy name for the true positive rate. That is:

$$ REC := TPR = \frac{TP}{TP + FN} $$

Precision, on the other hand, measures the proportion of all predicted positives that are correct. Note the altered denominator:

$$ PRE := \frac{TP}{TP + FP} $$

Another way of thinking about these two metrics:

  • precision measures the fraction of retrieved instances that are relevant
  • recall measures the fraction of relevant instances that are retrieved

As you might be able to tell from this near-tongue-twister, precision and recall are often inversely related: improving precision often reduces recall, and vice versa. As a perverse example, think about an easy way to get perfect recall: just label every sample as positive! If we do this, however, we'll (usually) get really bad precision, since we'll probably wind up with a lot of false positives (and hence a big denominator)

To balance precision and recall in the real world, we often use a combination of the two called the F1 score:

$$ F1 := 2 \frac{PRE * REC}{PRE + REC} $$

The leading $2$ helps shift the range of the metric to $[0, 1]$. Can you see why?