Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
Returning to our $k$-NN classifier, we find that we have only one hyperparameter to tune: $k$. Typically, you would have a much larger number of open parameters to mess with, but the $k$-NN algorithm is simple enough for us to manually implement grid search.
Before we get started, we need to split the dataset as we have done before into training and test sets. Here we choose a 75-25 split:
In [1]:
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris.data.astype(np.float32)
y = iris.target
In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=37
)
Then the goal is to loop over all possible values of $k$. As we do this, we want to keep track of the best accuracy we observed as well as the value for $k$ that gave rise to this result:
In [3]:
best_acc = 0
best_k = 0
Grid search then looks like an outer loop around the entire train and test procedure. After calculating the accuracy on the test set (acc
), we compare it to the best accuracy found
so far (best_acc
). If the new value is better, we update our bookkeeping variables and
move on to the next iteration:
In [4]:
import cv2
from sklearn.metrics import accuracy_score
for k in range(1, 20):
knn = cv2.ml.KNearest_create()
knn.setDefaultK(k)
knn.train(X_train, cv2.ml.ROW_SAMPLE, y_train)
_, y_test_hat = knn.predict(X_test)
acc = accuracy_score(y_test, y_test_hat)
if acc > best_acc:
best_acc = acc
best_k = k
When we are done, we can have a look at the best accuracy:
In [5]:
best_acc, best_k
Out[5]:
Turns out, we can get 97.4% accuracy using $k=1$.
How would you do this when you have more than one hyperparameter? Refer to the book to find the answer to this one (p.318).
Following our best practice of splitting the data into training and test sets, we might be tempted to tell people that we have found a model that performs with 97.4% accuracy on the dataset. However, our result might not necessarily generalize to new data. The argument is the same as earlier on in the book when we warranted the train-test split that we need an independent dataset for evaluation.
However, when we implemented grid search in the last section, we used the test set to evaluate the outcome of the grid search and update the hyperparameter $k$. This means we can no longer use the test set to evaluate the final data! Any model choices made based on the test set accuracy would leak information from the test set into the model.
One way to resolve this data is to split the data again and introduce what is known as a validation set. The validation set is different from the training and test set and is used exclusively for selecting the best parameters of the model. It is a good practice to do all exploratory analysis and model selection on this validation set and keep a separate test set, which is only used for the final evaluation.
In other words, we should end up splitting the data into three different sets:
In practice, the three-way split is achieved in two steps.
First, split the data into two chunks: one that contains training and validation sets and another that contains the test set:
In [6]:
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y, random_state=37
)
In [7]:
X_trainval.shape
Out[7]:
Second, split X_trainval
again into proper training and validation sets:
In [8]:
X_train, X_valid, y_train, y_valid = train_test_split(
X_trainval, y_trainval, random_state=37
)
In [9]:
X_train.shape
Out[9]:
Then we repeat the manual grid search from the preceding code, but this time, we will use the validation set to find the best $k$:
In [10]:
best_acc = 0.0
best_k = 0
for k in range(1, 20):
knn = cv2.ml.KNearest_create()
knn.setDefaultK(k)
knn.train(X_train, cv2.ml.ROW_SAMPLE, y_train)
_, y_valid_hat = knn.predict(X_valid)
acc = accuracy_score(y_valid, y_valid_hat)
if acc >= best_acc:
best_acc = acc
best_k = k
best_acc, best_k
Out[10]:
We now find that a 100% validation score (best_acc
) can be achieved with $k=7$ (best_k
)!
However, recall that this score might be overly optimistic. To find out how well the model
really performs, we need to test it on held-out data from the test set.
In order to arrive at our final model, we can use the value for $k$ we found during grid search and re-train the model on both the training and validation data. This way, we used as much data as possible to build the model while still honoring the train-test split principle.
This means we should retrain the model on X_trainval
, which contains both the training
and validation sets and score it on the test set:
In [11]:
knn = cv2.ml.KNearest_create()
knn.setDefaultK(best_k)
knn.train(X_trainval, cv2.ml.ROW_SAMPLE, y_trainval)
_, y_test_hat = knn.predict(X_test)
accuracy_score(y_test, y_test_hat), best_k
Out[11]:
With this procedure, we find a formidable score of 94.7% accuracy on the test set. Because we honored the train-test split principle, we can now be sure that this is the performance we can expect from the classifier when applied to novel data. It is not as high as the 100% accuracy reported during validation, but it is still a very good score!
One potential danger of the grid search we just implemented is that the outcome might be relatively sensitive to how exactly we split the data. After all, we might have accidentally chosen a split that put most of the easy-to-classify data points in the test set, resulting in an overly optimistic score. Although we would be happy at first, as soon as we tried the model on some new held-out data, we would find that the actual performance of the classifier is much lower than expected.
Instead, we can combine grid search with cross-validation. This way, the data is split multiple times into training and validation sets, and cross-validation is performed at every step of the grid search to evaluate every parameter combination.
Because grid search with cross-validation is such a commonly used method for
hyperparameter tuning, scikit-learn provides the GridSearchCV
class, which implements it
in the form of an estimator.
We can specify all the parameters we want GridSearchCV
to search over by using a
dictionary. Every entry of the dictionary should be of the form {name: values}
, where
name
is a string that should be equivalent to the parameter name usually passed to the
classifier, and values
is a list of values to try.
For example, in order to search for the best value of the parameter n_neighbors
of the
KNeighborsClassifier
class, we would design the parameter dictionary as follows:
In [12]:
param_grid = {'n_neighbors': range(1, 20)}
Here, we are searching for the best $k$ in the range [1, 19].
We then need to pass the parameter grid as well as the classifier (KNeighborsClassifier
)
to the GridSearchCV
object:
In [13]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
Then we can train the classifier using the fit
method. In return, scikit-learn will inform us
about all the parameters used in the grid search:
In [14]:
grid_search.fit(X_trainval, y_trainval)
Out[14]:
This will allow us to find the best validation score and the corresponding value for $k$:
In [15]:
grid_search.best_score_, grid_search.best_params_
Out[15]:
We thus get a validation score of 96.4% for $k=3$. Since grid search with cross-validation is more robust than our earlier procedure, we would expect the validation scores to be more realistic than the 100% accuracy we found before.
However, from the previous section, we know that this score might still be overly optimistic, so we need to score the classifier on the test set instead:
In [16]:
grid_search.score(X_test, y_test)
Out[16]:
And to our surprise, the test score is even better.