KNN Parameter Tuning

In Segmentation: KNN, we perform KNN classification of pixels as crop or non-crop. One parameter in the KNN classifier is the number of neighbors (the K in KNN). To determine what value this parameter should be, we perform cross-validation and pick the k that corresponds to the highest accuracy score. In this notebook, we demonstrate that cross-validation, using the training data X (values) and y (classifications) that was generated in Segmentation: KNN. The k value is then fed back into Segmentation: KNN to create the KNN Classifier that is used to predict pixel crop/non-crop designation.

In this notebook, we find that increasing the number of neighbors from 3 to 9 increases accuracy only marginally, while it also increases run time. Therefore, we will use the smallest number of neighbors: 3.


In [1]:
from __future__ import print_function

import os

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier as KNN

First we load the data that was saved in Segmentation: KNN


In [2]:
# Load data
def load_cross_val_data(datafile):
    npzfile = np.load(datafile)
    X = npzfile['X']
    y = npzfile['y']
    return X,y

datafile = os.path.join('data', 'knn_cross_val', 'xy_file.npz')
X, y = load_cross_val_data(datafile)

Next, we perform a grid search over the number of neighbors, looking for the value that corresponds to the highest accuracy.


In [ ]:
tuned_parameters = {'n_neighbors': range(3,11,2)}

clf = GridSearchCV(KNN(n_neighbors=3),
                   tuned_parameters,
                   cv=3,
                   verbose=10)
clf.fit(X, y)

print("Best parameters set found on development set:\n")
print(clf.best_params_)

print("Grid scores on development set:\n")

means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
res_params = clf.cv_results_['params']
for mean, std, params in zip(means, stds, res_params):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))


Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] n_neighbors=3 ...................................................

It turns out that increasing the number of neighbors from 3 to 9 increases accuracy only marginally, while it also increases run time. Therefore, we will use the smallest number of neighbors: 3.


In [ ]: