In Segmentation: KNN, we perform KNN classification of pixels as crop or non-crop. One parameter in the KNN classifier is the number of neighbors (the K in KNN). To determine what value this parameter should be, we perform cross-validation and pick the k that corresponds to the highest accuracy score. In this notebook, we demonstrate that cross-validation, using the training data X (values) and y (classifications) that was generated in Segmentation: KNN. The k value is then fed back into Segmentation: KNN to create the KNN Classifier that is used to predict pixel crop/non-crop designation.
In this notebook, we find that increasing the number of neighbors from 3 to 9 increases accuracy only marginally, while it also increases run time. Therefore, we will use the smallest number of neighbors: 3.
In [1]:
    
from __future__ import print_function
import os
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier as KNN
    
First we load the data that was saved in Segmentation: KNN
In [2]:
    
# Load data
def load_cross_val_data(datafile):
    npzfile = np.load(datafile)
    X = npzfile['X']
    y = npzfile['y']
    return X,y
datafile = os.path.join('data', 'knn_cross_val', 'xy_file.npz')
X, y = load_cross_val_data(datafile)
    
Next, we perform a grid search over the number of neighbors, looking for the value that corresponds to the highest accuracy.
In [ ]:
    
tuned_parameters = {'n_neighbors': range(3,11,2)}
clf = GridSearchCV(KNN(n_neighbors=3),
                   tuned_parameters,
                   cv=3,
                   verbose=10)
clf.fit(X, y)
print("Best parameters set found on development set:\n")
print(clf.best_params_)
print("Grid scores on development set:\n")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
res_params = clf.cv_results_['params']
for mean, std, params in zip(means, stds, res_params):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))
    
    
It turns out that increasing the number of neighbors from 3 to 9 increases accuracy only marginally, while it also increases run time. Therefore, we will use the smallest number of neighbors: 3.
In [ ]: