In previous post we took a look at Logistic Regression and simple classification problem. Let's ask a question how non linear method would perform in a similar task.


In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets, neighbors
from sklearn.cross_validation import train_test_split, KFold, cross_val_score

digits = datasets.load_digits()

# split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split( digits.data, 
                                                    digits.target,
                                                    test_size=0.33 )
# use default number of neighbors (5)
nbor = neighbors.KNeighborsClassifier()

# scores over different folds 
k_fold = KFold( len(X_train), 30 )
scores = cross_val_score( nbor, X_train, y_train, cv=k_fold )

fig, ax1 = plt.subplots( figsize=(10, 6) )    
ax1.plot( range( len( scores ) ), scores, 'b-' )

ax1.set_ylabel( 'CV score' )
ax1.set_xlabel( 'Fold' )

m0 = np.average( scores )
sigma = np.std( scores )

ax1.axhline( min( 1.0, m0 + sigma ) , linestyle='--', color='.5' )
ax1.axhline( max( 0.0, m0 - sigma ), linestyle='--', color='.5' )

plt.show()


Looks like there is a danger that the nereast neighbors can overfit the data. Let's see what testing phase would show.


In [2]:
# fit the predictor
nbor.fit( X_train, y_train )

print( 'Test score:{0}'.format( nbor.score( X_test, y_test ) ) )


Test score:0.983164983165

Not bad comparing to Logistic Regression.