In previous post we took a look at Logistic Regression and simple classification problem. Let's ask a question how non linear method would perform in a similar task.
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, neighbors
from sklearn.cross_validation import train_test_split, KFold, cross_val_score
digits = datasets.load_digits()
# split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split( digits.data,
digits.target,
test_size=0.33 )
# use default number of neighbors (5)
nbor = neighbors.KNeighborsClassifier()
# scores over different folds
k_fold = KFold( len(X_train), 30 )
scores = cross_val_score( nbor, X_train, y_train, cv=k_fold )
fig, ax1 = plt.subplots( figsize=(10, 6) )
ax1.plot( range( len( scores ) ), scores, 'b-' )
ax1.set_ylabel( 'CV score' )
ax1.set_xlabel( 'Fold' )
m0 = np.average( scores )
sigma = np.std( scores )
ax1.axhline( min( 1.0, m0 + sigma ) , linestyle='--', color='.5' )
ax1.axhline( max( 0.0, m0 - sigma ), linestyle='--', color='.5' )
plt.show()
Looks like there is a danger that the nereast neighbors can overfit the data. Let's see what testing phase would show.
In [2]:
# fit the predictor
nbor.fit( X_train, y_train )
print( 'Test score:{0}'.format( nbor.score( X_test, y_test ) ) )
Not bad comparing to Logistic Regression.