The ATM Camera

We are tasked with making a smart ATM camera which can distinguish between dollar notes and checks.We want to make sure that dollars are not classified as checks, and that checks are not classified as dollars.

We are given a set of 87 images of checks and dollars, each of which have been scaled to 322 X 137 pixels, and where each pixel has 3 color channels.

download imag.pix.npy from https://drive.google.com/open?id=0B-8JzeE81nw8MFBCRDZFdDhacGc and move this file to the data folder



In [1]:

    
# Importing all the required libraries
# It's just an Ipython thing which opens plots in the same window instead of a separate window
%matplotlib inline 
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

from PIL import Image



In [17]:

    
c0=sns.color_palette()[0]
c1=sns.color_palette()[1]
c2=sns.color_palette()[2]



In [21]:

    
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])

def points_plot(ax, Xtr, Xte, ytr, yte, clf, mesh=True, colorscale=cmap_light, cdiscrete=cmap_bold, alpha=0.1, psize=10, zfunc=False, predicted=False):
    h = .02
    X=np.concatenate((Xtr, Xte))
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))

    #plt.figure(figsize=(10,6))
    if zfunc:
        p0 = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 0]
        p1 = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
        Z=zfunc(p0, p1)
    else:
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    ZZ = Z.reshape(xx.shape)
    if mesh:
        plt.pcolormesh(xx, yy, ZZ, cmap=cmap_light, alpha=alpha, axes=ax)
    if predicted:
        showtr = clf.predict(Xtr)
        showte = clf.predict(Xte)
    else:
        showtr = ytr
        showte = yte
    ax.scatter(Xtr[:, 0], Xtr[:, 1], c=showtr-1, cmap=cmap_bold, s=psize, alpha=alpha,edgecolor="k")
    # and testing points
    ax.scatter(Xte[:, 0], Xte[:, 1], c=showte-1, cmap=cmap_bold, alpha=alpha, marker="s", s=psize+10)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    return ax,xx,yy



In [22]:

    
def points_plot_prob(ax, Xtr, Xte, ytr, yte, clf, colorscale=cmap_light, cdiscrete=cmap_bold, ccolor=cm, psize=10, alpha=0.1):
    ax,xx,yy = points_plot(ax, Xtr, Xte, ytr, yte, clf, mesh=False, colorscale=colorscale, cdiscrete=cdiscrete, psize=psize, alpha=alpha, predicted=True) 
    Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=ccolor, alpha=.2, axes=ax)
    cs2 = plt.contour(xx, yy, Z, cmap=ccolor, alpha=.6, axes=ax)
    plt.clabel(cs2, fmt = '%2.1f', colors = 'k', fontsize=14, axes=ax)
    return ax



In [3]:

    
data=np.load("data/imag.pix.npy")
y=np.load("data/imag.lbl.npy")
STANDARD_SIZE = (322, 137)#standardized pixels in image.
data.shape, y.shape









    Out[3]:





((87, 132342), (87,))

We display some of the images that we have:



In [4]:

    
def get_image(mat):
    size = STANDARD_SIZE[0]*STANDARD_SIZE[1]*3
    r,g,b = mat[0:size:3], mat[1:size:3],mat[2:size:3]
    rgbArray = np.zeros((STANDARD_SIZE[1],STANDARD_SIZE[0], 3), 'uint8')#3 channels
    rgbArray[..., 0] = r.reshape((STANDARD_SIZE[1], STANDARD_SIZE[0]))
    rgbArray[..., 1] = b.reshape((STANDARD_SIZE[1], STANDARD_SIZE[0]))
    rgbArray[..., 2] = g.reshape((STANDARD_SIZE[1], STANDARD_SIZE[0]))
    return rgbArray

def display_image(mat):
    with sns.axes_style("white"):
        plt.imshow(get_image(mat))
        plt.xticks([])
        plt.yticks([])

Image of a CHECK



In [5]:

    
display_image(data[5])

Image of a U.S. Dollar



In [6]:

    
display_image(data[50])

The curse of dimensionality: Feature engineering

The first thing that we notice is that we have many many features: to be precise, $322 x 137 x 3 = 136452$ of them. This is a lot of features! Having too many features can lead to overfitting.

Another way to look at this problem is the following: we have 85 data points, but 136452 features; that is, way more features than data points. Thus there is a high chance that a few attributes will correlate with $y$ purely coincidentally! [^Having lots of images, or "big-data" helps in combatting overfitting!]

We will engage in some a-priori feature selection that will reduce the dimensionality of the problem. The idea we'll use here is something called Principal Components Analysis, or PCA.

PCA is an unsupervized learning technique. The basic idea behind PCA is to rotate the co-ordinate axes of the feature space. We first find the direction in which the data varies the most. We set up one co-ordinate axes along this direction, which is called the first principal component. We then look for a perpendicular direction in which the data varies the second most. This is the second principal component. The diagram illustrates this process. There are as many principal components as the feature dimension: all we have done is a rotation.

(diagram taken from http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues which also has nice discussions)

How does this then achieve feature selection? We decide on a threshold of variation; once the variation in a particular direction falls below a certain number, we get rid of all the co-ordinate axes after that principal component. For example, if the variation falls below 10% after the third axes, and we decide that 10% is an acceptable cutoff, we remove all domensions from the fourth dimension onwards. In other words, we took our higher dimensional problem and projected it onto a 3 dimensional subspace.

These two ideas illustrate one of the most important reasons that learning is even feasible: we believe that most datasets, in either their unsupervized form $\{\v{x\}}$, or their supervized form $\{y, \v{x}\}$, live on a lower dimensional subspace. If we can find this subspace, we can then hope to find a method which respectively separates or fits the data.

Here we'll continue to focus on PCA. We'll reduce our dimensionality from 136452 to 60. We choose 60 as a large apriori number: we dont know if the variation in the data will have gone below a reasonable threshold by then. Notice that we use fit_transform in the sklearn API which takes the original 87 rows x 136452 columns dimensional data data and transforms it to a 87 x 90 data matrix X



In [7]:

    
from sklearn.decomposition import PCA
pca = PCA(n_components=60)
X = pca.fit_transform(data)



In [33]:

    
print pca.explained_variance_ratio_.sum()









    



0.942517727403

The explained variance ratio pca.explained_variance_ratio_ tells us how much of the variation in the features is explained by these 60 features. When we sum it up over the features, we see that 94% is explained: good enough to go down to a 60 dimensional space from a 136452 dimensional one!

We can see the individual varainces as we increase the dimensionality:



In [9]:

    
pca.explained_variance_ratio_*100









    Out[9]:





array([ 35.92596698,   6.29318801,   4.10778347,   3.11950952,
         2.81695972,   2.28831619,   2.10127948,   1.87404972,
         1.73264635,   1.53023757,   1.42159632,   1.31839387,
         1.24701501,   1.16381767,   1.09958246,   1.06073096,
         1.00742709,   0.98023627,   0.96055703,   0.91535041,
         0.90185445,   0.85212337,   0.83673085,   0.79691683,
         0.75488955,   0.72504809,   0.70810497,   0.67965443,
         0.6608896 ,   0.64768067,   0.62733756,   0.59472293,
         0.58291162,   0.57446687,   0.57261714,   0.55241183,
         0.5383977 ,   0.53333525,   0.51647796,   0.49326173,
         0.4853282 ,   0.47736271,   0.47200811,   0.45590127,
         0.4431691 ,   0.43974382,   0.43387969,   0.42625863,
         0.42155544,   0.40831131,   0.40490267,   0.39283743,
         0.38829814,   0.38018663,   0.37326643,   0.36088464,
         0.35867847,   0.34800511,   0.3387236 ,   0.3279939 ])

The first dimension accounts for 35% of the variation, the second 6%, and it goes steadily down from there.

Let us create a dataframe with these 60 features labelled pc1,pc2...,pc60 and the labels of the sample:



In [12]:

    
df = pd.DataFrame({"y":y, "label":np.where(y==1, "check", "dollar")})
for i in range(pca.explained_variance_ratio_.shape[0]):
    df["pc%i" % (i+1)] = X[:,i]
df.head()









    Out[12]:






  
    
      
      label
      y
      pc1
      pc2
      pc3
      pc4
      pc5
      pc6
      pc7
      pc8
      pc9
      pc10
      pc11
      pc12
      pc13
      pc14
      pc15
      pc16
      pc17
      pc18
      pc19
      pc20
      pc21
      pc22
      pc23
      pc24
      pc25
      pc26
      pc27
      pc28
      pc29
      pc30
      pc31
      pc32
      pc33
      pc34
      pc35
      pc36
      pc37
      pc38
      pc39
      pc40
      pc41
      pc42
      pc43
      pc44
      pc45
      pc46
      pc47
      pc48
      pc49
      pc50
      pc51
      pc52
      pc53
      pc54
      pc55
      pc56
      pc57
      pc58
      pc59
      pc60
    
  
  
    
      0
      check
      1
      -22536.362571
      -2428.343881
      -2133.778654
      -328.315167
      -1065.244935
      79.751089
      -425.044522
      622.473470
      -2490.235234
      -858.625406
      -1072.739717
      277.338468
      -1117.632799
      713.232053
      -788.152163
      415.330332
      -196.426333
      472.402955
      -428.260040
      -609.615315
      -589.374363
      -267.977782
      -985.580554
      955.838548
      1038.101443
      571.434103
      -38.025535
      -111.444917
      -515.419989
      548.684126
      -651.430312
      76.420204
      -728.238531
      1263.213011
      -570.152301
      -132.408164
      490.011465
      294.191539
      -248.829449
      48.852969
      -476.402972
      790.310302
      527.419954
      -425.422321
      479.078909
      -249.856538
      -14.054258
      295.129589
      639.375064
      -413.840231
      -645.843829
      104.977674
      286.880096
      227.000565
      204.690312
      362.838823
      -367.031617
      -105.476193
      -252.647043
      -145.515006
    
    
      1
      check
      1
      -22226.658684
      -709.255622
      -288.828952
      -1300.623807
      -792.079844
      217.399929
      1076.812657
      -2115.197190
      -875.369816
      -1124.906377
      -343.366448
      -43.206764
      547.400851
      358.287158
      -1762.085941
      -371.745415
      812.608885
      293.981164
      -939.109267
      156.749465
      -1122.593221
      508.925711
      -816.864442
      720.657425
      288.744718
      -116.338984
      830.702406
      1152.818226
      -355.451356
      122.820742
      -867.620263
      576.392497
      323.060849
      171.307213
      -691.497496
      -660.592823
      -114.819655
      -186.334422
      -133.453951
      -381.891612
      82.586705
      40.544842
      731.949079
      22.936331
      321.830227
      -6.799914
      28.684726
      418.281904
      -1158.224617
      228.609238
      -442.277211
      -152.162484
      138.813885
      -628.020400
      502.167440
      351.173132
      -232.657107
      203.933576
      -157.606857
      -447.203684
    
    
      2
      check
      1
      -17364.260784
      -4252.027339
      793.802272
      -1362.051898
      -374.229879
      3142.094522
      2514.186949
      1443.906396
      -121.040048
      -419.881696
      -2083.146966
      1717.481411
      -723.126194
      -1239.879985
      -310.764048
      -1059.301700
      1386.250730
      -799.640055
      -2600.094450
      -880.422727
      3382.583043
      -394.433198
      -3410.545813
      1337.285715
      -4754.132060
      1036.925349
      2211.351463
      -689.978524
      1380.291957
      -2163.528410
      -640.334232
      1696.781395
      1779.175882
      -3148.445702
      -899.679519
      2311.309173
      -3028.425497
      3045.493507
      843.163883
      269.238279
      1506.726842
      -3167.784697
      3762.513396
      1573.499583
      812.144537
      583.253509
      -907.626091
      -1021.615449
      184.965036
      -2451.420219
      -2778.376143
      -1922.169910
      585.094215
      2346.379958
      230.420305
      2689.840976
      -139.418108
      -535.682281
      -1302.156457
      1386.339072
    
    
      3
      check
      1
      -22847.256199
      -900.592736
      577.202127
      -180.356175
      -733.177677
      -107.825722
      1441.702751
      -1069.341036
      844.183935
      -1389.694656
      1470.681163
      -724.495531
      -578.435922
      -260.814973
      932.540647
      167.770069
      -301.431753
      870.203161
      -183.653629
      1229.524413
      557.588908
      -178.598619
      -311.478128
      -373.256561
      -918.619516
      189.651744
      1269.960852
      73.422847
      -286.175630
      -444.791500
      -140.480841
      -66.494644
      96.347819
      852.703020
      -167.296771
      642.997115
      375.239801
      307.053860
      172.091857
      -364.574305
      1443.536640
      -803.972919
      -960.156695
      594.051478
      470.853357
      -507.618837
      -805.532873
      -309.484195
      1101.745182
      -1472.021544
      -114.038992
      49.401849
      61.067276
      426.506747
      200.244924
      -700.370855
      -542.567106
      -747.487327
      930.468082
      1245.172927
    
    
      4
      check
      1
      -15868.672595
      3259.490950
      506.805248
      -919.648698
      2166.530434
      -981.609217
      -1669.780548
      -391.114114
      1735.622091
      163.576534
      -1627.094770
      -556.535502
      -862.434238
      1721.205351
      944.995298
      -1134.071084
      2141.223921
      765.901060
      -1095.785318
      599.646903
      -29.198194
      80.216536
      -501.665227
      1099.406955
      338.501535
      -1224.331844
      -714.722336
      -726.264946
      -149.598651
      -918.305686
      1043.412945
      -3608.676418
      2089.919279
      -3369.743018
      -1000.493284
      -2117.886268
      1910.882115
      2126.027672
      2554.394477
      -509.937422
      40.276456
      1330.193810
      1220.960654
      -1290.077375
      -270.174689
      -896.813375
      407.272956
      834.520226
      -3192.721918
      1480.233620
      -761.412351
      -2585.523271
      -337.350658
      -1573.808273
      2537.182171
      -2789.449590
      1469.724265
      -2280.157676
      -1974.021486
      -625.579753

Lets see what these principal components look like:



In [13]:

    
def normit(a):
    a=(a - a.min())/(a.max() -a.min())
    a=a*256
    return np.round(a)
def getNC(pc, j):
    size=322*137*3
    r=pc.components_[j][0:size:3]
    g=pc.components_[j][1:size:3]
    b=pc.components_[j][2:size:3]
    r=normit(r)
    g=normit(g)
    b=normit(b)
    return r,g,b
def display_component(pc, j):
    r,g,b = getNC(pc,j)
    rgbArray = np.zeros((137,322,3), 'uint8')
    rgbArray[..., 0] = r.reshape(137,322)
    rgbArray[..., 1] = g.reshape(137,322)
    rgbArray[..., 2] = b.reshape(137,322)
    plt.imshow(rgbArray)
    plt.xticks([])
    plt.yticks([])



In [14]:

    
display_component(pca,0)



In [15]:

    
display_component(pca,1)

We take the first two principal components and immediately notice in the diagram below that they are enough to separate out the checks and the dollars. Indeed the first component itself seems to be mostly enough. We can look at the image of the first component and speculate that the medallion in the middle of the dollars probably contributes to this.



In [18]:

    
colors = [c0, c2]
for label, color in zip(df['label'].unique(), colors):
    mask = df['label']==label
    plt.scatter(df[mask]['pc1'], df[mask]['pc2'], c=color, label=label)
plt.legend()









    Out[18]:





<matplotlib.legend.Legend at 0x7f46f80c3210>

Classifying in a reduced feature space with kNN

Implicit in the notion of classification, is the idea that samples close to each other in feature-space share a label. kNN is a very simple algorithm to diretly use this idea to do classification. The basic notion is this: if a lot of samples in some area of the feature space belong to one class as compared to the other, we'll label that part of the feature space as "belonging" to that class. This process will then classify the feature space into class-based regions. Then, given the point in feature space, we find which region its in and thus its class.

The way kNN does this is to ask for the k nearest neighbors in the training set of the new sample. To answer this question, one must define a distance in the feature space (Note that this distance is different from the error or risk measures we have seen earlier). This distance is typically defined as the Euclidean distance, the sum of the square of the difference of each feature value between any two samples.

$$D(s_1,s_2) = \sum_f (x_{f1} - x_{f2})^2.$$

Once we have a distance measure, we can sort the distances from the current sample. Then we choose the $k$ closest ones in the training set, where $k$ is an odd number (to break ties) like 1,3,5,...19,. We now see how many of these $k$ "nearest neighbors" belong to one class or the other, and choose the majority class amongst those neighbors as our sample's class.

The training process thus simply consists of memorizing the data, perhaps using a database to aid in the fast lookup of the $k$ nearest training set neighbors of any point in feature space. Notice that this process divides feature space into regions of one class or the other, since one can simply ask what the $k$ nearest neighbors in the training set are of any given point in feature space. Also notice that since classification happens via a majority "voting" scheme, we also know the probability that a point in feature space belongs to a class, as estimated by the fraction of $k$ nearest neighbors to that point in the desired class.

Thanks to sklearn's simple api, the classifier is really simple to write:



In [19]:

    
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
ys=df['y'].astype(int).values
subdf=df[['pc1','pc2']]
subdfstd=(subdf - subdf.mean())/subdf.std()
Xs=subdfstd.values
def classify(X,y, nbrs, plotit=True, train_size=0.6):
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=train_size)
    clf= KNeighborsClassifier(nbrs)
    clf=clf.fit(Xtrain, ytrain)
    #in sklearn accuracy can be found by using "score". It predicts and then gets the accuracy
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    Xall=np.concatenate((Xtrain, Xtest))
    if plotit:
        print "Accuracy on training data: %0.2f" % (training_accuracy)
        print "Accuracy on test data:     %0.2f" % (test_accuracy)
        plt.figure()
        ax=plt.gca()
        points_plot(ax, Xtrain, Xtest, ytrain, ytest, clf, alpha=0.3, psize=20)
    return nbrs, training_accuracy, test_accuracy









    



/home/manish/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Lets see what happens when we choose $k=1$. On the training set, the 1NN classifier memorizes the training data. It will predict perfectly on the training set, and wont do too badly on the test set, especially deep in the regions of feature space where one or the other class dominates. This is because evem one neighbor might be enough in those regions. However, the same classifier will do badly near the clasification boundaries on the test set, because you will need more than one neighbor to decide with any certainty of the class.

The result of this is, as you might expect, the regions of feature space classfied one way or the other (blue is check, red is dollar) are quite jagged and mottled. Since we are choosing just one neighbor, we fit to the noise in the region rather than the trend. We are overfitting.



In [23]:

    
classify(Xs,ys,1)









    



Accuracy on training data: 1.00
Accuracy on test data:     0.91






    Out[23]:





(1, 1.0, 0.91428571428571426)

If we choose too large a number for $k$, such as 50, we are wandering too far from our original sample,and thus we average over a large amount of the feature space. This leads to a very biased classification, depending on where our sample is, but extending far out from there. Our classification may even cover the entire feature space, then giving us the majority class.

In terms of probabilities, such an underfit case gives us the base rate classifier. Imagine $k=N$. Then the probability is just the fraction of training set examples in a given class. Say this number for the blue class is 0.4 (that is, whe have uneven class memberships in the training set). Now, on any random test set, if we use the classifier which says "classify all as red", we will be correct, on average, 60% of the time if the test set and training sets are representative of the population of samples. Any classifier we create must do a better job than this!

Error against complexity (k), and cross-validation



In [36]:

    
fits={}
for k in np.arange(1,45,1):
    fits[k]=[]
    for i in range(200):
        fits[k].append(classify(Xs, ys,k, False))
nbrs=np.arange(1,45,1)
fmeanstr = np.array([1.-np.mean([t[1] for t in fits[e]]) for e in nbrs])
fmeanste = np.array([1.-np.mean([t[2] for t in fits[e]]) for e in nbrs])
fstdsstr = np.array([np.std([t[1] for t in fits[e]]) for e in nbrs])
fstdsste = np.array([np.std([t[2] for t in fits[e]]) for e in nbrs])



In [37]:

    
plt.gca().invert_xaxis()
plt.plot(nbrs, fmeanstr, color=c0, label="training");
plt.fill_between(nbrs, fmeanstr - fstdsstr, fmeanstr+fstdsstr, color=c0, alpha=0.3)
plt.plot(nbrs, fmeanste, color=c1, label="testing");
plt.fill_between(nbrs, fmeanste - fstdsste, fmeanste+fstdsste, color=c1, alpha=0.5)

plt.legend();

We plot the test error and training errors against the number of neighbors $k$ . Here $k$ serves as a complexity parameter, with small $k$ being more "wiggly" in the classification of neighborhoods and large $k$ oversmoothing the classification. Notice that we plot $k$ reversed on the x-axis so as to go from lower complexity to higher complexity. As expected, the training error drops with complexity, but the test error starts going back up. There is a large range of $k$ from 25 to 5, in which the fit is as good as it gets!

Setting up some code

Lets set some code up for classification using cross-validation so that we can easily run classification models in scikit-learn. We first set up a function cv_optimize which takes a classifier clf, a grid of hyperparameters (such as a complexity parameter or regularization parameter) implemented as a dictionary parameters, a training set (as a samples x features array) Xtrain, and a set of labels ytrain. The code takes the traning set, splits it into n_folds parts, sets up n_folds folds, and carries out a cross-validation by splitting the training set into a training and validation section for each foldfor us. It prints the best value of the parameters, and returns the best classifier to us.



In [27]:

    
from sklearn.grid_search import GridSearchCV
def cv_optimize(clf, parameters, Xtrain, ytrain, n_folds=5):
    gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    gs.fit(Xtrain, ytrain)
    print "BEST PARAMS", gs.best_params_
    best = gs.best_estimator_
    return best









    



/home/manish/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

We then use this best classifier to fit the entire training set. This is done inside the do_classify function which takes a dataframe indf as input. It takes the columns in the list featurenames as the features used to train the classifier. The column targetname sets the target. The classification is done by setting those samples for which targetname has value target1val to the value 1, and all others to 0. We split the dataframe into 80% training and 20% testing by default, standardizing the dataset if desired.We then train the model on the training set using cross-validation. Having obtained the best classifier using cv_optimize, we retrain on the entire training set and calculate the training and testing accuracy, which we print. We return the split data and the trained classifier.



In [25]:

    
from sklearn.cross_validation import train_test_split
def do_classify(clf, parameters, indf, featurenames, targetname, target1val, standardize=False, train_size=0.8):
    subdf=indf[featurenames]
    if standardize:
        subdfstd=(subdf - subdf.mean())/subdf.std()
    else:
        subdfstd=subdf
    X=subdfstd.values
    y=(indf[targetname].values==target1val)*1
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=train_size)
    clf = cv_optimize(clf, parameters, Xtrain, ytrain)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print "Accuracy on training data: %0.2f" % (training_accuracy)
    print "Accuracy on test data:     %0.2f" % (test_accuracy)
    return clf, Xtrain, ytrain, Xtest, ytest

Cross-Validation

Lets repeat what we have been doing so far and carry out a cross-validation. We're of-course now training on an even smaller set, so our results will be a bit different from the diagram above. We plot the results in the diagram below. The results are fairly stable and correspond to our intuition that the first principal component basically separates the data.



In [28]:

    
bestcv, Xtrain, ytrain, Xtest, ytest = do_classify(KNeighborsClassifier(), {"n_neighbors": range(1,40,2)}, df, ['pc1','pc2'], 'label', 'check' )









    



BEST PARAMS {'n_neighbors': 5}
Accuracy on training data: 0.94
Accuracy on test data:     0.94



In [29]:

    
plt.figure()
ax=plt.gca()
points_plot(ax, Xtrain, Xtest, ytrain, ytest, bestcv, alpha=0.5, psize=20);

We can plot the probability contours as well.



In [30]:

    
plt.figure()
ax=plt.gca()
points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, bestcv, alpha=0.5, psize=20);

Evaluation

For the evaluation of this classifier we will use confusion matrix to check the values of false negatives and false positives predicted by this classifier.



In [32]:

    
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(ytest, bestcv.predict(Xtest), )









    Out[32]:





array([[12,  0],
       [ 1,  5]])

	label	y	pc1	pc2	pc3	pc4	pc5	pc6	pc7	pc8	pc9	pc10	pc11	pc12	pc13	pc14	pc15	pc16	pc17	pc18	pc19	pc20	pc21	pc22	pc23	pc24	pc25	pc26	pc27	pc28	pc29	pc30	pc31	pc32	pc33	pc34	pc35	pc36	pc37	pc38	pc39	pc40	pc41	pc42	pc43	pc44	pc45	pc46	pc47	pc48	pc49	pc50	pc51	pc52	pc53	pc54	pc55	pc56	pc57	pc58	pc59	pc60
0	check	1	-22536.362571	-2428.343881	-2133.778654	-328.315167	-1065.244935	79.751089	-425.044522	622.473470	-2490.235234	-858.625406	-1072.739717	277.338468	-1117.632799	713.232053	-788.152163	415.330332	-196.426333	472.402955	-428.260040	-609.615315	-589.374363	-267.977782	-985.580554	955.838548	1038.101443	571.434103	-38.025535	-111.444917	-515.419989	548.684126	-651.430312	76.420204	-728.238531	1263.213011	-570.152301	-132.408164	490.011465	294.191539	-248.829449	48.852969	-476.402972	790.310302	527.419954	-425.422321	479.078909	-249.856538	-14.054258	295.129589	639.375064	-413.840231	-645.843829	104.977674	286.880096	227.000565	204.690312	362.838823	-367.031617	-105.476193	-252.647043	-145.515006
1	check	1	-22226.658684	-709.255622	-288.828952	-1300.623807	-792.079844	217.399929	1076.812657	-2115.197190	-875.369816	-1124.906377	-343.366448	-43.206764	547.400851	358.287158	-1762.085941	-371.745415	812.608885	293.981164	-939.109267	156.749465	-1122.593221	508.925711	-816.864442	720.657425	288.744718	-116.338984	830.702406	1152.818226	-355.451356	122.820742	-867.620263	576.392497	323.060849	171.307213	-691.497496	-660.592823	-114.819655	-186.334422	-133.453951	-381.891612	82.586705	40.544842	731.949079	22.936331	321.830227	-6.799914	28.684726	418.281904	-1158.224617	228.609238	-442.277211	-152.162484	138.813885	-628.020400	502.167440	351.173132	-232.657107	203.933576	-157.606857	-447.203684
2	check	1	-17364.260784	-4252.027339	793.802272	-1362.051898	-374.229879	3142.094522	2514.186949	1443.906396	-121.040048	-419.881696	-2083.146966	1717.481411	-723.126194	-1239.879985	-310.764048	-1059.301700	1386.250730	-799.640055	-2600.094450	-880.422727	3382.583043	-394.433198	-3410.545813	1337.285715	-4754.132060	1036.925349	2211.351463	-689.978524	1380.291957	-2163.528410	-640.334232	1696.781395	1779.175882	-3148.445702	-899.679519	2311.309173	-3028.425497	3045.493507	843.163883	269.238279	1506.726842	-3167.784697	3762.513396	1573.499583	812.144537	583.253509	-907.626091	-1021.615449	184.965036	-2451.420219	-2778.376143	-1922.169910	585.094215	2346.379958	230.420305	2689.840976	-139.418108	-535.682281	-1302.156457	1386.339072
3	check	1	-22847.256199	-900.592736	577.202127	-180.356175	-733.177677	-107.825722	1441.702751	-1069.341036	844.183935	-1389.694656	1470.681163	-724.495531	-578.435922	-260.814973	932.540647	167.770069	-301.431753	870.203161	-183.653629	1229.524413	557.588908	-178.598619	-311.478128	-373.256561	-918.619516	189.651744	1269.960852	73.422847	-286.175630	-444.791500	-140.480841	-66.494644	96.347819	852.703020	-167.296771	642.997115	375.239801	307.053860	172.091857	-364.574305	1443.536640	-803.972919	-960.156695	594.051478	470.853357	-507.618837	-805.532873	-309.484195	1101.745182	-1472.021544	-114.038992	49.401849	61.067276	426.506747	200.244924	-700.370855	-542.567106	-747.487327	930.468082	1245.172927
4	check	1	-15868.672595	3259.490950	506.805248	-919.648698	2166.530434	-981.609217	-1669.780548	-391.114114	1735.622091	163.576534	-1627.094770	-556.535502	-862.434238	1721.205351	944.995298	-1134.071084	2141.223921	765.901060	-1095.785318	599.646903	-29.198194	80.216536	-501.665227	1099.406955	338.501535	-1224.331844	-714.722336	-726.264946	-149.598651	-918.305686	1043.412945	-3608.676418	2089.919279	-3369.743018	-1000.493284	-2117.886268	1910.882115	2126.027672	2554.394477	-509.937422	40.276456	1330.193810	1220.960654	-1290.077375	-270.174689	-896.813375	407.272956	834.520226	-3192.721918	1480.233620	-761.412351	-2585.523271	-337.350658	-1573.808273	2537.182171	-2789.449590	1469.724265	-2280.157676	-1974.021486	-625.579753