Heavily adopted code and format from greilliam

Simulated Classification

  1. State assumptions
  2. Formally define classification/regression problem
  3. provide algorithm for solving problem (including choosing hyperparameters as appropriate)
  4. sample data from a simulation setting inspired by your data (from both null and alternative as defined before)
  5. compute accuracy
  6. plot accuracy vs. sample size in simulation
  7. apply method directly on real data
  8. explain the degree to which you believe the result and why

Step 1: State assumptions

Histogram Equalization \begin{align} F(k) = floor((L-1) \sum_{n=0}^k p_n ) \ n = 0, 1..L-1 \ L=256 \ \end{align}

\begin{align} \newline F(x,y) | ~~ Histogram~Data\ F_{x|0} & = Norm(\mu0)\ F_{x|1} & = Norm(\mu1)\ F_{x|2} & = Norm(\mu2)\ \mu0 \neq \mu1 \neq \mu2 \ \end{align}

Additionally we assume the elements correspond to features 1-4.

Step 2: Formally define classification/regression problem

\begin{align} \newline X = {\mu0, \mu1, \mu2}\ Y = { 0, 1, 2} \ \end{align}

Classification is applied to reduce the estimated error. Objective is to minimize Error: $E[l] = \sum \Theta(\hat{Y}_i \neq Y_i)$

Step 3: Provide algorithm for solving problem (including choosing hyperparameters as appropriate)

"Machine Learning" is usually classification or prediction

  • predictive is subject specific

Clarity brains have $(X, Y)$ ~iid $F_{xy}$ which is some distribution

  • X is subject
  • Y is $\{0, 1\}$
  • Function g(x) spits out a class label (thus g is a classifier function)
  • G = {g: map from reals to {0, 1}}
    • Classifier takes a single x
      • If $x>k$ but $<0$ is one classifier
      • Best clasisifier is statistical decision theory
        • Need to define a loss function that tells us how wrong we are
        • We need to choose classifier that minimizes loss
        • G* = $\underset{g \in G}{argmin} l(g(x), y)$
        • Squared error is a good option $(g(x)-y)^2$
          • Problem is that $(0-1)^2 = (1-0)^2$ so you don't know which side of the "wrong" you are
        • Absolute error is $|g(x)-y|$
        • Zero one error
          • If $g(x)=y$ then $l=0$
          • If $g(y)!=y$ then $l=1$
    • If L is the set of loss functions
      • L = {l: yxy -> Real+}
        • Here we are finding which scores are the best
    • Definitions: Voxels, Priors, Baye's rule F(x,y) = F(x|y)Fy=F(y|x)F_x ->
      • $F_(x|y)=N(M_y. 1)$
      • $F_y = Bern(pi)$
      • Next need to fit the joint distribution
        • After fitting the Bayes optimal is called the Bayes plugin
        • MLE - minimizing squared error
        • Sample n train sample (xi, yi) ~iid F_xx generate training data i∈[n_train]
        • estimate classifier theta
          • sample iidF_xy i∈[n_test]
- Best classifier is called the Bayes Optimal
  - g* = argmax F_((x=(x)|(y)=y))
    - Can use a posterari if priors are not equal
      - F_(x=x, y=y) = F_(x|y)F_y
    - compute argmax for y∈y
      - Let y = 0 is .99 y-1 is .01

- Next need to relect get change level accuracy which will almost definitely ahppen if you use a regular loss function 
  - Use histogram instead of image data
    - Classifier list:
      - LDA
        - Variances are the same 
        - Made by Fischer
        - Finds optimal linear classifier (optimal line) under the assumptions that we have made
          - Advantages: Very interpretable, Very fast, Linear
      - Random Forest
        - Decision tree thresholds are created
        - Choose a loss function and then try to do a greedy search
        - Find the optimal thresholds to maximize purity
        - Change thresholds to maximize purity so that most of one group is in one part and the others are in the others
        - Random Forest uses decision trees on subsets of your data, since each tree is noisy and can overfit, so averageing over many different classifiers it will be much more effective
        - This is an ensemble method 
          - Every single classifier is on a different point on the bias variance tradeoff so when you average everything it will be more consistent
      - SVM
      - Logistic
      - Neural Network
        - Uses linear algebra, runs on GPU
        - Takes in more information and is very useful for computer vision techniques
        - Natively do the classificiation
      - KNN
        - K nearest neighbor 
        - specify apriori k and find the distance between the points and K 
        - Assuming K is big enough, it will always converge irrespectively
        - Doesn't care about the distributions, but it is universally consistent
      - QDA
        - Quadratic descriminatory analysis
        - Optimal discriminatory boundary is curved
        - Covariance matrices

Classification Techniques :

K Nearest Neighbors(3) - Default Parameters \ Support Vector Machines - Linear , C= 0.5 - Default\ Random Forest - Max depth = 5 , N Estimators = 10, Max Features =1 - Default\ Linear Discriminant Analysis \ Quadratic Discriminant Analysis \ We ran into errors with QDA, for which reason it was ignored in the assignment.


In [1]:
import os
PATH="/Users/david/Desktop/CourseWork/TheArtOfDataScience/claritycontrol/code/scripts/" # use your own path
os.chdir(PATH)

import clarity as cl  # I wrote this module for easier operations on data
import clarity.resources as rs
import csv,gc  # garbage memory collection :)

import numpy as np
import matplotlib.pyplot as plt
import jgraph as ig
%matplotlib inline

# settings for histogram
BINS=32 # histogram bins
RANGE=(10.0,300.0)

Histogram data preparation

Skip this step if the data are already baked, just eat them!

This and next step will load raw image datasets (which are pretty large) and extract histograms from these dataset. The datasets are not included in the repository (Because they are too large), but you can skip this and next step, use the histogram data that already generated in the repository.

  1. Set suitable data value range to get histogram from the majority of the datasets.

In [2]:
for token in rs.TOKENS:
    c = cl.Clarity(token)
    fname = rs.HIST_DATA_PATH+token+".csv"
    hist, bin_edges = c.loadImg().getHistogram(bins=BINS,range=RANGE,density=False)
    np.savetxt(fname,hist,delimiter=',')
    print fname,"saved."
    del c
    gc.collect()


Image Loaded: ../data/raw/Cocaine174.img
../data/hist/Cocaine174.csv saved.
Image Loaded: ../data/raw/Cocaine175.img
../data/hist/Cocaine175.csv saved.
Image Loaded: ../data/raw/Cocaine178.img
../data/hist/Cocaine178.csv saved.
Image Loaded: ../data/raw/Control181.img
../data/hist/Control181.csv saved.
Image Loaded: ../data/raw/Control182.img
../data/hist/Control182.csv saved.
Image Loaded: ../data/raw/Control189.img
../data/hist/Control189.csv saved.
Image Loaded: ../data/raw/Control239.img
../data/hist/Control239.csv saved.
Image Loaded: ../data/raw/Control258.img
../data/hist/Control258.csv saved.
Image Loaded: ../data/raw/Fear187.img
../data/hist/Fear187.csv saved.
Image Loaded: ../data/raw/Fear197.img
../data/hist/Fear197.csv saved.
Image Loaded: ../data/raw/Fear199.img
../data/hist/Fear199.csv saved.
Image Loaded: ../data/raw/Fear200.img
../data/hist/Fear200.csv saved.

Scale data


In [3]:
import numpy as np
import clarity.resources as rs
features = np.empty(shape=(1,BINS))
for token in rs.TOKENS:
    fname = rs.HIST_DATA_PATH+token+".csv"
    data = np.loadtxt(fname,delimiter=',')
    features = np.vstack([features,data])
features = features[1:,]
minc = np.min(features)
maxc = np.max(features)
features = (features-minc)/(maxc-minc)
print features
np.savetxt(rs.HIST_DATA_PATH+"features.csv",features,delimiter=',')


[[  3.02680054e-08   5.44824096e-08   2.05822436e-07   5.87199304e-07
    1.83726793e-06   6.61658597e-06   2.59366538e-05   1.53237831e-04
    5.70115436e-03   4.51718743e-01   1.00000000e+00   1.35286291e-01
    8.49485827e-02   7.86618544e-02   7.35613535e-02   6.39701853e-02
    5.84338221e-02   4.16227176e-02   3.39392057e-02   2.77592599e-02
    2.26080025e-02   1.85846279e-02   1.54385957e-02   1.29580479e-02
    1.10633858e-02   9.58203931e-03   8.43465187e-03   7.51167143e-03
    6.77831398e-03   6.13509768e-03   5.61766007e-03   5.66908239e-03]
 [  5.75092102e-08   1.63447229e-07   3.29921258e-07   7.65780536e-07
    2.70898648e-06   6.04452067e-06   2.48560860e-05   1.17918095e-04
    3.74111033e-03   2.92376673e-01   5.84049388e-01   1.55948727e-01
    1.33041235e-01   1.10344462e-01   8.33999686e-02   5.94000858e-02
    4.60053584e-02   2.86529742e-02   2.09942094e-02   1.57893807e-02
    1.21508849e-02   9.69784168e-03   7.96022813e-03   6.71293811e-03
    5.74635963e-03   4.98288249e-03   4.34869513e-03   3.79060457e-03
    3.33360309e-03   2.91697611e-03   2.57125192e-03   2.48431616e-03]
 [  1.51340027e-08   3.66242865e-07   2.99653253e-07   4.78234485e-07
    1.74949071e-06   5.71157261e-06   2.15841146e-05   1.47459669e-04
    4.75858749e-03   3.65928895e-01   5.59457624e-01   9.14320651e-02
    7.04332368e-02   6.77734601e-02   6.61138926e-02   6.02482195e-02
    5.87921771e-02   4.49651953e-02   3.86687359e-02   3.28632205e-02
    2.74471362e-02   2.28967502e-02   1.90871041e-02   1.59889346e-02
    1.34611687e-02   1.14588645e-02   9.82884766e-03   8.54004507e-03
    7.50793030e-03   6.61872592e-03   5.89751502e-03   5.80544883e-03]
 [  3.63216064e-08   6.96164123e-08   2.30036841e-07   5.81145703e-07
    3.06826770e-05   5.18793612e-06   1.32725204e-05   7.93142812e-05
    2.81480040e-03   2.27715179e-01   3.88426713e-01   4.86363048e-02
    3.30274910e-02   3.20018384e-02   3.28574847e-02   3.29833875e-02
    3.59319392e-02   3.01235029e-02   2.75277914e-02   2.42873199e-02
    2.08061179e-02   1.76493012e-02   1.49466922e-02   1.27984447e-02
    1.11091631e-02   9.75884987e-03   8.67802177e-03   7.79015826e-03
    7.11130139e-03   6.48077343e-03   5.96284851e-03   6.07441638e-03]
 [  9.08040161e-09   3.08733655e-07   9.08040161e-08   2.72412048e-07
    1.01700498e-06   3.18722096e-06   1.19074333e-05   7.10420354e-05
    2.30429719e-03   1.88612606e-01   3.07727452e-01   6.41034917e-02
    6.12421027e-02   6.40454376e-02   6.18823649e-02   5.43280337e-02
    4.93737908e-02   3.35422378e-02   2.49486728e-02   1.82312308e-02
    1.34255372e-02   1.02467338e-02   8.07892109e-03   6.57214043e-03
    5.49152724e-03   4.67500239e-03   4.07299598e-03   3.58319912e-03
    3.20458875e-03   2.88046684e-03   2.60133832e-03   2.60068453e-03]
 [  6.35628113e-08   1.48313226e-07   2.84519250e-07   7.99075341e-07
    3.58675864e-06   8.91090078e-06   2.40842519e-05   1.47432427e-04
    4.54237405e-03   3.57057751e-01   5.48353403e-01   1.44579656e-01
    1.03936638e-01   7.37345530e-02   4.85449711e-02   3.20946643e-02
    2.47973903e-02   1.60047890e-02   1.22824932e-02   9.62169040e-03
    7.62616016e-03   6.16654008e-03   5.07482156e-03   4.24220018e-03
    3.58879265e-03   3.05773746e-03   2.62668777e-03   2.26323562e-03
    1.96884900e-03   1.70269032e-03   1.49130765e-03   1.42861050e-03]
 [  6.96164123e-08   1.18045221e-07   3.72296466e-07   7.89994940e-07
    1.88872353e-06   6.16861949e-06   2.43203423e-05   1.57578263e-04
    5.39432759e-03   4.22281610e-01   6.68446346e-01   1.20708070e-01
    8.04814670e-02   6.31055404e-02   5.22423896e-02   4.33928239e-02
    3.99923679e-02   2.89120350e-02   2.35104855e-02   1.89281002e-02
    1.51471511e-02   1.22824206e-02   1.01160638e-02   8.46310077e-03
    7.12237039e-03   6.07994029e-03   5.20926296e-03   4.50099768e-03
    3.92865392e-03   3.42491864e-03   3.02073785e-03   2.94322755e-03]
 [  2.11876038e-08   8.17236145e-08   2.14902838e-07   5.81145703e-07
    3.54741023e-06   8.78982876e-06   2.38723758e-05   1.79274369e-04
    5.95812368e-03   4.64601472e-01   6.34477707e-01   8.05992186e-02
    6.43304381e-02   6.39353075e-02   6.14997047e-02   5.54860483e-02
    5.27806062e-02   3.82177487e-02   3.09508758e-02   2.50019384e-02
    2.01594753e-02   1.64956997e-02   1.36656594e-02   1.14503289e-02
    9.70391646e-03   8.28194071e-03   7.10695793e-03   6.15469016e-03
    5.40964926e-03   4.74080806e-03   4.20901132e-03   4.10615458e-03]
 [  1.24098822e-07   2.17929639e-07   3.60189264e-07   6.78003320e-07
    1.89175033e-06   5.66919740e-06   2.02674564e-05   1.37934327e-04
    4.31221613e-03   3.23488629e-01   4.72462055e-01   9.56688382e-02
    7.78883222e-02   7.01059247e-02   5.98414599e-02   4.71343913e-02
    4.03152942e-02   2.79142745e-02   2.27308120e-02   1.87224594e-02
    1.53413506e-02   1.26528919e-02   1.04781902e-02   8.81702456e-03
    7.56055728e-03   6.58286136e-03   5.78614087e-03   5.12447016e-03
    4.59966533e-03   4.10518298e-03   3.68900698e-03   3.64269088e-03]
 [  5.75092102e-08   9.68576172e-08   1.78581232e-07   5.44824096e-07
    1.59815068e-06   5.24241853e-06   1.89144765e-05   1.22225232e-04
    4.28175139e-03   3.32084395e-01   5.21520490e-01   8.18890443e-02
    6.44157788e-02   5.91305219e-02   5.30068656e-02   4.60999217e-02
    4.42667762e-02   3.33409010e-02   2.82204747e-02   2.37420660e-02
    1.98499335e-02   1.66562594e-02   1.39673678e-02   1.17270996e-02
    9.87607785e-03   8.36676982e-03   7.17462508e-03   6.21617658e-03
    5.47604213e-03   4.82562809e-03   4.30427986e-03   4.25544546e-03]
 [  1.48313226e-07   1.75554431e-07   4.08618072e-07   9.05013360e-07
    2.22772519e-06   7.56700134e-06   2.32034529e-05   1.40591858e-04
    4.59383571e-03   3.52431895e-01   5.96727000e-01   1.51405639e-01
    1.13602783e-01   8.30500614e-02   5.40600529e-02   3.30028922e-02
    2.26985584e-02   1.32255959e-02   9.49274264e-03   7.10924921e-03
    5.44678507e-03   4.31920502e-03   3.49370873e-03   2.88800357e-03
    2.42458830e-03   2.04189175e-03   1.74202965e-03   1.48730017e-03
    1.27906537e-03   1.10062943e-03   9.60428024e-04   9.17032785e-04]
 [  0.00000000e+00   1.33179224e-07   1.33179224e-07   2.87546051e-07
    1.38930145e-06   2.94507692e-06   9.83407494e-06   6.18950442e-05
    2.06246794e-03   1.63011485e-01   2.58921671e-01   4.43054648e-02
    3.46486605e-02   3.27022038e-02   3.08151722e-02   2.81701844e-02
    2.80819501e-02   2.16081807e-02   1.83370235e-02   1.52709412e-02
    1.25290171e-02   1.02423268e-02   8.38137413e-03   6.85885411e-03
    5.64271290e-03   4.68338360e-03   3.92759151e-03   3.32991342e-03
    2.88653255e-03   2.52064382e-03   2.24798660e-03   2.22261293e-03]]

Setup Step


In [4]:
from sklearn import cross_validation
from sklearn.cross_validation import LeaveOneOut
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

%matplotlib inline

np.random.seed(12345678)  # for reproducibility, set random seed

# Cocaine = ["Cocaine174","Cocaine175","Cocaine178"]
# Control = ["Control181","Control182","Control189","Control239","Control258"]
# Fear = ["Fear187","Fear197","Fear199","Fear200"]

features = np.loadtxt(rs.HIST_DATA_PATH+"features.csv",delimiter=',')
temp_mu = np.mean(features,axis=1)
temp_std = np.std(features,axis=1)

mu = [np.mean(temp_mu[0:3]),np.mean(temp_mu[3:8]),np.mean(temp_mu[8:12])]
std = [np.mean(temp_std[0:3]),np.mean(temp_std[3:8]),np.mean(temp_std[8:12])]
print mu
print std
std=[1,1,1]

# define number of subjects per class
S = np.array((9, 21, 30, 39, 45, 63, 81, 96, 108, 210, 333))

names = ["Nearest Neighbors", "Linear SVM", "Random Forest",
         "Linear Discriminant Analysis", "Quadratic Discriminant Analysis"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    LinearDiscriminantAnalysis()]
#     QuadraticDiscriminantAnalysis()]


[0.056280388503550105, 0.042498942485078253, 0.038805754482162662]
[0.13726273492760607, 0.1031691019233711, 0.093076772034756644]

Steps 4 & 5: Sample data from setting similar to data and record classification accuracy


In [5]:
accuracy = np.zeros((len(S), len(classifiers), 2), dtype=np.dtype('float64'))
for idx1, s in enumerate(S):
    s0=s/3
    s1=s/3
    s2=s/3
    
    x0 = np.random.normal(mu[0],std[0],(s0,BINS))
    x1 = np.random.normal(mu[1],std[1],(s1,BINS))
    x2 = np.random.normal(mu[2],std[2],(s2,BINS))
    X = x0
    X = np.vstack([X,x1])
    X = np.vstack([X,x2])
    y = np.append(np.append(np.zeros(s0), np.ones(s1)),np.ones(s2)*2)
    for idx2, cla in enumerate(classifiers):
        X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
        clf = cla.fit(X_train, y_train)
        loo = LeaveOneOut(len(X))
        scores = cross_validation.cross_val_score(clf, X, y, cv=loo)
        accuracy[idx1, idx2,] = [scores.mean(), scores.std()]
        print("Accuracy of %s: %0.2f (+/- %0.2f)" % (names[idx2], scores.mean(), scores.std() * 2))
    
print accuracy


Accuracy of Nearest Neighbors: 0.33 (+/- 0.94)
Accuracy of Linear SVM: 0.44 (+/- 0.99)
Accuracy of Random Forest: 0.11 (+/- 0.63)
Accuracy of Linear Discriminant Analysis: 0.22 (+/- 0.83)
Accuracy of Nearest Neighbors: 0.24 (+/- 0.85)
Accuracy of Linear SVM: 0.24 (+/- 0.85)
Accuracy of Random Forest: 0.14 (+/- 0.70)
/usr/local/lib/python2.7/site-packages/sklearn/discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
/usr/local/lib/python2.7/site-packages/sklearn/discriminant_analysis.py:453: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
Accuracy of Linear Discriminant Analysis: 0.19 (+/- 0.79)
Accuracy of Nearest Neighbors: 0.30 (+/- 0.92)
Accuracy of Linear SVM: 0.27 (+/- 0.88)
Accuracy of Random Forest: 0.40 (+/- 0.98)
Accuracy of Linear Discriminant Analysis: 0.37 (+/- 0.96)
Accuracy of Nearest Neighbors: 0.28 (+/- 0.90)
Accuracy of Linear SVM: 0.28 (+/- 0.90)
Accuracy of Random Forest: 0.26 (+/- 0.87)
Accuracy of Linear Discriminant Analysis: 0.33 (+/- 0.94)
Accuracy of Nearest Neighbors: 0.27 (+/- 0.88)
Accuracy of Linear SVM: 0.27 (+/- 0.88)
Accuracy of Random Forest: 0.40 (+/- 0.98)
Accuracy of Linear Discriminant Analysis: 0.36 (+/- 0.96)
Accuracy of Nearest Neighbors: 0.44 (+/- 0.99)
Accuracy of Linear SVM: 0.37 (+/- 0.96)
Accuracy of Random Forest: 0.32 (+/- 0.93)
Accuracy of Linear Discriminant Analysis: 0.30 (+/- 0.92)
Accuracy of Nearest Neighbors: 0.32 (+/- 0.93)
Accuracy of Linear SVM: 0.26 (+/- 0.88)
Accuracy of Random Forest: 0.22 (+/- 0.83)
Accuracy of Linear Discriminant Analysis: 0.40 (+/- 0.98)
Accuracy of Nearest Neighbors: 0.32 (+/- 0.94)
Accuracy of Linear SVM: 0.25 (+/- 0.87)
Accuracy of Random Forest: 0.32 (+/- 0.94)
Accuracy of Linear Discriminant Analysis: 0.24 (+/- 0.85)
Accuracy of Nearest Neighbors: 0.29 (+/- 0.90)
Accuracy of Linear SVM: 0.47 (+/- 1.00)
Accuracy of Random Forest: 0.28 (+/- 0.90)
Accuracy of Linear Discriminant Analysis: 0.38 (+/- 0.97)
Accuracy of Nearest Neighbors: 0.27 (+/- 0.88)
Accuracy of Linear SVM: 0.38 (+/- 0.97)
Accuracy of Random Forest: 0.35 (+/- 0.96)
Accuracy of Linear Discriminant Analysis: 0.37 (+/- 0.96)
Accuracy of Nearest Neighbors: 0.39 (+/- 0.97)
Accuracy of Linear SVM: 0.34 (+/- 0.94)
Accuracy of Random Forest: 0.35 (+/- 0.95)
Accuracy of Linear Discriminant Analysis: 0.32 (+/- 0.93)
[[[ 0.33333333  0.47140452]
  [ 0.44444444  0.49690399]
  [ 0.11111111  0.31426968]
  [ 0.22222222  0.41573971]]

 [[ 0.23809524  0.42591771]
  [ 0.23809524  0.42591771]
  [ 0.14285714  0.34992711]
  [ 0.19047619  0.39267673]]

 [[ 0.3         0.45825757]
  [ 0.26666667  0.44221664]
  [ 0.4         0.48989795]
  [ 0.36666667  0.48189441]]

 [[ 0.28205128  0.44999817]
  [ 0.28205128  0.44999817]
  [ 0.25641026  0.43665093]
  [ 0.33333333  0.47140452]]

 [[ 0.26666667  0.44221664]
  [ 0.26666667  0.44221664]
  [ 0.4         0.48989795]
  [ 0.35555556  0.47868132]]

 [[ 0.44444444  0.49690399]
  [ 0.36507937  0.48145241]
  [ 0.31746032  0.4654882 ]
  [ 0.3015873   0.45894706]]

 [[ 0.32098765  0.46685606]
  [ 0.25925926  0.43822813]
  [ 0.22222222  0.41573971]
  [ 0.39506173  0.48886395]]

 [[ 0.32291667  0.46759116]
  [ 0.25        0.4330127 ]
  [ 0.32291667  0.46759116]
  [ 0.23958333  0.42682919]]

 [[ 0.28703704  0.45237902]
  [ 0.47222222  0.4992278 ]
  [ 0.27777778  0.44790321]
  [ 0.37962963  0.48529473]]

 [[ 0.26666667  0.44221664]
  [ 0.38095238  0.48562091]
  [ 0.35238095  0.47771186]
  [ 0.36666667  0.48189441]]

 [[ 0.38738739  0.48715336]
  [ 0.33633634  0.47245551]
  [ 0.34834835  0.47644703]
  [ 0.31831832  0.46582375]]]

Step 6: Plot Accuracy versus N


In [6]:
plt.errorbar(S, accuracy[:,0,0], yerr = accuracy[:,0,1], hold=True, label=names[0])
plt.errorbar(S, accuracy[:,1,0], yerr = accuracy[:,1,1], color='green', hold=True, label=names[1])
plt.errorbar(S, accuracy[:,2,0], yerr = accuracy[:,2,1], color='red', hold=True, label=names[2])
plt.errorbar(S, accuracy[:,3,0], yerr = accuracy[:,3,1], color='black', hold=True, label=names[3])
# plt.errorbar(S, accuracy[:,4,0], yerr = accuracy[:,4,1], color='brown', hold=True, label=names[4])
plt.xscale('log')
plt.xlabel('number of samples')
plt.ylabel('accuracy')
plt.title('Accuracy of classification under simulated data')
plt.axhline(1, color='red', linestyle='--')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()


Step 7: Apply technique to data


In [7]:
y=np.array([0,0,0,1,1,1,1,1,2,2,2,2])
features = np.loadtxt(rs.HIST_DATA_PATH+"features.csv",delimiter=',')

In [8]:
accuracy=np.zeros((len(classifiers),2))
for idx, cla in enumerate(classifiers):
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(features, y, test_size=0.4, random_state=0)
    clf = cla.fit(X_train, y_train)
    loo = LeaveOneOut(len(features))
    scores = cross_validation.cross_val_score(clf, features, y, cv=loo)
    accuracy[idx,] = [scores.mean(), scores.std()]
    print("Accuracy of %s: %0.2f (+/- %0.2f)" % (names[idx], scores.mean(), scores.std() * 2))


Accuracy of Nearest Neighbors: 0.08 (+/- 0.55)
Accuracy of Linear SVM: 0.25 (+/- 0.87)
Accuracy of Random Forest: 0.25 (+/- 0.87)
Accuracy of Linear Discriminant Analysis: 0.25 (+/- 0.87)

Step 8: Reflect on result

Our results are highly unsatisfactory with very low accuracy rates and very high error bar values. This is roughly what we expected however due to the nature of our data being rather unsuited for this kind of analysis. We will do some statistical reconfigurations in order to analyze the data to a more satisfactory state.