Heavily adopted code and format from greilliam
Histogram Equalization \begin{align} F(k) = floor((L-1) \sum_{n=0}^k p_n ) \ n = 0, 1..L-1 \ L=256 \ \end{align}
\begin{align} \newline F(x,y) | ~~ Histogram~Data\ F_{x|0} & = Norm(\mu0)\ F_{x|1} & = Norm(\mu1)\ F_{x|2} & = Norm(\mu2)\ \mu0 \neq \mu1 \neq \mu2 \ \end{align}Additionally we assume the elements correspond to features 1-4.
Classification is applied to reduce the estimated error. Objective is to minimize Error: $E[l] = \sum \Theta(\hat{Y}_i \neq Y_i)$
"Machine Learning" is usually classification or prediction
Clarity brains have $(X, Y)$ ~iid $F_{xy}$ which is some distribution
- Best classifier is called the Bayes Optimal
- g* = argmax F_((x=(x)|(y)=y))
- Can use a posterari if priors are not equal
- F_(x=x, y=y) = F_(x|y)F_y
- compute argmax for y∈y
- Let y = 0 is .99 y-1 is .01
- Next need to relect get change level accuracy which will almost definitely ahppen if you use a regular loss function
- Use histogram instead of image data
- Classifier list:
- LDA
- Variances are the same
- Made by Fischer
- Finds optimal linear classifier (optimal line) under the assumptions that we have made
- Advantages: Very interpretable, Very fast, Linear
- Random Forest
- Decision tree thresholds are created
- Choose a loss function and then try to do a greedy search
- Find the optimal thresholds to maximize purity
- Change thresholds to maximize purity so that most of one group is in one part and the others are in the others
- Random Forest uses decision trees on subsets of your data, since each tree is noisy and can overfit, so averageing over many different classifiers it will be much more effective
- This is an ensemble method
- Every single classifier is on a different point on the bias variance tradeoff so when you average everything it will be more consistent
- SVM
- Logistic
- Neural Network
- Uses linear algebra, runs on GPU
- Takes in more information and is very useful for computer vision techniques
- Natively do the classificiation
- KNN
- K nearest neighbor
- specify apriori k and find the distance between the points and K
- Assuming K is big enough, it will always converge irrespectively
- Doesn't care about the distributions, but it is universally consistent
- QDA
- Quadratic descriminatory analysis
- Optimal discriminatory boundary is curved
- Covariance matrices
Classification Techniques :
K Nearest Neighbors(3) - Default Parameters \ Support Vector Machines - Linear , C= 0.5 - Default\ Random Forest - Max depth = 5 , N Estimators = 10, Max Features =1 - Default\ Linear Discriminant Analysis \ Quadratic Discriminant Analysis \ We ran into errors with QDA, for which reason it was ignored in the assignment.
In [1]:
import os
PATH="/Users/david/Desktop/CourseWork/TheArtOfDataScience/claritycontrol/code/scripts/" # use your own path
os.chdir(PATH)
import clarity as cl # I wrote this module for easier operations on data
import clarity.resources as rs
import csv,gc # garbage memory collection :)
import numpy as np
import matplotlib.pyplot as plt
import jgraph as ig
%matplotlib inline
# settings for histogram
BINS=32 # histogram bins
RANGE=(10.0,300.0)
Skip this step if the data are already baked, just eat them!
This and next step will load raw image datasets (which are pretty large) and extract histograms from these dataset. The datasets are not included in the repository (Because they are too large), but you can skip this and next step, use the histogram data that already generated in the repository.
In [2]:
for token in rs.TOKENS:
c = cl.Clarity(token)
fname = rs.HIST_DATA_PATH+token+".csv"
hist, bin_edges = c.loadImg().getHistogram(bins=BINS,range=RANGE,density=False)
np.savetxt(fname,hist,delimiter=',')
print fname,"saved."
del c
gc.collect()
In [3]:
import numpy as np
import clarity.resources as rs
features = np.empty(shape=(1,BINS))
for token in rs.TOKENS:
fname = rs.HIST_DATA_PATH+token+".csv"
data = np.loadtxt(fname,delimiter=',')
features = np.vstack([features,data])
features = features[1:,]
minc = np.min(features)
maxc = np.max(features)
features = (features-minc)/(maxc-minc)
print features
np.savetxt(rs.HIST_DATA_PATH+"features.csv",features,delimiter=',')
In [4]:
from sklearn import cross_validation
from sklearn.cross_validation import LeaveOneOut
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
%matplotlib inline
np.random.seed(12345678) # for reproducibility, set random seed
# Cocaine = ["Cocaine174","Cocaine175","Cocaine178"]
# Control = ["Control181","Control182","Control189","Control239","Control258"]
# Fear = ["Fear187","Fear197","Fear199","Fear200"]
features = np.loadtxt(rs.HIST_DATA_PATH+"features.csv",delimiter=',')
temp_mu = np.mean(features,axis=1)
temp_std = np.std(features,axis=1)
mu = [np.mean(temp_mu[0:3]),np.mean(temp_mu[3:8]),np.mean(temp_mu[8:12])]
std = [np.mean(temp_std[0:3]),np.mean(temp_std[3:8]),np.mean(temp_std[8:12])]
print mu
print std
std=[1,1,1]
# define number of subjects per class
S = np.array((9, 21, 30, 39, 45, 63, 81, 96, 108, 210, 333))
names = ["Nearest Neighbors", "Linear SVM", "Random Forest",
"Linear Discriminant Analysis", "Quadratic Discriminant Analysis"]
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
LinearDiscriminantAnalysis()]
# QuadraticDiscriminantAnalysis()]
In [5]:
accuracy = np.zeros((len(S), len(classifiers), 2), dtype=np.dtype('float64'))
for idx1, s in enumerate(S):
s0=s/3
s1=s/3
s2=s/3
x0 = np.random.normal(mu[0],std[0],(s0,BINS))
x1 = np.random.normal(mu[1],std[1],(s1,BINS))
x2 = np.random.normal(mu[2],std[2],(s2,BINS))
X = x0
X = np.vstack([X,x1])
X = np.vstack([X,x2])
y = np.append(np.append(np.zeros(s0), np.ones(s1)),np.ones(s2)*2)
for idx2, cla in enumerate(classifiers):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
clf = cla.fit(X_train, y_train)
loo = LeaveOneOut(len(X))
scores = cross_validation.cross_val_score(clf, X, y, cv=loo)
accuracy[idx1, idx2,] = [scores.mean(), scores.std()]
print("Accuracy of %s: %0.2f (+/- %0.2f)" % (names[idx2], scores.mean(), scores.std() * 2))
print accuracy
In [6]:
plt.errorbar(S, accuracy[:,0,0], yerr = accuracy[:,0,1], hold=True, label=names[0])
plt.errorbar(S, accuracy[:,1,0], yerr = accuracy[:,1,1], color='green', hold=True, label=names[1])
plt.errorbar(S, accuracy[:,2,0], yerr = accuracy[:,2,1], color='red', hold=True, label=names[2])
plt.errorbar(S, accuracy[:,3,0], yerr = accuracy[:,3,1], color='black', hold=True, label=names[3])
# plt.errorbar(S, accuracy[:,4,0], yerr = accuracy[:,4,1], color='brown', hold=True, label=names[4])
plt.xscale('log')
plt.xlabel('number of samples')
plt.ylabel('accuracy')
plt.title('Accuracy of classification under simulated data')
plt.axhline(1, color='red', linestyle='--')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
In [7]:
y=np.array([0,0,0,1,1,1,1,1,2,2,2,2])
features = np.loadtxt(rs.HIST_DATA_PATH+"features.csv",delimiter=',')
In [8]:
accuracy=np.zeros((len(classifiers),2))
for idx, cla in enumerate(classifiers):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(features, y, test_size=0.4, random_state=0)
clf = cla.fit(X_train, y_train)
loo = LeaveOneOut(len(features))
scores = cross_validation.cross_val_score(clf, features, y, cv=loo)
accuracy[idx,] = [scores.mean(), scores.std()]
print("Accuracy of %s: %0.2f (+/- %0.2f)" % (names[idx], scores.mean(), scores.std() * 2))
Our results are highly unsatisfactory with very low accuracy rates and very high error bar values. This is roughly what we expected however due to the nature of our data being rather unsuited for this kind of analysis. We will do some statistical reconfigurations in order to analyze the data to a more satisfactory state.