Heavily borrowed text materials and formatting from grelliam
We extract the histogram data out of the raw clarity scanned data, given different conditions to the subjects.
For independent histograms, check that off diagonal covariance is approximately 0.
$x_i \stackrel{iid}{\sim} F$
$(x_1, x_2, ..., x_n) \sim F = \prod_i^n F_i$
$F_i = F_j, \forall i,j$
For identical histograms, check the optimal number of clusters and see if that is 1.
$F = \prod_j^J F_j, J < n$
$\prod_j^J w_jF_j(\theta)$
In [1]:
import os
PATH="/Users/david/Desktop/CourseWork/TheArtOfDataScience/claritycontrol/code/scripts/" # use your own path
os.chdir(PATH)
import clarity as cl # I wrote this module for easier operations on data
import clarity.resources as rs
import csv,gc # garbage memory collection :)
import numpy as np
import matplotlib.pyplot as plt
import jgraph as ig
%matplotlib inline
# settings for histogram
BINS=32 # histogram bins
RANGE=(10.0,300.0)
In [2]:
# Load X
X = np.loadtxt("../data/hist/features.csv",delimiter=',')
print X.shape
# Load Y
y = np.array([0,0,0,1,1,1,1,1,2,2,2,2])
In [3]:
vectorized = X
covar = np.cov(vectorized)
plt.figure(figsize=(7,7))
plt.imshow(covar)
plt.title('Covariance of Clarity Histograms datasets')
plt.colorbar()
plt.show()
diag = covar.diagonal()*np.eye(covar.shape[0])
hollow = covar-diag
d_det = np.linalg.det(diag)
h_det = np.linalg.det(hollow)
plt.figure(figsize=(11,8))
plt.subplot(121)
plt.imshow(diag)
plt.clim([0, np.max(covar)])
plt.title('Determinant of on-diagonal: ' + str(d_det))
plt.subplot(122)
plt.imshow(hollow)
plt.clim([0, np.max(covar)])
plt.title('Determinant of off-diagonal: ' + str(h_det))
plt.show()
print "Ratio of on- and off-diagonal determinants: " + str(d_det/h_det)
From the above, we conclude that the assumption that the histograms were independent is most likely true. This is because cross-graph covariance matrix is not highly influenced by the off-diagonal components of the covariance matrix.
In [4]:
import sklearn.mixture
i = np.linspace(1,12,12,dtype='int')
print i
bic = np.array(())
for idx in i:
print "Fitting and evaluating model with " + str(idx) + " clusters."
gmm = sklearn.mixture.GMM(n_components=idx,n_iter=1000,covariance_type='diag')
gmm.fit(vectorized)
bic = np.append(bic, gmm.bic(vectorized))
plt.figure(figsize=(7,7))
plt.plot(i, 1.0/bic)
plt.title('BIC')
plt.ylabel('score')
plt.xlabel('number of clusters')
plt.show()
print bic
From the above we observe that that our data most likely was not sampled identically from one distribution. This is an odd shape for a BIC curve, and thus we must do more investigation. This curve implies the larger number of clusters the better.
In [5]:
vect = X.T
covar = np.cov(vect)
plt.figure(figsize=(7,7))
plt.imshow(covar)
plt.title('Covariance of Clarity Histogram dataset')
plt.colorbar()
plt.show()
diag = covar.diagonal()*np.eye(covar.shape[0])
hollow = covar-diag
d_det = np.sum(diag)
h_det = np.sum(hollow)
plt.figure(figsize=(11,8))
plt.subplot(121)
plt.imshow(diag)
plt.clim([0, np.max(covar)])
plt.title('Sum of on-diagonal: ' + str(d_det))
plt.subplot(122)
plt.imshow(hollow)
plt.clim([0, np.max(covar)])
plt.title('Sum of off-diagonal: ' + str(h_det))
plt.show()
print "Ratio of on- and off-diagonal covariance sums: " + str(d_det/h_det)
The edges are not independent of one another because the ratio of on- to off-diagonal covariance is relatively small.
In [7]:
import scipy.stats as ss
prob = 1.0*np.sum(1.0*(vectorized>0),1)/64
vals = ss.linregress(prob, y)
m = vals[0]
c = vals[1]
def comp_value(m, c, data):
return m.T*data + c
resi = np.array(())
for idx, subj in enumerate(y):
temp = comp_value(m, c, prob[idx])
resi = np.append(resi, subj - temp)
plt.figure(figsize=(7,7))
plt.scatter(prob, resi)
plt.title('Residual assignment error')
plt.xlabel('edge probability')
plt.ylabel('error')
plt.show()
From the above we can see that the edge probability separation is independent. Thus, this assumption might be true.