HW6 Testing Assumptions

Heavily borrowed text materials and formatting from grelliam

Testing Assumptions

  1. State assumptions
  2. Check assumptions (with figures)
    1. residuals
    2. correlations
    3. # of modes

Step 1: State assumptions

We extract the histogram data out of the raw clarity scanned data, given different conditions to the subjects.

  1. We assume that histograms are sampled according to: $x_i \stackrel{iid}{\sim} F$. This is both an independent and identical assumption.
  2. We assume that the data poinst are independent: $F_{X|0}=Norm(\mu_{0},\sigma_{0})^{V\times V}$.
  3. We assume there is a class conditional difference across conditions={Control, Cocaine, Fear}.
  4. In addition, we assume that any other differences of the subjects such as genders, ages will not or have limit affects to the data. (We cannot test on this, because we do not have access to that information.)

Step 2: Check assumptions

For independent histograms, check that off diagonal covariance is approximately 0.
$x_i \stackrel{iid}{\sim} F$
$(x_1, x_2, ..., x_n) \sim F = \prod_i^n F_i$
$F_i = F_j, \forall i,j$

For identical histograms, check the optimal number of clusters and see if that is 1.
$F = \prod_j^J F_j, J < n$
$\prod_j^J w_jF_j(\theta)$


In [1]:
import os
PATH="/Users/david/Desktop/CourseWork/TheArtOfDataScience/claritycontrol/code/scripts/" # use your own path
os.chdir(PATH)

import clarity as cl  # I wrote this module for easier operations on data
import clarity.resources as rs
import csv,gc  # garbage memory collection :)

import numpy as np
import matplotlib.pyplot as plt
import jgraph as ig
%matplotlib inline

# settings for histogram
BINS=32 # histogram bins
RANGE=(10.0,300.0)

Histogram data preparation and Scale data

If you haven't done this before, please refer to homework for data preparation and scale data.

Setup Step


In [2]:
# Load X
X = np.loadtxt("../data/hist/features.csv",delimiter=',')
print X.shape

# Load Y
y = np.array([0,0,0,1,1,1,1,1,2,2,2,2])


(12, 32)

Independent Histogram Assumption


In [3]:
vectorized = X
covar = np.cov(vectorized)

plt.figure(figsize=(7,7))
plt.imshow(covar)
plt.title('Covariance of Clarity Histograms datasets')
plt.colorbar()
plt.show()

diag = covar.diagonal()*np.eye(covar.shape[0])
hollow = covar-diag
d_det = np.linalg.det(diag)
h_det = np.linalg.det(hollow)

plt.figure(figsize=(11,8))
plt.subplot(121)
plt.imshow(diag)
plt.clim([0, np.max(covar)])
plt.title('Determinant of on-diagonal: ' + str(d_det))
plt.subplot(122)
plt.imshow(hollow)
plt.clim([0, np.max(covar)])
plt.title('Determinant of off-diagonal: ' + str(h_det))
plt.show()

print "Ratio of on- and off-diagonal determinants: " + str(d_det/h_det)


Ratio of on- and off-diagonal determinants: -0.104579653134

From the above, we conclude that the assumption that the histograms were independent is most likely true. This is because cross-graph covariance matrix is not highly influenced by the off-diagonal components of the covariance matrix.

Identical Histogram Assumption


In [4]:
import sklearn.mixture
i = np.linspace(1,12,12,dtype='int')
print i
bic = np.array(())
for idx in i:
    print "Fitting and evaluating model with " + str(idx) + " clusters."
    gmm = sklearn.mixture.GMM(n_components=idx,n_iter=1000,covariance_type='diag')
    gmm.fit(vectorized)
    bic = np.append(bic, gmm.bic(vectorized))
plt.figure(figsize=(7,7))
plt.plot(i, 1.0/bic)
plt.title('BIC')
plt.ylabel('score')
plt.xlabel('number of clusters')
plt.show()
print bic


[ 1  2  3  4  5  6  7  8  9 10 11 12]
Fitting and evaluating model with 1 clusters.
Fitting and evaluating model with 2 clusters.
Fitting and evaluating model with 3 clusters.
Fitting and evaluating model with 4 clusters.
Fitting and evaluating model with 5 clusters.
Fitting and evaluating model with 6 clusters.
Fitting and evaluating model with 7 clusters.
Fitting and evaluating model with 8 clusters.
Fitting and evaluating model with 9 clusters.
Fitting and evaluating model with 10 clusters.
Fitting and evaluating model with 11 clusters.
Fitting and evaluating model with 12 clusters.
[-1639.72072733 -1509.33333076 -1366.11704279 -1217.79273549 -1065.97939288
  -908.97767055  -752.11847439  -591.68896016  -433.19008793  -271.65646483
  -110.19222953    51.32818139]

From the above we observe that that our data most likely was not sampled identically from one distribution. This is an odd shape for a BIC curve, and thus we must do more investigation. This curve implies the larger number of clusters the better.

Independent Histogram Data Points Assumption


In [5]:
vect = X.T
covar = np.cov(vect)

plt.figure(figsize=(7,7))
plt.imshow(covar)
plt.title('Covariance of Clarity Histogram dataset')
plt.colorbar()
plt.show()

diag = covar.diagonal()*np.eye(covar.shape[0])
hollow = covar-diag
d_det = np.sum(diag)
h_det = np.sum(hollow)

plt.figure(figsize=(11,8))
plt.subplot(121)
plt.imshow(diag)
plt.clim([0, np.max(covar)])
plt.title('Sum of on-diagonal: ' + str(d_det))
plt.subplot(122)
plt.imshow(hollow)
plt.clim([0, np.max(covar)])
plt.title('Sum of off-diagonal: ' + str(h_det))
plt.show()

print "Ratio of on- and off-diagonal covariance sums: " + str(d_det/h_det)


Ratio of on- and off-diagonal covariance sums: 0.539128118403

The edges are not independent of one another because the ratio of on- to off-diagonal covariance is relatively small.

Class Conditional Histogram Probability Assumption


In [7]:
import scipy.stats as ss

prob = 1.0*np.sum(1.0*(vectorized>0),1)/64

vals = ss.linregress(prob, y)
m = vals[0]
c = vals[1]

def comp_value(m, c, data):
    return m.T*data + c

resi = np.array(())
for idx, subj in enumerate(y):
    temp = comp_value(m, c, prob[idx])
    resi = np.append(resi, subj - temp)
    
plt.figure(figsize=(7,7))
plt.scatter(prob, resi)
plt.title('Residual assignment error')
plt.xlabel('edge probability')
plt.ylabel('error')
plt.show()


From the above we can see that the edge probability separation is independent. Thus, this assumption might be true.