Revised version: instead of considering just the synapse values alone, we consider instead synapses/unmasked, as per Greg's explanation of the unmasked variable
Call a row vector from our data set $X_i=(x_i, y_i, z_i, u_i, s_i)$ where positions are respective to those in the original data set. Each row vector represents a 'bin' of pixels, let the number of rows in the data correspond to N. We will look at the number of synapses per bin, and whether the number follows a uniform distribution, conditioned on unmasked.
The number of synapses per bin follows a multinomial distribution, where the probability of synapses is conditioned on the unmasked value for that bin.
We will test whether or not the number of synapses are distributed uniformly across each bin. In other words does X follow a multinomial distribution where each cell has equal probabilities.
$H_0: \textrm{ all cells have equal probability }$
$H_A: \textrm{ cells do not have equal probability }$
We'll use Pearson's Chi Squared Test to determine whether to reject the null. First, define $\bar \pi$ to be the average synaptic density (synapse/unmasked) across all bins. Let $E_i$, the expected number of synapses at bin i, be $E_i=\bar \pi u_i$ where $u_i$ is unmasked value at that bin. Let $X_i$ be the observed number of synapses. Our test statistic is as follows $$ T = \sum_{i = 1}^{N} \frac{(X_i - E_i)^2}{X_i} $$
and it approximately follows a chi-squared distribution with N-1 degrees of freedom. Therefore, given a signifigance level, $\alpha$, we can use the inverse CDF of the chi-squared distribution to determine a critical value. When T is greater than the critical value, we can reject the null.
In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import itertools
import urllib2
import scipy.stats as stats
%matplotlib inline
np.set_printoptions(precision=3, threshold=1000000, suppress=True)
np.random.seed(1)
alpha = .025
url = ('https://raw.githubusercontent.com/Upward-Spiral-Science'
'/data/master/syn-density/output.csv')
data = urllib2.urlopen(url)
csv = np.genfromtxt(data, delimiter=",")[1:] # don't want first row (labels)
num_samples = 150 # how many different sized samples to draw
N = np.sum(csv[:, -1]) # total data set size
L = np.unique(csv[:, 2]) # list of labels
print L.shape
# sample sizes to iterate over
sample_sizes = np.logspace(1.0, 5.0, num=num_samples, base=10.0)
print sample_sizes
In [23]:
# simulated sampling under the null
repeats = 100 # how many repitions per sample size
pi_null = np.array([1.0/float(len(L))]*len(L)) # pi vector under the null (uniform probs)
power_null = []
for s in sample_sizes:
power = 0
E_i = pi_null*s # expected per label
for r in xrange(repeats):
null_data = np.random.multinomial(s, pi_null)
chi_sq = stats.chisquare(null_data, E_i)
p_value = chi_sq[1]
# can we reject the null hypothesis
if p_value < alpha:
power = power + 1
power_null.append(float(power)/float(repeats))
In [24]:
# simulated sampling under alternate
repeats = 100 # how many repitions per sample size
power_alt = []
pi_alt = np.random.rand(len(L)) # create a pi vector (random probabilities)
pi_alt = pi_alt/np.sum(pi_alt) # normalize
for s in sample_sizes:
power = 0
E_i = pi_null*s # all labels have equal expectancy
for r in xrange(repeats):
alt_data = np.random.multinomial(s, pi_alt) # use pi vector to gen data
chi_sq = stats.chisquare(alt_data, E_i)
p_value = chi_sq[1]
# can we reject the null hypothesis
if p_value < alpha:
power = power + 1
power_alt.append(float(power)/float(repeats))
In [27]:
plt.scatter(sample_sizes, power_null, hold=True, label='null', s=4)
plt.scatter(sample_sizes, power_alt, color='green', hold=True, label='alt', s=4)
plt.xlabel('sample size')
plt.xscale('log')
plt.ylabel('power')
plt.axhline(alpha, color='red', linestyle='--', label='alpha')
plt.legend(loc=5)
plt.show()
In [11]:
from __future__ import division
csv = csv[np.where(csv[:, -2] != 0)]
X = csv[:, -1]
density = csv[:, -1]/csv[:,-2]
# get average density (probability)
avg = np.average(density)
# expected values are everage probability multipled by unmasked per bin
E = csv[:, -2]*avg
print X[:50]
print E[:50]
print stats.chisquare(X, E)