In [2]:
In [3]:
# Read in data
df = pd.read_csv('../output.csv')
print "Data types:"
print df.dtypes
The data set is a table consisting of the following columns: "cx", "cy", "cz", "unmasked", and "synapses". there are 61776 rows, each corresponding to a cortical volumn called "bins" henceforth.
"cx", "cy", and "cz" denote the unique location of the bin. "synapses" is an integer count of the number of synapses found within the bin. Each bin was comprised of many individual voxels of the EM image. A synapse could technically be found at any given voxel. However, a subset of these voxels were pre-determined by the experimenters to contain material that are not synapses (i.e. cell bodies). These voxels were considered "masked". Thus the "unmasked" voxels comprise the subset area of each bin in which a synapse may reside.
We found no values within our data set that were "bad": no NaNs, Infs or nonsensical values. All numbers were integers as expected. Synapses and unmasked values were nonnegative.
Shown is a histogram of synapse count across all bins of the dataset.
To get a sense of outliers, we boxplotted synapse count.
Shown are the marginal distributions of synapse count for along the cortical dimensions of x, y, and z
Shown are the heatmaps of synapse count.
At this point, we realized the significance of the unmasked data. We incorporated unmasked considerations into our subsequent exploratory analysis.
Shown is a box plot of the unmasked count:
Shown is a joing plot of Synapse vs. unmasked data.
At this point we did two things with the data: 1) We defined "weighted" synapse count of a bin as the raw synapse count divided by the unmasked count 2) We discarded all bins with lower than 50% of voxels unmasked
Shown is the distribution of weighted synapses over all bins that met the 50% unmasked requirement:
We decided to test whether the weighted synapse count follows a poisson distribution.
We simulated data from poisson as well as geometric distributions under a range of sample sizes. We tested each set of simulated data under th Null Hypothesis that the data set was a poisson distribution. Thus, data generated by a poisson process was considered the "Null Model", while data generated by a geometric distribution was considered the "Alt Model"
Shown is a the power of our statistical test over a range of sample sizes
We applied our test to our weighted synapse data by setting $\lambda$ to the observed mean of weighted synapse count.
Shown is a comparison of the observed vs. expected weighted synapse counts:
We divided the bins according to Z layer, and attempted to classify the Z-layers based on weighted synapse count.
We then randomly choose a small grid, and use the $X$ and $Y$ position to predict this grid belongs to high or low density.
$N$ is the number of synapses (observations)
Classification methods:
Shown is an example of a 5-by-5 grid of bins.
We average the weighted synapse density across the 25 bins to produce a single "grid mean".
We randomly sample 1000 grids from each Z-layer and visualize the distribution of grid means
We then took the means of the grid-means of each Z-layer (to have 11 means total), and performed K-means clustering to isolate two groups: one "high-response" group and one "low-response" group.
The low-response group comprised of 8 Z-layers, while the high-response group comprised of 3 Z-layers
Shown is the distribution of grid-means of each group
It is immediately obvious that the two groups have a significant overlap in grid means. Classification is not expected to be very successful.
Accuracy of Nearest Neighbors: 0.75 (+/- 0.01) Accuracy of Linear SVM: 0.80 (+/- 0.01) Accuracy of Random Forest: 0.80 (+/- 0.01) Accuracy of Linear Discriminant Analysis: 0.80 (+/- 0.01) Accuracy of Quadratic Discriminant Analysis: 0.80 (+/- 0.01)
The five classifiers tested had accuracies between 75-80%, which is better than chance. However, these numbers are only slightly at or above the maximum prior probability of 73%. This means that our classifiers are only slightly better than just choosing the class with the maximum prior 100% of the time, assuming we trust the priors. Taking into account the observed synapse density provides little added information, which is not surprising given the large overlap in observed densities from the two classes. If we wanted to distinguish between similar populations of Z layers but with different priors from another dataset, our accuracy would decrease accordingly.
In [5]:
isNan = df.isnull()
isInf = np.isinf(df)
isNeg = df < 0
print "Number of nan values by column:"
print isNan.sum(), "\n"
print "Number of rows with nan values:", isNan.sum(1).sum(), "\n"
print "Number of inf values by column:"
print isInf.sum(), "\n"
print "Number of rows with inf values:", isInf.sum(1).sum(), "\n"
print "Number of negative values by column:"
print isNeg.sum(), "\n"
print "Number of rows with negative values:", isNeg.sum(1).sum(), "\n"
In [6]:
nSyn = df['synapses'].sum()
print "There are", nSyn, "total synapses in the data."
In [7]:
nBins = df['synapses'].count()
print "There are", nBins, "total 3D bins."
In [ ]: