In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.lines as mlines
from LECA.plotting import histLinePlot
%matplotlib inline
In [2]:
nodestats = pd.read_csv("nodeStats_HUMAN.csv",index_col=0,na_values=[None])
nodestats.head()
Out[2]:
In [3]:
nodestats.corr("spearman")
Out[3]:
In [4]:
nodestats["NodeError"].hist(bins=50,color='grey')
plt.ylabel("Number of Genes")
plt.xlabel("Avg. Node Error Between Algorithms")
Out[4]:
Note the positive skew. The bimodality statisitc is defined as the average node error between the "old" and "young" groups of the algorithms minus the average node error within those groups. So the positive skew means that this grouping is capturing a true clustering with respect to node error. Genes falling below zero here are bimodal with respect to some other grouping, but are clearly in the minority
In [5]:
nodestats["Bimodality"].hist(bins=50,color='grey')
plt.ylabel("Number of Genes")
plt.xlabel("Bimodality")
#plt.savefig("Polarization_distribution.svg")
In [6]:
# Those split between other groups will have bimodality score <0 (greater within group difference
# than between)
len(nodestats[nodestats["Bimodality"] < 0])/float(len(nodestats))
Out[6]:
In [7]:
# Bimodal or neutral genes (score greater than or equal to 0)
len(nodestats[nodestats["Bimodality"] >= 0])/float(len(nodestats))
Out[7]:
I made a module to do the binning called histLinePlot. It makes a dataframe with the mean,standard deviation, and variance for the bimodality statistic in each bin in the node error histogram. The plots below visualize these statistics.The clear takeaway is that genes with more node error are more bimodal with respect to the "old" and "young" algorithms. There are therefore systematic differences between these algorithms that make determination of a true age very difficult for a substantial subset of genes
In [8]:
%%capture
stats = histLinePlot.getLineScoreStats(nodestats,"Bimodality","NodeError")
In [9]:
stats.head()
Out[9]:
In [10]:
fig,ax1 = plt.subplots()
nodestats["NodeError"].hist(bins=50,color='grey')
ax2 = ax1.twinx()
ax2.plot(stats.index,stats['mean'],'black',label="Avg Bimodality")
ax1.set_ylabel("Number of Genes")
ax1.set_xlabel("Avg. Node Error Between Algorithms")
ax2.set_ylabel("Average Bimodality")
plt.legend()
plt.savefig("nodeError-polarization_correlation.svg")
In [11]:
fig,ax1 = plt.subplots()
nodestats["NodeError"].hist(bins=50,color='grey')
ax2 = ax1.twinx()
ax2.plot(stats.index,stats['var'],'black',label="Var Bimodality")
ax1.set_ylabel("Number of Genes")
ax1.set_xlabel("Avg. Node Error Between Algorithms")
ax2.set_ylabel("Variance Bimodality")
plt.legend()
#plt.savefig("nodeError-polarization_correlation.svg")
Out[11]:
In [12]:
fig,ax1 = plt.subplots()
nodestats["NodeError"].hist(bins=50,color='grey')
ax2 = ax1.twinx()
ax2.plot(stats.index,stats['stanDev.'],'black',label="stdev Bimodality")
ax1.set_ylabel("Number of Genes")
ax1.set_xlabel("Std. Dev. Node Error Between Algorithms")
ax2.set_ylabel("Std. Dev. Bimodality")
plt.legend()
#plt.savefig("nodeError-polarization_correlation.svg")
Out[12]:
In [ ]: