False positive and false negatives

This notebook explores the two sources of systematic error that we identify and trim in our datasets.



In [1]:

    
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np

False positives

False positives are defined as algorithms that, for a given gene, infer an outsized number of losses for that orthogroup.

My programs output mean number of taxa that the algorithms inferred to have had lost the orthogroup, and variance of this number for each gene.

It also identifies algorithms that have an outsize number of taxa (2 standard deviations above the mean). These are listed in the outlier column, if they were found.



In [2]:

    
stats2 = pd.read_csv("lossStats_HUMAN.csv",index_col=0)



In [3]:

    
stats2.fillna({"mean":np.nan,"variance":np.nan,"outliers":0},inplace=True)
stats2.head()









    Out[3]:






  
    
      
      mean
      variance
      outliers
    
  
  
    
      Q8TEA1
      28.538462
      92.269231
      0
    
    
      A6NIH7
      21.461538
      25.769231
      0
    
    
      Q96HJ5
      3.230769
      18.192308
      PANTHER8_all
    
    
      O94913
      7.076923
      25.243590
      Hieranoid_2
    
    
      P37837
      18.230769
      47.858974
      0

Let's look at distribution of the mean and variance



In [8]:

    
ax = stats2["variance"].hist(bins=50,color='grey')
ax.set_title("Variance histogram, all genes")
ax.set_ylabel("Number of genes")

#plt.savefig("variance_histogram.svg")









    Out[8]:





<matplotlib.text.Text at 0xa61640c>



In [9]:

    
stats_outliers = stats2[stats2["outliers"] != 0]
ax = stats_outliers["variance"].hist(bins=50,color='grey')
ax.set_title("Variance, genes with outliers")
ax.set_ylabel("Number of genes")









    Out[9]:





<matplotlib.text.Text at 0xa578c0c>



In [11]:

    
ax = stats2["mean"].hist(bins=50,color='grey')
ax.set_title("Histogram of mean values, all genes")
ax.set_ylabel("Number of genes")
#plt.savefig("mean_histogram.svg")









    Out[11]:





<matplotlib.text.Text at 0xa9b9fec>



In [12]:

    
ax = stats_outliers["mean"].hist(bins=50,color='grey')
ax.set_title("Mean, genes with outliers")
ax.set_ylabel("Number of genes")









    Out[12]:





<matplotlib.text.Text at 0xa7e7e4c>

Count number of outliers for each gene



In [13]:

    
stats2['numOutliers'] = stats2['outliers'].map(lambda x: len(x.split(" ")) if x != 0 else 0)
stats2.head()









    Out[13]:






  
    
      
      mean
      variance
      outliers
      numOutliers
    
  
  
    
      Q8TEA1
      28.538462
      92.269231
      0
      0
    
    
      A6NIH7
      21.461538
      25.769231
      0
      0
    
    
      Q96HJ5
      3.230769
      18.192308
      PANTHER8_all
      1
    
    
      O94913
      7.076923
      25.243590
      Hieranoid_2
      1
    
    
      P37837
      18.230769
      47.858974
      0
      0



In [14]:

    
stats2["numOutliers"].value_counts()









    Out[14]:





0    11774
1     7910
2      768
dtype: int64

Get number of false positives (outliers) for each algorithm



In [15]:

    
FalsePos = pd.Series([db for row in stats2["outliers"] for db in str(row).split()]).value_counts()
FalsePos = FalsePos[FalsePos.index != '0'] # don't care about these
FalsePos









    Out[15]:





PANTHER8_all         2006
PhylomeDB            1805
Hieranoid_2          1309
EnsemblCompara_v2     971
Metaphors             810
RSD                   630
OMA_Pairs             529
OMA_Groups            439
EggNOG                224
PANTHER8_LDO          224
InParanoidCore        216
InParanoid            195
Orthoinspector         88
dtype: int64

False Negatives

False negatives are defined as oversplitting co-ortholog groups. See the paper for an in-depth description.

My programs output a file that, for each gene, says whether or not each algorithm was found to oversplit.



In [16]:

    
ldos = pd.read_csv("HUMAN_LDO_results.csv",index_col=0)
ldos.head()









    Out[16]:






  
    
      
      OMA_Groups
      PANTHER8_LDO
      InParanoidCore
      Orthoinspector
      RSD
      Hieranoid_2
      OMA_Pairs
      EnsemblCompara_v2
      Metaphors
      InParanoid
      EggNOG
      PANTHER8_all
    
  
  
    
      A0A0A0MS98
      False
      False
      False
      False
      True
      False
      False
      False
      False
      False
      NaN
      NaN
    
    
      A0A0A0MSL8
      True
      NaN
      True
      True
      NaN
      True
      True
      NaN
      True
      True
      True
      NaN
    
    
      A0A0B4J1T7
      False
      NaN
      False
      False
      False
      False
      False
      False
      False
      False
      NaN
      NaN
    
    
      A0A0B4J1V8
      False
      NaN
      False
      NaN
      NaN
      NaN
      True
      NaN
      NaN
      False
      NaN
      NaN
    
    
      A0A0B4J207
      True
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

Get number of false negatives for each algorithm



In [18]:

    
FalseNeg = ldos.apply(pd.value_counts).ix[True]
FalseNeg.sort(ascending=False, inplace=True)
FalseNeg









    Out[18]:





PANTHER8_LDO         5692
RSD                  5628
OMA_Pairs            5032
OMA_Groups           4676
InParanoidCore       4496
Orthoinspector       4292
InParanoid           3462
EggNOG               3460
Hieranoid_2          2993
EnsemblCompara_v2    2766
Metaphors            1693
PANTHER8_all         1104
Name: True, dtype: int64

Combine counts of false-negatives and false-positives for each algorithm



In [19]:

    
dbs = ["InParanoid","InParanoidCore","OMA_Groups","OMA_Pairs","PANTHER8_LDO","RSD","EggNOG","Orthoinspector",
       "Hieranoid_2","EnsemblCompara_v2","Metaphors","PhylomeDB","PANTHER8_all"]
errors = pd.DataFrame({"FalsePositive":FalsePos,"FalseNegative":FalseNeg})
errors = errors.reindex(dbs)
errors.head()









    Out[19]:






  
    
      
      FalseNegative
      FalsePositive
    
  
  
    
      InParanoid
      3462
      195
    
    
      InParanoidCore
      4496
      216
    
    
      OMA_Groups
      4676
      439
    
    
      OMA_Pairs
      5032
      529
    
    
      PANTHER8_LDO
      5692
      224



In [20]:

    
# errors.to_csv("errors_byDatabase.csv")

Plot counts of errors for each algorithm



In [21]:

    
width = .35
fig, ax1 = plt.subplots()
errors["FalseNegative"].plot(kind='bar', ax=ax1, color='grey', width=width, position=1)
ax1.set_ylabel("Number Genes False Negative")

ax2 = ax1.twinx()
errors["FalsePositive"].plot(kind='bar', ax=ax2, color='black', width=width, position=0)
ax2.set_ylabel("Number Genes False Positive")

ax1.yaxis.grid(False)
ax2.yaxis.grid(False)
ax1.xaxis.grid(False)
ax2.xaxis.grid(False)

#plt.savefig("errors_byDatabase.svg")

Proportional error by database

Normalized error counts by database.



In [23]:

    
normErrors = errors/errors.sum()
normErrors["sumErrors"] = normErrors["FalseNegative"] + normErrors["FalsePositive"]
normErrors["normSum"] = normErrors["sumErrors"]/normErrors["sumErrors"].sum()
normErrors.sum()









    Out[23]:





FalseNegative    1.000000
FalsePositive    1.000000
sumErrors        1.808914
normSum          1.000000
dtype: float64



In [24]:

    
normErrors["normSum"].plot(kind='bar',color='grey')

#plt.savefig("totalErrors.svg")









    Out[24]:





<matplotlib.axes.AxesSubplot at 0xa2d6b8c>



In [ ]:

	mean	variance	outliers
Q8TEA1	28.538462	92.269231	0
A6NIH7	21.461538	25.769231	0
Q96HJ5	3.230769	18.192308	PANTHER8_all
O94913	7.076923	25.243590	Hieranoid_2
P37837	18.230769	47.858974	0

	OMA_Groups	PANTHER8_LDO	InParanoidCore	Orthoinspector	RSD	Hieranoid_2	OMA_Pairs	EnsemblCompara_v2	Metaphors	InParanoid	EggNOG	PANTHER8_all
A0A0A0MS98	False	False	False	False	True	False	False	False	False	False	NaN	NaN
A0A0A0MSL8	True	NaN	True	True	NaN	True	True	NaN	True	True	True	NaN
A0A0B4J1T7	False	NaN	False	False	False	False	False	False	False	False	NaN	NaN
A0A0B4J1V8	False	NaN	False	NaN	NaN	NaN	True	NaN	NaN	False	NaN	NaN
A0A0B4J207	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	FalseNegative	FalsePositive
InParanoid	3462	195
InParanoidCore	4496	216
OMA_Groups	4676	439
OMA_Pairs	5032	529
PANTHER8_LDO	5692	224