Investigating different sources of error

This notebook investigates the different sources of error in age inference. Besides the "bimodality" that we discuss at length in the manuscript, we also looked at whether protein length, number of domains, and evolutionary rate correlated with node error, our main error statistic
We also investigate whether protein length and evolutionary rate affect the age call, with shorter, more highly evolving proteins coming out younger than they should due to limited sensitivity in the inital search used to identify homologs



In [1]:

    
import numpy as np
import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt
from itertools import combinations
%matplotlib inline



In [56]:

    
## Suppress copious FutureWarnings from pandas

import warnings
warnings.filterwarnings('ignore')

Read in the different files with error-source information



In [6]:

    
# Protein lengths are from Uniprot proteome. Number of domains was inferred with hmmer. Similarities here are derived 
# from pairwise alignments (needleman-wunsch) between one-to-one human and mouse orthologs, defined by PhylomeDB. 
# We look atHuman-Yeast pairs below, but chose to use Mouse here because there are more one-to-ones.

lengths = pd.read_csv("lengthMapping.csv",index_col=0,names=["protLength"]).sort_index()
numDomains = pd.read_csv("numDomains.csv",index_col=0).sort_index()
mouse_human = pd.read_csv("h-m_similarities.csv",index_col=0,names=["Hs-Mm_sim"]).sort_index()



In [7]:

    
nodeStats = pd.read_csv("../../nodeStats/nodeStats_HUMAN.csv",index_col=0).sort_index()[["Bimodality","NodeError"]]



In [8]:

    
lossTaxa = pd.read_csv("../../Errors/lossStats_HUMAN.csv",index_col=0).sort_index()[["mean"]]
lossTaxa.columns = ["meanLossTaxa"]



In [9]:

    
fracOverSplit = pd.read_csv("../../Errors/HUMAN_LDO_summary.csv",index_col=0).sort_index()
fracOverSplit.columns = ["fracOverSplit"]

Make concatenated dataframe



In [55]:

    
all_stats = pd.concat([lengths,numDomains,mouse_human,nodeStats,lossTaxa,fracOverSplit],axis=1)
all_stats.dropna(inplace=True)
all_stats = all_stats.convert_objects(convert_numeric=True)
all_stats.columns = ["Protein Length","# Domains","Hs-Mm_evoSim","Bimodality",
                     "Node Error","Avg. Loss Taxa","% Oversplit"]
all_stats.head()









    Out[55]:






  
    
      
      Protein Length
      # Domains
      Hs-Mm_evoSim
      Bimodality
      Node Error
      Avg. Loss Taxa
      % Oversplit
    
  
  
    
      A0AV02
      714.0
      2.0
      86.7
      1.428571
      2.769231
      13.307692
      0.555556
    
    
      A0AV96
      593.0
      4.0
      96.3
      7.000000
      4.769231
      10.153846
      1.000000
    
    
      A0AVI4
      362.0
      1.0
      94.2
      0.007937
      0.282051
      2.461538
      0.000000
    
    
      A0AVK6
      867.0
      2.0
      88.3
      2.861111
      4.512821
      13.769231
      0.500000
    
    
      A0AVT1
      1052.0
      4.0
      94.5
      0.337302
      2.487179
      19.923077
      1.000000

Scatterplot and correlation analysis

Only bimodality correlates strongly with the node error statistic



In [26]:

    
fig, axes = plt.subplots(3,3,sharey=False)
index = 0
for col in all_stats.columns:
    axis = axes.flat[index]
    all_stats.plot(ax=axis,kind='scatter',x=col,y='Node Error',color='grey',title=col)
    axis.set_ylabel('')
    axis.set_xticklabels([])
    index += 1
    
#fig.savefig("errorScatterPlot.png",dpi=200)



In [27]:

    
# Get Spearman rank correlations between all variables

correlations = all_stats.corr("spearman")
correlations









    Out[27]:






  
    
      
      Protein Length
      # Domains
      Hs-Mm_evoSim
      Bimodality
      Node Error
      Avg. Loss Taxa
      % Oversplit
    
  
  
    
      Protein Length
      1.000000
      0.327528
      -0.061458
      0.069116
      0.098448
      0.071032
      0.074736
    
    
      # Domains
      0.327528
      1.000000
      0.099203
      0.041373
      0.107270
      0.079636
      0.223168
    
    
      Hs-Mm_evoSim
      -0.061458
      0.099203
      1.000000
      -0.029975
      -0.134729
      0.078357
      -0.039889
    
    
      Bimodality
      0.069116
      0.041373
      -0.029975
      1.000000
      0.685699
      0.061470
      0.063460
    
    
      Node Error
      0.098448
      0.107270
      -0.134729
      0.685699
      1.000000
      0.342641
      0.061165
    
    
      Avg. Loss Taxa
      0.071032
      0.079636
      0.078357
      0.061470
      0.342641
      1.000000
      0.002736
    
    
      % Oversplit
      0.074736
      0.223168
      -0.039889
      0.063460
      0.061165
      0.002736
      1.000000

Plot absolute Spearman rank correlations with Node Error



In [28]:

    
absCorrs = correlations.applymap(np.abs)

absCorrs.drop("Node Error",axis=1).ix["Node Error",].plot(
    kind='bar',color='grey',ylim=(0,1),title="Correlation with Node Error")
plt.ylabel("Absolute Spearman corr. coef.")

#plt.savefig("correlation_barPlot.svg")









    Out[28]:





<matplotlib.text.Text at 0x7fd5529b1110>



In [29]:

    
# correlations.to_csv("errorCorrelations.csv")

Human-yeast evolutionary rates and node errors

Now we'll take a quick look at yeast-human similarities. We used mouse above because there are many one-to-one orthologs, but the tradeoff is a small dynamic range. Human-yeast similarities have a stronger correlation with node error than human-mouse.



In [43]:

    
yeast_human = pd.read_csv("h-y_similarities.csv",index_col=0,names=["Hs-Sc_sim"]).sort_index()



In [57]:

    
error_yeastSims = pd.concat([nodeStats,yeast_human],axis=1).dropna()[["Hs-Sc_sim","NodeError"]]
error_yeastSims = error_yeastSims.convert_objects(convert_numeric=True)
print "Number of alignments: %d" % len(error_yeastSims)









    



Number of alignments: 892



In [52]:

    
error_yeastSims.corr()









    Out[52]:






  
    
      
      Hs-Sc_sim
      NodeError
    
  
  
    
      Hs-Sc_sim
      1.000000
      -0.357516
    
    
      NodeError
      -0.357516
      1.000000



In [53]:

    
error_yeastSims.plot(kind='scatter',y="NodeError",x="Hs-Sc_sim",color='grey',title="Yeast-Human Distances Scatter")

#plt.savefig("Yeast-Human_dists_scatter.png",dpi=600)









    Out[53]:





<matplotlib.axes.AxesSubplot at 0x7fd5529c63d0>

Effect of protein lengths and evolutionary distances on age-call.

Does the limited sensitivity of homology search algorithms affect the age-call? Prediction of Moyers and Zhang (2015,2016) is that quickly evolving proteins and short proteins will be called at a younger age incorrectly.
Takeaway from the below is that protein length and evolutionary rate may affect some of the proteins in the "Mammal" age class. So we cannot rule out that some of these are false negatives due to limited sensitivity in the initial homology search.

First we'll take a look at the distribution of human-mouse one-to-one alignment similarities. Most are highly similar.



In [59]:

    
mouse_human.hist(bins=50,color='grey')









    Out[59]:





array([[<matplotlib.axes.AxesSubplot object at 0x7fd552499650>]], dtype=object)

Distribution of protein lengths. Most are short



In [62]:

    
lengths.hist(bins=50,color='grey')
plt.yscale("log")

Add in the mode-ages



In [36]:

    
conAges = pd.read_csv("../../Main/main_HUMAN.csv",index_col=0)["modeAge"]
conAges.head()









    Out[36]:





A0A075B6G5    Eumetazoa
A0A075B6R3     Mammalia
A0A0A0MR89    Eumetazoa
A0A0A0MS98     Mammalia
A0A0A0MSJ3    Eukaryota
Name: modeAge, dtype: object



In [37]:

    
statsAges = all_stats.join(conAges).sort("modeAge")
statsAges.head()









    Out[37]:






  
    
      
      Protein Length
      # Domains
      Hs-Mm_evoSim
      Bimodality
      Node Error
      Avg. Loss Taxa
      % Oversplit
      modeAge
    
  
  
    
      A1Z1Q3
      448
      1
      77.8
      5.376984
      7.256410
      29.923077
      0.8
      Cellular_organisms
    
    
      A2A3L6
      582
      16
      43.2
      0.452381
      4.076923
      20.307692
      1.0
      Cellular_organisms
    
    
      A2RTX5
      802
      4
      91.1
      2.960317
      4.871795
      19.538462
      0.4
      Cellular_organisms
    
    
      A2RU49
      373
      1
      89.6
      -0.305556
      2.974359
      13.692308
      1.0
      Cellular_organisms
    
    
      A6NDG6
      321
      4
      94.1
      0.003968
      1.974359
      23.846154
      0.0
      Cellular_organisms



In [44]:

    
ages = ["Cellular_organisms","Euk+Bac","Euk_Archaea","Eukaryota",
        "Opisthokonta","Eumetazoa","Vertebrata","Mammalia"]

ageLengths = {}

for i in ages:
    df = statsAges[statsAges["modeAge"] == i]
    assert len(df) > 50 # just checkin'
    ageLengths[i] = df["Protein Length"].mean()

Average protein lengths in each age category. No strong correlation with age, but Mammals may be enriched for small proteins.



In [45]:

    
ax = pd.Series(ageLengths,index=ages).plot(color='black')
plt.xticks(rotation=70)
ax.set_ylabel("Average Protein Length")
ax.set_title("Average Protein Lengths in each Age Category")

#plt.savefig("lengthsXage.svg")









    Out[45]:





<matplotlib.text.Text at 0xd0cbfac>

Average similarity in each age class. Mammals definitely enriched for more quickly evolving proteins. Some of these may be older than "Mammal" and are false-negatives.



In [62]:

    
ageSims = {}

for i in ages:
    df = statsAges[statsAges["modeAge"] == i]
    assert len(df) > 50 # just checkin'
    ageSims[i] = df["Hs-Mm_evoSim"].mean()
    
ax = pd.Series(ageSims,index=ages).plot(color='black')
plt.xticks(rotation=70)
ax.set_ylabel("Average Hs-Mm Similarity")
ax.set_title("Hs-Mm Evolutionary Similarity in each Age Category")

#plt.savefig("Hs-Mm_similarityXage.svg")









    Out[62]:





<matplotlib.text.Text at 0xdd5dd2c>

Double-check that Mammal age class has significantly lower similarities in its proteins than Vertebrate class



In [56]:

    
# Create series for Mann-Whitney U-test

ageSimsSets = {}

for i in ages:
    df = statsAges[statsAges["modeAge"] == i]
    assert len(df) > 50 # just checkin'
    ageSimsSets[i] = df["Hs-Mm_evoSim"]

ageSimsSets["Mammalia"].head()









    Out[56]:





A1L4H1    70.1
A2RU30    71.3
A4D2G3    89.7
A6NCF6    61.6
A6NCV1    93.9
Name: Hs-Mm_evoSim, dtype: float64



In [103]:

    
## Mann-Whitney U-test

f,p = stats.mannwhitneyu(ageSimsSets["Vertebrata"],ageSimsSets["Mammalia"])
print "p-val Mammals and Vertebrate classes have same mean: %s" % p









    



p-val Mammals and Vertebrate classes have same mean: 1.68683397453e-30



In [104]:

    
ageCatSimsDF = pd.DataFrame(ageSimsSets,columns=ages)

Box plot of similarities



In [105]:

    
ax = ageCatSimsDF.boxplot(sym=".")
plt.ylim(20,110)
plt.xticks(rotation=70)









    Out[105]:





(array([1, 2, 3, 4, 5, 6, 7, 8]), <a list of 8 Text xticklabel objects>)



In [ ]:

	Protein Length	# Domains	Hs-Mm_evoSim	Bimodality	Node Error	Avg. Loss Taxa	% Oversplit
A0AV02	714.0	2.0	86.7	1.428571	2.769231	13.307692	0.555556
A0AV96	593.0	4.0	96.3	7.000000	4.769231	10.153846	1.000000
A0AVI4	362.0	1.0	94.2	0.007937	0.282051	2.461538	0.000000
A0AVK6	867.0	2.0	88.3	2.861111	4.512821	13.769231	0.500000
A0AVT1	1052.0	4.0	94.5	0.337302	2.487179	19.923077	1.000000

	Protein Length	# Domains	Hs-Mm_evoSim	Bimodality	Node Error	Avg. Loss Taxa	% Oversplit
Protein Length	1.000000	0.327528	-0.061458	0.069116	0.098448	0.071032	0.074736
# Domains	0.327528	1.000000	0.099203	0.041373	0.107270	0.079636	0.223168
Hs-Mm_evoSim	-0.061458	0.099203	1.000000	-0.029975	-0.134729	0.078357	-0.039889
Bimodality	0.069116	0.041373	-0.029975	1.000000	0.685699	0.061470	0.063460
Node Error	0.098448	0.107270	-0.134729	0.685699	1.000000	0.342641	0.061165
Avg. Loss Taxa	0.071032	0.079636	0.078357	0.061470	0.342641	1.000000	0.002736
% Oversplit	0.074736	0.223168	-0.039889	0.063460	0.061165	0.002736	1.000000

	Protein Length	# Domains	Hs-Mm_evoSim	Bimodality	Node Error	Avg. Loss Taxa	% Oversplit	modeAge
A1Z1Q3	448	1	77.8	5.376984	7.256410	29.923077	0.8	Cellular_organisms
A2A3L6	582	16	43.2	0.452381	4.076923	20.307692	1.0	Cellular_organisms
A2RTX5	802	4	91.1	2.960317	4.871795	19.538462	0.4	Cellular_organisms
A2RU49	373	1	89.6	-0.305556	2.974359	13.692308	1.0	Cellular_organisms
A6NDG6	321	4	94.1	0.003968	1.974359	23.846154	0.0	Cellular_organisms