Statistics libraries

A number of python lybraries are widely used for statistical computation. Here is a small list:

As Python comes with battery included, it has its own statistics functions.
For basic statistics one need not go further than Pandas, a module designed precisely for managing data that comes in a any tabular form.
numpy/scipy.stats is a submodule dealing exclusively with statistics.
The scikits stack, useful for many research fields, many of which have their own statistics submodules. Scikit-learn is leaned upon machine learning, but some of its modules, like the generalized linear models are part of some statistics textbooks.
Statsmodels While still in beta, this is probably the most convenient library for the programming agnostic. We did not use it because practicing with scipy is more important for this short course.
PyMC is targeted towards Bayesian statistics and MCMC estimation, but one can use scikit-learn instead. This also goes for a number of other machine learning related small packages, but hey the Python world is free for your taking, and sometimes they hold hidden gems not found in the main libraries.
A popular way of doing statistics in Python is also running R statistics packages from withing Python, using Rpy2. We will get to that in another chapter.

Statistical data structures with Pandas: Series and Dataframes

Languages developed primarily for statistics are using rich data structures that are available also in Python. One of the most important modules to use in Pandas. Let us have a look at Pandas API index.

http://pandas.pydata.org/pandas-docs/dev/api.html

Do you recognize some of these structures and can you name equivalent functionality from other languages like R or tabular computing programs like Excel? Pandas can do everything Excel does and then some more, like SQL interogation and advanced statistics through its scipy/numpy modules integration and matplotlib visualization. Moreover, for those with a background into R, Pandas has a similar synthax and can be used interchangeably with R in Python itself through the rpy2 module. We will investigate Rpy in the Python extension chapter, now we will focus on Pandas.

http://pandas.pydata.org/pandas-docs/dev/comparison_with_r.html

http://pandas.pydata.org/pandas-docs/dev/r_interface.html

Next is a small hands-on introduction, with tips on creation, selection, filtering and grouping.

TODO: add small pivoting example http://stackoverflow.com/questions/19961490/construct-pandas-dataframe-from-list-of-tuples



In [2]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
df2 = pd.DataFrame({'number' : 1.,
                    'dates' : pd.date_range('20150720',periods=4),
                    'floats' : pd.Series(0.25,index=list(range(4)),dtype='float32'),
                    'integers' : np.array([3] * 4,dtype='int32'),
                    'cathegories' : pd.Categorical(["test","train","test","train"]),
                    'string' : 'foo' })

print(df2)









    



  cathegories      dates  floats  integers  number string
0        test 2015-07-20    0.25         3     1.0    foo
1       train 2015-07-21    0.25         3     1.0    foo
2        test 2015-07-22    0.25         3     1.0    foo
3       train 2015-07-23    0.25         3     1.0    foo



In [15]:

    
df1.sort(columns='B')



In [7]:

    
# Description

df2.dtypes
#try tab completion, column labels can be easily retrieved
#df2.
df2.head(2)
df1.describe()
df1.T
#sort by axis
print(df1.sort_index(axis=0, ascending=False))
#sort by values
print(df1.sort(columns='B'))









    



             A         B         C         D         F
10.0 -0.312356  1.970467 -0.043003  0.159953  0.882028
9.0  -0.111822  2.404463 -0.766551 -0.366889 -0.534442
2.0   1.681985  0.112361  1.344568  0.830033 -0.335651
             A         B         C         D         F
2.0   1.681985  0.112361  1.344568  0.830033 -0.335651
10.0 -0.312356  1.970467 -0.043003  0.159953  0.882028
9.0  -0.111822  2.404463 -0.766551 -0.366889 -0.534442






    



/home/sergiun/programs/anaconda3/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:12: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)



In [6]:

    
df1.T



In [13]:

    
# Selection and slicing. And filtering.
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
# selection by labels
#df1
#df1['A']
#df1.A
#print(df1[0:2])
df1.loc[2:10,['A','B']]
#print df1.loc[2,'A']



In [23]:

    
df1.loc[2:10,['A','B']]



In [19]:

    
# selection by position
#df1
#df1.iloc[2]
#df1.iloc[0:2,0:3]
#df1.iloc[[1,2,],[0,2]]
print(df1)
df1.iloc[1,1]









    



             A         B         C         D         F
9.0   1.489061  0.233563 -1.425374 -1.935047  0.314065
2.0   0.776222  1.519814 -0.838998  0.398798 -1.709546
10.0  0.385811  0.044610 -1.041533  1.266857  1.133820






    Out[19]:





1.5198138664523637



In [21]:

    
# filter by boolean test
#print df1
#print df1[df1.A > 0.1]
#print df1[df1 > 0.1]
print(df1[(df1['A'] > 0.1) & (df1['B']>0.2)])









    



            A         B         C         D         F
9.0  1.489061  0.233563 -1.425374 -1.935047  0.314065
2.0  0.776222  1.519814 -0.838998  0.398798 -1.709546



In [22]:

    
s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
# filter by values
df3 = df1.copy()
df3['mask']=['one','one','two']
df3['test'] = [1, 2, 3]
print(df3)
print(df3[df3['mask'].isin(['one'])])
#df1.dropna()

df1['test2'] = df3['test']
#print df1









    



             A         B         C         D         F mask  test
9.0   0.860532 -0.016485  2.227145  0.446174 -0.964432  one     1
2.0   0.963801 -1.618940 -1.333987 -0.398223 -0.244665  one     2
10.0 -0.766295 -1.518846  1.198762 -0.071410 -0.568117  two     3
            A         B         C         D         F mask  test
9.0  0.860532 -0.016485  2.227145  0.446174 -0.964432  one     1
2.0  0.963801 -1.618940 -1.333987 -0.398223 -0.244665  one     2



In [31]:

    
# Setting values by label, position and numpy array

df1.loc[:,'A'] = 0
df1.iat[0,1] = 5
df1.loc[:,'D'] = np.array([10] * len(df1))
print df1









    



    A         B         C   D         F
9   0  5.000000  0.659238  10  0.268067
2   0  0.068065  0.537079  10 -0.930164
10  0  0.277067 -0.435075  10  0.519541



In [41]:

    
# Categories and grouping
s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
df3 = df1.copy()
df3['mask']=['one', 'one','two']
df3["vtype"] = df3["mask"].astype("category")
print df3
print df3["vtype"]
df3["vtype"].cat.categories = ["nice", "nasty"]
print df3
print df3.groupby("vtype").size()









    



           A         B         C         D         F mask vtype
9  -0.969339 -1.247162 -0.172333 -1.604282 -0.720619  one   one
2  -1.096683  0.984533  0.228007  1.707215  0.884176  one   one
10 -0.208973 -0.147277 -0.031264 -0.524877 -0.172902  two   two
9     one
2     one
10    two
Name: vtype, dtype: category
Categories (2, object): [one < two]
           A         B         C         D         F mask  vtype
9  -0.969339 -1.247162 -0.172333 -1.604282 -0.720619  one   nice
2  -1.096683  0.984533  0.228007  1.707215  0.884176  one   nice
10 -0.208973 -0.147277 -0.031264 -0.524877 -0.172902  two  nasty
vtype
nice     2
nasty    1
dtype: int64



In [28]:

    
# Basic operations
s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
df1['diff'] = df1.A - df1.B
df1['A'].map(lambda x : 10 * x)
print df1.applymap(np.sqrt)
print df1.apply(np.sqrt, axis = 1)
print df1.apply(np.sum, axis = 1)









    



           A         B         C         D        F      diff
9   0.866943  0.542561  1.018572       NaN      NaN  0.676178
2   0.643135  0.705118  0.953981  0.264569      NaN       NaN
10  1.205045  1.545023       NaN       NaN  0.55412       NaN
           A         B         C         D        F      diff
9   0.866943  0.542561  1.018572       NaN      NaN  0.676178
2   0.643135  0.705118  0.953981  0.264569      NaN       NaN
10  1.205045  1.545023       NaN       NaN  0.55412       NaN
9    -0.499315
2     1.365650
10    1.990501
dtype: float64

SciPy: more advanced statistics

SciPy is one of the most popular scientific computing libraries for Python. It is part of the so called "NumPy stack" of modules for scientific computing, along with NumPy, SymPy, Matplotlib and Pandas. SciPy does much more than just statistics, one could view it as a free and arguably more popular alternative to MatLab, one of the private scientific computing environments. Scipy can do pretty much anything that needs numerical computing, but in this chapter we will only use it for statistics.

T-test

One of the basic application of statistics is testing if a null hypothesis is true. In the previous application of Pandas for gene expression studies we observed that the data was normalized, with its averages matching closely. Suppose we want to know if in statistical terms these averages are matching sufficiently close. A simple way to check this is to compute all pair t-tests and see of all our obtained P - values are falling under the 0.005 bar.

http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_ind.html



In [24]:

    
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv('data/gex.txt', sep = '\t', index_col = 0)
print(df.head(4))
#df.iloc[:,1:].astype(float)
#print df.dtypes
#print type(df.GSM21712.values)

pmatrix = np.zeros((6,6))# the P-value matrix
i = -1
for ci in df.columns[1:]:
    i += 1
    a = df[ci].values
    a = a[~np.isnan(a)]#Removing undefined elements
    #a = df[ci].values.astype(float)
    #print a.dtype, type(a[0])
    j = -1
    for cj in df.columns[1:]:
        j += 1
        if ci == cj: continue
        b = df[cj].values
        b = b[~np.isnan(b)]
        t, p = stats.ttest_ind(a, b, equal_var = False)
        #print np.isnan(a).any(), np.isnan(b).any()
        pmatrix[i,j]=p

np.set_printoptions(linewidth=200)
np.set_printoptions(precision=2)
print(pmatrix)
df.boxplot()









    



          GeneID  GSM21712  GSM21713  GSM21714  GSM21715  GSM21716  GSM21718
ID                                                                          
1007_s_at    780  7.002916  7.018323  6.742437  6.725877  7.051872  6.883046
1053_at     5982  5.912374  5.905245  5.678275  5.765487  5.652194  5.845824
117_at      3310  4.623223  4.572847  4.841309  4.319575  4.489222  4.573047
121_at      7849  7.619362  7.573693  7.405601  7.552200  7.675400  7.223818
[[  0.00e+00   4.55e-06   7.28e-09   5.75e-16   2.26e-03   7.81e-06]
 [  4.55e-06   0.00e+00   2.54e-01   4.74e-04   1.33e-01   9.50e-01]
 [  7.28e-09   2.54e-01   0.00e+00   1.66e-02   7.92e-03   2.34e-01]
 [  5.75e-16   4.74e-04   1.66e-02   0.00e+00   6.26e-07   4.32e-04]
 [  2.26e-03   1.33e-01   7.92e-03   6.26e-07   0.00e+00   1.54e-01]
 [  7.81e-06   9.50e-01   2.34e-01   4.32e-04   1.54e-01   0.00e+00]]






    



/home/sergiun/programs/anaconda3/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:33: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.






    Out[24]:





{'boxes': [<matplotlib.lines.Line2D at 0x7fcf2ddc7390>,
  <matplotlib.lines.Line2D at 0x7fcf2dd657f0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd7b160>,
  <matplotlib.lines.Line2D at 0x7fcf2dd8da90>,
  <matplotlib.lines.Line2D at 0x7fcf2dd23400>,
  <matplotlib.lines.Line2D at 0x7fcf2dd34d30>],
 'caps': [<matplotlib.lines.Line2D at 0x7fcf2dd5ae10>,
  <matplotlib.lines.Line2D at 0x7fcf2dd5af28>,
  <matplotlib.lines.Line2D at 0x7fcf2dd70780>,
  <matplotlib.lines.Line2D at 0x7fcf2dd70f98>,
  <matplotlib.lines.Line2D at 0x7fcf2dd81fd0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd88908>,
  <matplotlib.lines.Line2D at 0x7fcf2dd18a20>,
  <matplotlib.lines.Line2D at 0x7fcf2dd18b38>,
  <matplotlib.lines.Line2D at 0x7fcf2dd29c50>,
  <matplotlib.lines.Line2D at 0x7fcf2dd2eba8>,
  <matplotlib.lines.Line2D at 0x7fcf2dd40cc0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd40dd8>],
 'fliers': [<matplotlib.lines.Line2D at 0x7fcf2dd60f98>,
  <matplotlib.lines.Line2D at 0x7fcf2dd76ef0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd8d978>,
  <matplotlib.lines.Line2D at 0x7fcf2dd1dba8>,
  <matplotlib.lines.Line2D at 0x7fcf2dd34c18>,
  <matplotlib.lines.Line2D at 0x7fcf2dd45e48>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7fcf2dd60780>,
  <matplotlib.lines.Line2D at 0x7fcf2dd760f0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd88a20>,
  <matplotlib.lines.Line2D at 0x7fcf2dd1d390>,
  <matplotlib.lines.Line2D at 0x7fcf2dd2ecc0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd45630>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7fcf2ddc7da0>,
  <matplotlib.lines.Line2D at 0x7fcf2ddc7eb8>,
  <matplotlib.lines.Line2D at 0x7fcf2dd65fd0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd69f28>,
  <matplotlib.lines.Line2D at 0x7fcf2dd7bf60>,
  <matplotlib.lines.Line2D at 0x7fcf2dd81898>,
  <matplotlib.lines.Line2D at 0x7fcf2dd129b0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd12ac8>,
  <matplotlib.lines.Line2D at 0x7fcf2dd23be0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd29b38>,
  <matplotlib.lines.Line2D at 0x7fcf2dd38c50>,
  <matplotlib.lines.Line2D at 0x7fcf2dd38d68>]}

ANOVA

.. But, in order not to get incidentally murdered by a reviewer with background in statistics, one should use a single test to rule them all, popularly called ANOVA.



In [45]:

    
import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv('data/gex.txt', sep = '\t', index_col = 0)

sl = [] #sample list
for ci in df.columns[1:]:
    a = df[ci].values
    a = a[~np.isnan(a)]#Removing undefined elements
    sl.append(a)

print stats.f_oneway(*sl)









    



(14.440905160749136, 3.5930324483772904e-14)

Linear regression

Ever felt the need to fit a straight trendline to a number of points in a scatter plot? This is called linear regression. Here is how you can do LR with numpy. As a useful exercise, let us generate our own dataset.

How is the line fitted? The most basic methodology is least square minimization of the squared standard error. One can do least square fitting directly in numpy or scikit-learn to obtain the same results.



In [46]:

    
%matplotlib inline

import numpy as np
import pylab as plt
from scipy import stats

nsamp = 30
x = np.linspace(0, 3, nsamp)
y = -0.5*x + 1 #next line randomizes these aligned datapoints
yr = y + .5*np.random.normal(size=nsamp)

slope, intercept, r_value, p_value, std_err = stats.linregress(x,yr)
line = slope*x+intercept

plt.plot(x,line,'r-', x,yr,'o', x,y,'b-')









    Out[46]:





[<matplotlib.lines.Line2D at 0x7f0d3501bed0>,
 <matplotlib.lines.Line2D at 0x7f0d35029190>,
 <matplotlib.lines.Line2D at 0x7f0d35029850>]

Curve fitting and parameter estimation

When the data points are multidimensional you will use more complex multivariate regression techniques, but we will discuss that at more length in the machine learning chapter. For the moment, let us use a similar exercise as before, but fit a curve instead. While not strictly statistics related, this exercise can be useful for example if we want to decide how a probability distribution fits our data. We will use the least-square again, through the optimization module of scipy.



In [27]:

    
%matplotlib inline
import numpy as np
import pylab as plt
from scipy import optimize

nsamp = 30
x = np.linspace(0,1,nsamp)
"""
y = -0.5*x**2 + 7*sin(x)
This is what we try to fit against. Suppose we know our function is generated
by this law and want to find the (-0.5, 7) parameters. Alternatively we might
not know anything about this dataset but just want to fit this curve to it.
"""

f = lambda p, x: p[0]*x*x + p[1]*np.sin(x)
#Exercise: define a normal Python function f() instead!

testp = (1, 20)
y = f(testp,x)
yr = y + .5*np.random.normal(size=nsamp)

e = lambda p, x, y: (abs((f(p,x)-y))).sum()

p0 = (5, 20) # initial parameter value
p = optimize.fmin(e, p0, args=(x,yr))
yp = f(p,x)

print("estimated vs real", p, testp)
plt.plot(x,yp,'r-', x,yr,'o', x,y,'b-')









    



Optimization terminated successfully.
         Current function value: 13.261850
         Iterations: 68
         Function evaluations: 129
estimated vs real [ -0.6   21.56] (1, 20)






    Out[27]:





[<matplotlib.lines.Line2D at 0x7fcf2d754ef0>,
 <matplotlib.lines.Line2D at 0x7fcf2d75b0f0>,
 <matplotlib.lines.Line2D at 0x7fcf2d75bac8>]

Exercise: Statistical enrichment analysis

What is enrichment? Before we move on, we should get back to something that is always obsessing biologists with data: putting a P-value on your findings. If all you have is a series of numbers of a series of series it may be that the T-tests and ANOVA will be quite good, but what if you have cathegorical data, as many times it happens in biology? What if the question is "I have a bag of vegetables, to which of the following vegetable racks is my bag belonging more?". If the answer also takes into account how many items each of the vegetable racks is holding then you can sove it with statistical enrichment testing. Of course no one at this course will argue with your shopping habits if you don't think the number of vegies on a store racks is important! As there are many cathegorical sets of functional annotations in biology, we will only focus on Gene Ontology

Gene Ontology To use the GO annotations we typically need to know if the genes/protein of a specific organism are annotated with a certain GO label (GO id), while concurrent labels are possible for every gene/protein since each can have multiple roles in biology. The annotations are structured in a tree, with the branches having inherited all annotations from the leafs. You can read more about it here.

Extra task: Download raw annotation files and program your own Python GO module. One key aspect is annotations must be inherited through the tree. The elegant way to achieve the tree traversal is using a recursive function. This is a hard task, that needs hours to complete but it can be very instructive.

For the purpose of this course in order to parse the GO annotations we will use a Python library called Orange, that has other useful modules as well. Other alternatives are calling an R package from within Python (recomending topGO for enrichment and GO.db for interogation) or if you are a Perl senior then call a BIO::Perl module. BioPython is also preparing a GO module so be sure to check their package once every couple of years.

Enrichment test

In the figure below you can see that there are four sets forming up when we intersect our custom set of genes with the annotation database. The purpose of the enrichment test is to tell how likely is the four sets overlap. To answer this a hypergeometric test is commonly used, known as Fischer's exact test. Let us put it into Python!

http://en.wikipedia.org/wiki/Fisher%27s_exact_test

Note that multiple testing corrections like the Bonferroni correction must also be executed, but obtaining the raw P-values is sufficient for the purpose of this course.



In [1]:

    
%run ./lib/svgoutput.py
"""
GO enrichment figure displayed using IPython's native SVG frontend
"""
scene = SVGScene(500,600)
scene.text((50,50),"GO enrichment schema")


scene.circle((170,200),35,(100,100,100), 0.5)
scene.text((190,160),"G: genes annotated to GO_id",size = 11)
scene.circle((200,200),80,(200,200,200), 0.1)
scene.text((190,110),"X: all annotated genes",size = 11)
scene.circle((100,200),50,(100,100,100), 0.5)
scene.text((50,140),"T: test gene set",size = 11)

scene.text((170,200),"a", size = 11)
scene.text((140,200),"b", size = 11)
scene.text((125,200),"c", size = 11)
scene.text((250,200),"d", size = 11)

scene









    Out[1]:



In [ ]:

Perform the statistical test

                In Set  Outside Set
    GOid ann.     b       a
Not GOid ann.     c       d

Read the docs: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html



In [ ]:

    
"""
Using R to get the annotations and perform GO enrichment
TODO: not finished, only for reference
See here something similar:
http://bcb.io/2009/10/18/gene-ontology-analysis-with-python-and-bioconductor/
"""

import rpy2.robjects as robjects
def get_go_children(go_term, go_term_type):
    robjects.r('''
        library(GO.db)
    ''')
    child_map = robjects.r["GO%sCHILDREN" % (go_term_type)]
    children = []
    to_check = [go_term]
    while len(to_check) > 0:
        new_children = []
        for check_term in to_check:
            new_children.extend(list(robjects.r.get(check_term, child_map)))
        new_children = list(set([c for c in new_children if c]))
        children.extend(new_children)
        to_check = new_children
    children = list(set(children))
    return children



In [48]:

    
"""
Using Orange to get annotations
"""
from orangecontrib.bio import go
import sys

ontology = go.Ontology()
# Print names and definitions of all terms with "apoptosis" in the name
terms = [term for term in ontology.terms.values() if "apoptosis" in term.name.lower()]
#for term in terms: print term.id, term.name, term.def_
##Catch question: Why def_?
# Load annotations for yeast.
annotations = go.Annotations("sgd", ontology=ontology)

#Review: Object introspection
#print dir(annotations)
#print go.__file__
#print dir(annotations.get_all_annotations)
#print annotations.get_all_annotations.__doc__
#print dir(ontology.terms)
#print ontology.terms.values()[0]
#print annotations.get_annotated_terms.__doc__

go = {}
for term in ontology.terms.values():
    ants = annotations.get_all_annotations(term.id)
    gs = set([a.gene_name for a in ants])
    if len(gs)>0: go[term.id] = gs

print len(go)


        # res = annotations.get_enriched_terms(["YGR270W", "YIL075C", "YDL007W"])

# gene = annotations.alias_mapper["YIL075C"]
# print(gene + " (YIL075C) directly annotated to the following terms:")
# for a in annotations.gene_annotations[gene]:
#     print(ontology[a.GO_ID].name + " with evidence code " + a.Evidence_Code)

# # Get all genes annotated to the same terms as YIL075C
# ids = set([a.GO_ID for a in annotations.gene_annotations[gene]])
# for termid in ids:
#     ants = annotations.get_all_annotations(termid)
#     genes = set([a.DB_Object_Symbol for a in ants])
#     print(", ".join(genes) +" annotated to " + termid + " " + ontology[termid].name)

Now we want to choose a set of genes to test enrichment, so we output a few terms together with their annotated genes:



In [ ]:

    
for gid in go:
    if len(go[gid])>5 and len(go[gid])<20:
        print gid, ontology[gid].name, go[gid]



In [48]:



In [49]:

    
testset = set(['KAR2', 'EGD1', 'EGD2', 'YBR137W', 'MDY2', 'SEC72', 'SGT2', 'SEC66', 'SEC61', 'SEC62', 'SEC63', 'LHS1', 'SSS1', 'BTT1', 'GET3', 'GET4'])
testGO = 'GO:0006620'# posttranslational protein targeting to membrane set

#testset = set(['ENV9', 'VPS41', 'ENV7', 'ATG15', 'ATG18', 'ENV11'])
#testGO = 'GO:0006624'#GO:0006624 vacuolar protein processing set

#annotatios.get_enriched_terms(genes, aspect=["P"])
#res = annotations.get_enriched_terms(list(testset), use_fdr = False)

# print("Enriched terms:")
# for go_id, (genes, p_value, ref) in res.items():
#     if p_value < 0.05:
#         print(go_id + " " + ontology[go_id].name + " with p-value: %.4f " % p_value + ", ".join(genes))
#print [(gid,ontology[gid].name,res[gid]) for gid in res.keys() if 'protein' in ontology[gid].name]
#print [(gid,ontology[gid].name,res[gid]) for gid in res.keys()[0:10]]
#import pandas as pd
#df = pd.DataFrame([(gid,ontology[gid].name, res[gid][1]) for gid in res.keys()])
#df.sort(columns = 2)
##TODO: Best GO enrichment in Orange has no members in the test set. Is Orange broken?
#print go['GO:0006595']

import numpy as np
from scipy import stats

T = testset
X = set()
for goid in go:
    X = X | go[goid]
print "Total number of annotated genes:", len(X)

rec = []#will hold the results tupple
for goid in go:
    G = go[goid]
    a = G - T
    b = T & G
    c = (X & T) - G
    d = X - T - G
    oddsratio, pvalue = stats.fisher_exact([[len(b), len(a)], [len(c), len(d)]])
    rec.append((goid, ontology[goid].name, pvalue))

df = pd.DataFrame(rec)
df.sort(columns = 2)









    



Total number of annotated genes: 6380






    Out[49]:






  
    
      
      0
      1
      2
    
  
  
    
      43  
       GO:0006620
         posttranslational protein targeting to membrane
       4.809643e-47
    
    
      3887
       GO:0045047
                                 protein targeting to ER
       1.778213e-37
    
    
      4639
       GO:0072599
       establishment of protein localization to endop...
       2.916269e-37
    
    
      2949
       GO:0070972
           protein localization to endoplasmic reticulum
       6.379422e-36
    
    
      3873
       GO:0006612
                           protein targeting to membrane
       3.103997e-34
    
    
      7252
       GO:0090150
       establishment of protein localization to membrane
       1.226710e-29
    
    
      257 
       GO:0072657
                        protein localization to membrane
       1.975087e-29
    
    
      7646
       GO:0006605
                                       protein targeting
       1.247136e-22
    
    
      4641
       GO:0072594
       establishment of protein localization to organ...
       1.767239e-22
    
    
      5884
       GO:0044802
                   single-organism membrane organization
       1.981671e-22
    
    
      6783
       GO:0061024
                                   membrane organization
       2.485697e-22
    
    
      5266
       GO:0031204
       posttranslational protein targeting to membran...
       1.706860e-21
    
    
      1692
       GO:0033365
                       protein localization to organelle
       4.011794e-21
    
    
      1518
       GO:0006886
                         intracellular protein transport
       1.311140e-20
    
    
      398 
       GO:1902580
                   single-organism cellular localization
       2.956570e-20
    
    
      1851
       GO:0034613
                           cellular protein localization
       1.468180e-18
    
    
      1378
       GO:0070727
                     cellular macromolecule localization
       2.966828e-18
    
    
      811 
       GO:0016482
                                   cytoplasmic transport
       7.606578e-18
    
    
      4945
       GO:0015031
                                       protein transport
       1.319491e-17
    
    
      3728
       GO:0045184
                   establishment of protein localization
       2.508306e-17
    
    
      397 
       GO:1902582
                 single-organism intracellular transport
       7.233926e-17
    
    
      2796
       GO:0046907
                                 intracellular transport
       3.367064e-16
    
    
      7499
       GO:0008104
                                    protein localization
       4.977274e-16
    
    
      7734
       GO:0051649
                   establishment of localization in cell
       1.490948e-15
    
    
      7050
       GO:0033036
                              macromolecule localization
       7.195909e-15
    
    
      6762
       GO:0071702
                             organic substance transport
       9.396642e-15
    
    
      7739
       GO:0051641
                                   cellular localization
       2.312074e-14
    
    
      4381
       GO:0030867
                    rough endoplasmic reticulum membrane
       3.932540e-14
    
    
      6806
       GO:0005791
                             rough endoplasmic reticulum
       3.932540e-14
    
    
      6576
       GO:0044765
                               single-organism transport
       4.021301e-13
    
    
      ...
      ...
      ...
      ...
    
    
      2679
       GO:0033592
                           RNA strand annealing activity
       1.000000e+00
    
    
      2706
       GO:0034982
                        mitochondrial protein processing
       1.000000e+00
    
    
      2705
       GO:0007039
                protein catabolic process in the vacuole
       1.000000e+00
    
    
      2704
       GO:0007035
                                  vacuolar acidification
       1.000000e+00
    
    
      2703
       GO:0007034
                                      vacuolar transport
       1.000000e+00
    
    
      2702
       GO:0008453
                alanine-glyoxylate transaminase activity
       1.000000e+00
    
    
      2701
       GO:0007031
                                 peroxisome organization
       1.000000e+00
    
    
      2700
       GO:0007030
                                      Golgi organization
       1.000000e+00
    
    
      2699
       GO:0007033
                                    vacuole organization
       1.000000e+00
    
    
      2698
       GO:0007032
                                   endosome organization
       1.000000e+00
    
    
      2697
       GO:0005868
                              cytoplasmic dynein complex
       1.000000e+00
    
    
      2696
       GO:0005869
                                        dynactin complex
       1.000000e+00
    
    
      2695
       GO:0016129
                       phytosteroid biosynthetic process
       1.000000e+00
    
    
      2694
       GO:0016128
                          phytosteroid metabolic process
       1.000000e+00
    
    
      2693
       GO:0016125
                                sterol metabolic process
       1.000000e+00
    
    
      2692
       GO:0015138
             fumarate transmembrane transporter activity
       1.000000e+00
    
    
      2691
       GO:0016126
                             sterol biosynthetic process
       1.000000e+00
    
    
      2690
       GO:0001786
                              phosphatidylserine binding
       1.000000e+00
    
    
      2689
       GO:0015168
             glycerol transmembrane transporter activity
       1.000000e+00
    
    
      2688
       GO:0015169
       glycerol-3-phosphate transmembrane transporter...
       1.000000e+00
    
    
      2687
       GO:0015165
       pyrimidine nucleotide-sugar transmembrane tran...
       1.000000e+00
    
    
      2686
       GO:0071577
                        zinc ion transmembrane transport
       1.000000e+00
    
    
      2685
       GO:0051575
               5'-deoxyribose-5-phosphate lyase activity
       1.000000e+00
    
    
      2684
       GO:0019388
                             galactose catabolic process
       1.000000e+00
    
    
      2683
       GO:0033048
       negative regulation of mitotic sister chromati...
       1.000000e+00
    
    
      2682
       GO:0033043
                    regulation of organelle organization
       1.000000e+00
    
    
      2681
       GO:2000200
       regulation of ribosomal subunit export from nu...
       1.000000e+00
    
    
      2680
       GO:2000203
       regulation of ribosomal large subunit export f...
       1.000000e+00
    
    
      2773
       GO:0005528
                                           FK506 binding
       1.000000e+00
    
    
      7885
       GO:0070525
            threonylcarbamoyladenosine metabolic process
       1.000000e+00
    
  

7886 rows × 3 columns



In [ ]:

	A	B	C	D	F
10	0.939892	-1.073998	-0.463096	0.049837	0.240060
2	1.152203	-0.740541	-0.565459	-0.775712	0.928403
9	-0.291012	-0.726169	-0.753784	0.499030	0.229992

	A	B
2	-0.679790	-1.774782
10	0.761237	-0.367952

	9.0	2.0	10.0
A	-0.111822	1.681985	-0.312356
B	2.404463	0.112361	1.970467
C	-0.766551	1.344568	-0.043003
D	-0.366889	0.830033	0.159953
F	-0.534442	-0.335651	0.882028

	0	1	2
43	GO:0006620	posttranslational protein targeting to membrane	4.809643e-47
3887	GO:0045047	protein targeting to ER	1.778213e-37
4639	GO:0072599	establishment of protein localization to endop...	2.916269e-37
2949	GO:0070972	protein localization to endoplasmic reticulum	6.379422e-36
3873	GO:0006612	protein targeting to membrane	3.103997e-34
7252	GO:0090150	establishment of protein localization to membrane	1.226710e-29
257	GO:0072657	protein localization to membrane	1.975087e-29
7646	GO:0006605	protein targeting	1.247136e-22
4641	GO:0072594	establishment of protein localization to organ...	1.767239e-22
5884	GO:0044802	single-organism membrane organization	1.981671e-22
6783	GO:0061024	membrane organization	2.485697e-22
5266	GO:0031204	posttranslational protein targeting to membran...	1.706860e-21
1692	GO:0033365	protein localization to organelle	4.011794e-21
1518	GO:0006886	intracellular protein transport	1.311140e-20
398	GO:1902580	single-organism cellular localization	2.956570e-20
1851	GO:0034613	cellular protein localization	1.468180e-18
1378	GO:0070727	cellular macromolecule localization	2.966828e-18
811	GO:0016482	cytoplasmic transport	7.606578e-18
4945	GO:0015031	protein transport	1.319491e-17
3728	GO:0045184	establishment of protein localization	2.508306e-17
397	GO:1902582	single-organism intracellular transport	7.233926e-17
2796	GO:0046907	intracellular transport	3.367064e-16
7499	GO:0008104	protein localization	4.977274e-16
7734	GO:0051649	establishment of localization in cell	1.490948e-15
7050	GO:0033036	macromolecule localization	7.195909e-15
6762	GO:0071702	organic substance transport	9.396642e-15
7739	GO:0051641	cellular localization	2.312074e-14
4381	GO:0030867	rough endoplasmic reticulum membrane	3.932540e-14
6806	GO:0005791	rough endoplasmic reticulum	3.932540e-14
6576	GO:0044765	single-organism transport	4.021301e-13
...	...	...	...
2679	GO:0033592	RNA strand annealing activity	1.000000e+00
2706	GO:0034982	mitochondrial protein processing	1.000000e+00
2705	GO:0007039	protein catabolic process in the vacuole	1.000000e+00
2704	GO:0007035	vacuolar acidification	1.000000e+00
2703	GO:0007034	vacuolar transport	1.000000e+00
2702	GO:0008453	alanine-glyoxylate transaminase activity	1.000000e+00
2701	GO:0007031	peroxisome organization	1.000000e+00
2700	GO:0007030	Golgi organization	1.000000e+00
2699	GO:0007033	vacuole organization	1.000000e+00
2698	GO:0007032	endosome organization	1.000000e+00
2697	GO:0005868	cytoplasmic dynein complex	1.000000e+00
2696	GO:0005869	dynactin complex	1.000000e+00
2695	GO:0016129	phytosteroid biosynthetic process	1.000000e+00
2694	GO:0016128	phytosteroid metabolic process	1.000000e+00
2693	GO:0016125	sterol metabolic process	1.000000e+00
2692	GO:0015138	fumarate transmembrane transporter activity	1.000000e+00
2691	GO:0016126	sterol biosynthetic process	1.000000e+00
2690	GO:0001786	phosphatidylserine binding	1.000000e+00
2689	GO:0015168	glycerol transmembrane transporter activity	1.000000e+00
2688	GO:0015169	glycerol-3-phosphate transmembrane transporter...	1.000000e+00
2687	GO:0015165	pyrimidine nucleotide-sugar transmembrane tran...	1.000000e+00
2686	GO:0071577	zinc ion transmembrane transport	1.000000e+00
2685	GO:0051575	5'-deoxyribose-5-phosphate lyase activity	1.000000e+00
2684	GO:0019388	galactose catabolic process	1.000000e+00
2683	GO:0033048	negative regulation of mitotic sister chromati...	1.000000e+00
2682	GO:0033043	regulation of organelle organization	1.000000e+00
2681	GO:2000200	regulation of ribosomal subunit export from nu...	1.000000e+00
2680	GO:2000203	regulation of ribosomal large subunit export f...	1.000000e+00
2773	GO:0005528	FK506 binding	1.000000e+00
7885	GO:0070525	threonylcarbamoyladenosine metabolic process	1.000000e+00