Statistics libraries

A number of python lybraries are widely used for statistical computation. Here is a small list:

  • As Python comes with battery included, it has its own statistics functions.
  • For basic statistics one need not go further than Pandas, a module designed precisely for managing data that comes in a any tabular form.
  • numpy/scipy.stats is a submodule dealing exclusively with statistics.
  • The scikits stack, useful for many research fields, many of which have their own statistics submodules. Scikit-learn is leaned upon machine learning, but some of its modules, like the generalized linear models are part of some statistics textbooks.
  • Statsmodels While still in beta, this is probably the most convenient library for the programming agnostic. We did not use it because practicing with scipy is more important for this short course.
  • PyMC is targeted towards Bayesian statistics and MCMC estimation, but one can use scikit-learn instead. This also goes for a number of other machine learning related small packages, but hey the Python world is free for your taking, and sometimes they hold hidden gems not found in the main libraries.
  • A popular way of doing statistics in Python is also running R statistics packages from withing Python, using Rpy2. We will get to that in another chapter.

Statistical data structures with Pandas: Series and Dataframes

Languages developed primarily for statistics are using rich data structures that are available also in Python. One of the most important modules to use in Pandas. Let us have a look at Pandas API index.

http://pandas.pydata.org/pandas-docs/dev/api.html

Do you recognize some of these structures and can you name equivalent functionality from other languages like R or tabular computing programs like Excel? Pandas can do everything Excel does and then some more, like SQL interogation and advanced statistics through its scipy/numpy modules integration and matplotlib visualization. Moreover, for those with a background into R, Pandas has a similar synthax and can be used interchangeably with R in Python itself through the rpy2 module. We will investigate Rpy in the Python extension chapter, now we will focus on Pandas.

http://pandas.pydata.org/pandas-docs/dev/comparison_with_r.html

http://pandas.pydata.org/pandas-docs/dev/r_interface.html

Next is a small hands-on introduction, with tips on creation, selection, filtering and grouping.

TODO: add small pivoting example http://stackoverflow.com/questions/19961490/construct-pandas-dataframe-from-list-of-tuples


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
df2 = pd.DataFrame({'number' : 1.,
                    'dates' : pd.date_range('20150720',periods=4),
                    'floats' : pd.Series(0.25,index=list(range(4)),dtype='float32'),
                    'integers' : np.array([3] * 4,dtype='int32'),
                    'cathegories' : pd.Categorical(["test","train","test","train"]),
                    'string' : 'foo' })

print(df2)


  cathegories      dates  floats  integers  number string
0        test 2015-07-20    0.25         3     1.0    foo
1       train 2015-07-21    0.25         3     1.0    foo
2        test 2015-07-22    0.25         3     1.0    foo
3       train 2015-07-23    0.25         3     1.0    foo

In [15]:
df1.sort(columns='B')


Out[15]:
A B C D F
10 0.939892 -1.073998 -0.463096 0.049837 0.240060
2 1.152203 -0.740541 -0.565459 -0.775712 0.928403
9 -0.291012 -0.726169 -0.753784 0.499030 0.229992

In [7]:
# Description

df2.dtypes
#try tab completion, column labels can be easily retrieved
#df2.
df2.head(2)
df1.describe()
df1.T
#sort by axis
print(df1.sort_index(axis=0, ascending=False))
#sort by values
print(df1.sort(columns='B'))


             A         B         C         D         F
10.0 -0.312356  1.970467 -0.043003  0.159953  0.882028
9.0  -0.111822  2.404463 -0.766551 -0.366889 -0.534442
2.0   1.681985  0.112361  1.344568  0.830033 -0.335651
             A         B         C         D         F
2.0   1.681985  0.112361  1.344568  0.830033 -0.335651
10.0 -0.312356  1.970467 -0.043003  0.159953  0.882028
9.0  -0.111822  2.404463 -0.766551 -0.366889 -0.534442
/home/sergiun/programs/anaconda3/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:12: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)

In [6]:
df1.T


Out[6]:
9.0 2.0 10.0
A -0.111822 1.681985 -0.312356
B 2.404463 0.112361 1.970467
C -0.766551 1.344568 -0.043003
D -0.366889 0.830033 0.159953
F -0.534442 -0.335651 0.882028

In [13]:
# Selection and slicing. And filtering.
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
# selection by labels
#df1
#df1['A']
#df1.A
#print(df1[0:2])
df1.loc[2:10,['A','B']]
#print df1.loc[2,'A']


Out[13]:
A B
2.0 0.776222 1.519814
10.0 0.385811 0.044610

In [23]:
df1.loc[2:10,['A','B']]


Out[23]:
A B
2 -0.679790 -1.774782
10 0.761237 -0.367952

In [19]:
# selection by position
#df1
#df1.iloc[2]
#df1.iloc[0:2,0:3]
#df1.iloc[[1,2,],[0,2]]
print(df1)
df1.iloc[1,1]


             A         B         C         D         F
9.0   1.489061  0.233563 -1.425374 -1.935047  0.314065
2.0   0.776222  1.519814 -0.838998  0.398798 -1.709546
10.0  0.385811  0.044610 -1.041533  1.266857  1.133820
Out[19]:
1.5198138664523637

In [21]:
# filter by boolean test
#print df1
#print df1[df1.A > 0.1]
#print df1[df1 > 0.1]
print(df1[(df1['A'] > 0.1) & (df1['B']>0.2)])


            A         B         C         D         F
9.0  1.489061  0.233563 -1.425374 -1.935047  0.314065
2.0  0.776222  1.519814 -0.838998  0.398798 -1.709546

In [22]:
s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
# filter by values
df3 = df1.copy()
df3['mask']=['one','one','two']
df3['test'] = [1, 2, 3]
print(df3)
print(df3[df3['mask'].isin(['one'])])
#df1.dropna()

df1['test2'] = df3['test']
#print df1


             A         B         C         D         F mask  test
9.0   0.860532 -0.016485  2.227145  0.446174 -0.964432  one     1
2.0   0.963801 -1.618940 -1.333987 -0.398223 -0.244665  one     2
10.0 -0.766295 -1.518846  1.198762 -0.071410 -0.568117  two     3
            A         B         C         D         F mask  test
9.0  0.860532 -0.016485  2.227145  0.446174 -0.964432  one     1
2.0  0.963801 -1.618940 -1.333987 -0.398223 -0.244665  one     2

In [31]:
# Setting values by label, position and numpy array

df1.loc[:,'A'] = 0
df1.iat[0,1] = 5
df1.loc[:,'D'] = np.array([10] * len(df1))
print df1


    A         B         C   D         F
9   0  5.000000  0.659238  10  0.268067
2   0  0.068065  0.537079  10 -0.930164
10  0  0.277067 -0.435075  10  0.519541

In [41]:
# Categories and grouping
s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
df3 = df1.copy()
df3['mask']=['one', 'one','two']
df3["vtype"] = df3["mask"].astype("category")
print df3
print df3["vtype"]
df3["vtype"].cat.categories = ["nice", "nasty"]
print df3
print df3.groupby("vtype").size()


           A         B         C         D         F mask vtype
9  -0.969339 -1.247162 -0.172333 -1.604282 -0.720619  one   one
2  -1.096683  0.984533  0.228007  1.707215  0.884176  one   one
10 -0.208973 -0.147277 -0.031264 -0.524877 -0.172902  two   two
9     one
2     one
10    two
Name: vtype, dtype: category
Categories (2, object): [one < two]
           A         B         C         D         F mask  vtype
9  -0.969339 -1.247162 -0.172333 -1.604282 -0.720619  one   nice
2  -1.096683  0.984533  0.228007  1.707215  0.884176  one   nice
10 -0.208973 -0.147277 -0.031264 -0.524877 -0.172902  two  nasty
vtype
nice     2
nasty    1
dtype: int64

In [28]:
# Basic operations
s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
df1['diff'] = df1.A - df1.B
df1['A'].map(lambda x : 10 * x)
print df1.applymap(np.sqrt)
print df1.apply(np.sqrt, axis = 1)
print df1.apply(np.sum, axis = 1)


           A         B         C         D        F      diff
9   0.866943  0.542561  1.018572       NaN      NaN  0.676178
2   0.643135  0.705118  0.953981  0.264569      NaN       NaN
10  1.205045  1.545023       NaN       NaN  0.55412       NaN
           A         B         C         D        F      diff
9   0.866943  0.542561  1.018572       NaN      NaN  0.676178
2   0.643135  0.705118  0.953981  0.264569      NaN       NaN
10  1.205045  1.545023       NaN       NaN  0.55412       NaN
9    -0.499315
2     1.365650
10    1.990501
dtype: float64

SciPy: more advanced statistics

SciPy is one of the most popular scientific computing libraries for Python. It is part of the so called "NumPy stack" of modules for scientific computing, along with NumPy, SymPy, Matplotlib and Pandas. SciPy does much more than just statistics, one could view it as a free and arguably more popular alternative to MatLab, one of the private scientific computing environments. Scipy can do pretty much anything that needs numerical computing, but in this chapter we will only use it for statistics.

T-test

One of the basic application of statistics is testing if a null hypothesis is true. In the previous application of Pandas for gene expression studies we observed that the data was normalized, with its averages matching closely. Suppose we want to know if in statistical terms these averages are matching sufficiently close. A simple way to check this is to compute all pair t-tests and see of all our obtained P - values are falling under the 0.005 bar.

http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_ind.html


In [24]:
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv('data/gex.txt', sep = '\t', index_col = 0)
print(df.head(4))
#df.iloc[:,1:].astype(float)
#print df.dtypes
#print type(df.GSM21712.values)

pmatrix = np.zeros((6,6))# the P-value matrix
i = -1
for ci in df.columns[1:]:
    i += 1
    a = df[ci].values
    a = a[~np.isnan(a)]#Removing undefined elements
    #a = df[ci].values.astype(float)
    #print a.dtype, type(a[0])
    j = -1
    for cj in df.columns[1:]:
        j += 1
        if ci == cj: continue
        b = df[cj].values
        b = b[~np.isnan(b)]
        t, p = stats.ttest_ind(a, b, equal_var = False)
        #print np.isnan(a).any(), np.isnan(b).any()
        pmatrix[i,j]=p

np.set_printoptions(linewidth=200)
np.set_printoptions(precision=2)
print(pmatrix)
df.boxplot()


          GeneID  GSM21712  GSM21713  GSM21714  GSM21715  GSM21716  GSM21718
ID                                                                          
1007_s_at    780  7.002916  7.018323  6.742437  6.725877  7.051872  6.883046
1053_at     5982  5.912374  5.905245  5.678275  5.765487  5.652194  5.845824
117_at      3310  4.623223  4.572847  4.841309  4.319575  4.489222  4.573047
121_at      7849  7.619362  7.573693  7.405601  7.552200  7.675400  7.223818
[[  0.00e+00   4.55e-06   7.28e-09   5.75e-16   2.26e-03   7.81e-06]
 [  4.55e-06   0.00e+00   2.54e-01   4.74e-04   1.33e-01   9.50e-01]
 [  7.28e-09   2.54e-01   0.00e+00   1.66e-02   7.92e-03   2.34e-01]
 [  5.75e-16   4.74e-04   1.66e-02   0.00e+00   6.26e-07   4.32e-04]
 [  2.26e-03   1.33e-01   7.92e-03   6.26e-07   0.00e+00   1.54e-01]
 [  7.81e-06   9.50e-01   2.34e-01   4.32e-04   1.54e-01   0.00e+00]]
/home/sergiun/programs/anaconda3/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:33: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
Out[24]:
{'boxes': [<matplotlib.lines.Line2D at 0x7fcf2ddc7390>,
  <matplotlib.lines.Line2D at 0x7fcf2dd657f0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd7b160>,
  <matplotlib.lines.Line2D at 0x7fcf2dd8da90>,
  <matplotlib.lines.Line2D at 0x7fcf2dd23400>,
  <matplotlib.lines.Line2D at 0x7fcf2dd34d30>],
 'caps': [<matplotlib.lines.Line2D at 0x7fcf2dd5ae10>,
  <matplotlib.lines.Line2D at 0x7fcf2dd5af28>,
  <matplotlib.lines.Line2D at 0x7fcf2dd70780>,
  <matplotlib.lines.Line2D at 0x7fcf2dd70f98>,
  <matplotlib.lines.Line2D at 0x7fcf2dd81fd0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd88908>,
  <matplotlib.lines.Line2D at 0x7fcf2dd18a20>,
  <matplotlib.lines.Line2D at 0x7fcf2dd18b38>,
  <matplotlib.lines.Line2D at 0x7fcf2dd29c50>,
  <matplotlib.lines.Line2D at 0x7fcf2dd2eba8>,
  <matplotlib.lines.Line2D at 0x7fcf2dd40cc0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd40dd8>],
 'fliers': [<matplotlib.lines.Line2D at 0x7fcf2dd60f98>,
  <matplotlib.lines.Line2D at 0x7fcf2dd76ef0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd8d978>,
  <matplotlib.lines.Line2D at 0x7fcf2dd1dba8>,
  <matplotlib.lines.Line2D at 0x7fcf2dd34c18>,
  <matplotlib.lines.Line2D at 0x7fcf2dd45e48>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7fcf2dd60780>,
  <matplotlib.lines.Line2D at 0x7fcf2dd760f0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd88a20>,
  <matplotlib.lines.Line2D at 0x7fcf2dd1d390>,
  <matplotlib.lines.Line2D at 0x7fcf2dd2ecc0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd45630>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7fcf2ddc7da0>,
  <matplotlib.lines.Line2D at 0x7fcf2ddc7eb8>,
  <matplotlib.lines.Line2D at 0x7fcf2dd65fd0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd69f28>,
  <matplotlib.lines.Line2D at 0x7fcf2dd7bf60>,
  <matplotlib.lines.Line2D at 0x7fcf2dd81898>,
  <matplotlib.lines.Line2D at 0x7fcf2dd129b0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd12ac8>,
  <matplotlib.lines.Line2D at 0x7fcf2dd23be0>,
  <matplotlib.lines.Line2D at 0x7fcf2dd29b38>,
  <matplotlib.lines.Line2D at 0x7fcf2dd38c50>,
  <matplotlib.lines.Line2D at 0x7fcf2dd38d68>]}

ANOVA

.. But, in order not to get incidentally murdered by a reviewer with background in statistics, one should use a single test to rule them all, popularly called ANOVA.


In [45]:
import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv('data/gex.txt', sep = '\t', index_col = 0)

sl = [] #sample list
for ci in df.columns[1:]:
    a = df[ci].values
    a = a[~np.isnan(a)]#Removing undefined elements
    sl.append(a)

print stats.f_oneway(*sl)


(14.440905160749136, 3.5930324483772904e-14)

Linear regression

Ever felt the need to fit a straight trendline to a number of points in a scatter plot? This is called linear regression. Here is how you can do LR with numpy. As a useful exercise, let us generate our own dataset.

How is the line fitted? The most basic methodology is least square minimization of the squared standard error. One can do least square fitting directly in numpy or scikit-learn to obtain the same results.


In [46]:
%matplotlib inline

import numpy as np
import pylab as plt
from scipy import stats

nsamp = 30
x = np.linspace(0, 3, nsamp)
y = -0.5*x + 1 #next line randomizes these aligned datapoints
yr = y + .5*np.random.normal(size=nsamp)

slope, intercept, r_value, p_value, std_err = stats.linregress(x,yr)
line = slope*x+intercept

plt.plot(x,line,'r-', x,yr,'o', x,y,'b-')


Out[46]:
[<matplotlib.lines.Line2D at 0x7f0d3501bed0>,
 <matplotlib.lines.Line2D at 0x7f0d35029190>,
 <matplotlib.lines.Line2D at 0x7f0d35029850>]

Curve fitting and parameter estimation

When the data points are multidimensional you will use more complex multivariate regression techniques, but we will discuss that at more length in the machine learning chapter. For the moment, let us use a similar exercise as before, but fit a curve instead. While not strictly statistics related, this exercise can be useful for example if we want to decide how a probability distribution fits our data. We will use the least-square again, through the optimization module of scipy.


In [27]:
%matplotlib inline
import numpy as np
import pylab as plt
from scipy import optimize

nsamp = 30
x = np.linspace(0,1,nsamp)
"""
y = -0.5*x**2 + 7*sin(x)
This is what we try to fit against. Suppose we know our function is generated
by this law and want to find the (-0.5, 7) parameters. Alternatively we might
not know anything about this dataset but just want to fit this curve to it.
"""

f = lambda p, x: p[0]*x*x + p[1]*np.sin(x)
#Exercise: define a normal Python function f() instead!

testp = (1, 20)
y = f(testp,x)
yr = y + .5*np.random.normal(size=nsamp)

e = lambda p, x, y: (abs((f(p,x)-y))).sum()

p0 = (5, 20) # initial parameter value
p = optimize.fmin(e, p0, args=(x,yr))
yp = f(p,x)

print("estimated vs real", p, testp)
plt.plot(x,yp,'r-', x,yr,'o', x,y,'b-')


Optimization terminated successfully.
         Current function value: 13.261850
         Iterations: 68
         Function evaluations: 129
estimated vs real [ -0.6   21.56] (1, 20)
Out[27]:
[<matplotlib.lines.Line2D at 0x7fcf2d754ef0>,
 <matplotlib.lines.Line2D at 0x7fcf2d75b0f0>,
 <matplotlib.lines.Line2D at 0x7fcf2d75bac8>]

Exercise: Statistical enrichment analysis

What is enrichment? Before we move on, we should get back to something that is always obsessing biologists with data: putting a P-value on your findings. If all you have is a series of numbers of a series of series it may be that the T-tests and ANOVA will be quite good, but what if you have cathegorical data, as many times it happens in biology? What if the question is "I have a bag of vegetables, to which of the following vegetable racks is my bag belonging more?". If the answer also takes into account how many items each of the vegetable racks is holding then you can sove it with statistical enrichment testing. Of course no one at this course will argue with your shopping habits if you don't think the number of vegies on a store racks is important! As there are many cathegorical sets of functional annotations in biology, we will only focus on Gene Ontology

Gene Ontology To use the GO annotations we typically need to know if the genes/protein of a specific organism are annotated with a certain GO label (GO id), while concurrent labels are possible for every gene/protein since each can have multiple roles in biology. The annotations are structured in a tree, with the branches having inherited all annotations from the leafs. You can read more about it here.

Extra task: Download raw annotation files and program your own Python GO module. One key aspect is annotations must be inherited through the tree. The elegant way to achieve the tree traversal is using a recursive function. This is a hard task, that needs hours to complete but it can be very instructive.

For the purpose of this course in order to parse the GO annotations we will use a Python library called Orange, that has other useful modules as well. Other alternatives are calling an R package from within Python (recomending topGO for enrichment and GO.db for interogation) or if you are a Perl senior then call a BIO::Perl module. BioPython is also preparing a GO module so be sure to check their package once every couple of years.

Enrichment test

In the figure below you can see that there are four sets forming up when we intersect our custom set of genes with the annotation database. The purpose of the enrichment test is to tell how likely is the four sets overlap. To answer this a hypergeometric test is commonly used, known as Fischer's exact test. Let us put it into Python!

http://en.wikipedia.org/wiki/Fisher%27s_exact_test

Note that multiple testing corrections like the Bonferroni correction must also be executed, but obtaining the raw P-values is sufficient for the purpose of this course.


In [1]:
%run ./lib/svgoutput.py
"""
GO enrichment figure displayed using IPython's native SVG frontend
"""
scene = SVGScene(500,600)
scene.text((50,50),"GO enrichment schema")


scene.circle((170,200),35,(100,100,100), 0.5)
scene.text((190,160),"G: genes annotated to GO_id",size = 11)
scene.circle((200,200),80,(200,200,200), 0.1)
scene.text((190,110),"X: all annotated genes",size = 11)
scene.circle((100,200),50,(100,100,100), 0.5)
scene.text((50,140),"T: test gene set",size = 11)

scene.text((170,200),"a", size = 11)
scene.text((140,200),"b", size = 11)
scene.text((125,200),"c", size = 11)
scene.text((250,200),"d", size = 11)

scene


Out[1]:
GO enrichment schemaG: genes annotated to GO_idX: all annotated genesT: test gene setabcd

In [ ]:

Perform the statistical test

                In Set  Outside Set
    GOid ann.     b       a
Not GOid ann.     c       d

Read the docs: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html


In [ ]:
"""
Using R to get the annotations and perform GO enrichment
TODO: not finished, only for reference
See here something similar:
http://bcb.io/2009/10/18/gene-ontology-analysis-with-python-and-bioconductor/
"""

import rpy2.robjects as robjects
def get_go_children(go_term, go_term_type):
    robjects.r('''
        library(GO.db)
    ''')
    child_map = robjects.r["GO%sCHILDREN" % (go_term_type)]
    children = []
    to_check = [go_term]
    while len(to_check) > 0:
        new_children = []
        for check_term in to_check:
            new_children.extend(list(robjects.r.get(check_term, child_map)))
        new_children = list(set([c for c in new_children if c]))
        children.extend(new_children)
        to_check = new_children
    children = list(set(children))
    return children

In [48]:
"""
Using Orange to get annotations
"""
from orangecontrib.bio import go
import sys

ontology = go.Ontology()
# Print names and definitions of all terms with "apoptosis" in the name
terms = [term for term in ontology.terms.values() if "apoptosis" in term.name.lower()]
#for term in terms: print term.id, term.name, term.def_
##Catch question: Why def_?
# Load annotations for yeast.
annotations = go.Annotations("sgd", ontology=ontology)

#Review: Object introspection
#print dir(annotations)
#print go.__file__
#print dir(annotations.get_all_annotations)
#print annotations.get_all_annotations.__doc__
#print dir(ontology.terms)
#print ontology.terms.values()[0]
#print annotations.get_annotated_terms.__doc__

go = {}
for term in ontology.terms.values():
    ants = annotations.get_all_annotations(term.id)
    gs = set([a.gene_name for a in ants])
    if len(gs)>0: go[term.id] = gs

print len(go)


        # res = annotations.get_enriched_terms(["YGR270W", "YIL075C", "YDL007W"])

# gene = annotations.alias_mapper["YIL075C"]
# print(gene + " (YIL075C) directly annotated to the following terms:")
# for a in annotations.gene_annotations[gene]:
#     print(ontology[a.GO_ID].name + " with evidence code " + a.Evidence_Code)

# # Get all genes annotated to the same terms as YIL075C
# ids = set([a.GO_ID for a in annotations.gene_annotations[gene]])
# for termid in ids:
#     ants = annotations.get_all_annotations(termid)
#     genes = set([a.DB_Object_Symbol for a in ants])
#     print(", ".join(genes) +" annotated to " + termid + " " + ontology[termid].name)


7886

Now we want to choose a set of genes to test enrichment, so we output a few terms together with their annotated genes:


In [ ]:
for gid in go:
    if len(go[gid])>5 and len(go[gid])<20:
        print gid, ontology[gid].name, go[gid]

In [48]:


In [49]:
testset = set(['KAR2', 'EGD1', 'EGD2', 'YBR137W', 'MDY2', 'SEC72', 'SGT2', 'SEC66', 'SEC61', 'SEC62', 'SEC63', 'LHS1', 'SSS1', 'BTT1', 'GET3', 'GET4'])
testGO = 'GO:0006620'# posttranslational protein targeting to membrane set

#testset = set(['ENV9', 'VPS41', 'ENV7', 'ATG15', 'ATG18', 'ENV11'])
#testGO = 'GO:0006624'#GO:0006624 vacuolar protein processing set

#annotatios.get_enriched_terms(genes, aspect=["P"])
#res = annotations.get_enriched_terms(list(testset), use_fdr = False)

# print("Enriched terms:")
# for go_id, (genes, p_value, ref) in res.items():
#     if p_value < 0.05:
#         print(go_id + " " + ontology[go_id].name + " with p-value: %.4f " % p_value + ", ".join(genes))
#print [(gid,ontology[gid].name,res[gid]) for gid in res.keys() if 'protein' in ontology[gid].name]
#print [(gid,ontology[gid].name,res[gid]) for gid in res.keys()[0:10]]
#import pandas as pd
#df = pd.DataFrame([(gid,ontology[gid].name, res[gid][1]) for gid in res.keys()])
#df.sort(columns = 2)
##TODO: Best GO enrichment in Orange has no members in the test set. Is Orange broken?
#print go['GO:0006595']

import numpy as np
from scipy import stats

T = testset
X = set()
for goid in go:
    X = X | go[goid]
print "Total number of annotated genes:", len(X)

rec = []#will hold the results tupple
for goid in go:
    G = go[goid]
    a = G - T
    b = T & G
    c = (X & T) - G
    d = X - T - G
    oddsratio, pvalue = stats.fisher_exact([[len(b), len(a)], [len(c), len(d)]])
    rec.append((goid, ontology[goid].name, pvalue))

df = pd.DataFrame(rec)
df.sort(columns = 2)


Total number of annotated genes: 6380
Out[49]:
0 1 2
43 GO:0006620 posttranslational protein targeting to membrane 4.809643e-47
3887 GO:0045047 protein targeting to ER 1.778213e-37
4639 GO:0072599 establishment of protein localization to endop... 2.916269e-37
2949 GO:0070972 protein localization to endoplasmic reticulum 6.379422e-36
3873 GO:0006612 protein targeting to membrane 3.103997e-34
7252 GO:0090150 establishment of protein localization to membrane 1.226710e-29
257 GO:0072657 protein localization to membrane 1.975087e-29
7646 GO:0006605 protein targeting 1.247136e-22
4641 GO:0072594 establishment of protein localization to organ... 1.767239e-22
5884 GO:0044802 single-organism membrane organization 1.981671e-22
6783 GO:0061024 membrane organization 2.485697e-22
5266 GO:0031204 posttranslational protein targeting to membran... 1.706860e-21
1692 GO:0033365 protein localization to organelle 4.011794e-21
1518 GO:0006886 intracellular protein transport 1.311140e-20
398 GO:1902580 single-organism cellular localization 2.956570e-20
1851 GO:0034613 cellular protein localization 1.468180e-18
1378 GO:0070727 cellular macromolecule localization 2.966828e-18
811 GO:0016482 cytoplasmic transport 7.606578e-18
4945 GO:0015031 protein transport 1.319491e-17
3728 GO:0045184 establishment of protein localization 2.508306e-17
397 GO:1902582 single-organism intracellular transport 7.233926e-17
2796 GO:0046907 intracellular transport 3.367064e-16
7499 GO:0008104 protein localization 4.977274e-16
7734 GO:0051649 establishment of localization in cell 1.490948e-15
7050 GO:0033036 macromolecule localization 7.195909e-15
6762 GO:0071702 organic substance transport 9.396642e-15
7739 GO:0051641 cellular localization 2.312074e-14
4381 GO:0030867 rough endoplasmic reticulum membrane 3.932540e-14
6806 GO:0005791 rough endoplasmic reticulum 3.932540e-14
6576 GO:0044765 single-organism transport 4.021301e-13
... ... ... ...
2679 GO:0033592 RNA strand annealing activity 1.000000e+00
2706 GO:0034982 mitochondrial protein processing 1.000000e+00
2705 GO:0007039 protein catabolic process in the vacuole 1.000000e+00
2704 GO:0007035 vacuolar acidification 1.000000e+00
2703 GO:0007034 vacuolar transport 1.000000e+00
2702 GO:0008453 alanine-glyoxylate transaminase activity 1.000000e+00
2701 GO:0007031 peroxisome organization 1.000000e+00
2700 GO:0007030 Golgi organization 1.000000e+00
2699 GO:0007033 vacuole organization 1.000000e+00
2698 GO:0007032 endosome organization 1.000000e+00
2697 GO:0005868 cytoplasmic dynein complex 1.000000e+00
2696 GO:0005869 dynactin complex 1.000000e+00
2695 GO:0016129 phytosteroid biosynthetic process 1.000000e+00
2694 GO:0016128 phytosteroid metabolic process 1.000000e+00
2693 GO:0016125 sterol metabolic process 1.000000e+00
2692 GO:0015138 fumarate transmembrane transporter activity 1.000000e+00
2691 GO:0016126 sterol biosynthetic process 1.000000e+00
2690 GO:0001786 phosphatidylserine binding 1.000000e+00
2689 GO:0015168 glycerol transmembrane transporter activity 1.000000e+00
2688 GO:0015169 glycerol-3-phosphate transmembrane transporter... 1.000000e+00
2687 GO:0015165 pyrimidine nucleotide-sugar transmembrane tran... 1.000000e+00
2686 GO:0071577 zinc ion transmembrane transport 1.000000e+00
2685 GO:0051575 5'-deoxyribose-5-phosphate lyase activity 1.000000e+00
2684 GO:0019388 galactose catabolic process 1.000000e+00
2683 GO:0033048 negative regulation of mitotic sister chromati... 1.000000e+00
2682 GO:0033043 regulation of organelle organization 1.000000e+00
2681 GO:2000200 regulation of ribosomal subunit export from nu... 1.000000e+00
2680 GO:2000203 regulation of ribosomal large subunit export f... 1.000000e+00
2773 GO:0005528 FK506 binding 1.000000e+00
7885 GO:0070525 threonylcarbamoyladenosine metabolic process 1.000000e+00

7886 rows × 3 columns


In [ ]: