This chapter deals with statistics from a data analysis and numerical computation perspective, while the machine learning aspects are done later. As Python comes with battery included, it has its own statistics functions. Similarly, numpy offers the basic statistical summaries as part of its array functions. The scikits stack is useful for many research fields, many of which have their own statistics submodules. [PyMC3] will be presented in another chapter. Rpy2 module can also be useful for integrating R programs. As you can see, third party libraries present a lot of redundancies. This is a long time subject of dissagreements in the community. On top of everything, scikit-learn also has an implementation for generalized linear models. So there isn't always an obvious wau of doing things, contrary to Python's Zen and prossibly influenced by R's byzantine approach to statistics.
Languages developed primarily for statistics are using rich data structures that are available also in Python. One of the most important modules to use in Pandas. First have a look at Pandas API. Do you recognize some of these structures and can you name equivalent functionality from other languages like R or tabular computing programs like Excel? Pandas can do everything Excel does and then some more, like SQL interogation and advanced statistics through its scipy/numpy modules integration and matplotlib visualization. Moreover, for those with a background into R, Pandas has a similar synthax and can be used interchangeably with R in Python itself through the rpy2 module.
R related functionality:
Main documentation entry points:
Next is a small hands-on introduction, with tips on creation, selection, filtering and grouping.
In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
df2 = pd.DataFrame({'number' : 1.,
'dates' : pd.date_range('20150720',periods=4),
'floats' : pd.Series(0.25,index=list(range(4)),dtype='float32'),
'integers' : np.array([3] * 4,dtype='int32'),
'cathegories' : pd.Categorical(["test","train","test","train"]),
'string' : 'foo' })
print(df2)
print(type(df2.values))
In [6]:
df1.sort_values(by='B')
Out[6]:
In [5]:
# Description
print(df2.dtypes)
#try tab completion, column labels can be easily retrieved
#df2.
df2.head(2)
df1.describe()
df1.T
#sort by axis
#print(df1.sort_index(axis=0, ascending=False))
#sort by values
#print(df1.sort(columns='B'))
Out[5]:
In [13]:
# Selection and slicing. And filtering.
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
# selection by labels
#df1
#df1['A']
#df1.A
#print(df1[0:2])
df1.loc[2:10,['A','B']]
#print df1.loc[2,'A']
Out[13]:
In [23]:
df1.loc[2:10,['A','B']]
Out[23]:
In [19]:
# selection by position
#df1
#df1.iloc[2]
#df1.iloc[0:2,0:3]
#df1.iloc[[1,2,],[0,2]]
print(df1)
df1.iloc[1,1]
Out[19]:
In [21]:
# filter by boolean test
#print df1
#print df1[df1.A > 0.1]
#print df1[df1 > 0.1]
print(df1[(df1['A'] > 0.1) & (df1['B']>0.2)])
In [22]:
s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
# filter by values
df3 = df1.copy()
df3['mask']=['one','one','two']
df3['test'] = [1, 2, 3]
print(df3)
print(df3[df3['mask'].isin(['one'])])
#df1.dropna()
df1['test2'] = df3['test']
#print df1
In [31]:
# Setting values by label, position and numpy array
df1.loc[:,'A'] = 0
df1.iat[0,1] = 5
df1.loc[:,'D'] = np.array([10] * len(df1))
print df1
In [41]:
# Categories and grouping
s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
df3 = df1.copy()
df3['mask']=['one', 'one','two']
df3["vtype"] = df3["mask"].astype("category")
print df3
print df3["vtype"]
df3["vtype"].cat.categories = ["nice", "nasty"]
print df3
print df3.groupby("vtype").size()
In [28]:
# Basic operations
s = pd.Series([1, 9, 2, 10, np.nan, 6])
df1 = pd.DataFrame(np.random.randn(3,5),index=s[1:4],columns=list('ABCDF'))
df1['diff'] = df1.A - df1.B
df1['A'].map(lambda x : 10 * x)
print df1.applymap(np.sqrt)
print df1.apply(np.sqrt, axis = 1)
print df1.apply(np.sum, axis = 1)
Task:
For the finer points, try to be as short as possible! Lots of code makes room for errors and it is hard to read/reproduce.
In [2]:
import numpy as np
gene_table = {'name': ["G"+str(i) for i in range(10)],
'expression': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'read_counts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
As data science became a thing and armies of analists entered the workforce, python and r dataframes developed two new concepts. Tidiness was a well known concept to database programmers, but data scientists were either inexperienced programmers or people from other backgrounds who started to program.
Tidiness rules are as follows:
The general idea is that when working with input data you should not use the garbled mess in which the data is at first loaded. Instead you should extract the pieces of information that are needed for your study, and multiple types of observational units must form different dataframes. Why is this so important after all? Because computers are not statisticians, the dataframe concept is unnatural to them. What they can work much better with are memory addresses and values. Having a tidy dataframe eases their work, and it affects reproducibility as well! Getting to a tidy format early on is also useful to separate data munging from data analysis, leading to a cleaner and faster workflow.
While tidiness has to do with data, the second concept of method chaining has to do with a cleaner syntax. The best way to understand how that works is by using the task completed above on following template:
In [ ]:
import pandas as pd
def proccess(dict):
df = (pd.Dataframe()
.pipe(create_matrix)
.pipe(rem_undefined)
.pipe(add_seq_quality)
.pipe(filter_qualify_seq_quality))
return df
def create_matrix(df, d=None):
return df
def rem_undefined(df):
return df
def add_seq_quality(df):
return df
def filter_qualify_seq_quality(df):
return df
df = proccess(gene_table)
df
Now, if some of your functions contain a single instruction that doesn't require a lot of coding you can actually replace the pipeline function with the dataframe function itself. This is recommended for readability. Try it!
SciPy is one of the most popular scientific computing libraries for Python. It is part of the so called "NumPy stack" of modules for scientific computing, along with NumPy, SymPy, Matplotlib and Pandas. SciPy does much more than just statistics, one could view it as a free and arguably more popular alternative to MatLab, one of the private scientific computing environments. Scipy can do pretty much anything that needs numerical computing, but in this chapter we will only use it for statistics.
T-test
One of the basic application of statistics is testing if a null hypothesis is true. In the previous application of Pandas for gene expression studies we observed that the data was normalized, with its averages matching closely. Suppose we want to know if in statistical terms these averages are matching sufficiently close. A simple way to check this is to compute all pair t-tests and see of all our obtained P - values are falling under the 0.005 bar.
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_ind.html
In [24]:
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
df = pd.read_csv('data/gex.txt', sep = '\t', index_col = 0)
print(df.head(4))
#df.iloc[:,1:].astype(float)
#print df.dtypes
#print type(df.GSM21712.values)
pmatrix = np.zeros((6,6))# the P-value matrix
i = -1
for ci in df.columns[1:]:
i += 1
a = df[ci].values
a = a[~np.isnan(a)]#Removing undefined elements
#a = df[ci].values.astype(float)
#print a.dtype, type(a[0])
j = -1
for cj in df.columns[1:]:
j += 1
if ci == cj: continue
b = df[cj].values
b = b[~np.isnan(b)]
t, p = stats.ttest_ind(a, b, equal_var = False)
#print np.isnan(a).any(), np.isnan(b).any()
pmatrix[i,j]=p
np.set_printoptions(linewidth=200)
np.set_printoptions(precision=2)
print(pmatrix)
df.boxplot()
Out[24]:
ANOVA
.. But, in order not to get incidentally murdered by a reviewer with background in statistics, one should use a single test to rule them all, popularly called ANOVA.
In [45]:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.read_csv('data/gex.txt', sep = '\t', index_col = 0)
sl = [] #sample list
for ci in df.columns[1:]:
a = df[ci].values
a = a[~np.isnan(a)]#Removing undefined elements
sl.append(a)
print stats.f_oneway(*sl)
Linear regression
Ever felt the need to fit a straight trendline to a number of points in a scatter plot? This is called linear regression. Here is how you can do LR with numpy. As a useful exercise, let us generate our own dataset.
How is the line fitted? The most basic methodology is least square minimization of the squared standard error. One can do least square fitting directly in numpy or scikit-learn to obtain the same results.
In [46]:
%matplotlib inline
import numpy as np
import pylab as plt
from scipy import stats
nsamp = 30
x = np.linspace(0, 3, nsamp)
y = -0.5*x + 1 #next line randomizes these aligned datapoints
yr = y + .5*np.random.normal(size=nsamp)
slope, intercept, r_value, p_value, std_err = stats.linregress(x,yr)
line = slope*x+intercept
plt.plot(x,line,'r-', x,yr,'o', x,y,'b-')
Out[46]:
Curve fitting and parameter estimation
TODO: moved to the new optimization chapter!
When the data points are multidimensional you will use more complex multivariate regression techniques, but we will discuss that at more length in the machine learning chapter. For the moment, let us use a similar exercise as before, but fit a curve instead. While not strictly statistics related, this exercise can be useful for example if we want to decide how a probability distribution fits our data. We will use the least-square again, through the optimization module of scipy.
In [1]:
%matplotlib inline
import numpy as np
import pylab as plt
from scipy import optimize
nsamp = 30
x = np.linspace(0,1,nsamp)
"""
y = -0.5*x**2 + 7*sin(x)
This is what we try to fit against. Suppose we know our function is generated
by this law and want to find the (-0.5, 7) parameters. Alternatively we might
not know anything about this dataset but just want to fit this curve to it.
"""
f = lambda p, x: p[0]*x*x + p[1]*np.sin(x)
#Exercise: define a normal Python function f() instead!
testp = (1, 20)
y = f(testp,x)
yr = y + .5*np.random.normal(size=nsamp)
e = lambda p, x, y: (abs((f(p,x)-y))).sum()
p0 = (5, 20) # initial parameter value
p = optimize.fmin(e, p0, args=(x,yr))
yp = f(p,x) # predicted target
print("estimated vs real", p, testp)
plt.title("estimated(red) vs real(blue)")
plt.plot(x,yp,'r-', x,yr,'o', x,y,'b-')
Out[1]:
What is enrichment? Before we move on, we should get back to something that is always obsessing biologists with data: putting a P-value on your findings. If all you have is a series of numbers of a series of series it may be that the T-tests and ANOVA will be quite good, but what if you have cathegorical data, as many times it happens in biology? What if the question is "I have a bag of vegetables, to which of the following vegetable racks is my bag belonging more?". If the answer also takes into account how many items each of the vegetable racks is holding then you can sove it with statistical enrichment testing. Of course no one at this course will argue with your shopping habits if you don't think the number of vegies on a store racks is important! As there are many cathegorical sets of functional annotations in biology, we will only focus on Gene Ontology
Gene Ontology To use the GO annotations we typically need to know if the genes/protein of a specific organism are annotated with a certain GO label (GO id), while concurrent labels are possible for every gene/protein since each can have multiple roles in biology. The annotations are structured in a tree, with the branches having inherited all annotations from the leafs. You can read more about it here.
Extra task: Download raw annotation files and program your own Python GO module. One key aspect is annotations must be inherited through the tree. The elegant way to achieve the tree traversal is using a recursive function. This is a hard task, that needs hours to complete but it can be very instructive.
For the purpose of this course in order to parse the GO annotations we will use a Python library called Orange, that has other useful modules as well. Other alternatives are calling an R package from within Python (recomending topGO for enrichment and GO.db for interogation) or if you are a Perl senior then call a BIO::Perl module. BioPython is also preparing a GO module so be sure to check their package once every couple of years.
Enrichment test
In the figure below you can see that there are four sets forming up when we intersect our custom set of genes with the annotation database. The purpose of the enrichment test is to tell how likely is the four sets overlap. To answer this a hypergeometric test is commonly used, known as Fischer's exact test. Let us put it into Python!
http://en.wikipedia.org/wiki/Fisher%27s_exact_test
Note that multiple testing corrections like the Bonferroni correction must also be executed, but obtaining the raw P-values is sufficient for the purpose of this course.
TODO: use GOATOOLS
In [1]:
%run ./lib/svgoutput.py
"""
GO enrichment figure displayed using IPython's native SVG frontend
"""
scene = SVGScene(500,600)
scene.text((50,50),"GO enrichment schema")
scene.circle((170,200),35,(100,100,100), 0.5)
scene.text((190,160),"G: genes annotated to GO_id",size = 11)
scene.circle((200,200),80,(200,200,200), 0.1)
scene.text((190,110),"X: all annotated genes",size = 11)
scene.circle((100,200),50,(100,100,100), 0.5)
scene.text((50,140),"T: test gene set",size = 11)
scene.text((170,200),"a", size = 11)
scene.text((140,200),"b", size = 11)
scene.text((125,200),"c", size = 11)
scene.text((250,200),"d", size = 11)
scene
Out[1]:
In [ ]:
Perform the statistical test
In Set Outside Set
GOid ann. b a
Not GOid ann. c d
Read the docs: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html
In [ ]:
"""
Using R to get the annotations and perform GO enrichment
TODO: not finished, only for reference
See here something similar:
http://bcb.io/2009/10/18/gene-ontology-analysis-with-python-and-bioconductor/
"""
import rpy2.robjects as robjects
def get_go_children(go_term, go_term_type):
robjects.r('''
library(GO.db)
''')
child_map = robjects.r["GO%sCHILDREN" % (go_term_type)]
children = []
to_check = [go_term]
while len(to_check) > 0:
new_children = []
for check_term in to_check:
new_children.extend(list(robjects.r.get(check_term, child_map)))
new_children = list(set([c for c in new_children if c]))
children.extend(new_children)
to_check = new_children
children = list(set(children))
return children
In [48]:
"""
Using Orange to get annotations
"""
from orangecontrib.bio import go
import sys
ontology = go.Ontology()
# Print names and definitions of all terms with "apoptosis" in the name
terms = [term for term in ontology.terms.values() if "apoptosis" in term.name.lower()]
#for term in terms: print term.id, term.name, term.def_
##Catch question: Why def_?
# Load annotations for yeast.
annotations = go.Annotations("sgd", ontology=ontology)
#Review: Object introspection
#print dir(annotations)
#print go.__file__
#print dir(annotations.get_all_annotations)
#print annotations.get_all_annotations.__doc__
#print dir(ontology.terms)
#print ontology.terms.values()[0]
#print annotations.get_annotated_terms.__doc__
go = {}
for term in ontology.terms.values():
ants = annotations.get_all_annotations(term.id)
gs = set([a.gene_name for a in ants])
if len(gs)>0: go[term.id] = gs
print len(go)
# res = annotations.get_enriched_terms(["YGR270W", "YIL075C", "YDL007W"])
# gene = annotations.alias_mapper["YIL075C"]
# print(gene + " (YIL075C) directly annotated to the following terms:")
# for a in annotations.gene_annotations[gene]:
# print(ontology[a.GO_ID].name + " with evidence code " + a.Evidence_Code)
# # Get all genes annotated to the same terms as YIL075C
# ids = set([a.GO_ID for a in annotations.gene_annotations[gene]])
# for termid in ids:
# ants = annotations.get_all_annotations(termid)
# genes = set([a.DB_Object_Symbol for a in ants])
# print(", ".join(genes) +" annotated to " + termid + " " + ontology[termid].name)
Now we want to choose a set of genes to test enrichment, so we output a few terms together with their annotated genes:
In [ ]:
for gid in go:
if len(go[gid])>5 and len(go[gid])<20:
print gid, ontology[gid].name, go[gid]
In [48]:
In [49]:
testset = set(['KAR2', 'EGD1', 'EGD2', 'YBR137W', 'MDY2', 'SEC72', 'SGT2', 'SEC66', 'SEC61', 'SEC62', 'SEC63', 'LHS1', 'SSS1', 'BTT1', 'GET3', 'GET4'])
testGO = 'GO:0006620'# posttranslational protein targeting to membrane set
#testset = set(['ENV9', 'VPS41', 'ENV7', 'ATG15', 'ATG18', 'ENV11'])
#testGO = 'GO:0006624'#GO:0006624 vacuolar protein processing set
#annotatios.get_enriched_terms(genes, aspect=["P"])
#res = annotations.get_enriched_terms(list(testset), use_fdr = False)
# print("Enriched terms:")
# for go_id, (genes, p_value, ref) in res.items():
# if p_value < 0.05:
# print(go_id + " " + ontology[go_id].name + " with p-value: %.4f " % p_value + ", ".join(genes))
#print [(gid,ontology[gid].name,res[gid]) for gid in res.keys() if 'protein' in ontology[gid].name]
#print [(gid,ontology[gid].name,res[gid]) for gid in res.keys()[0:10]]
#import pandas as pd
#df = pd.DataFrame([(gid,ontology[gid].name, res[gid][1]) for gid in res.keys()])
#df.sort(columns = 2)
##TODO: Best GO enrichment in Orange has no members in the test set. Is Orange broken?
#print go['GO:0006595']
import numpy as np
from scipy import stats
T = testset
X = set()
for goid in go:
X = X | go[goid]
print "Total number of annotated genes:", len(X)
rec = []#will hold the results tupple
for goid in go:
G = go[goid]
a = G - T
b = T & G
c = (X & T) - G
d = X - T - G
oddsratio, pvalue = stats.fisher_exact([[len(b), len(a)], [len(c), len(d)]])
rec.append((goid, ontology[goid].name, pvalue))
df = pd.DataFrame(rec)
df.sort(columns = 2)
Out[49]:
Data scientists normally use R for statistical heavy lifting, with a few exceptions:
So there is quite a lot of benefit for getting advanced statistical models working in Python! Statsmodels has everything for everyone here are a few examples:
If you have background in statistics with R, then you will appreciate the formula api. For exemple, let us redo the ordinary least squares example, but this time using multiple variables and the formula API.
In [ ]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas
In [6]:
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()
Out[6]:
In [7]:
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
res = mod.fit()
print(res.summary())
Above, the Region variable was treated as cathegorical, and the central (C) region was treated as intercept. To explicitely force cathegorical treatment and excluding the intercept, as well as exemplify advanced functions we can do:
In [10]:
def log_plus_1(x):
return np.log(x) + 1.0
res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)*np.log(Wealth) + C(Region) - 1', data=df).fit()
print(res.params)
The formulas are docummented in a separate module called patsy.
Task:
As biologists, DE is finally a statistical concept that I feel no need of explaining. But I have a personal remark: one of the most impotant concept in biological data science is also one of the leas understood. This is mainly because of the tendency to create black hole programs or function calls, that have no transparent model fitting. This is not a pythonic approach!
Here is a simple T-test application for determining the samples having a differential expression level. Suppose that the machine cannot determine very accurately the amount of reads for a given gene. We estimate the machine error using a Poisson model with set medians. We are interested in genes having a remarkably different signal in the first set of samples.
In [1]:
from scipy import stats
import numpy as np
np.random.seed(1) # for repeatability
gex = [stats.poisson(5.0).rvs(50) for i in range(0,2000)]
E1 = stats.poisson(10.0).rvs(25) # Poisson sampling of average 10. expression
E2 = stats.poisson(5.0).rvs(25)
gex.append(np.concatenate((E1, E2), axis = None))
X = np.stack(gex)
In [2]:
from scipy.stats.distributions import norm
C1 = list(range(0,25))
C2 = list(range(25,50))
MC1 = X[:,C1].mean(axis=1)
MC2 = X[:,C2].mean(axis=1)
VC1 = X[:,C1].var(axis=1)
VC2 = X[:,C2].var(axis=1)
nC1 = len(C1)
nC2 = len(C2)
zscores = (MC1 - MC2) / np.sqrt(VC1/nC1 + VC2/nC2)
print(zscores.mean(), zscores.std())
pvalues = 2*norm.cdf(-np.abs(zscores))
Task:
In [ ]: