In this class session we are going to analyze gene expression data from a human bladder cancer cohort, using python. We will load a data matrix of expression measurements of 4,473 genes in 414 different bladder cancer samples. These genes have been selected because they are differentially expressed between normal bladder and bladder cancer (thus more likely to have a function in bladder cancer specifically), but the columns in the data matrix are restricted to bladder cancer samples (not normal bladder) because we want to obtain a network representing variation across cancers. The measurements in the matrix have already been normalized to account for inter-sample heterogeneity and then log2 transformed. Our job is to compute Pearson correlation coefficients between all pairs of genes, obtain Fisher-transformed z-scores for all pairs of genes, test each pair of genes for significance of the z score, adjust for multiple hypothesis testing, filter to eliminate any pair for which R < 0.75 or Padj > 0.01, load the graph into an igraph.Graph
object, and plot the degree distribution on log-log scale. We will then answer two questions: (1) does the network look to be scale-free? and (2) what is it's best-fit scaling exponent?
We will start by importing all of the modules that we will need for this notebook. Note the difference in language-design philosophy between R (which requires loading one package for this analysis) and python (where we have to load seven modules). Python keeps its core minimal, whereas R has a lot of statistical and plotting functions in the base language (or in packages that are loaded by default).
In [57]:
import pandas
import scipy.stats
import matplotlib
import pylab
import numpy
import statsmodels.sandbox.stats.multicomp
import igraph
import math
Using pandas.read_csv
, load the tab-deliminted text file of gene expression measurements (rows correspond to genes, columns correspond to bladder tumor samples), into a data frame gene_matrix_for_network_df
.
In [58]:
gene_matrix_for_network_df = pandas.read_csv("shared/bladder_cancer_genes_tcga.txt", sep="\t")
Use the pandas.DataFrame.as_matrix
method to make a matrix gene_matrix_for_network
. Print out the dimensions of the matrix, by accessing its shape
variable
In [59]:
gene_matrix_for_network = gene_matrix_for_network_df.as_matrix()
gene_matrix_for_network.shape
Out[59]:
Use del
to delete the data frame, since we no longer need it (save memory)
In [60]:
del gene_matrix_for_network_df
Look at the online help for the numpy.corrcoef
function, using help(numpy.corrcoef)
. When you pass a single argument x
which is a 2D "array" (i.e., a matrix), by default does corrcoef
compute coefficients for pairs of rows, or pairs of columns?
In [61]:
help(numpy.corrcoef)
Compute the 4,473 x 4,473 matrix of gene-gene Pearson correlation coefficients, using numpy.corrcoef
(this function treats each row as a variable, so you don't have to do any transposing of the matrix, unlike the situation in R).
In [62]:
gene_matrix_for_network_cor = numpy.corrcoef(gene_matrix_for_network)
Look at the online help for numpy.fill_diagonal
. Does it return the modified matrix or modify the matrix argument in place?
In [63]:
help(numpy.fill_diagonal)
Set the diagonal elements of the matrix to zero, using numpy.fill_diagonal
In [64]:
numpy.fill_diagonal(gene_matrix_for_network_cor, 0)
Look at the online help for numpy.multiply
. Does it do element-wise multiplication or matrix multiplication?
In [65]:
help(numpy.multiply)
Look at the online help for numpy.tri
. Does it modify a matrix argument in-place or return a matrix? What is in the matrix that it returns?
In [66]:
help(numpy.tri)
Set the upper-triangle of the matrix to zero, using numpy.multiply
and numpy.tri
:
In [67]:
gene_matrix_for_network_cor = numpy.multiply(gene_matrix_for_network_cor, numpy.tri(*gene_matrix_for_network_cor.shape))
Using numpy.where
, get a tuple of two numpy.arrays containing the row/col indices of the entries of the matrix for which R >= 0.75. Use array indexing to obtain the R values for these matrix entries, as a numpy array cor_coeff_values_above_thresh
.
In [68]:
inds_correl_above_thresh = numpy.where(gene_matrix_for_network_cor >= 0.75)
cor_coeff_values_above_thresh = gene_matrix_for_network_cor[inds_correl_above_thresh]
Refer to Eq. (13.5) in the assigned readding for today's class (p9 of the PDF). Obtain a numpy array of the correlation coefficients that exceeded 0.75, and Fisher-transform the correlation coefficient values to get a vector z_scores
of z scores. Each of these z scores will correspond to an edge in the network, unless the absolute z score is too small such that we can't exclude the null hypothesis that the corresponding two genes' expression values are indepdenent (we will perform that check in the next step).
In [69]:
z_scores = 0.5*numpy.log((1 + cor_coeff_values_above_thresh)/
(1 - cor_coeff_values_above_thresh))
Delete the correlation matrix object in order to save memory (we won't need it from here on out).
In [70]:
del gene_matrix_for_network_cor
Assume that under the null hypothesis that two genes are independent, then sqrt(M-3)z for the pair of genes is an independent sample from the normal distribution with zero mean and unit variance, where M is the number of samples used to compute the Pearson correlation coefficient (i.e., M = 414). For each entry in z_scores
compute a P value as the area under two tails of the normal distribution N(x), where the two tails are x < -sqrt(M-3)z and x > sqrt(M-3)z. (You'll know you are doing it right if z=0 means you get a P value of 1). You will want to use the functions numpy.abs
and scipy.stats.norm.cdf
, as well as the math.sqrt
function (in order to compute the square root).
In [71]:
M = gene_matrix_for_network.shape[1]
P_values = 2*scipy.stats.norm.cdf(-numpy.abs(z_scores)*math.sqrt(M-3))
Adjust the P values for multiple hypothesis testing, using the statsmodels.sandbox.stats.multicomp.multipletests
function wth method="fdr_bh"
In [72]:
P_values_adj = statsmodels.sandbox.stats.multicomp.multipletests(P_values, method="fdr_bh")[1]
Verify that we don't need to drop any entries due to the adjusted P value not being small enough (use numpy.where
and len
); this should produce zero since we have M=414 samples per gene.
In [73]:
len(numpy.where(P_values_adj >= 0.01)[0])
Out[73]:
Read the online help for the function zip
. What does it do?
In [74]:
help(zip)
We want to pass our tuple of numpy arrays containing row and column indices to Graph.TupleList
; however, Graph.TupleList
accepts a tuple list, not a tuple of numpy arrays. So we need to make a tuple list, using zip
:
In [75]:
row_col_inds_tuple_list = zip(inds_correl_above_thresh[0], inds_correl_above_thresh[1])
## [note this can be done more elegantly using the unary "*" operator:
## row_col_inds_tuple_list = zip(*inds_correl_above_thresh)
## see how we only need to type the variable name once, if we use the unary "*" ]
Make an undirected graph from the row/column indices of the (upper-triangle) gene pairs whose correlations were above our threshold, using igraph.Graph.TupleList
. Print a summary of the network, as a sanity check, using the igraph.Graph.summary
method.
In [76]:
final_network = igraph.Graph.TupleList(row_col_inds_tuple_list)
final_network.summary()
Out[76]:
Plot the degree distribution on log-log scale; does it appear to be scale-free?
In [77]:
degree_dist = final_network.degree_distribution()
xs, ys = zip(*[(left, count) for left, _, count in degree_dist.bins()])
matplotlib.pyplot.scatter(xs, ys, marker="o")
ax = matplotlib.pyplot.gca()
ax.set_yscale("log")
ax.set_xscale("log")
matplotlib.pyplot.ylim((0.5,1000))
pylab.xlabel("k")
pylab.ylabel("N(k)")
pylab.show()
Use the igraph.statistics.power_law_fit
function to estimate the scaling exponent alpha of the degree distribution:
In [78]:
igraph.statistics.power_law_fit(final_network.degree()).alpha
Out[78]:
In [85]:
inds_use = numpy.where(P_values_adj > 0)
matplotlib.pyplot.scatter(cor_coeff_values_above_thresh[inds_use], -numpy.log10(P_values_adj[inds_use]))
pylab.xlabel("R")
pylab.ylabel("-log10(P)")
pylab.show()
For each of the gene pairs for which R>0.75, see if you can compute the t-test P value for each correlation coefficient (don't bother adjusting for false discovery rate control). Compare to the (un-adjusted) P values that you got using the Fisher transformation, using a scatter plot. How do they compare? Which test has better statistical power, for this case where M = 414? (If you are wondering, general advice is to use Fisher if M>=10; for very small numbers of samples, use the Student t test).
In [114]:
ts = numpy.divide(cor_coeff_values_above_thresh * math.sqrt(M - 2), numpy.sqrt(1 - cor_coeff_values_above_thresh**2))
P_values_studentT = 2*scipy.stats.t.cdf(-ts, M-2)
inds_use = numpy.where(numpy.logical_and(P_values > 0, P_values_studentT > 0))
matplotlib.pyplot.scatter(-numpy.log10(P_values[inds_use]),
-numpy.log10(P_values_studentT[inds_use]))
pylab.xlabel("Fisher transform")
pylab.ylabel("Student t")
pylab.show()