Gene expression

  • Use R through Rpy2 to download and normalize a GEO experiment of your own choosing. Alternatively use the supplied gene expression dataset. It is also possible to parse the SOFT file that you can download directly from GEO, then do a mean centering and unit variance normalization with scikit-learn.

  • Cluster the data using one of your favorite methods from Scikit-learn or other packages. I suggest Kmeans.

  • Now we would like to know in which samples do these clusters have a particularly high expression. Let's call them characteristic samples. Figure out a test method, or use this one: compute zscores for every gene across the samples (row based zscores). Add the zscores for all the genes belonging to a group and order the samples by the cumulative zscores. (Normally it would require a high average and low variance, or a cummulative significance study, but a simple ordering will do)

  • Compute the Pearson correlation between all pairs of genes. Store the result in a matrix.

  • Form a co-expression network by traversing the matrix and adding a link between two gene pairs that have a correlation value above an established threshold. Use the weighted network and compute the minimum spanning tree.

  • Plot the network by coloring the genes in each cluster with a different color. Do you see any particular color clusters or are the colors uniformly distributed in the network? In case there is a clear color cluster output its characteristic samples.

Finer points:

  • Work with the scipy stack, keep your expression data in a pandas dataframe.

  • Use https://github.com/tanghaibao/goatools to perform GO enrichment of each cluster or other hunftional enrichment. Alternatively use topGO, calling it from R.

  • Plot the gene expression using the XKCDfier


In [ ]:
import GE
GE.run()