By Roman Sasik (rsasik@ucsd.edu) This Notebook describes the steps used in Gene Ontology analysis, which produces both conditional and unconditional posterior probabilities that a GO term is differentially regulated in a given experiment. It is assumed that posterior probabilities for all genes have been calculated, either directly using eBayes in the limma R package, or indirectly using the lfdr function of the qvalue R package. The conditional probabilities are defined in the context of the GO graph structure:
The node in red can be pronounced conditionally significant if it is significant given the status of its descendant nodes. For instance, if the dark grey node had been found significant and the light grey nodes had been found not significant, the red node can be declared significant only if there are more significant genes in it than in the dark grey node.
The program PP2P works for both "continuous" PP's as well as for a simple cutoff, which is equivalent to the usual two-gene-list GO analysis (one a significant gene list, the other the expressed gene list).
The algorithm is described in this paper: "Posterior conditional probabilities for gene ontologies", R Sasik and A Chang, (to be published)
GNU fortran compiler gfortran is assumed to be installed (is part of gcc).
Execute this command:
In [9]:
!gfortran -ffree-line-length-none PP2p_branch_conditional_exact_pvalues.f90
In [10]:
!ls
There is one input file - a tab-delimited list of genes, which must meet these requirements: 1) The first line is the header line 2) The first column contains Entrez gene ID's of all expressed genes, in no particular order. Each gene ID must be present only once. 3) The second column contains posterior probabilities (PP) of the genes in the first column. PP is the probability that, given some prior assumptions, the gene is differentially expressed (DE). An example of such a file is C_T_PP.txt. The genes in it are ordered by their PP but they don't have to be. This is the top of that file:
There are times when we do not have the PP's, but instead have a "list of DE genes." In that case, define PP's in the second column as 1 when the corresponding gene is among the significant genes and 0 otherwise. An example of such a file is C_T_1652.txt. (The 1652 in the file name indicates the number of significant genes, but it has no other significance).
Enter this command if you want to find differentially regulated GO terms in the Biological Process ontology, in the experiment defined by the input file C_T_PP.txt, and if you want a term reported as significant with posterior error probability of 0.01:
In [7]:
!./a.out BP C_T_PP.txt 0.01
The output is a number of files:
BP_C_T_PP_0.01_conditional.dot
BP_C_T_PP_0.01_conditional_lfdr_expanded.txt
BP_C_T_PP_0.01_conditional_lfdr.txt
BP_C_T_PP_0.01_unconditional_BH_expanded.txt
BP_C_T_PP_0.01_unconditional_BH.txt
BP_C_T_PP_0.01_unconditional.dot
BP_C_T_PP_0.01_unconditional_lfdr_expanded.txt
BP_C_T_PP_0.01_unconditional_lfdr.txt
For instance, the simple list of conditionally significant GO terms is in BP_C_T_PP_0.01_conditional_lfdr.txt and looks like this:
This is the entire file. There are no more conditionally significant GO terms. The way to read this output is from top to bottom, as GO terms are reported in levels depending on the significance (or not) of their child terms. Therefore, the "level" column also corresponds to the level of the GO organization - the lower the level, the more specific (and smaller) the term is.
The expanded files contain all the genes from the reported GO terms. For instance, the top of BP_C_T_PP_0.01_conditional_lfdr_expanded.txt looks like this:
The .dot files encode the ontology structure of the significant terms. Convert them into pdf files using the following commands:
In [8]:
!dot -Tfig BP_C_T_PP_0.01_conditional.dot > BP_C_T_PP_0.01_conditional.fig
!fig2dev -L pdf BP_C_T_PP_0.01_conditional.fig BP_C_T_PP_0.01_conditional.pdf
!ls *pdf
The graph looks like this:
Cleanup after exercize:
In [27]:
!rm BP_C_T*