PP2P protocol

By Roman Sasik (rsasik@ucsd.edu) This Notebook describes the steps used in Gene Ontology analysis, which produces both conditional and unconditional posterior probabilities that a GO term is differentially regulated in a given experiment. It is assumed that posterior probabilities for all genes have been calculated, either directly using eBayes in the limma R package, or indirectly using the lfdr function of the qvalue R package. The conditional probabilities are defined in the context of the GO graph structure:

The node in red can be pronounced conditionally significant if it is significant given the status of its descendant nodes. For instance, if the dark grey node had been found significant and the light grey nodes had been found not significant, the red node can be declared significant only if there are more significant genes in it than in the dark grey node.

The program PP2P works for both "continuous" PP's as well as for a simple cutoff, which is equivalent to the usual two-gene-list GO analysis (one a significant gene list, the other the expressed gene list).

The algorithm is described in this paper: "Posterior conditional probabilities for gene ontologies", R Sasik and A Chang, (to be published)

GNU fortran compiler gfortran is assumed to be installed (is part of gcc).

Compilation of the code

Execute this command:



In [9]:

    
!gfortran -ffree-line-length-none PP2p_branch_conditional_exact_pvalues.f90



In [10]:

    
!ls









    



a.out			   dot_conditional.png
BP_child_GO_GO_table.txt   go_interface.mod
BP_GO_locus_table.txt	   GOstructure.png
BP_GO_term_table.txt	   input.png
BP_locus_GO_table.txt	   PP2p_branch_conditional_exact_pvalues.f90
BP_parent_GO_GO_table.txt  PP2p.ipynb
chirp_transform.mod	   xls_conditional_expanded.png
C_T_1652.txt		   xls_conditional.png
C_T_PP.txt

Input file

There is one input file - a tab-delimited list of genes, which must meet these requirements: 1) The first line is the header line 2) The first column contains Entrez gene ID's of all expressed genes, in no particular order. Each gene ID must be present only once. 3) The second column contains posterior probabilities (PP) of the genes in the first column. PP is the probability that, given some prior assumptions, the gene is differentially expressed (DE). An example of such a file is C_T_PP.txt. The genes in it are ordered by their PP but they don't have to be. This is the top of that file:

There are times when we do not have the PP's, but instead have a "list of DE genes." In that case, define PP's in the second column as 1 when the corresponding gene is among the significant genes and 0 otherwise. An example of such a file is C_T_1652.txt. (The 1652 in the file name indicates the number of significant genes, but it has no other significance).

Running PP2p

Enter this command if you want to find differentially regulated GO terms in the Biological Process ontology, in the experiment defined by the input file C_T_PP.txt, and if you want a term reported as significant with posterior error probability of 0.01:



In [7]:

    
!./a.out BP C_T_PP.txt 0.01









    



 Reading expressed genes...
 Defining GO terms...
 Reading loci...
   1651.9746816417971      significant genes;       11218 expressed genes
 Reading loci again...
 Defining GO children...
 Defining GO parents...
 Connecting loci to GO terms...
 Calling get_n
100.00%
 Testing        9091 terms
 Finished level:           1
 Tests:        2749
 pi0:   1.0000000000000000     
 Total frozen genes:           0
 -----------------------------------------
 Finished level:           2
 Tests:        2076
 pi0:   1.0000000000000000     
 Total frozen genes:           0
 -----------------------------------------
 Finished level:           3
 Tests:        1463
 pi0:  0.92276150407811020     
 Total frozen genes:           0
 -----------------------------------------
 Finished level:           4
 Tests:         908
 pi0:  0.72320114600559338     
 Total frozen genes:           0
 -----------------------------------------
 *
 Finished level:           5
 Tests:         604
 pi0:  0.55187635776135313     
 Total frozen genes:         274
 -----------------------------------------
 Finished level:           6
 Tests:         403
 pi0:  0.45285362500706783     
 Total frozen genes:         274
 -----------------------------------------
 *
 Finished level:           7
 Tests:         277
 pi0:  0.37132540018251448     
 Total frozen genes:         755
 -----------------------------------------
 Finished level:           8
 Tests:         181
 pi0:  0.37292819902383295     
 Total frozen genes:         755
 -----------------------------------------
 *
 *
 *
 Finished level:           9
 Tests:         135
 pi0:  0.59259259847947121     
 Total frozen genes:        2914
 -----------------------------------------
 Finished level:          10
 Tests:          86
 pi0:  0.58914728682170536     
 Total frozen genes:        2914
 -----------------------------------------
 *
 Finished level:          11
 Tests:          69
 pi0:  0.57971015479994126     
 Total frozen genes:        3473
 -----------------------------------------
 *
 *
 Finished level:          12
 Tests:          46
 pi0:  0.61381074599221142     
 Total frozen genes:        4957
 -----------------------------------------
 Finished level:          13
 Tests:          34
 pi0:  0.86687306569534384     
 Total frozen genes:        4957
 -----------------------------------------
 Finished level:          14
 Tests:          21
 pi0:  0.85213032648283804     
 Total frozen genes:        4957
 -----------------------------------------
 Finished level:          15
 Tests:          15
 pi0:  0.77192982516680619     
 Total frozen genes:        4957
 -----------------------------------------
 Finished level:          16
 Tests:          12
 pi0:  0.78947368482968816     
 Total frozen genes:        4957
 -----------------------------------------
 Finished level:          17
 Tests:          10
 pi0:  0.73684210584104226     
 Total frozen genes:        4957
 -----------------------------------------
 Finished level:          18
 Tests:           2
 pi0:   1.0000000000000000     
 Total frozen genes:        4957
 -----------------------------------------
 Making dot files...

The output is a number of files:

Conditional reporting is done in these files:

BP_C_T_PP_0.01_conditional.dot

BP_C_T_PP_0.01_conditional_lfdr_expanded.txt

BP_C_T_PP_0.01_conditional_lfdr.txt

Unconditional reporting is done in these files (BH indicates Benjamini-Hochberg adjustment of raw p-values; lfdr indicates local false discovery rate (Storey) corresponding to the raw p-values):

BP_C_T_PP_0.01_unconditional_BH_expanded.txt

BP_C_T_PP_0.01_unconditional_BH.txt

BP_C_T_PP_0.01_unconditional.dot

BP_C_T_PP_0.01_unconditional_lfdr_expanded.txt

BP_C_T_PP_0.01_unconditional_lfdr.txt

For instance, the simple list of conditionally significant GO terms is in BP_C_T_PP_0.01_conditional_lfdr.txt and looks like this:

This is the entire file. There are no more conditionally significant GO terms. The way to read this output is from top to bottom, as GO terms are reported in levels depending on the significance (or not) of their child terms. Therefore, the "level" column also corresponds to the level of the GO organization - the lower the level, the more specific (and smaller) the term is.

The expanded files contain all the genes from the reported GO terms. For instance, the top of BP_C_T_PP_0.01_conditional_lfdr_expanded.txt looks like this:

The .dot files encode the ontology structure of the significant terms. Convert them into pdf files using the following commands:



In [8]:

    
!dot -Tfig BP_C_T_PP_0.01_conditional.dot > BP_C_T_PP_0.01_conditional.fig
!fig2dev -L pdf BP_C_T_PP_0.01_conditional.fig BP_C_T_PP_0.01_conditional.pdf
!ls *pdf









    



BP_C_T_PP_0.01_conditional.pdf

Demultiplexing phenotyping reads

The graph looks like this:

Cleanup after exercize:



In [27]:

    
!rm BP_C_T*