An IPython Introduction to Using TEA for C. elegans researchers

All of the code below was written by David Angeles-Albores. Should you find any errors, typos, or just have general comments, please contact:

dangeles at caltech dt edu

The work here was submitted and accepted for publication on .... Please cite Tissue Enrichment Analysis for C. elegans Genomics if this notebook was useful for you in your research.

Please note: I have tried to make this tutorial as complete as possible, with a brief introduction to Pandas dataframes and showing how I typically prepare my dataframes for analysis. Experienced users will want to skip this and go straight to Calling TEA. However, this tutorial is by no means a complete introduction to Python or Pandas - in fact, it's more like a super fast crash course. I apologize for this, and in the future I will consider improving the tutorial.

Best of Luck!

-- David Angeles-Albores

Introduction

What is TEA meant for?

TEA is meant to provide straightforward analysis of large gene lists for C. elegans researchers. We hope that TEA will function as a hypothesis generator, or alternatively, as a way of understanding the biology behind a dataset.

How is TEA different from GO?

Great question. GO is primarily a molecular/cellular ontology, whereas TEA works from TO, the C. elegans tissue ontology. I believe tissues are, in some senses, fundamental units in biology. Often, it is the case that tissues, not cells, have been studied for considerably longer time, and as a result we have a better intuition for what the function of a tissue is, as compared to the molecular function of a list of genes. In other words, I think GO analysis and TEA are similar, but my guess is that the results from TEA will be easier to interpret, and as a result easier to use for hypotheiss generation.

What TEA is not:

TEA is NOT meant to be used as a quantitative tool!

At best, TEA is a very good guess about what tissues are being affected in your dataset. At worst, TEA is a guess about what tissues are being affected in your dataset. TEA is working directly from the WormBase-curated expression dataset. As a result, we have the very best, most up to date annotations in the world. On the other hand, please remember these annotations suffer from bias. For example, the ASE, ASK and ASI neurons have been very well studied and are quite well annotated, but the individual intestinal cells have not been generally well studied! Thus, our annotations are significantly biased by the research community's interests.

Please use TEA carefully, and always use it as a guiding tool for your research, and never as the final say on anything.

What do you need to do to run this tool?

The gist of the algorithm is:

Get your gene list into WBIDs

Call our analysis function

Call the plotting function

Done.

Batch users:

This script runs on Python > 3.5.

Dependencies: scipy (all), pandas, numpy, matplotlib and seaborn

If you have pip, do

pip install tissue_enrichment_analysis

to install the library in your computer.

Import the module. You may find that the numpy and pandas modules are also often very useful.

For the purposes of this journal, the file structure I'm working with is the following:

src - the folder this file lives in input - a folder that contains all my input files. Also contains

Engelmann - folder containing the files i will be using



In [1]:

    
import tissue_enrichment_analysis as tea #the main library for this tutorial
import pandas as pd
import os
import importlib as imp
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#to make IPython plot inline, not req'd if you're not working with an Ipython notebook
%matplotlib inline

Now let's import our dataset. Here, I will use a dataset I obtained from Engelmann et al, 2011 (PLOS One).

Specifically, this is data from an RNA-seq experiment they performed. Briefly, young adult worms were placed in D. coniospora fungus for 24, cleaned and then RNA-seq'ed.



In [2]:

    
dfDcon= pd.read_csv('../input/Engelmann/coniospora_Engelmann_2011.csv') #Don't forget to change the path in your script!
dfLum = pd.read_csv('../input/Engelmann/luminescens_Engelmann_2011.csv')
dfMarc = pd.read_csv('../input/Engelmann/marcescens_Engelmann_2011.csv')
dfFaec = pd.read_csv('../input/Engelmann/faecalis_Engelmann_2011.csv')



In [7]:

    
# dfDcon[dfDcon.GenePublicName =='C41A3.1']
# dfLum[dfLum.GenePublicName =='C41A3.1']
dfMarc[dfMarc.GenePublicName =='C41A3.1']
dfFaec[dfFaec.GenePublicName =='C41A3.1']









    Out[7]:






  
    
      
      GenePublicName
      SequenceNameGene
      SequenceNameTranscript
      EF
      OP50
      Ratio_EFvsOP50
      log2_Ratio_EFvsOP50
      Infection_upregulated
      Infection_downregulated
    
  
  
    
      4657
      C41A3.1
      C41A3.1
      C41A3.1
      0,007401
      0,00369
      2.005805
      1.004181
      NaN
      NaN

Let's visualize the first five lines of the dataframe to see what's in it



In [20]:

    
print('This dataframe has {0} columns and {1} rows'.format(dfDcon.shape[1], dfLum.shape[0]))
dfDcon.head()









    



This dataframe has 9 columns and 27322 rows






    Out[20]:






  
    
      
      GenePublicName
      SequenceNameGene
      SequenceNameTranscript
      DC
      OP50
      Ratio_DCvsOP50
      log2_Ratio_DCvsOP50
      Infection_upregulated
      Infection_downregulated
    
  
  
    
      0
      2L52.1
      2L52.1
      2L52.1
      0,080068
      0,07456
      1.073868970660
      0.102818
      NaN
      NaN
    
    
      1
      2RSSE.1
      2RSSE.1
      2RSSE.1
      0,639852
      0,586729
      1.090540948206
      0.125044
      NaN
      NaN
    
    
      2
      2RSSE.2
      2RSSE.2
      2RSSE.2
      0,080266
      0,081083
      0.989921488245
      -0.014614
      NaN
      NaN
    
    
      3
      3R5.1
      3R5.1
      3R5.1
      0,270145
      0,209694
      1.288281972779
      0.365448
      NaN
      NaN
    
    
      4
      nas-6
      4R79.1
      4R79.1a
      0,000617
      0,000749
      0.824236379271
      -0.278870
      NaN
      NaN

Ok. Clearly we can see the dataframe has a few different columns. Of particular interest to use are the columns 'Infection_upregulated' and 'Infection_downregulated', since these are the genes they identified as significantly altered by the treatment relative to an OP50 control. Let's analyze the genes that are upregulated first and see what they can do.

Before we can analyze anything, notice that they don't list WBIDs anywhere. We need to turn the names into WBIDs before we can continue.

To do this, I will load another file containing all the WBID-human readable relationships into a new dataframe called names



In [21]:

    
names= pd.read_csv('../input/Engelmann/c_elegans.PRJNA13758.WS241.livegeneIDs.unmaprm.txt', 
                   sep= '\t',comment= '#')

Let's take a look at it:



In [22]:

    
print('The length of this dataframe is:{0}'.format(len(names)))
names.head()









    



The length of this dataframe is:46788






    Out[22]:






  
    
      
      WBID
      HumanReadable
      GeneName
    
  
  
    
      0
      WBGene00000001
      aap-1
      Y110A7A.10
    
    
      1
      WBGene00000002
      aat-1
      F27C8.1
    
    
      2
      WBGene00000003
      aat-2
      F07C3.7
    
    
      3
      WBGene00000004
      aat-3
      F52H2.2
    
    
      4
      WBGene00000005
      aat-4
      T13A10.10

The Engelmann names look like they are GeneNames.

Next, I'm going to generate a lambda function. This function will take a single argument 'x'. 'x' should the be the column containing the names we want to convert into WBIDs. Once we provide 'x', this function will look in the GeneName column of the names dataframe to see whether a particular entry can be found in the GeneName column.\

For every entry it can find, g returns True. Else, it returns False



In [23]:

    
g= lambda x: (names.GeneName.isin(x))

Let's try our new function out!



In [24]:

    
#Remember, dfLum is the dataframe. dfLum['SequenceNameGene'] is the column we want.#
#We store the result in a variable called 'translate'
translate= g(dfDcon['SequenceNameGene'])
#I only want to show the first 5 rows, so I'm going to add [0:5] after translate, since 'g' returns a Series object
print(translate[0:5])









    



0    True
1    True
2    True
3    True
4    True
Name: GeneName, dtype: bool

Great! Now we can get the WBIDs by simple indexing:



In [25]:

    
wbids= names[translate].WBID # names[translate] gets rows for every gene name that was found by 'translate'
#The .WBID after names[] tells the computer to get the WBID colum



In [26]:

    
print('wbids has {} gene IDS. The original dataframe has {} genes'.format(len(wbids), dfLum.shape[0]))
wbids.head() #let's see what we found









    



wbids has 19702 gene IDS. The original dataframe has 27322 genes






    Out[26]:





0    WBGene00000001
1    WBGene00000002
2    WBGene00000003
3    WBGene00000004
4    WBGene00000005
Name: WBID, dtype: object

Hmmm. We lost quite a few genes. Let's quickly check to make sure those aren't important



In [27]:

    
not_found= dfDcon[~dfDcon.SequenceNameGene.isin(names[translate].GeneName)]
not_found.head()









    Out[27]:






  
    
      
      GenePublicName
      SequenceNameGene
      SequenceNameTranscript
      DC
      OP50
      Ratio_DCvsOP50
      log2_Ratio_DCvsOP50
      Infection_upregulated
      Infection_downregulated
    
  
  
    
      2
      2RSSE.2
      2RSSE.2
      2RSSE.2
      0,080266
      0,081083
      0.989921488245
      -0.014614
      NaN
      NaN
    
    
      7
      6R55.1
      6R55.1
      6R55.1a
      1,6612
      1,38576
      1.198764576839
      0.261548
      NaN
      NaN
    
    
      8
      6R55.1
      6R55.1
      6R55.1b
      1,66481
      1,38919
      1.198403386146
      0.261114
      NaN
      NaN
    
    
      197
      B0212.6
      B0212.6
      B0212.6
      0,015785
      0,007612
      2.073554738456
      1.052106
      1.0
      NaN
    
    
      416
      B0310.4
      B0310.4
      B0310.4
      0
      0
      NaN
      NaN
      NaN
      NaN

A quick search in WormBase shows that these genes have been merged into other genes. Hmmmm.. This could be a problem.

To figure out if it really is a problem, let's look at how many of those genes are upregulated during infection.



In [28]:

    
print('There are {0} upregulated genes, of which {1} can\'t be found in the names dictionary'.format(
      dfDcon[dfDcon.Infection_upregulated == 1].shape[0], not_found[not_found.Infection_upregulated == 1].shape[0]))


print('{0:.2}% could not be found'.format(
        not_found[not_found.Infection_upregulated == 1].shape[0]/dfDcon[dfDcon.Infection_upregulated == 1].shape[0]))









    



There are 1692 upregulated genes, of which 31 can't be found in the names dictionary
0.018% could not be found

Great! So there's almost no loss in our gene name conversion. Now we can go ahead and extract all the IDs that we can find to use for our enrichment analysis



In [37]:

    
translate= g(dfLum[dfLum.Infection_downregulated == 1]['SequenceNameGene'])
wbids= names[translate].WBID



In [38]:

    
print(wbids.head())









    



2     WBGene00000003
4     WBGene00000005
5     WBGene00000006
19    WBGene00000021
36    WBGene00000038
Name: WBID, dtype: object

See how the list changed from before? Great! Now we can put this into TEA

Calling TEA

TEA works by comparing your gene-list to a reference tissue expression ''dictionary''. In order for us to run TEA, we first need to fetch the dictionary. That's done easily enough:



In [33]:

    
tissue_df= tea.fetch_dictionary() #this downloads the tissue dictionary we want



In [35]:

    
tissue_df.head()









    Out[35]:






  
    
      
      wbid
      ray 2 WBbt:0006945
      Cpapa WBbt:0005962
      ABarappa WBbt:0006005
      ABprpppap WBbt:0006237
      Cpaaa WBbt:0006212
      pm6 WBbt:0003724
      ventral nerve cord WBbt:0005829
      ABalpaap WBbt:0005934
      retrovesicular ganglion WBbt:0005656
      ...
      ABprpappa WBbt:0006088
      MSaapp WBbt:0006425
      thermosensory neuron WBbt:0005838
      postdeirid sensillum WBbt:0005471
      ABprappp WBbt:0006702
      AWA WBbt:0005670
      ray 9 WBbt:0006954
      ABprappaa WBbt:0006350
      nociceptor neuron WBbt:0008434
      ABpraapa WBbt:0006302
    
  
  
    
      0
      WBGene00003681
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      WBGene00000023
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      WBGene00022837
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      WBGene00003905
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      WBGene00003983
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 261 columns

Quick technical note: We could have placed the dictionary inside the other functions and call them from the inside, but we want you to be able to access the dictionary. Why? Well, you might imagine that you want to get all the genes that are specifically expressed in a tissue, or you may want to take a look at what tissues are included, etc...

In other words, we want you to be able to get your hands on this data! It's up to date, it's easy and it works beautifully.

Now that we have the dictionary, we can run the enrichment analysis. Just so you know what's going on when you call it, the function has the following args.:

enrichment_analysis(gene_list, tissue_df, alpha= 0.05, aname= '', save= False, show= True)

Most of these you can ignore. Mainly, you'll want to assign:

gene_list = your gene list

tissue_df = the result from fetch_dictionary()

alpha= your desired q-value threshold

aname= if you want to save the result to your python interpret, give it a name and complete path

save= if you want to save your file, you must set this to True

This function returns 2 things:

df_res -- a dataframe with all the results

unused -- a list of all the genes that were discarded from the analysis

For now, let's jsut run the analysis and show it here:



In [39]:

    
df_res, unused= tea.enrichment_analysis(wbids, tissue_df, show= True, save= False)









    



Executing script

                                     Tissue   Expected  Observed  \
0   posterior lateral ganglion WBbt:0005465  20.460474      37.0   
5                          PDE WBbt:0006747   5.416008      14.0   
6                          AIY WBbt:0005413   9.267391      21.0   
7             lateral ganglion WBbt:0005105  61.140711      88.0   
9                          NSM WBbt:0003666   5.897431      15.0   
8         postdeirid sensillum WBbt:0005471   5.897431      14.0   
11                        SDQL WBbt:0004993   4.332806      11.0   
2                         SDQR WBbt:0004991   4.453162      11.0   
4           cephalic sensillum WBbt:0006920   5.656719      13.0   
10                         AQR WBbt:0003927   3.971739      10.0   
13                         AWB WBbt:0005671   8.906324      18.0   
12                         PQR WBbt:0004096   6.739921      14.0   
1                          AWA WBbt:0005670   5.536364      12.0   
3        anal sphincter muscle WBbt:0005798  15.525889      26.0   

    Enrichment Fold Change   Q value  
0                 1.808365  0.016980  
5                 2.584930  0.016980  
6                 2.266010  0.016980  
7                 1.439303  0.016980  
9                 2.543480  0.016980  
8                 2.373915  0.018099  
11                2.538770  0.024781  
2                 2.470155  0.028501  
4                 2.298152  0.028501  
10                2.517789  0.028501  
13                2.021036  0.028501  
12                2.077176  0.041767  
1                 2.167488  0.045186  
3                 1.674622  0.046300

Voila! We got our results. Great! But what if we didn't want to show them?'



In [40]:

    
df_res, unused= tea.enrichment_analysis(wbids, tissue_df, show= False, save= False)









    



Executing script

We could still look at the results by typing df_res.head():



In [41]:

    
df_res.head()









    Out[41]:






  
    
      
      Tissue
      Expected
      Observed
      Enrichment Fold Change
      Q value
    
  
  
    
      0
      posterior lateral ganglion WBbt:0005465
      20.460474
      37.0
      1.808365
      0.01698
    
    
      5
      PDE WBbt:0006747
      5.416008
      14.0
      2.584930
      0.01698
    
    
      6
      AIY WBbt:0005413
      9.267391
      21.0
      2.266010
      0.01698
    
    
      7
      lateral ganglion WBbt:0005105
      61.140711
      88.0
      1.439303
      0.01698
    
    
      9
      NSM WBbt:0003666
      5.897431
      15.0
      2.543480
      0.01698

What about the unused genes? Let's see how many of those there are:



In [42]:

    
print('{0} were discarded from the analysis'.format(len(unused)))









    



4451 were discarded from the analysis

Ouch! That's a lot! Don't like it? Make GFP reporters and let WormBase know where they are expressed. Seriously. Do it! You'd be helping the whole community a lot!

Now let's plot the results



In [43]:

    
tea.plot_enrichment_results(df_res, title= 'Exercise', save= False)









    Out[43]:





<matplotlib.axes._subplots.AxesSubplot at 0x11ab54160>

Voila! We've analyzed our data! Yay! :D

If we wanted to save our plot, we would type:



In [ ]:

    
tea.plot_enrichment_results(df_res, title= 'Exercise', save= True, dirGraphs= 'example_graph_dir/')
#This will save the graph in the corresponding directory. If no directory is specified, the graphs will be saved
#to the current working directory.

	GenePublicName	SequenceNameGene	SequenceNameTranscript	DC	OP50	Ratio_DCvsOP50	log2_Ratio_DCvsOP50	Infection_upregulated	Infection_downregulated
0	2L52.1	2L52.1	2L52.1	0,080068	0,07456	1.073868970660	0.102818	NaN	NaN
1	2RSSE.1	2RSSE.1	2RSSE.1	0,639852	0,586729	1.090540948206	0.125044	NaN	NaN
2	2RSSE.2	2RSSE.2	2RSSE.2	0,080266	0,081083	0.989921488245	-0.014614	NaN	NaN
3	3R5.1	3R5.1	3R5.1	0,270145	0,209694	1.288281972779	0.365448	NaN	NaN
4	nas-6	4R79.1	4R79.1a	0,000617	0,000749	0.824236379271	-0.278870	NaN	NaN

	WBID	HumanReadable	GeneName
0	WBGene00000001	aap-1	Y110A7A.10
1	WBGene00000002	aat-1	F27C8.1
2	WBGene00000003	aat-2	F07C3.7
3	WBGene00000004	aat-3	F52H2.2
4	WBGene00000005	aat-4	T13A10.10

	wbid	...
0	WBGene00003681	...
1	WBGene00000023	...
2	WBGene00022837	...
3	WBGene00003905	...
4	WBGene00003983	...

	Tissue	Expected	Observed	Enrichment Fold Change	Q value
0	posterior lateral ganglion WBbt:0005465	20.460474	37.0	1.808365	0.01698
5	PDE WBbt:0006747	5.416008	14.0	2.584930	0.01698
6	AIY WBbt:0005413	9.267391	21.0	2.266010	0.01698
7	lateral ganglion WBbt:0005105	61.140711	88.0	1.439303	0.01698
9	NSM WBbt:0003666	5.897431	15.0	2.543480	0.01698