All of the code below was written by David Angeles-Albores. Should you find any errors, typos, or just have general comments, please contact:
dangeles at caltech dt edu
The work here was submitted and accepted for publication on .... Please cite Tissue Enrichment Analysis for C. elegans Genomics if this notebook was useful for you in your research.
Please note: I have tried to make this tutorial as complete as possible, with a brief introduction to Pandas dataframes and showing how I typically prepare my dataframes for analysis. Experienced users will want to skip this and go straight to Calling TEA. However, this tutorial is by no means a complete introduction to Python or Pandas - in fact, it's more like a super fast crash course. I apologize for this, and in the future I will consider improving the tutorial.
TEA is meant to provide straightforward analysis of large gene lists for C. elegans researchers. We hope that TEA will function as a hypothesis generator, or alternatively, as a way of understanding the biology behind a dataset.
Great question. GO is primarily a molecular/cellular ontology, whereas TEA works from TO, the C. elegans tissue ontology. I believe tissues are, in some senses, fundamental units in biology. Often, it is the case that tissues, not cells, have been studied for considerably longer time, and as a result we have a better intuition for what the function of a tissue is, as compared to the molecular function of a list of genes. In other words, I think GO analysis and TEA are similar, but my guess is that the results from TEA will be easier to interpret, and as a result easier to use for hypotheiss generation.
TEA is NOT meant to be used as a quantitative tool!
At best, TEA is a very good guess about what tissues are being affected in your dataset. At worst, TEA is a guess about what tissues are being affected in your dataset. TEA is working directly from the WormBase-curated expression dataset. As a result, we have the very best, most up to date annotations in the world. On the other hand, please remember these annotations suffer from bias. For example, the ASE, ASK and ASI neurons have been very well studied and are quite well annotated, but the individual intestinal cells have not been generally well studied! Thus, our annotations are significantly biased by the research community's interests.
Please use TEA carefully, and always use it as a guiding tool for your research, and never as the final say on anything.
The gist of the algorithm is:
Get your gene list into WBIDs
Call our analysis function
Call the plotting function
Done.
This script runs on Python > 3.5.
Dependencies: scipy (all), pandas, numpy, matplotlib and seaborn
If you have pip, do
pip install tissue_enrichment_analysis
to install the library in your computer.
Import the module. You may find that the numpy and pandas modules are also often very useful.
For the purposes of this journal, the file structure I'm working with is the following:
src - the folder this file lives in input - a folder that contains all my input files. Also contains
Engelmann - folder containing the files i will be using
In [1]:
import tissue_enrichment_analysis as tea #the main library for this tutorial
import pandas as pd
import os
import importlib as imp
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#to make IPython plot inline, not req'd if you're not working with an Ipython notebook
%matplotlib inline
Now let's import our dataset. Here, I will use a dataset I obtained from Engelmann et al, 2011 (PLOS One).
Specifically, this is data from an RNA-seq experiment they performed. Briefly, young adult worms were placed in D. coniospora fungus for 24, cleaned and then RNA-seq'ed.
In [2]:
dfDcon= pd.read_csv('../input/Engelmann/coniospora_Engelmann_2011.csv') #Don't forget to change the path in your script!
dfLum = pd.read_csv('../input/Engelmann/luminescens_Engelmann_2011.csv')
dfMarc = pd.read_csv('../input/Engelmann/marcescens_Engelmann_2011.csv')
dfFaec = pd.read_csv('../input/Engelmann/faecalis_Engelmann_2011.csv')
In [7]:
# dfDcon[dfDcon.GenePublicName =='C41A3.1']
# dfLum[dfLum.GenePublicName =='C41A3.1']
dfMarc[dfMarc.GenePublicName =='C41A3.1']
dfFaec[dfFaec.GenePublicName =='C41A3.1']
Out[7]:
Let's visualize the first five lines of the dataframe to see what's in it
In [20]:
print('This dataframe has {0} columns and {1} rows'.format(dfDcon.shape[1], dfLum.shape[0]))
dfDcon.head()
Out[20]:
Ok. Clearly we can see the dataframe has a few different columns. Of particular interest to use are the columns 'Infection_upregulated' and 'Infection_downregulated', since these are the genes they identified as significantly altered by the treatment relative to an OP50 control. Let's analyze the genes that are upregulated first and see what they can do.
Before we can analyze anything, notice that they don't list WBIDs anywhere. We need to turn the names into WBIDs before we can continue.
To do this, I will load another file containing all the WBID-human readable relationships into a new dataframe called names
In [21]:
names= pd.read_csv('../input/Engelmann/c_elegans.PRJNA13758.WS241.livegeneIDs.unmaprm.txt',
sep= '\t',comment= '#')
Let's take a look at it:
In [22]:
print('The length of this dataframe is:{0}'.format(len(names)))
names.head()
Out[22]:
The Engelmann names look like they are GeneNames.
Next, I'm going to generate a lambda function. This function will take a single argument 'x'. 'x' should the be the column containing the names we want to convert into WBIDs. Once we provide 'x', this function will look in the GeneName column of the names dataframe to see whether a particular entry can be found in the GeneName column.\
For every entry it can find, g returns True. Else, it returns False
In [23]:
g= lambda x: (names.GeneName.isin(x))
Let's try our new function out!
In [24]:
#Remember, dfLum is the dataframe. dfLum['SequenceNameGene'] is the column we want.#
#We store the result in a variable called 'translate'
translate= g(dfDcon['SequenceNameGene'])
#I only want to show the first 5 rows, so I'm going to add [0:5] after translate, since 'g' returns a Series object
print(translate[0:5])
Great! Now we can get the WBIDs by simple indexing:
In [25]:
wbids= names[translate].WBID # names[translate] gets rows for every gene name that was found by 'translate'
#The .WBID after names[] tells the computer to get the WBID colum
In [26]:
print('wbids has {} gene IDS. The original dataframe has {} genes'.format(len(wbids), dfLum.shape[0]))
wbids.head() #let's see what we found
Out[26]:
Hmmm. We lost quite a few genes. Let's quickly check to make sure those aren't important
In [27]:
not_found= dfDcon[~dfDcon.SequenceNameGene.isin(names[translate].GeneName)]
not_found.head()
Out[27]:
A quick search in WormBase shows that these genes have been merged into other genes. Hmmmm.. This could be a problem.
To figure out if it really is a problem, let's look at how many of those genes are upregulated during infection.
In [28]:
print('There are {0} upregulated genes, of which {1} can\'t be found in the names dictionary'.format(
dfDcon[dfDcon.Infection_upregulated == 1].shape[0], not_found[not_found.Infection_upregulated == 1].shape[0]))
print('{0:.2}% could not be found'.format(
not_found[not_found.Infection_upregulated == 1].shape[0]/dfDcon[dfDcon.Infection_upregulated == 1].shape[0]))
Great! So there's almost no loss in our gene name conversion. Now we can go ahead and extract all the IDs that we can find to use for our enrichment analysis
In [37]:
translate= g(dfLum[dfLum.Infection_downregulated == 1]['SequenceNameGene'])
wbids= names[translate].WBID
In [38]:
print(wbids.head())
See how the list changed from before? Great! Now we can put this into TEA
In [33]:
tissue_df= tea.fetch_dictionary() #this downloads the tissue dictionary we want
In [35]:
tissue_df.head()
Out[35]:
Quick technical note: We could have placed the dictionary inside the other functions and call them from the inside, but we want you to be able to access the dictionary. Why? Well, you might imagine that you want to get all the genes that are specifically expressed in a tissue, or you may want to take a look at what tissues are included, etc...
In other words, we want you to be able to get your hands on this data! It's up to date, it's easy and it works beautifully.
Now that we have the dictionary, we can run the enrichment analysis. Just so you know what's going on when you call it, the function has the following args.:
enrichment_analysis(gene_list, tissue_df, alpha= 0.05, aname= '', save= False, show= True)
Most of these you can ignore. Mainly, you'll want to assign:
gene_list = your gene list
tissue_df = the result from fetch_dictionary()
alpha= your desired q-value threshold
aname= if you want to save the result to your python interpret, give it a name and complete path
save= if you want to save your file, you must set this to True
This function returns 2 things:
df_res -- a dataframe with all the results
unused -- a list of all the genes that were discarded from the analysis
For now, let's jsut run the analysis and show it here:
In [39]:
df_res, unused= tea.enrichment_analysis(wbids, tissue_df, show= True, save= False)
Voila! We got our results. Great! But what if we didn't want to show them?'
In [40]:
df_res, unused= tea.enrichment_analysis(wbids, tissue_df, show= False, save= False)
We could still look at the results by typing df_res.head():
In [41]:
df_res.head()
Out[41]:
What about the unused genes? Let's see how many of those there are:
In [42]:
print('{0} were discarded from the analysis'.format(len(unused)))
Ouch! That's a lot! Don't like it? Make GFP reporters and let WormBase know where they are expressed. Seriously. Do it! You'd be helping the whole community a lot!
Now let's plot the results
In [43]:
tea.plot_enrichment_results(df_res, title= 'Exercise', save= False)
Out[43]:
Voila! We've analyzed our data! Yay! :D
If we wanted to save our plot, we would type:
In [ ]:
tea.plot_enrichment_results(df_res, title= 'Exercise', save= True, dirGraphs= 'example_graph_dir/')
#This will save the graph in the corresponding directory. If no directory is specified, the graphs will be saved
#to the current working directory.