For many organisms, we can use Ensembl BioMart to get our gene symbols and GO terms.
BioMart
on the top menuEnsembl Genes 91
as the database (the version number may change with Ensembl updates)Mus musculus genes (GRCm38.p5)
as the datasetAttributes
from the left-hand menu+
symbol next to GENE:
and select Ensembl Gene ID
and Associated Gene Name
+
symbol next to EXTERNAL:
and select GO Term Accession
Results
buttonExport all results to
is set to File
and TSV
and click the Go
button to download the annotationsThe downloaded file will be called mart_export.txt
which is the same as ensembl_mm10.tsv
from the DEAGO tutorial.
Let's take a look:
Ensembl Gene ID Associated Gene Name GO Term Accession
ENSMUSG00000064372 mt-Tp
ENSMUSG00000064371 mt-Tt
ENSMUSG00000064370 mt-Cytb GO:0016020
ENSMUSG00000064370 mt-Cytb GO:0016021
ENSMUSG00000064370 mt-Cytb GO:0046872
ENSMUSG00000064370 mt-Cytb GO:0009055
...
Notice that ENSMUSG00000064370 has multiple GO terms. Ensembl BioMart output has one entry per line (e.g. one line per gene symbol or GO term). DEAGO expects one line per gene. We can re-format this BioMart annotation for use with DEAGO using mart_to_deago
.
To look at the usage information for mart_to_deago
:
In [ ]:
mart_to_deago -h
To convert the annotation for use with DEAGO:
In [ ]:
mart_to_deago -a data/ensembl_mm10.tsv
This will generate a DEAGO-formatted annotation file called deago_annotation.tsv
which the same as ensembl_mm10_deago_formatted.tsv
from the DEAGO tutorial.
Let's take a look:
ENSMUSG00000000001 Gnai3 GO:0000139;GO:0000166;GO:0001664;GO:0003924;GO:0004871;GO:0005515;GO:0005525;...
ENSMUSG00000000003 Pbsn GO:0005576;
ENSMUSG00000000028 Cdc45 GO:0000727;GO:0003682;GO:0003688;GO:0003697;GO:0005634;GO:0005656;GO:0005813;...
...
ENSMUSG00000115848 AC114008.2
ENSMUSG00000115849 AC156016.5
ENSMUSG00000115850 AC118639.2
Let's search the converted annotation for the gene identifier from earlier (ENSMUSG00000064370) that was split across multiple lines :
In [ ]:
grep ENSMUSG00000064370 data/ensembl_mm10_deago_formatted.tsv
We can see that there is one gene name (mt-Cytb) and 36 GO terms associated with this gene:
ENSMUSG00000064370 mt-Cytb GO:0001666;GO:0005739;GO:0005743;GO:0006122;GO:0007584;...
Return to the index
Previous: Output files
Next: Quality control