Preparing an annotation file

For many organisms, we can use Ensembl BioMart to get our gene symbols and GO terms.

  1. Go to the Ensembl mouse genome page
  2. Click BioMart on the top menu
  3. Select Ensembl Genes 91 as the database (the version number may change with Ensembl updates)
  4. Select Mus musculus genes (GRCm38.p5) as the dataset
  5. Select Attributes from the left-hand menu
  6. Click on the + symbol next to GENE: and select Ensembl Gene ID and Associated Gene Name
  7. Click on the + symbol next to EXTERNAL: and select GO Term Accession
  8. Click the Results button
  9. Check Export all results to is set to File and TSV and click the Go button to download the annotations

The downloaded file will be called mart_export.txt which is the same as ensembl_mm10.tsv from the DEAGO tutorial.

Let's take a look:

Ensembl Gene ID Associated Gene Name    GO Term Accession
ENSMUSG00000064372  mt-Tp   
ENSMUSG00000064371  mt-Tt   
ENSMUSG00000064370  mt-Cytb GO:0016020
ENSMUSG00000064370  mt-Cytb GO:0016021
ENSMUSG00000064370  mt-Cytb GO:0046872
ENSMUSG00000064370  mt-Cytb GO:0009055
...

Notice that ENSMUSG00000064370 has multiple GO terms. Ensembl BioMart output has one entry per line (e.g. one line per gene symbol or GO term). DEAGO expects one line per gene. We can re-format this BioMart annotation for use with DEAGO using mart_to_deago.

To look at the usage information for mart_to_deago:


In [ ]:
mart_to_deago -h

To convert the annotation for use with DEAGO:


In [ ]:
mart_to_deago -a data/ensembl_mm10.tsv

This will generate a DEAGO-formatted annotation file called deago_annotation.tsv which the same as ensembl_mm10_deago_formatted.tsv from the DEAGO tutorial.

Let's take a look:

ENSMUSG00000000001  Gnai3   GO:0000139;GO:0000166;GO:0001664;GO:0003924;GO:0004871;GO:0005515;GO:0005525;...
ENSMUSG00000000003  Pbsn    GO:0005576;
ENSMUSG00000000028  Cdc45   GO:0000727;GO:0003682;GO:0003688;GO:0003697;GO:0005634;GO:0005656;GO:0005813;...
...
ENSMUSG00000115848  AC114008.2
ENSMUSG00000115849  AC156016.5
ENSMUSG00000115850  AC118639.2

Let's search the converted annotation for the gene identifier from earlier (ENSMUSG00000064370) that was split across multiple lines :


In [ ]:
grep ENSMUSG00000064370 data/ensembl_mm10_deago_formatted.tsv

We can see that there is one gene name (mt-Cytb) and 36 GO terms associated with this gene:

ENSMUSG00000064370  mt-Cytb GO:0001666;GO:0005739;GO:0005743;GO:0006122;GO:0007584;...



Return to the index
Previous: Output files
Next: Quality control