04 - Ensembl

Introduction

`Ensembl` is a collection of browsers covering the major domains of life, that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. `Ensembl` contains genome annotations, computes multiple alignments, predicts regulatory function and collects disease data. `Ensembl` tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species. The service is provided by the European Bioinformatics Institute (EBI)

Ensembl has a number of entrypoints or portals, divided broadly by type of organism:

Each of these resources is organised in a very similar way, and provides consistent browser, programmatic and pipeline-based routes to access their curated, integrated data. Not all features are available in each resource, however.

Resources

The EnsemblBacteria portal

You'll be using the `EnsemblBacteria` portal in this lesson. `EnsemblBacteria` provides similar access to the other `Ensembl*` portals, and what you learn here will be transferable to the other sites without much modification.

EnsemblBacteria provides access to over 40,000 bacterial genomes through a common genome browser interface. This data can also be accessed programmatically through a REST interface (see later lessons), and downloaded directly. Additionally, over 100 bacterial genomes are covered in the pan-taxonomic Compara tool. Bacterial genes from all the genomes are classified into families with HAMAP and PANTHER tools.

Given all this power behind the scenes, the landing page is deceptively plain:

One thing to note is in the lower left corner: the *release number*. In the image above, it shows we are using release 34. `Ensembl`'s release policy is to release complete, updated versions of the entire database, rather than to provide incrementally-updated versions of each individual database entry (which is what `UniProt` does). It is important to cite the `Ensembl` version number used, in any publication that derives from or uses its data.

Using Ensembl Genomes

The way we interact with `EnsemblBacteria` is similar to that for all the other `Ensembl` portals. In this section, you'll use the portal to find and browse a *Pectobacterium* genome.

As EnsemblBacteria provides over 40,000 genomes, it can be hard to navigate the full list, even though you can link to it from the landing page. It's much simpler to start typing the name of the species you want in the Search for a genome box, as for the term Pectobacterium below:

  • Start typing Pectobacterium in the search field

This produces a drop-down list, which we can scroll through until we find our organism of interest. In this case, we're going for Pectobacterium atrosepticum SCRI 1043

  • Click on the entry for Pectobacterium atrosepticum SCRI 1043

This brings up the genome's homepage, which offers several useful links for information and statistics, comparative genomics, downloading data and, at the top, another search bar so you can search for features of interest:

Downloading data

To download files for this genome, click on the `Download DNA sequence (FASTA)` link. This will take you to an FTP page, and clicking on any of those links will download the corresponding file.

On the right hand side of the page, there are links to download gene and protein sequences in FASTA or GFF3 (usable with Artemis or Tablet) format. These also take you to an FTP site from where you can download the corresponding data.

The Genome Browser

On each genome home page, there's a link to a region of interest, in this case: [`Chromosome:2198014-2201221`](http://bacteria.ensembl.org/Pectobacterium_atrosepticum_scri1043/psychic?q=Chromosome:2198014-2201221;site=ensemblthis), just below the search bar. We'll use this to get a first look at the genome browser.
  • Click on the region of interest link to Chromosome:2198014-2201221

NOTE: the `Ensembl` browser interface puts up tabs at the top of the page to indicate which view you are currently in - this is a good way to take shortcuts between search outputs.

At the top of this page, there's a circular overview of the entire chromosome, with features and GC content/skew indicated, and the region of interest marked with a red wedge. Below this there's a linear view of the region in detail.

On the circular view, there are yellow and blue handles. You can click and drag these to modify the region of interest, and if you click on this new region, the linear view beneath the circle will update (if the region is not too large to view).

  • Use the handles to zoom in on a region approximately 1,644434-1,688,332

The lower region shows a view of the features annotated on the genome, coloured by type. In the example above, all the features are protein-coding genes. If you click on one of these, a small popup window appears, giving a little more information.

  • Click on one of the features in the lower view

You can change which features show in the browser view by clicking on the `Configure this page` option in the left-hand menu. For now, we will persist with the default view.

Gene view

In the popup you saw above, you will have noticed several distinct identifiers - for the transcript, the gene, and the protein product. These are live links, and will take you to a new page centring the browser on that feature, with an overview window that shows the gene-based display. For instance, clicking on the ECA1453 gene link above brings up the gene view:

  • Click on the ECA1453 link

You will see that the content of the left-hand menu has changed, giving links out to sequence annotations such as GO terms, and external references.

NOTE: in general the annotations in Ensembl may not be identical to, or contain the same detailed information as, those found in other resources. This genome, Pba SCRI 1043 was carefully manually annotated by a team of six people over a period of six months; that complete annotation can be found in the GenBank record for this genome.

To obtain an account of the links to other databases for this feature, you can use the External references link in the left-hand menu. This will bring up an account of other entries for this sequence.

  • Click on the External references menu item

QUESTIONS
  1. Can you find the corresponding entry for this feature in NCBI/GenBank, using your browser?

You can search directly for a gene of interest in the the gene search box, and we will do this for the virulence-related protein VgrG, below.

  • Click on the tab for the whole genome home page
  • Enter vgrG into the search field and click Go or hit Return

QUESTIONS
  1. How many genes does the search return?
  2. Are all of these results likely to be active coding genes?
  3. How many genes in the GenBank record for this genome are annotated as `VgrG` HINT: the genome record can be found here.
  • Go to the gene page for the first hit, and click on Sequence in the left-hand menu

This will present a marked-up view of the genomic sequence for, and flanking, that vgrG search result. There are buttons that will allow you to download the sequence in a range of formats, or submit the sequence as a BLAST search, and the exons are highlighted, with the current gene of interest shown in red.

The FASTA header shows the location of the region covered by the sequence.

QUESTIONS
  1. What does the FASTA sequence represent?
  2. Is this the same sequence you would expect from `UniProt`, `KEGG` or `NCBI`? HINT: `UniProt`, `KEGG` and `NCBI` records for ECA1453.

Transcript view

When you clicked through to reach the gene page for your search result, `Ensembl` kindly also opened up a page for the corresponding transcript (on the right hand side of the tab list), which you should be able to see in the top tab.
  • Click on the transcript tab

This will bring you to the transcript summary view:

  • Click on the cDNA menu item on the left-hand side

This will show the transcript coding sequence.

  • Click on the General identifiers menu item

This shows the external database links, to find more information about this sequence. In particular, the link out to UniProtKB indicates the percentage sequence identity to one of UniProt's protein sequence entries. Here it is, as would be expected, a 100% identity match.

Comparative Genomics in Ensembl

For some genomes, `EnsemblBacteria` provides gene trees and lists of homologues in a range of other species. To explore this, you will look at the comparative genomics of the *E. coli* K-12 gene `pfkA`.

  • Search for the gene pfkA.
  • Click on the link for b3916 to get the gene view

Note that the menu on the left now has the Pan-taxonomic Compara, Gene Tree and Orthologues links available. These are the entrypoints to Ensembl's comparative genomics tools.

  • Click on the Gene Tree menu option

The browser now shows a tree locating this E. coli pfkA in the context of orthologues from other organisms. The E. coli gene of interest is highlighted in red, and a set of figurative protein alignments is shown to the right of the tree - these alignments indicate regions of the sequence and their extent of sequence identity. On the tree itself, several branches are 'collapsed' into shaded triangles, and more information about these collapsed sequence sets can be obtained by clicking on the grey triangle, to bring up a pop-up window.

The pop-up has a link (expand this sub-tree) that allows you to see the members of the collapsed branches in full.

  • Click on the Orthologues link in the left-hand menu

This will present the set of orthologues to pfkA in the Pan-taxonomic Compara as a table:

NOTE: the `Ensembl` definition of orthologue is different to how it might be used in the literature. In `Ensembl` an "orthologue" may refer to a `1-to-1` orthologue (as in the original, strict definition), but it may also be `1-to-many` or `many-to-many`, which are in phylogenetic terms not orthologues by the standard definition.

The table is interactive and will allow you to live-filter on organism type by selecting Show details for any of the offered groups. Also, each individual row in the table presents links to view the corresponding gene page, and the sequence alignment.

  • Click on Show details to restrict the table to bacterial orthologues
  • Enter Myco into the table filter to restrict the table to two rows
  • Click on the link to view sequence alignments in the first row

Clicking on the sequence alignment link will bring up a pop-up menu asking if you want the protein or cDNA alignment.

  • Click on View Protein Alignment

This gives the pairwise alignment used in constructing the orthologue set for your query sequence, in ClustalW format. There is information given on sequence percentage identity and coverage for this alignment.

Exercise 01 (10min)

Using the EnsemblBacteria tools, starting from EnsemblBacteria, can you:


  • Go to the home page for the *Kitasatospora setae* KM-6054 genome
  • Find how many coding genes are annotated in this genome?
  • Download the FASTA file describing all predicted protein gene products?
  • Find how many putative penicillin-binding proteins are annotated in the genome?
  • Examine the feature `KSE_59840` in the genome browser, and find the GO term for its molecular function? What is the evidence for this functional annotation?
  • What is the `UniProt` accession for this protein?