Gene Expression Analysis: Annotations and Ontology Analysis

I. Overview and Objectives

In this prelab, we will continue to think about analysis of gene expression on microarrays, covering two additional topics.

  • Annotations: libraries, how to load and attached, outputting content to files
  • Ontology Analysis: Question, statistics, key ideas, resources, tools.

Annotating your results: Back to your pipeline

After you've performed your analysis, you might have been thinking: These probe IDs are interesting, but I don't speak probe. Turns out not many speak probe. You'd really like to know what gene names, positions, protein domains, and more! are attached to your top hits.

There are a lot of different kinds of annotations that one can associate with probe ids. It also turns out there is a package in R: biomaRt that lets you obtain a great deal of information from Ensembl, which is the go to place to obtain these sorts of data.

Once again, this package is available from Bioconductor:

source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

We have loaded this package from bioconductor for you already in CoCalc, so you only need to invoke the library() function to include this functionality into your analysis pipeline.

This package can do a lot of things. Check out this User Guide which gives you a range of examples. We'll walk through one of those, and give you some helper functions to 'troll' through this information, then some additional functions to help connect those data to your association results.

Copy, paste, and execute the following code in the cell below:

library(biomaRt)

In [0]:

Task One: Obtaining the biomart of interest

The first thing we need to do is to identify the biomart of information that we want to access. We can list mart using

listMarts()

Copy, paste, and execute the above code in the cell below:


In [0]:

In most cases, we'll be accessing Ensembl. But, we still need to find the organism we want to study. To get a list of those, we can use the function useMart(), followed by listDatasets():

mymart <- useMart("ensembl")
listDatasets(mymart)[,1]

This will give us a list of biomarts for all the organisms that have been curated by Ensembl. e.g. hsapiens_gene_ensembl for humans, or mmusculus_gene_ensembl for the mouse genome. Note the VERSION of the database you are accessing: genome builds can change the interpretations and specific positions of things -- be mindful when making comparisons with other databases if the anchoring to a genome build could influence your results or interpretation.

that said, we can invoke useMart() to obtain that biomart specifically.

Execute the follow code in the cell below:

mymart <- useMart("ensembl", "hsapiens_gene_ensembl")

In [0]:

Task Two: obtaining Probe IDs from your array

In order to make use of your biomart, you will need to obtain a list of probe IDs to 'lookup' in this data base. These can be found in various object that you create from your actual expression analysis. For example, you might recall using the top Table command:

tt <- topTable(efit, sort="P", n=100)

In the resulting table variable (i.e. tt), the names of each row correspond to each probe. In R, there is a function to extract the list of row names, easily enough called row.names:

affyid <- row.names(tt)

This will create a list of ids that you can search for.

Task Three: Obtaining annotations of interest

Next, we might want to obtain a complete listing of all the annotations that are available from the give mart. To do this, use the listAttributes() function.

Copy, paste, and execute the code in the cell below:

attrs <- listAttributes(mymart)

In [0]:

Then access attrs in various ways

> attrs[1:5,]

                   name            description
1       ensembl_gene_id        Ensembl Gene ID
2 ensembl_transcript_id  Ensembl Transcript ID
3    ensembl_peptide_id     Ensembl Protein ID
4       ensembl_exon_id        Ensembl Exon ID
5           description            Description

There are hundreds (even a thousand!) types of information: the name gives you the key you want to access, the description is just the human-readable description for it.

You might want to try to find specific information for annotation, even if you don't know what the annotation key is. Here's a quick little script to grep for a key word or two:

attrs[grep("(HGNC|Affy)",attrs[[2]]),]

which will looks for the keys "HGNC" and "Affy" in the list of annotations stored by the attrs array. The keys that we will want to look-up are listed as "name", so we'll want to save or store those names down somewhere (exactly!)

Once we have our list of keys, we can use another function, getBM(), to query ensembl to obtain the annotations we are interested in. This function take 4 arguments:

  • attributes: a list of keys that we want to look up (use the c() function to make that list)
  • filters: a filter to include entries which have the given key. in our cases, we want this to be our list of probes (e.g. only return annotations for probes that are found
  • values: the list of query terms that we are looking for. In our case, this is usually a list of our probeids
  • mart: the biomart to query <br >

So for example, imagine we had array data from the Affy GeneChip Human Genome U133 Plus 2.0 Array, a list of affy probes we were interested stored in a variable called 'affyid", wanted the keys for the the probeid and the entrez gene id for the probe, we could use:

lookup <- getBM(attributes=c('affy_hg_u133_plus_2','entrezgene'), filters='affy_hg_u133_plus_2', values=affyid,mart=mymart)

which would store our information into a variable called lookup.

Task Four: Merging annotations with our results.

Now we have a set of annotations stored in a new table. But those values are not connected to our topTable of results! We need a way to merge those objects together. To do this, we can use the function merge() to accomplish this.

Imagine we had two tables - called "mytopresults" and "lookup", where one of the columns in each table was an affyprobeid. We could use the following to merge the tables by that id by the following:

merge(mytopresults, lookup, by="affyprobeid", all=TRUE)

In practice, we will need to make sure:

  1. That the affyprobeid is actually an entry in our topTable: typically, these data are listed as the name of the row of the table, but are not actually an entry in that table.
  2. Ensure that the name of the column in both tables are named the same, and in this case, "affyprobeid".

To reassign the name of a specific column, you can use the colnames() R function:

colnames(mytopresults)[7] <- c("affyprobeid")

would change the name of the 7th column of the table "mytopresults" to "affyprobeid"

you can always check to see the names of the columns easily, to make sure they look right, e.g.:

mytopresults[1:5,]

Ontology Analysis

For an in-depth background on the scientific rationale, approaches, and issues, head to Canvas and watch the pre-recorded lecture that is assigned to the module.

For class, we'll be using the tool WebGestalt, which has a number of versitile analyses, background comparison, statistical controls. There are several good tools out there.