In this prelab, we will continue to think about analysis of gene expression on microarrays, covering two additional topics.
After you've performed your analysis, you might have been thinking: These probe IDs are interesting, but I don't speak probe. Turns out not many speak probe. You'd really like to know what gene names, positions, protein domains, and more! are attached to your top hits.
There are a lot of different kinds of annotations that one can associate with probe ids. It also turns out there is a package in R: biomaRt that lets you obtain a great deal of information from Ensembl, which is the go to place to obtain these sorts of data.
Once again, this package is available from Bioconductor:
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
We have loaded this package from bioconductor for you already in CoCalc, so you only need to invoke the library() function to include this functionality into your analysis pipeline.
This package can do a lot of things. Check out this User Guide which gives you a range of examples. We'll walk through one of those, and give you some helper functions to 'troll' through this information, then some additional functions to help connect those data to your association results.
Copy, paste, and execute the following code in the cell below:
library(biomaRt)
In [0]:
In [0]:
In most cases, we'll be accessing Ensembl. But, we still need to find the organism we want to study. To get a list of those, we can use the function useMart(), followed by listDatasets():
mymart <- useMart("ensembl")
listDatasets(mymart)[,1]
This will give us a list of biomarts for all the organisms that have been curated by Ensembl. e.g. hsapiens_gene_ensembl for humans, or mmusculus_gene_ensembl for the mouse genome. Note the VERSION of the database you are accessing: genome builds can change the interpretations and specific positions of things -- be mindful when making comparisons with other databases if the anchoring to a genome build could influence your results or interpretation.
that said, we can invoke useMart() to obtain that biomart specifically.
Execute the follow code in the cell below:
mymart <- useMart("ensembl", "hsapiens_gene_ensembl")
In [0]:
In order to make use of your biomart, you will need to obtain a list of probe IDs to 'lookup' in this data base. These can be found in various object that you create from your actual expression analysis. For example, you might recall using the top Table command:
tt <- topTable(efit, sort="P", n=100)
In the resulting table variable (i.e. tt), the names of each row correspond to each probe. In R, there is a function to extract the list of row names, easily enough called row.names:
affyid <- row.names(tt)
This will create a list of ids that you can search for.
In [0]:
Then access attrs in various ways
> attrs[1:5,]
name description
1 ensembl_gene_id Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3 ensembl_peptide_id Ensembl Protein ID
4 ensembl_exon_id Ensembl Exon ID
5 description Description
There are hundreds (even a thousand!) types of information: the name gives you the key you want to access, the description is just the human-readable description for it.
You might want to try to find specific information for annotation, even if you don't know what the annotation key is. Here's a quick little script to grep for a key word or two:
attrs[grep("(HGNC|Affy)",attrs[[2]]),]
which will looks for the keys "HGNC" and "Affy" in the list of annotations stored by the attrs array. The keys that we will want to look-up are listed as "name", so we'll want to save or store those names down somewhere (exactly!)
Once we have our list of keys, we can use another function, getBM(), to query ensembl to obtain the annotations we are interested in. This function take 4 arguments:
So for example, imagine we had array data from the Affy GeneChip Human Genome U133 Plus 2.0 Array, a list of affy probes we were interested stored in a variable called 'affyid", wanted the keys for the the probeid and the entrez gene id for the probe, we could use:
lookup <- getBM(attributes=c('affy_hg_u133_plus_2','entrezgene'), filters='affy_hg_u133_plus_2', values=affyid,mart=mymart)
which would store our information into a variable called lookup.
Now we have a set of annotations stored in a new table. But those values are not connected to our topTable of results! We need a way to merge those objects together. To do this, we can use the function merge() to accomplish this.
Imagine we had two tables - called "mytopresults" and "lookup", where one of the columns in each table was an affyprobeid. We could use the following to merge the tables by that id by the following:
merge(mytopresults, lookup, by="affyprobeid", all=TRUE)
In practice, we will need to make sure:
To reassign the name of a specific column, you can use the colnames() R function:
colnames(mytopresults)[7] <- c("affyprobeid")
would change the name of the 7th column of the table "mytopresults" to "affyprobeid"
you can always check to see the names of the columns easily, to make sure they look right, e.g.:
mytopresults[1:5,]
For an in-depth background on the scientific rationale, approaches, and issues, head to Canvas and watch the pre-recorded lecture that is assigned to the module.
For class, we'll be using the tool WebGestalt, which has a number of versitile analyses, background comparison, statistical controls. There are several good tools out there.