For this prelab, read the material below and execute only the commands followed by an empty code cell.
In this lab, we will introduce you to the analysis of gene expression data, as generated by microarrays. After this prelab and some in-class activities, you will:
For those also interested, we include a couple of optional features which may be of interest:
In the following prelab, we're going to walk through a range of commands (and motivatory details).
It turns out that there are a wide range of technologies used to assay gene expression in the genome. A short review of the types of microarray technologies are given in the first part of a paper by Miller and Tang (2009). (see paper in Canvas)
But not only do the types of technology used to assay gene expression differ, the types of transcripts that are assayed by a given microarray will also differ:
__It is important you know what type of array experiment design you are performing, or analyzing__. Specifically, this has importantly implications for pathway enrichment analysis. But more practically, you need to be aware of what annotation files to use, for later analyses.
If this is a new scientific area to you, we've provided a couple of review papers to you to read:
(1) a short PloS Comp Biology paper from Slonim and Yanai (2009). <br > (2) a longer review paper from Alison et al. (2006).
These papers can be found on Canvas. Broadly, I have summarized the following take home messages:
There is a huge trove of expression data sets to be found at GEO. These data are nicely annotated, organized, and easily searchable. Take a moment to search around this index, search for your favorite terms, see what you can find!
To look at one example data set, search for: GSE47516
If we search for this term, we get a lot of specific information about the experiment, the samples that were assayed, and the platform that was used. In this case, 3 groups of n=7 samples, related to cerebellum were collected on the Affymetrix Mouse 430 2.0 Array.
You will also notice a link to a file archive (.tar) which contains all of the CEL data that is referred to here. If you want, this is a place where you can download data. However, if we know the GSE number, we can use R to download the file directly from GEO (see below).
After a great deal of work from a large community of bioinformaticians, there are a number of tools available to you which can be used for the analysis of microarray gene expression data. Many of these tools are housed in a vault called Bioconductor. Bioconductor makes all of these tools available to you, via R:
source("https://bioconductor.org/biocLite.R")
biocLite()
Bioconductor contains libraries for tools to load and process data, to normalize, make plots, visualize, perform analysis, annotations, and more. If there is something you want to do with your array data, chances are there is a package (and a function) somewhere in this repository. The website has lots of searchable packages, so that's a great place to start. <br >
With this introduction in hand, we are now ready to 'wade' into the various packages, functions, and procedures you will use for a gene expression analysis.
In this section, we'll review the commands to load packages in R, and specifically, list some of the key packages you will need to load when performing a differential gene-expression microarray study.
For sanity, I like to make sure my packages are loaded from bioconductor, then loaded into R. There are two packages which you will almost always want to load:
If we weren't using CoCalc, we'd need to initialize libraries we need from bioconductor using the function biocLite():
## this is the section for drawing packages from Bioconductor
source("https://bioconductor.org/biocLite.R")
biocLite("GEOquery")
biocLite("oligo")
biocLite("limma")
However, on CoCalc, we have installed this for you already, so you only need to call the library() function.
Next, we'll load the libraries into R using library(). Copy these libraries and execute in the cell below:
library(GEOquery)
library(oligo)
library(limma)
In [0]:
To keep code organized, I like to place all of the packages I load at the beginning of my R pipeline. Once my packages are loaded, then proceed subsequently to perform analysis steps. In that way, my workspace is "initialized" and all I need do it copy and paste the library/package initialization steps when running my R pipeline.
Next, given the array we are analyzing, we will want to attached these data to the specific annotation of the array we are analyzing: what type of array, and what organism.
Fortunately, when you load in your data below, the Bioconductor package attempts to identify the annotations automatically for you.
In general, you will (and probably should not) try to change this.
It may be the case that you end up working with an array that does not automatically load annotations, but you may be able to find them online. For example, if we used the Affymetrix "Mouse Gene Array 2.1 ST" array, we could use the following:
## this is the section where I draw annotation packages for my array from bioconductor
biocLite(pd.mogene.2.1.st)
## this is the section where I load annotation packages into R
library(pd.mogene.2.1.st)
The annotation files you will require depend on the particulars of your experiment: what organism, what version of array, what type of experiment. <br > Don't guess here!<br > Your annotation files must be the same as what is expected for your array. In the pipeline below, one of the functions will try to be 'smart' and detect the library that it thinks it should use for your data, and will install the correct annotation libraries for you. However, if you see that this is not correct, you will need to adjust accordingly.
A list of annotation packages can be found here.
Now that we have the GEOquery package loaded, let's imagine we wanted to download a set of data from the GEO repository, rather than copy and upload directly. For our GSE47516 data set, we can use:
getGEOSuppFiles("GSE47516")
This will create a directory (named the data set we downloaded); inside, there will be the .tar archive file we requested. We'll have to use UNIX (in a terminal) to untar the file, and then to gunzip the file contents:
cd GSE47516
tar -xvf GSE47516_RAW.tar
gunzip *.gz
NOTE: You will not have to create a new directory to store your data: if you download the data directly from GEO, it will make a directory already for you. All you will have to do is untarball and decompress the data!
In the lab tomorrow, you will be performing analysis on a given set of affymetrix cel file data. For this prelab, we have created for you an example file phenotype file which provides a description (annotation) for those data. Let's take a look at what is in there.
Let's use R to see what's in this file. The code below should look familiar to you. Copy the code below and execute it:
mydata <- read.table(file="example-pheno.csv", sep=",",header=T)
mydata
In [0]:
You will need to prepare a file which connects each of the CEL files that you will analyze to a treatment condition that you want to compare.
NOTE: This is a file that you will have to prepare in the lab!
An easy way to make this file is to create a comma separated file with the sample and the condition. In the example described in this prelab, we have data from 4 Liver and 4 Spleens from mouse, that we want to compare gene expression of.
A couple of notes/hints for preparing this file:
Make sure you do not have any extra lines (i.e., newlines, returns) at the end of the file. Each line of your file should only contain entries referring to a .CEL file (not blank lines)
Make sure that your "treatment" column is not unique - this column represents how you want to "group" the data for analysis. In this example experiment, we want to compare the four samples of Liver to Spleen, so even though each CEL file corresponds to a unique specimen, we're going to group them together with two labels: "Liver" for the four specimens from Liver, and "Spleen" for the four specimens from Spleen. If we did not do this, and we labelled each sample with a unique label, R would think we had 8 different treatment groups, not 2, which is not what we want.
Once you have prepared this phenotype file to describe your data, you will need load the file into R using the read.AnnotatedDataFrame() function.
For example, given the example phenotype file we provided you in the prelab, you could use:
phenoData <- read.AnnotatedDataFrame("example-pheno.csv", header=TRUE, sep=",")
This will create a variable that effectively stores the information from our phenotype file. Note the argument include a header (=True), and that the content of the file are separated by a comma.
we can get a little summary of that data to check that it's loaded as intended in R via the following:
> pData(phenoData) <br > Tissue <br > MoGene-2_1_GA_Liver_1_C01.ga.cel Liver <br > MoGene-2_1_GA_Liver_A01.ga.cel Liver <br > MoGene-2_1_GA_Liver_C01.ga.cel Liver <br > MoGene-2_1_GA_Liver_G01.ga.cel Liver <br > MoGene-2_1_GA_Spleen_A01.ga.cel Spleen <br > MoGene-2_1_GA_Spleen_C01.ga.cel Spleen <br > MoGene-2_1_GA_Spleen_E01.ga.cel Spleen <br > MoGene-2_1_GA_Spleen_G01.ga.cel Spleen
Here, we can see that the phenotype file has 8 cel files, one set where the Tissue is liver, the other Spleen.
Now, we'll want to load the expression array (CEL) files into R for analysis. I like to get my data organized so that I don't have piles of array data strewn about, so typically I make a directory where all of these data are stored.
cel-data/
that contains all of our cel files, which references the separate assignment directory 16-18_Data_GeneExpression (UNIX users will recognize this as a symbolic link to that directory).
We can get a list of the files that are housed in that directory:
celFiles <- list.celfiles("../16-18_Data_GeneExpression/cel-data",full.name=T)
To read those data into R using the read.celfiles() command we use:
affyRaw <- read.celfiles(celFiles)
At this point, R will try to identify the proper annotation library for you. In this case, the function correctly identified these data as Affymetrix 2.1 Gene ST array data from mouse, and is using this set of annotations
pd.mogene.2.1.st
After loading the CEL data into R, it is often useful to look at high level summaries and plots of the data, before we get down to performing analysis. In this prelab, we are going to walk through a couple of basic functions, but not an exhaustive listing of them. For your own data and analysis, you will want to explore the data in much more detail, check outliers, etc.
A user's guide for the oligo() package contains a large number of useful details which could be explored for more in depth purposes.
These functions allow you to visualize the distribution of intensities of the probes assayed in each CEL file:
boxplot(affyRaw, which=c("all"))
hist(affyRaw, which=c("all"))
will generate the following two plots:
QQ-plot: This plot is designed to help determine if the distribution of association statistics is reasonably well calibrated. In general, for most experiments, we expect only a handful of positive findings; most results will not show differences in probe intensity across treatment (i.e. are "null" distributed). If all results were, in fact, null, and there were no excess number of significant associations, the scores of our statistics should line up on the x=y line.
qqt(efit$t[,2],df=efit$df.residual+efit$df.prior)
abline(0,1)
Outputting Expression values: In many cases, you may want to store or extract the expression values for your normalized data. You can use exprs() to do this (here, I'm reporting the first 100 probes)
exp_data <- exprs(genenorm[1:100,])
and then write data to file
write.table(exp_data,file="example_expr_output.txt",quote=F,sep=" ",row.names=T,col.names=T);
Heatmap: These plots allow you to visualize see larger patterns of co-expression across treatment groups. For example, to generate a heatmap for using the first 100 probes from your normalized gene expression data, you could use:
#add me to libary steps
source("https://bioconductor.org/biocLite.R")
biocLite("Heatplus")
library(Heatplus)
reg1 <- regHeatmap(exprs(genenorm[1:100,]))
plot(reg1)
There are a lot of different kinds of options for heatmaps. See the Heatplus user guide for more details with examples.