Introduction to Gene Expression Analysis by Arrays - Prelab

For this prelab, read the material below and execute only the commands followed by an empty code cell.

I.1. Overview and Objectives

In this lab, we will introduce you to the analysis of gene expression data, as generated by microarrays. After this prelab and some in-class activities, you will:

Use R and the Bioconductor package to load libraries and tools required for the analysis of these data
Read data and files into R that you will need for analysis: CEL files, and phenotypes
Learn about several useful functions for viewing data: hist(), image(), boxplot()

For those also interested, we include a couple of optional features which may be of interest:

Visualize results of analysis via a Quantile-Quantile (QQ) plot.
Another 'nifty' functions: heatmap, volcano plots.

In the following prelab, we're going to walk through a range of commands (and motivatory details).

II. Brief review: Technology for Gene Expression Analysis

It turns out that there are a wide range of technologies used to assay gene expression in the genome. A short review of the types of microarray technologies are given in the first part of a paper by Miller and Tang (2009). (see paper in Canvas)

But not only do the types of technology used to assay gene expression differ, the types of transcripts that are assayed by a given microarray will also differ:

One might want to use a small collection of probes to query all transcripts.
Or, probes for every exon in all transcripts.
Or alternatively still, only a subset of all transcripts (ie. 'onco-chips' which select only cancer-related oncogenes or tumor supressor genes).
And finally, the probes one designs for a mouse will be different for that for humans.

__It is important you know what type of array experiment design you are performing, or analyzing__. Specifically, this has importantly implications for pathway enrichment analysis. But more practically, you need to be aware of what annotation files to use, for later analyses.

III. Brief review: Motivation for Gene Expression Studies

If this is a new scientific area to you, we've provided a couple of review papers to you to read:

(1) a short PloS Comp Biology paper from Slonim and Yanai (2009). (2) a longer review paper from Alison et al. (2006).

These papers can be found on Canvas. Broadly, I have summarized the following take home messages:

Gene expression microarrays allow the survey of the expression of virtual all genes in the genome, at a specfic point in time. Often, in different treated or untreated states.
One objective of these experiments is to determine if the expression of a given gene is different between treatments, conditions, or tissues: i.e. H0 = The condition has no effect on the expression of a gene (β = 0) H1 = The condition has an effect on the expression of a gene (β != 0) and our goal is to determine if β is significantly different from zero.
Here, note that we are describing the hypothesis test for a single gene. In a single microarray experiment, we assay thousands of genes, simultaneously. As a result, mutliple-hypothesis testing issues apply. In other words, we should not consider all genes that have a nominal P < 0.05 to "significantly differ" in levels of gene expression (e.g. if 1000 genes were tested, we'd expect 50 by chance to score P < 0.05. Not what we want!). The levels of significance require burden of proof given the number of genes (and hypotheses) tested. There are ways apply correction for mutliple testing, and we'll go over some of that later.
A corollary to this question is to identify specific sets of genes (or pathways) that differ between these groups; typically, we think about the use of gene sets from Gene Ontology or curated sets of Pathways to test these hypotheses.
The design of an experiment follows this general workflow:

We are going to focus our modules here on the use of Affymetrix microarrays. The major file format by which data is provided by these arrays are files in the .CEL format. The CEL file is a file that summarizes of the results of the intensity calculations on the pixel values read from the microarray. These files contain a bunch of information how how the intensity value for the given probe was calculated, and summary statistics for that intensity (mean, standard deviation), in addition to perhaps other pieces of information about the experiment performed. Unfortunately, these files are quite large and, consequently, are stored in a binary (not-human readable) format: if you try to 'peek' at them in UNIX (using less, or more, or head), you'll probably get a bunch of jibberish. Note that there are ways to convert a CEL file into a human readable format, but we won't go into those details here.

IV. Finding Data: the Gene Expression Omnibus (GEO) catalog

There is a huge trove of expression data sets to be found at GEO. These data are nicely annotated, organized, and easily searchable. Take a moment to search around this index, search for your favorite terms, see what you can find!

To look at one example data set, search for: GSE47516

If we search for this term, we get a lot of specific information about the experiment, the samples that were assayed, and the platform that was used. In this case, 3 groups of n=7 samples, related to cerebellum were collected on the Affymetrix Mouse 430 2.0 Array.

You will also notice a link to a file archive (.tar) which contains all of the CEL data that is referred to here. If you want, this is a place where you can download data. However, if we know the GSE number, we can use R to download the file directly from GEO (see below).

IV. Bioconductor: The Veritable "Swiss-Army" Knife for High-throughput Data Analysis

After a great deal of work from a large community of bioinformaticians, there are a number of tools available to you which can be used for the analysis of microarray gene expression data. Many of these tools are housed in a vault called Bioconductor. Bioconductor makes all of these tools available to you, via R:

source("https://bioconductor.org/biocLite.R")
biocLite()

Bioconductor contains libraries for tools to load and process data, to normalize, make plots, visualize, perform analysis, annotations, and more. If there is something you want to do with your array data, chances are there is a package (and a function) somewhere in this repository. The website has lots of searchable packages, so that's a great place to start.

With this introduction in hand, we are now ready to 'wade' into the various packages, functions, and procedures you will use for a gene expression analysis.

V.1. Analysis Pipeline: Packages in R: How, where, what?

In this section, we'll review the commands to load packages in R, and specifically, list some of the key packages you will need to load when performing a differential gene-expression microarray study.

For sanity, I like to make sure my packages are loaded from bioconductor, then loaded into R. There are two packages which you will almost always want to load:

GEOquery: a package to automatically obtain data from the Gene Expression Omnibus
oligo: a package to analyze oligonucleotide arrays (expression/SNP/tiling/exon) at probe-level.
limma: Data analysis, linear models and differential expression for microarray data.

If we weren't using CoCalc, we'd need to initialize libraries we need from bioconductor using the function biocLite():

## this is the section for drawing packages from Bioconductor
source("https://bioconductor.org/biocLite.R")
biocLite("GEOquery")
biocLite("oligo") 
biocLite("limma")

However, on CoCalc, we have installed this for you already, so you only need to call the library() function.

Next, we'll load the libraries into R using library(). Copy these libraries and execute in the cell below:

library(GEOquery)
library(oligo)
library(limma)



In [0]:

To keep code organized, I like to place all of the packages I load at the beginning of my R pipeline. Once my packages are loaded, then proceed subsequently to perform analysis steps. In that way, my workspace is "initialized" and all I need do it copy and paste the library/package initialization steps when running my R pipeline.

Next, given the array we are analyzing, we will want to attached these data to the specific annotation of the array we are analyzing: what type of array, and what organism.

Fortunately, when you load in your data below, the Bioconductor package attempts to identify the annotations automatically for you.

In general, you will (and probably should not) try to change this.

It may be the case that you end up working with an array that does not automatically load annotations, but you may be able to find them online. For example, if we used the Affymetrix "Mouse Gene Array 2.1 ST" array, we could use the following:

## this is the section where I draw annotation packages for my array from bioconductor
biocLite(pd.mogene.2.1.st)

## this is the section where I load annotation packages into R
library(pd.mogene.2.1.st)

The annotation files you will require depend on the particulars of your experiment: what organism, what version of array, what type of experiment. Don't guess here! Your annotation files must be the same as what is expected for your array. In the pipeline below, one of the functions will try to be 'smart' and detect the library that it thinks it should use for your data, and will install the correct annotation libraries for you. However, if you see that this is not correct, you will need to adjust accordingly.

A list of annotation packages can be found here.

Now that we have the GEOquery package loaded, let's imagine we wanted to download a set of data from the GEO repository, rather than copy and upload directly. For our GSE47516 data set, we can use:

getGEOSuppFiles("GSE47516")

This will create a directory (named the data set we downloaded); inside, there will be the .tar archive file we requested. We'll have to use UNIX (in a terminal) to untar the file, and then to gunzip the file contents:

cd GSE47516
tar -xvf GSE47516_RAW.tar
gunzip *.gz

NOTE: You will not have to create a new directory to store your data: if you download the data directly from GEO, it will make a directory already for you. All you will have to do is untarball and decompress the data!

In the lab tomorrow, you will be performing analysis on a given set of affymetrix cel file data. For this prelab, we have created for you an example file phenotype file which provides a description (annotation) for those data. Let's take a look at what is in there.

Let's use R to see what's in this file. The code below should look familiar to you. Copy the code below and execute it:

mydata <- read.table(file="example-pheno.csv", sep=",",header=T)
mydata



In [0]:

V.2. Analysis Pipeline: The phenotype data file, loading into R

You will need to prepare a file which connects each of the CEL files that you will analyze to a treatment condition that you want to compare.

NOTE: This is a file that you will have to prepare in the lab!

An easy way to make this file is to create a comma separated file with the sample and the condition. In the example described in this prelab, we have data from 4 Liver and 4 Spleens from mouse, that we want to compare gene expression of.

A couple of notes/hints for preparing this file:

Make sure you do not have any extra lines (i.e., newlines, returns) at the end of the file. Each line of your file should only contain entries referring to a .CEL file (not blank lines)
Make sure that your "treatment" column is not unique - this column represents how you want to "group" the data for analysis. In this example experiment, we want to compare the four samples of Liver to Spleen, so even though each CEL file corresponds to a unique specimen, we're going to group them together with two labels: "Liver" for the four specimens from Liver, and "Spleen" for the four specimens from Spleen. If we did not do this, and we labelled each sample with a unique label, R would think we had 8 different treatment groups, not 2, which is not what we want.

Once you have prepared this phenotype file to describe your data, you will need load the file into R using the read.AnnotatedDataFrame() function.

For example, given the example phenotype file we provided you in the prelab, you could use:

phenoData <- read.AnnotatedDataFrame("example-pheno.csv", header=TRUE, sep=",")

This will create a variable that effectively stores the information from our phenotype file. Note the argument include a header (=True), and that the content of the file are separated by a comma.

we can get a little summary of that data to check that it's loaded as intended in R via the following:

> pData(phenoData) Tissue MoGene-2_1_GA_Liver_1_C01.ga.cel Liver MoGene-2_1_GA_Liver_A01.ga.cel Liver MoGene-2_1_GA_Liver_C01.ga.cel Liver MoGene-2_1_GA_Liver_G01.ga.cel Liver MoGene-2_1_GA_Spleen_A01.ga.cel Spleen MoGene-2_1_GA_Spleen_C01.ga.cel Spleen MoGene-2_1_GA_Spleen_E01.ga.cel Spleen MoGene-2_1_GA_Spleen_G01.ga.cel Spleen

Here, we can see that the phenotype file has 8 cel files, one set where the Tissue is liver, the other Spleen.

V.3. Analysis Pipeline: Loading cel file data into R

Now, we'll want to load the expression array (CEL) files into R for analysis. I like to get my data organized so that I don't have piles of array data strewn about, so typically I make a directory where all of these data are stored.

cel-data/

that contains all of our cel files, which references the separate assignment directory 16-18_Data_GeneExpression (UNIX users will recognize this as a symbolic link to that directory).

We can get a list of the files that are housed in that directory:

celFiles <- list.celfiles("../16-18_Data_GeneExpression/cel-data",full.name=T)

To read those data into R using the read.celfiles() command we use:

affyRaw <- read.celfiles(celFiles)

At this point, R will try to identify the proper annotation library for you. In this case, the function correctly identified these data as Affymetrix 2.1 Gene ST array data from mouse, and is using this set of annotations

pd.mogene.2.1.st

V.4. Analysis Pipeline: Some visualization functions

After loading the CEL data into R, it is often useful to look at high level summaries and plots of the data, before we get down to performing analysis. In this prelab, we are going to walk through a couple of basic functions, but not an exhaustive listing of them. For your own data and analysis, you will want to explore the data in much more detail, check outliers, etc.

A user's guide for the oligo() package contains a large number of useful details which could be explored for more in depth purposes.

hist() and boxplot()

These functions allow you to visualize the distribution of intensities of the probes assayed in each CEL file:

boxplot(affyRaw, which=c("all"))
hist(affyRaw, which=c("all"))

will generate the following two plots:

image()

Sometimes there are spatial feature on the arrays, and you might want to visually inspect your arrays.

image(affyRaw[,1])

Bonus material

There are a cople of other nice function we might care about plotting.

Volcano plot: these help us visualize the distribution, at a high level, of the significance of association by the fold change in expression.

volcanoplot(efit,coef=2)

QQ-plot: This plot is designed to help determine if the distribution of association statistics is reasonably well calibrated. In general, for most experiments, we expect only a handful of positive findings; most results will not show differences in probe intensity across treatment (i.e. are "null" distributed). If all results were, in fact, null, and there were no excess number of significant associations, the scores of our statistics should line up on the x=y line.

qqt(efit$t[,2],df=efit$df.residual+efit$df.prior)
abline(0,1)

Outputting Expression values: In many cases, you may want to store or extract the expression values for your normalized data. You can use exprs() to do this (here, I'm reporting the first 100 probes)

exp_data <- exprs(genenorm[1:100,])

and then write data to file

write.table(exp_data,file="example_expr_output.txt",quote=F,sep=" ",row.names=T,col.names=T);

Heatmap: These plots allow you to visualize see larger patterns of co-expression across treatment groups. For example, to generate a heatmap for using the first 100 probes from your normalized gene expression data, you could use:

 #add me to libary steps
 source("https://bioconductor.org/biocLite.R")
 biocLite("Heatplus")
 library(Heatplus)

 reg1 <- regHeatmap(exprs(genenorm[1:100,]))
 plot(reg1)

There are a lot of different kinds of options for heatmaps. See the Heatplus user guide for more details with examples.