Introduction to Gene Expression Analysis by Arrays 2 - Prelab

For this prelab, read the example code and descriptions below, and answer the questions throughout.

I.1. Overview and Objectives

In this lab, we will introduce you to the analysis of gene expression data, as generated by microarrays. After this prelab and some in-class activities, you will:

  • Learn about Normalization: what and why and an R function to normalize data: rma()
  • Build a design module for the analysis of your data: model.matrix()
  • Perform differential expression analysis: eBayes()
  • Report p-values from the top results: topTable()
  • Output results from your expression analysis: write.table()

In the following prelab, we're going to walk through a range of commands, picking up where we left off in the previous class.

II.1. Analysis Pipeline: Normalization

As we can see from the histogram or box plots from the previous class, the distribution of intensity values differs across each CEL file. This is problematic, because we want intensity data to be comparable across different arrays (and across treatment conditions). To deal with this issue, we need to normalize the data in some way. As it turns out, there are many ways to normalize data, and many packages exist that implement a wide range of normalization procedures.

One robust way to normalize data is using the Robust Multichip Average (RMA) approach. To perform normalization using RMA, simply use:

genenorm <- rma(affyRaw)

These look much better, don't they?

II.2. Analysis Pipeline: Design Matrix for Analysis

Next up, we need to create a matrix that tells R what groups to compare against. In this example, we will design a simple, two group comparison (Liver vs. Spleen).

group <- factor(t(pData(phenoData)[, 1]))
group


design <- model.matrix(~group, pData(phenoData))

> design <br >                   (Intercept) GroupSpleen <br > MoGene-2_1_GA_Liver_1_C01.ga.cel     1     0 <br > MoGene-2_1_GA_Liver_A01.ga.cel      1     0 <br > MoGene-2_1_GA_Liver_C01.ga.cel      1     0 <br > MoGene-2_1_GA_Liver_G01.ga.cel      1     0 <br > MoGene-2_1_GA_Spleen_A01.ga.cel     1     1 <br > MoGene-2_1_GA_Spleen_C01.ga.cel     1     1 <br > MoGene-2_1_GA_Spleen_E01.ga.cel     1     1 <br > MoGene-2_1_GA_Spleen_G01.ga.cel     1     1 <br > attr(,"assign") <br > [1] 0 1 <br > attr(,"contrasts") <br > attr(,"contrasts")$Group <br > [1] "contr.treatment" <br >

As you can see, all of the CEL files that are Liver are set to "0", and Spleen "1", and the appropriate object has been created. Here, we will measure expression in Spleen (=1), relative to Liver (=0).

Q: How many samples are being compared in each group?

II.3. Analysis Pipeline: Performing differential expression analysis and summarizing results

Now that we have our design matrix in hand, we can now prepare to test our probes for differences in gene expression. To perform this analysis, we will use linear regression to determine if the intensity of the probe is correlated with the effect the treatment (in this case, tissue), and if so, estimate the effect of the probe intensity (β). Because the number of arrays we typically analyze is small, we need to use moderated t-statistics and an empirical Bayes procedure, equivalent to shrinkage of the estimated sample variances of the probe intensities towards a pooled estimate, resulting in better stability for inference (see Smith 2004 for more details).

fit <- lmFit(genenorm, design)
efit <- eBayes(fit)

The result is a large table of scores! We can use another function, topTable(), to obtain results

tt <- topTable(efit, sort = "P", n = 500)

Here, I have opted to sort the Table by P-value, and am only recording the top 500 results in the scan. If I only wanted a table with all the results, I could set n=Inf instead, but keep in mind that will generate a VERY large file (>200 Mb).

> tt[1:5, ] <br >        logFC AveExpr     t   P.Value   adj.P.Val    B <br > 17405908 -4.136117 6.911107 -67.54411 5.597037e-13 1.415179e-08 20.03929 <br > 17284936 -4.116741 6.375286 -63.26197 9.795902e-13 1.415179e-08 19.60969 <br > 17419622 -3.586706 7.065164 -61.67103 1.217751e-12 1.415179e-08 19.43799 <br > 17472364 -4.903165 5.905147 -59.14892 1.739743e-12 1.415179e-08 19.15110 <br > 17219356 -3.681500 7.225212 -58.70824 1.854509e-12 1.415179e-08 19.09902 <br >

Not surprisingly, there are several probes which are differently expressed between liver and spleen, in this case, expressed lower in Spleen relative to Liver (note the negative values here). Positive scores here would indicate higher expression in Spleen, relative to Liver.

IMPORTANT NOTE: Capital letters matter for the function names here! eBayes() is not the same as ebayes(); topTable() also is not the same as toptable().

BE SURE YOU ARE USING topTable() and eBayes() !!!!

Q: Given that ~20,000 genes tested, do any of these probes appear differentially expressed between Liver and Spleen? Why or why not?

II.4. Analysis Pipeline: Output results to file

Often, we will want to output our results to a file, so that we can summarize, save results, and extract information more readily using other tools. We can use write.table for this:

write.table(tt, file = "example_allresults.txt", quote = F, sep = ",", row.names = T, col.names = NA);

Note that adding col.names = NA prevents the header in the file from being "off-by-one" column. e.g. the first column is not logFC, but the probeID.

Also note here that in this case, we're separating the results by comma (sep = ","), which generally should be fine. In some cases, you may have a table that contains not just numbers, but also words or descriptions (which may themselves include whitespaces, or even commas!).

If that is the case, you can set quote = T, which will keep those units together. (parsing those data files in UNIX can be a trick issue, however).