For this prelab, read the example code and descriptions below, and answer the questions throughout.
In this lab, we will introduce you to the analysis of gene expression data, as generated by microarrays. After this prelab and some in-class activities, you will:
rma()
model.matrix()
eBayes()
topTable()
write.table()
In the following prelab, we're going to walk through a range of commands, picking up where we left off in the previous class.
As we can see from the histogram or box plots from the previous class, the distribution of intensity values differs across each CEL file. This is problematic, because we want intensity data to be comparable across different arrays (and across treatment conditions). To deal with this issue, we need to normalize the data in some way. As it turns out, there are many ways to normalize data, and many packages exist that implement a wide range of normalization procedures.
One robust way to normalize data is using the Robust Multichip Average (RMA) approach. To perform normalization using RMA, simply use:
genenorm <- rma(affyRaw)
These look much better, don't they?
Next up, we need to create a matrix that tells R what groups to compare against. In this example, we will design a simple, two group comparison (Liver vs. Spleen).
group <- factor(t(pData(phenoData)[, 1]))
group
design <- model.matrix(~group, pData(phenoData))
> design <br > (Intercept) GroupSpleen <br > MoGene-2_1_GA_Liver_1_C01.ga.cel 1 0 <br > MoGene-2_1_GA_Liver_A01.ga.cel 1 0 <br > MoGene-2_1_GA_Liver_C01.ga.cel 1 0 <br > MoGene-2_1_GA_Liver_G01.ga.cel 1 0 <br > MoGene-2_1_GA_Spleen_A01.ga.cel 1 1 <br > MoGene-2_1_GA_Spleen_C01.ga.cel 1 1 <br > MoGene-2_1_GA_Spleen_E01.ga.cel 1 1 <br > MoGene-2_1_GA_Spleen_G01.ga.cel 1 1 <br > attr(,"assign") <br > [1] 0 1 <br > attr(,"contrasts") <br > attr(,"contrasts")$Group <br > [1] "contr.treatment" <br >
As you can see, all of the CEL files that are Liver are set to "0", and Spleen "1", and the appropriate object has been created. Here, we will measure expression in Spleen (=1), relative to Liver (=0).
Now that we have our design matrix in hand, we can now prepare to test our probes for differences in gene expression. To perform this analysis, we will use linear regression to determine if the intensity of the probe is correlated with the effect the treatment (in this case, tissue), and if so, estimate the effect of the probe intensity (β). Because the number of arrays we typically analyze is small, we need to use moderated t-statistics and an empirical Bayes procedure, equivalent to shrinkage of the estimated sample variances of the probe intensities towards a pooled estimate, resulting in better stability for inference (see Smith 2004 for more details).
fit <- lmFit(genenorm, design)
efit <- eBayes(fit)
The result is a large table of scores! We can use another function, topTable(), to obtain results
tt <- topTable(efit, sort = "P", n = 500)
Here, I have opted to sort the Table by P-value, and am only recording the top 500 results in the scan. If I only wanted a table with all the results, I could set n=Inf
instead, but keep in mind that will generate a VERY large file (>200 Mb).
> tt[1:5, ] <br > logFC AveExpr t P.Value adj.P.Val B <br > 17405908 -4.136117 6.911107 -67.54411 5.597037e-13 1.415179e-08 20.03929 <br > 17284936 -4.116741 6.375286 -63.26197 9.795902e-13 1.415179e-08 19.60969 <br > 17419622 -3.586706 7.065164 -61.67103 1.217751e-12 1.415179e-08 19.43799 <br > 17472364 -4.903165 5.905147 -59.14892 1.739743e-12 1.415179e-08 19.15110 <br > 17219356 -3.681500 7.225212 -58.70824 1.854509e-12 1.415179e-08 19.09902 <br >
Not surprisingly, there are several probes which are differently expressed between liver and spleen, in this case, expressed lower in Spleen relative to Liver (note the negative values here). Positive scores here would indicate higher expression in Spleen, relative to Liver.
IMPORTANT NOTE: Capital letters matter for the function names here! eBayes() is not the same as ebayes(); topTable() also is not the same as toptable().
BE SURE YOU ARE USING topTable() and eBayes() !!!!
Often, we will want to output our results to a file, so that we can summarize, save results, and extract information more readily using other tools. We can use write.table
for this:
write.table(tt, file = "example_allresults.txt", quote = F, sep = ",", row.names = T, col.names = NA);
Note that adding col.names = NA
prevents the header in the file from being "off-by-one" column. e.g. the first column is not logFC, but the probeID.
Also note here that in this case, we're separating the results by comma (sep = ","
), which generally should be fine. In some cases, you may have a table that contains not just numbers, but also words or descriptions (which may themselves include whitespaces, or even commas!).
If that is the case, you can set quote = T
, which will keep those units together. (parsing those data files in UNIX can be a trick issue, however).