Book Chapter 2:
Book chapter 7:
“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey
Example: individual heights compared to a normal distribution and to t-distributions with df=12
and df=3
.
In [ ]:
suppressPackageStartupMessages(library(UsingR))
suppressPackageStartupMessages(library(rafalib))
# height qq plot
x <- father.son$fheight
ps <- (seq(0,99) + 0.5 ) / 100
qs <- quantile(x, ps)
normalqs <- qnorm(ps, mean(x), popsd(x))
par(mfrow=c(1,3))
plot(normalqs, qs, xlab = "Normal Percentiles", ylab = "Height Percentiles", main = "Height Q-Q Plot")
abline(0,1)
# t-distributed for 12 df
x <- rt(1000, 12)
qqnorm(x, xlab="t quantiles", main = "T Quantiles (df=12) Q-Q Plot", ylim=c(-6,6))
qqline(x)
# t-distributed for 3 df
x <- rt(1000, 3)
qqnorm(x, xlab="t quantiles", main = "T Quantiles (df=3) Q-Q Plot", ylim=c(-6,6))
qqline(x)
Example: Three different views of a continuous variable (executive pay compensation)
In [ ]:
par(mfrow=c(1, 3))
hist(exec.pay, main = "CEO Compensation")
qqnorm(exec.pay, main = "CEO Compensation")
boxplot(exec.pay, ylab="10,000s of dollars", ylim=c(0,400), main = "CEO Compensation")
In [ ]:
par(mfrow=c(1,3))
plot(father.son$fheight, father.son$sheight,xlab="Father's height in inches",ylab="Son's height in inches",main=paste("correlation =",signif(cor(father.son$fheight, father.son$sheight),2)))
plot(cars$speed, cars$dist,xlab="Speed",ylab="Stopping Distance",main=paste("correlation =",signif(cor(cars$speed, cars$dist),2)))
plot(faithful$eruptions, faithful$waiting,xlab="Eruption Duration",ylab="Waiting Time",main=paste("correlation =",signif(cor(faithful$eruptions, faithful$waiting),2)))
In [ ]:
library(BiocInstaller)
biocLite("genomicsclass/GSE5859Subset") #install from Github
In [ ]:
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))
rbind(dim(geneExpression), dim(sampleInfo))
head(sampleInfo)
In [ ]:
library(genefilter)
g <- factor(sampleInfo$group)
summary(g)
In [ ]:
system.time(results <- rowttests(geneExpression, g))
pvals <- results$p.value
In [ ]:
head(results)
In [ ]:
par(mar=c(4, 4, 0, 0))
plot(results$dm, -log10(results$p.value),
xlab="Effect size (difference in group means)",
ylab="- log (base 10) p-values")
abline(h=-log10(0.05 / nrow(geneExpression)), col="red")
legend("bottomright", lty=1, col="red", legend="Bonferroni = 0.05")
If all null hypotheses are true, expect a flat histogram of p-values. The following shows the histogram of p-values next to p-values generated by tests on independent, random normal data.
In [ ]:
m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1); randomData <- matrix(rnorm(n*m),m,n)
nullpvals <- rowttests(randomData, g)$p.value
par(mfrow=c(1, 2))
hist(pvals,ylim=c(0,1400))
hist(nullpvals,ylim=c(0,1400))
In [ ]:
permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)
ComBat
: corrects for known batch effects by linear model), and sva
: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-valuesAll are available from the sva Bioconductor package
Permute the labels g, perform the row t-tests, and create a p-value histogram. Do this several times without re-setting the random number generating seed, so you can see the variability in this histogram. Do this using both the randomData
matrix and the geneExpression
matrix and put the histograms side by side, as above.
In [ ]:
rafalib::mypar(1,2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))
In [ ]:
pseudo0 <- rowMeans(geneExpression[, g==0])
pseudo1 <- rowMeans(geneExpression[, g==1])
plot((pseudo0 + pseudo1) / 2, (pseudo0 - pseudo1))
In [ ]:
ge = geneExpression[sample(1:nrow(geneExpression), 200), ]
pheatmap::pheatmap(ge, scale="row", show_colnames = F, show_rownames = F)
In [ ]:
rowvar <- apply(geneExpression, 1, var)
ge2 <- geneExpression[order(rowvar, decreasing=TRUE) <= 200, ]
pheatmap::pheatmap(ge2, scale="row", show_colnames = F, show_rownames = F)
Combination of RColorBrewer
package and colorRampPalette()
can create anything you want
"Pie charts are a very bad way of displaying information." - R Help
Source: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf
A metric satisfies the following five properties:
In [ ]:
##biocLite("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
table(tissue) ##tissue[i] corresponds to e[,i]
Interested in identifying similar samples and similar genes
Euclidean distance as for two dimensions. E.g., the distance between two samples $i$ and $j$ is:
$$ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } $$and the distance between two features $h$ and $g$ is:
$$ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } $$The distance between samples $i$ and $j$ can be written as:
$$ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }$$with $\mathbf{Y}_i$ and $\mathbf{Y}_j$ columns $i$ and $j$.
In [ ]:
t(matrix(1:3, ncol=1))
matrix(1:3, ncol=1)
t(matrix(1:3, ncol=1)) %*% matrix(1:3, ncol=1)
In [ ]:
kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2)^2))
sqrt(sum((kidney1 - colon1)^2))
In [ ]:
dim(e)
(d <- dist(t(e[, c(1, 2, 87)])))
class(d)
In [ ]:
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
dist
class output from dist()
is used for many clustering algorithms and heatmap functionsCaution: dist(e)
creates a nrow(e)
x nrow(e)
matrix that will probably crash your R session.
Simulate the heights of twin pairs:
In [ ]:
library(MASS)
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0,0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
cor(t(y))
In [ ]:
plot(t(y))
SVD generalizes the example rotation we looked at:
$$\mathbf{Y} = \mathbf{UDV}^\top$$note: the above formulation is for $m > n$
$\mathbf{Y}$: the m rows x n cols matrix of measurements
Scaling:
In [ ]:
system.time(e.standardize.slow <- t(apply(e, 1, function(x) x - mean(x) )))
system.time(e.standardize.fast <- t(scale(t(e), scale=FALSE)))
all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])
In [ ]:
SVD:
In [ ]:
s <- svd(e.standardize.fast)
names(s)
In [ ]:
dim(s$u) # loadings
length(s$d) # eigenvalues
dim(s$v) # d %*% vT = scores
In [ ]:
rafalib::mypar()
p <- prcomp( t(e.standardize.fast) )
plot(s$u[, 1] ~ p$rotation[, 1])
In [ ]:
plot(p$rotation[, 1], xlab="Index of genes", ylab="Loadings of PC1",
main="Scores of PC1") #or, predict(p)
abline(h=c(-0.03, 0.04), col="red")
Genes with high PC1 loadings:
In [ ]:
e.pc1genes <- e.standardize.fast[p$rotation[, 1] < -0.03 | p$rotation[, 1] > 0.03, ]
pheatmap::pheatmap(e.pc1genes, scale="none", show_rownames=TRUE,
show_colnames = FALSE)
Alternatively as cumulative % variance explained (using cumsum()
function):
prcomp()
), scores $\mathbf{D V^T}$
In [ ]:
d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)
In [ ]:
rafalib::mypar()
plot(mds, col=factor(tissue), pch=as.integer(factor(tissue)))
legend("topright", legend=levels(factor(tissue)), col=1:length(unique(tissue)),
bty='n', pch=1:length(unique(tissue)))
In [ ]:
library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)
In [ ]:
head(table(sampleInfo$date))
In [ ]:
year <- as.integer(format(sampleInfo$date,"%y") )
year <- year - min(year)
month = as.integer( format(sampleInfo$date,"%m") ) + 12*year
table(year, sampleInfo$ethnicity)
In [ ]:
pc <- prcomp(t(geneExpression), scale.=TRUE)
boxplot(pc$x[, 1] ~ month, varwidth=TRUE, notch=TRUE,
main="PC1 scores vs. month", xlab="Month", ylab="PC1")
In [ ]:
RColorBrewer::display.brewer.all(n=3, colorblindFriendly = TRUE)
Interpolate one color per month on a quantitative palette:
In [ ]:
col3 <- c(RColorBrewer::brewer.pal(n=3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space="Lab")(length(unique(month)))
In [ ]:
d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col=MYcols[as.integer(factor(month))],
main="MDS shaded by batch")
In [ ]:
hcclass <- cutree(hclust(d), k=5)
table(hcclass, year)
Exploratory Data Analysis: http://genomicsclass.github.io/book/pages/exploratory_data_analysis_exercises.html
Given the above histogram, how many people are between the ages of 35 and 45?
The InsectSprays
data set is included in R. The dataset reports the counts of insects in agricultural experimental units treated with different insecticides. Make a boxplot and determine which insecticide appears to be most effective.
Download and load this dataset into R. Use exploratory data analysis tools to determine which two columns are different from the rest. Which is the column that is positively skewed?
Which is the column that is negatively skewed?
Let’s consider a random sample of finishers from the New York City Marathon in 2002. This dataset can be found in the UsingR package. Load the library and then load the nym.2002
dataset.
In [ ]:
suppressPackageStartupMessages(library(dplyr))
data(nym.2002, package="UsingR")
Use boxplots and histograms to compare the finishing times of males and females. Which of the following best describes the difference?
In [ ]:
A) Males and females have the same distribution.
B) Most males are faster than most women.
C) Male and females have similar right skewed distributions with the former, 20 minutes shifted to the left.
D) Both distribution are normally distributed with a difference in mean of about 30 minutes.
Use dplyr
to create two new data frames: males and females, with the data for each gender. For males, what is the Pearson correlation between age and time to finish?
For females, what is the Pearson correlation between age and time to finish?
If we interpret these correlations without visualizing the data, we would conclude that the older we get, the slower we run marathons, regardless of gender. Look at scatterplots and boxplots of times stratified by age groups (20-25, 25-30, etc..). After examining the data, what is a more reasonable conclusion?
In [ ]:
A) Finish times are constant up until about our 40s, then we get slower.
B) On average, finish times go up by about 7 minutes every five years.
C) The optimal age to run a marathon is 20-25.
D) Coding errors never happen: a five year old boy completed the 2012 NY city marathon.
When is it appropriate to use pie charts or donut charts? A) When you are hungry. B) To compare percentages. C) To compare values that add up to 100%. D) Never.
The use of pseudo-3D plots in the literature mostly adds: A) Pizzazz. B) The ability to see three dimensional data. C) Ability to discover. D) Confusion.