title: 'UNIVR Applied Statistics for High-throughput Biology 2018: Day 2' author: "Levi Waldron" date: "Mar 8, 2018" output: html_document: keep_md: yes number_sections: yes
toc: yes
Textbook sources:
In [ ]:
table(spider$leg,spider$type)
summary(spider)
In [ ]:
boxplot(spider$friction ~ spider$type * spider$leg,
col=c("grey90","grey40"), las=2,
main="Friction coefficients of different leg pairs")
The following are examples of linear models:
Random part of model:
$y_i = E[y_i|x_i] + \epsilon_i$
Assumptions of linear models: $\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)$
Comparisons with respect to a reference category, e.g. L1
:
L2
={1 if $2^{nd}$ leg pair, 0 otherwise}, L3
={1 if $3^{nd}$ leg pair, 0 otherwise}, L4
={1 if $4^{th}$ leg pair, 0 otherwise}.R re-codes factors to dummy variables automatically.
Using the spider
object:
leg2
where L1
/ L2
are merged, and L3
/ L4
are mergedleg3
that is an "ordered" factor (see ?factor
)spider$type
(see ?relevel
)aov()
, lm()
, glm()
, and coxph()
use a "model formula" interface.> response variable ~ explanatory variables
Model formula for simple linear regression:
> y ~ x
Friction coefficient for leg type of first leg pair:
In [ ]:
spider.sub <- spider[spider$leg=="L1", ]
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)
In [ ]:
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Remember there are positions 1-4
In [ ]:
fit <- lm(friction ~ leg, data=spider)
In [ ]:
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Additional explanatory variables can be added as follows:
> y ~ x + z
Note that "+" does not have its usual meaning, which would be achieved by:
> y ~ I(x + z)
Remember there are positions 1-4
In [ ]:
fit <- lm(friction ~ type + leg, data=spider)
In [ ]:
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Interaction is modeled as the product of two covariates: $$ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2 $$
Image credit: http://personal.stevens.edu/~ysakamot/
symbol | example | meaning |
---|---|---|
+ | + x | include this variable |
- | - x | delete this variable |
: | x : z | include the interaction |
* | x * z | include these variables and their interactions |
^ | (u + v + w)^3 | include these variables and all interactions up to three way |
1 | -1 | intercept: delete the intercept |
Note: order generally doesn't matter (u+v OR v+u)
In [ ]:
lm( y ~ u + v)
u
and v
factors: ANOVA
u
and v
numeric: multiple regression
one factor, one numeric: ANCOVA
Perform regression on the spider dataset to model friction as a function of type, leg, and the interaction of type and leg. Interpret the output of this regression.
Useful for log-transformed microarray data
The model: $y_i = E[y|x] + \epsilon_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i$
Random component of $y_i$ is normally distributed: $\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)$
Systematic component (linear predictor): $\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}$
Link function here is the identity link: $g(E(y | x)) = E(y | x)$. We are modeling the mean directly, no transformation.
Useful for binary outcomes, e.g. Single Nucleotide Polymorphisms or somatic variants
The model: $$ Logit(P(x)) = log \left( \frac{P(x)}{1-P(x)} \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} $$
Random component: $y_i$ follows a Binomial distribution (outcome is a binary variable)
Systematic component: linear predictor $$ \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} $$
Link function: logit (log of the odds that the event occurs)
The systematic part of the GLM is:
$$ log\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + log(t_i) $$dbinom()
Negative Binomial Distribution has two parameters: # of trials n, and probability of success p
Simulate the following vectors of p-values:
In [ ]:
set.seed(1)
p1 <- runif(1000)
p2 <- c(runif(900), runif(100, min=0, max=0.05))
p.adjust
function to calculate Bonferroni-corrected ("bonferroni") p-values and Benjamini Hochberg False Discovery Rate ("BH"). For each of these simulations, how many p-values are < 0.05 for the raw, Bonferroni-adjusted, and BH-adjusted values?Extends to binary $Y$ (logistic regression), count $Y$ (log-linear regression with e.g. Poisson or Negative Binomial link functions) through Generalized Linear Models
Generalized Linear Models extend linear regression to:
Create a binary friction variable "binfriction" by splitting friction at its median into "high" and "low". Make sure "low" is the reference category. (One way to do this is with ?cut
). Repeat the above linear regressions using logistic regression with "binfriction" as the response variable. Compare the results to what you saw in the linear regressions.