dplyr
Hypothesis testing for categorical variables (Fisher's Test, Chi-square test)
Book chapters 0 and 1
Pseudo code | Example code |
---|---|
library(packagename) | library(dplyr) |
?functionname | ?select |
?package::functionname | ?dplyr::select |
? 'Reserved keyword or symbol' \color{blue}{(or backticks)} | ? '%>%' |
??searchforpossiblyexistingfunctionandortopic | ??simulate |
help(package = "loadedpackage") | help("dplyr") |
browseVignettes("packagename") | browseVignettes("dplyr") |
\tiny Slide credit: Marcel Ramos
Pseudo code:
In [1]:
## source("https://bioconductor.org/biocLite.R")
## packages <- c("packagename", "githubuser/repository", "biopackage")
## BiocInstaller::biocLite(packages)
devtools
numeric
(set seed to sync random number generator):
In [2]:
set.seed(1)
rnorm(5)
integer
:
In [3]:
sample( 1:5 )
logical
:
In [4]:
1:3 %in% 3
character
:
In [5]:
c("yes", "no")
In [6]:
factor(c("yes", "no"))
In [7]:
c(NA, NaN, -Inf, Inf)
In [8]:
matrix(1:9, nrow = 3)
The list
is a non-atomic vector:
In [9]:
measurements <- c( 1.3, 1.6, 3.2, 9.8, 10.2 )
parents <- c( "Parent1.name", "Parent2.name" )
my.list <- list( measurements, parents)
my.list
The data.frame
has list-like and matrix-like properties:
In [10]:
x <- 11:16
y <- seq(0,1,.2)
z <- c( "one", "two", "three", "four", "five", "six" )
a <- factor( z )
my.df <- data.frame(x,y,z,a, stringsAsFactors = FALSE)
In [11]:
suppressPackageStartupMessages(library(S4Vectors))
df <- DataFrame(var1 = Rle(c("a", "a", "b")),
var2 = 1:3)
metadata(df) <- list(father="Levi is my father")
df
In [12]:
List(my.list)
In [13]:
str(List(my.list))
In [14]:
suppressPackageStartupMessages(library(IRanges))
IntegerList(var1=1:26, var2=1:100)
In [15]:
CharacterList(var1=letters[1:100], var2=LETTERS[1:26])
In [16]:
LogicalList(var1=1:100 %in% 5, var2=1:100 %% 2)
In [17]:
suppressPackageStartupMessages(library(Biostrings))
bstring = BString("I am a BString object")
bstring
In [18]:
dnastring = DNAString("TTGAAA-CTC-N")
dnastring
In [19]:
str(dnastring)
In [20]:
alphabetFrequency(dnastring, baseOnly=TRUE, as.prob=TRUE)
dplyr
dplyr
convention aims to ease cognitive burdendplyr
example
In [32]:
suppressPackageStartupMessages({
library(nycflights13)
library(dplyr)
})
delays <- flights %>%
filter(!is.na(dep_delay)) %>%
group_by(year, month, day, hour) %>%
summarise(delay = mean(dep_delay), n = n()) %>%
filter(n > 10)
In [22]:
hist(delays$delay, main="Mean hourly delay", xlab="Delay (hours)")
Examples:
- number of new diabetes cases in NYC in a given year
- The weight of a randomly selected individual in NYC
Types:
- Categorical random variable (e.g. disease / healthy)
- Discrete random variable (e.g. sequence read counts)
- Continuous random variable (e.g. normalized qPCR intensity)
Normally distributed random variable with mean $\mu = 0$ / standard deviation $\sigma = 1$, and a sample of $n=100$
In [23]:
x=rnorm(100)
res=hist(x, main="Standard Normal Distribution\n mean 0, std. dev. 1", prob=TRUE)
xdens = seq(min(res$breaks), max(res$breaks), by=0.01)
lines(xdens, dnorm(xdens))
In [24]:
x=rpois(100, lambda=2)
res=hist(x, main="Poisson Distribution", prob=FALSE, col="lightgrey",
breaks=seq(-0.5, round(max(x))+0.5, by=0.5))
xdens = seq(min(x), max(x), by=1)
lines(xdens, length(x) * dpois(xdens, lambda=2), lw=2)
In [25]:
x=rnbinom(100, size=30, mu=2)
res=hist(x, main="Negative Binomial Distribution", prob=FALSE, col="lightgrey",
breaks=seq(-0.5, round(max(x))+0.5, by=0.5))
xdens = seq(min(x), max(x), by=1)
lines(xdens, length(x) * dnbinom(xdens, size=30, mu=2), lw=2)
In [26]:
x=rbinom(100, size=20, prob=0.25)
res=hist(x, main="Binomial Distribution", prob=FALSE, col="lightgrey",
breaks=seq(-0.5, round(max(x))+0.5, by=0.5))
xdens = seq(min(x), max(x), by=1)
lines(xdens, length(x) * dbinom(xdens, size=20, prob=0.25), lw=2)
Population distributions
Sampling distributions
Hypothesis testing is about sampling distributions
One Sample - observations come from one of two population distributions:
Two Sample - two samples are drawn, either:
$\frac{{}\bar{x} - \mu}{s}$ and $\frac{\bar{x_1} - \bar{x_2}}{s}$ are t-distributed if:
Wilcoxon tests are an alternative for non-normal populations
Source: https://en.wikipedia.org/wiki/Lady_tasting_tea
p-value is the probability of the observed number of successes, or more, under $H_0$
Success count | Permutations of selection | Number of permutations |
---|---|---|
0 | oooo | 1 × 1 = 1 |
1 | ooox, ooxo, oxoo, xooo | 4 × 4 = 16 |
2 | ooxx, oxox, oxxo, xoxo, xxoo, xoox | 6 × 6 = 36 |
3 | oxxx, xoxx, xxox, xxxo | 4 × 4 = 16 |
4 | xxxx | 1 × 1 = 1 |
Total | 70 |
What do you notice about all these combinations?
fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)
In [27]:
disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
labels=c("control","cases"))
genotype=factor(c(rep("AA/Aa",204),rep("aa",46)),
levels=c("AA/Aa","aa"))
dat <- data.frame(disease, genotype)
dat <- dat[sample(nrow(dat)),] #shuffle them up
summary(dat)
In [28]:
table(disease, genotype)
In [29]:
chisq.test(disease, genotype)
In [30]:
chisq.test(disease, genotype, simulate.p.value = TRUE)
In [31]:
library(epitools)
epitools::oddsratio(genotype, disease, method="wald")$measure
(the default is whichever level is first alphabetically!)
simulate.p.value=TRUE
to get a more accurate p-value from the chi-square testTrue state of nature | Result of test | |
---|---|---|
Reject $H_0$ | Fail to reject $H_0$ | |
$H_0$ TRUE | Type I error, probability = $\alpha$ | No error, probability = $1-\alpha$ |
$H_0$ FALSE | No error, probability is called power = $1-\beta$ | Type II error, probability = $\beta$ (false negative) |