Welcome and outline

Some essential R classes and related Bioconductor classes
Introduction to dplyr
Random variables and distributions
Hypothesis testing for one or two samples (t-test, Wilcoxon test, etc)
Hypothesis testing for categorical variables (Fisher's Test, Chi-square test)
Book chapters 0 and 1

Learning objectives

Perform basic data manipulation/exploration in R and dplyr
Identify key R and Bioconductor data classes
Define random variables and distinguish them from non-random ones
Distinguish population from sampling distributions
Interpret t-tests and confidence intervals
Identify when to use Fisher's Exact Test and Chi-square Test

A bit about me - research interests

High-dimensional statistics (more variables than observations)
Predictive modeling and methodology for validation
Metagenomic profiling of the human microbiome
Cancer genomics
HIV treatment effectiveness
http://www.waldronlab.org

R - basic usage

Tips for learning R

Pseudo code	Example code
library(packagename)	library(dplyr)
?functionname	?select
?package::functionname	?dplyr::select
? 'Reserved keyword or symbol' \color{blue}{(or backticks)}	? '%>%'
??searchforpossiblyexistingfunctionandortopic	??simulate
help(package = "loadedpackage")	help("dplyr")
browseVignettes("packagename")	browseVignettes("dplyr")

\tiny Slide credit: Marcel Ramos

Installing Packages the Bioconductor Way

See the Bioconductor site for more info

Pseudo code:



In [1]:

    
## source("https://bioconductor.org/biocLite.R")
## packages <- c("packagename", "githubuser/repository", "biopackage")
## BiocInstaller::biocLite(packages)

Works for CRAN, GitHub, and Bioconductor packages!

Note about installing `devtools`

Useful for building packages
Download and install from GitHub, directly or via BiocInstaller::biocLite()
Installation dependent on OS (Rtools for Windows)

Introduction to the R language

Base R Data Types: atomic vectors

numeric (set seed to sync random number generator):



In [2]:

    
set.seed(1)
rnorm(5)









    





	-0.626453810742332
	0.183643324222082
	-0.835628612410047
	1.59528080213779
	0.329507771815361

integer:



In [3]:

    
sample( 1:5 )

logical:



In [4]:

    
1:3 %in% 3









    





	FALSE
	FALSE
	TRUE

character:



In [5]:

    
c("yes", "no")









    





	'yes'
	'no'



In [6]:

    
factor(c("yes", "no"))









    





	yes
	no

Demo: integer-like properties, relevel()

Base R Data Types: missingness

Missing Values and others - IMPORTANT



In [7]:

    
c(NA, NaN, -Inf, Inf)









    





	<NA>
	NaN
	-Inf
	Inf

class() to find the class of a variable.

Base R Data Types: matrix, list, data.frame

matrix:



In [8]:

    
matrix(1:9, nrow = 3)

The list is a non-atomic vector:



In [9]:

    
measurements <- c( 1.3, 1.6, 3.2, 9.8, 10.2 )
parents <- c( "Parent1.name", "Parent2.name" )
my.list <- list( measurements, parents)
my.list









    





	
	1.3
	1.6
	3.2
	9.8
	10.2


	
	'Parent1.name'
	'Parent2.name'

The data.frame has list-like and matrix-like properties:



In [10]:

    
x <- 11:16
y <- seq(0,1,.2)
z <- c( "one", "two", "three", "four", "five", "six" )
a <- factor( z )
my.df <- data.frame(x,y,z,a, stringsAsFactors = FALSE)

Bioconductor S4 vectors: DataFrame

Bioconductor (www.bioconductor.org) defines its own set of vectors using the S4 formal class system DataFrame: like a data.frame but more flexible. columns can be any atomic vector type:
- GenomicRanges objects
- Rle (run-length encoding)



In [11]:

    
suppressPackageStartupMessages(library(S4Vectors))
df <- DataFrame(var1 = Rle(c("a", "a", "b")),
          var2 = 1:3)
metadata(df) <- list(father="Levi is my father")
df









    





DataFrame with 3 rows and 2 columns
   var1      var2
  <Rle> <integer>
1     a         1
2     a         2
3     b         3

Bioconductor S4 vectors: List and derived classes



In [12]:

    
List(my.list)









    





List of length 2



In [13]:

    
str(List(my.list))









    



Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
  ..@ listData       :List of 2
  .. ..$ : num [1:5] 1.3 1.6 3.2 9.8 10.2
  .. ..$ : chr [1:2] "Parent1.name" "Parent2.name"
  ..@ elementType    : chr "ANY"
  ..@ elementMetadata: NULL
  ..@ metadata       : list()



In [14]:

    
suppressPackageStartupMessages(library(IRanges))
IntegerList(var1=1:26, var2=1:100)









    





IntegerList of length 2
[["var1"]] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[["var2"]] 1 2 3 4 5 6 7 8 9 10 11 12 ... 89 90 91 92 93 94 95 96 97 98 99 100



In [15]:

    
CharacterList(var1=letters[1:100], var2=LETTERS[1:26])









    





CharacterList of length 2
[["var1"]] a b c d e f g h i j ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
[["var2"]] A B C D E F G H I J K L M N O P Q R S T U V W X Y Z



In [16]:

    
LogicalList(var1=1:100 %in% 5, var2=1:100 %% 2)









    





LogicalList of length 2
[["var1"]] FALSE FALSE FALSE FALSE TRUE FALSE ... FALSE FALSE FALSE FALSE FALSE
[["var2"]] TRUE FALSE TRUE FALSE TRUE FALSE ... FALSE TRUE FALSE TRUE FALSE

Bioconductor S4 vectors: Biostrings



In [17]:

    
suppressPackageStartupMessages(library(Biostrings))
bstring = BString("I am a BString object")
bstring









    





  21-letter "BString" instance
seq: I am a BString object



In [18]:

    
dnastring = DNAString("TTGAAA-CTC-N")
dnastring









    





  12-letter "DNAString" instance
seq: TTGAAA-CTC-N



In [19]:

    
str(dnastring)









    



Formal class 'DNAString' [package "Biostrings"] with 5 slots
  ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
  .. .. ..@ xp                    :<externalptr> 
  .. .. ..@ .link_to_cached_object:<environment: 0x7fbaee4bf518> 
  ..@ offset         : int 0
  ..@ length         : int 12
  ..@ elementMetadata: NULL
  ..@ metadata       : list()



In [20]:

    
alphabetFrequency(dnastring, baseOnly=TRUE, as.prob=TRUE)









    





	A
		0.25
	C
		0.166666666666667
	G
		0.0833333333333333
	T
		0.25
	other
		0.25

dplyr

Data Manipulation using `dplyr`

dplyr convention aims to ease cognitive burden
Function names are easy to remember:

select (Y)
mutate/transmute (add Ys / new Y)
filter (get Xs based on condition)
slice (get Xs specified)
summarise (reduce to single observation)
arrange (re-order observations)

`dplyr` example



In [32]:

    
suppressPackageStartupMessages({
    library(nycflights13)
    library(dplyr)
})
delays <- flights %>% 
  filter(!is.na(dep_delay)) %>%
  group_by(year, month, day, hour) %>%
  summarise(delay = mean(dep_delay), n = n()) %>%
  filter(n > 10)

`dplyr` example (cont'd)



In [22]:

    
hist(delays$delay, main="Mean hourly delay", xlab="Delay (hours)")

Random Variables and Distributions

Random Variables

A random variable: any characteristic that can be measured or categorized, and where any particular outcome is determined at least partially by chance.

Examples:

number of new diabetes cases in NYC in a given year

The weight of a randomly selected individual in NYC

Types:

Categorical random variable (e.g. disease / healthy)

Discrete random variable (e.g. sequence read counts)

Continuous random variable (e.g. normalized qPCR intensity)

Probability Distributions

We use probability distributions to describe the probability of all possible realizations of a random variable
In public health we use probability distributions to describe hypotheses or inference about the population
- because we normally cannot observe the entire population
  - In practice we study a sample selected from the population

Random Variables - examples

Normally distributed random variable with mean $\mu = 0$ / standard deviation $\sigma = 1$, and a sample of $n=100$



In [23]:

    
x=rnorm(100)
res=hist(x, main="Standard Normal Distribution\n mean 0, std. dev. 1", prob=TRUE)
xdens = seq(min(res$breaks), max(res$breaks), by=0.01)
lines(xdens, dnorm(xdens))

Random Variables - examples

Poisson distributed random variable ($\lambda = 2$), and a sample of $n=100$.



In [24]:

    
x=rpois(100, lambda=2)
res=hist(x, main="Poisson Distribution", prob=FALSE, col="lightgrey",
     breaks=seq(-0.5, round(max(x))+0.5, by=0.5))
xdens = seq(min(x), max(x), by=1)
lines(xdens, length(x) * dpois(xdens, lambda=2), lw=2)

Random Variables - examples

Negative Binomially distributed random variable ($size=30, \mu=2$), and a sample of $n=100$.



In [25]:

    
x=rnbinom(100, size=30, mu=2)
res=hist(x, main="Negative Binomial Distribution", prob=FALSE, col="lightgrey",
     breaks=seq(-0.5, round(max(x))+0.5, by=0.5))
xdens = seq(min(x), max(x), by=1)
lines(xdens, length(x) * dnbinom(xdens, size=30, mu=2), lw=2)

Random Variables - examples

Binomial Distribution random variable ($size=20, prob=0.25$), and a sample of $n=100$.
- use for binary outcomes



In [26]:

    
x=rbinom(100, size=20, prob=0.25)
res=hist(x, main="Binomial Distribution", prob=FALSE, col="lightgrey",
     breaks=seq(-0.5, round(max(x))+0.5, by=0.5))
xdens = seq(min(x), max(x), by=1)
lines(xdens, length(x) * dbinom(xdens, size=20, prob=0.25), lw=2)

Hypothesis Testing

Population vs sampling distributions

Population distributions
- Each realization / point is an individual
Sampling distributions
- Each realization / point is a sample
- distribution depends on sample size
- large sample distributions are given by the Central Limit Theorem
Hypothesis testing is about sampling distributions
- Did my sample likely come from that distribution?

Logic of hypothesis testing

One Sample - observations come from one of two population distributions:

usual distribution that has been true in the past
a potentially new distribution induced by an intervention or changing condition

Two Sample - two samples are drawn, either:

from a single population
from two different populations

The t-tests

In a one-sample test, only a single sample is drawn:
- $H_0: \mu = \mu_0$
In a two-sample test, two samples are drawn independently:
- $H_0: \mu_1 = \mu_2$
A paired test is one sample of paired measurements, e.g.:
- individuals before and after treatment

When to use t-tests

$\frac{{}\bar{x} - \mu}{s}$ and $\frac{\bar{x_1} - \bar{x_2}}{s}$ are t-distributed if:
- the standard deviation $s$ is estimated from the sample
- the population values are normally distributed
- the population values are slightly skewed but n > 15
- the population values are quite skewed but n > 30
Wilcoxon tests are an alternative for non-normal populations
- e.g. rank data
- data where ranks are more informative than means
- not when many ranks are arbitrary

Hypothesis testing for categorical variables

Lady Tasting Tea

The Lady in question claimed to be able to tell whether the tea or the milk was added first to a cup
Fisher proposed to give her eight cups, four of each variety, in random order
- the Lady is fully informed of the experimental method
- $H_0$: the Lady has no ability to tell which was added first

Source: https://en.wikipedia.org/wiki/Lady_tasting_tea

Fisher's Exact Test

p-value is the probability of the observed number of successes, or more, under $H_0$

Tea-Tasting Distribution
Success count	Permutations of selection	Number of permutations
0	oooo	1 × 1 = 1
1	ooox, ooxo, oxoo, xooo	4 × 4 = 16
2	ooxx, oxox, oxxo, xoxo, xxoo, xoox	6 × 6 = 36
3	oxxx, xoxx, xxox, xxxo	4 × 4 = 16
4	xxxx	1 × 1 = 1
Total		70

What do you notice about all these combinations?

Notes on Fisher's Exact Test

Can also be applied to rxc tables
Remember that the margins of the table are fixed by design
Also referred to as the Hypergeometric Test
Exact p-values are difficult (and unnecessary) for large samples
- fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)

Notes on Fisher's Exact Test (cont'd)

Has been applied (with peril!) to gene set analysis, e.g.:
- 10 of my top 100 genes are annotated with the cytokinesis GO term
- 465 of 21,000 human genes are annotated with the cytokinesis GO term
- Are my top 100 genes enriched for cytokinesis process?
Problems with this analysis:
- Main problem: top-n genes tend to be correlated, so their selections are not independent trials
- Secondary: does not match design for $H_0$
Alternative: permutation test repeating all steps

Chi-squared test

Test of independence for rxc table (two categorical variables)
Does not assume the margins are fixed by design
- i.e., the number of cups of tea with milk poured first can be random, and the Lady doesn't know how many
- more common in practice
- classic genomics example is GWAS
$H_0$: the two variables are independent
$H_A$: there is an association between the variables

Application to GWAS

Interested in association between disease and some potential causative factor
In a case-control study, the numbers of cases and controls are fixed, but the other variable is not
In a prospective or longitudinal cohort study, neither the number of cases or the other variable are fixed



In [27]:

    
disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
               labels=c("control","cases"))
genotype=factor(c(rep("AA/Aa",204),rep("aa",46)),
                levels=c("AA/Aa","aa"))
dat <- data.frame(disease, genotype)
dat <- dat[sample(nrow(dat)),] #shuffle them up 
summary(dat)









    





    disease     genotype  
 control:220   AA/Aa:204  
 cases  : 30   aa   : 46

Application to GWAS (cont'd)



In [28]:

    
table(disease, genotype)









    





         genotype
disease   AA/Aa  aa
  control   184  36
  cases      20  10



In [29]:

    
chisq.test(disease, genotype)









    





	Pearson's Chi-squared test with Yates' continuity correction

data:  disease and genotype
X-squared = 3.9963, df = 1, p-value = 0.0456



In [30]:

    
chisq.test(disease, genotype, simulate.p.value = TRUE)









    





	Pearson's Chi-squared test with simulated p-value (based on 2000
	replicates)

data:  disease and genotype
X-squared = 5.0634, df = NA, p-value = 0.04048

Application to GWAS (cont'd)

Note that the result says nothing about how the departure from independence occurs



In [31]:

    
library(epitools)
epitools::oddsratio(genotype, disease, method="wald")$measure









    





estimate lower upper

	AA/Aa 1.000000       NA       NA
	aa 2.555556 1.104441 5.913274

(the default is whichever level is first alphabetically!)

Summary - two categorical variables

Choice between Fisher's Exact Test and Chi-square test is determined by experimental design
If any counts in the table are less than 5, can use simulate.p.value=TRUE to get a more accurate p-value from the chi-square test
Both assume independent observations (important!!)
In a GWAS, correction for multiple testing is required
Can also use logistic regression for two categorical variables

Power and type I and II error

True state of nature	Result of test
	Reject $H_0$	Fail to reject $H_0$
$H_0$ TRUE	Type I error, probability = $\alpha$	No error, probability = $1-\alpha$
$H_0$ FALSE	No error, probability is called power = $1-\beta$	Type II error, probability = $\beta$ (false negative)

Use and mis-use of the p-value

The p-value is the probability of observing a sample statistic as or more extreme as you did, _assuming that $H_0$ is true_
The p-value is a random variable:
- don't treat it as precise.
- don't do silly things like try to interpret differences or ratios between p-values
- don't lose track of test assumptions such as independence of observations
- do use a moderate p-value cutoff, then use some effect size measure for ranking
Small p-values are particularly:
- variable under repeated sampling, and
- sensitive to test assumptions

Use and mis-use of the p-value (cont'd)

If we fail to reject $H_0$, is the probability that $H_0$ is true equal to ($1-\alpha$)? (Hint: NO NO NO!)
- Failing to reject $H_0$ does not mean $H_0$ is true
- "No evidence of difference $\neq$ evidence of no difference"
Statistical significance vs. practical significance
- As sample size increases, point estimates such as the mean are more precise
- With large sample size, small differences in means may be statistically significant but not practically significant
Although $\alpha = 0.05$ is a common cut-off for the p-value, there is no set border between "significant" and "insignificant," only increasingly strong evidence against $H_0$ (in favor of $H_A$) as the p-value gets smaller.

Welcome and outline

Learning objectives

A bit about me - research interests

R - basic usage

Tips for learning R

Installing Packages the Bioconductor Way

Note about installing devtools

Introduction to the R language

Base R Data Types: atomic vectors

Base R Data Types: missingness

Base R Data Types: matrix, list, data.frame

Bioconductor S4 vectors: DataFrame

Bioconductor S4 vectors: List and derived classes

Bioconductor S4 vectors: Biostrings

dplyr

Data Manipulation using dplyr

dplyr example

dplyr example (cont'd)

Random Variables and Distributions

Random Variables

Probability Distributions

Random Variables - examples

Random Variables - examples

Random Variables - examples

Random Variables - examples

Hypothesis Testing

Population vs sampling distributions

Logic of hypothesis testing

The t-tests

When to use t-tests

Hypothesis testing for categorical variables

Lady Tasting Tea

Fisher's Exact Test

Notes on Fisher's Exact Test

Notes on Fisher's Exact Test (cont'd)

Chi-squared test

Application to GWAS

Application to GWAS (cont'd)

Application to GWAS (cont'd)

Summary - two categorical variables

Power and type I and II error

Use and mis-use of the p-value

Use and mis-use of the p-value (cont'd)

Lab Exercises

Note about installing `devtools`

Data Manipulation using `dplyr`

`dplyr` example

`dplyr` example (cont'd)