Julien Wist / 2017 / Universidad del Valle
Andrés Bernal / 2017 / ???
An up-to-date version of this notebook can be found here: https://github.com/jwist/chemometrics/
In [1]:
options(repr.plot.width=4, repr.plot.height=4)
In [2]:
# we load a few packages
library(ggplot2)
library(corrplot)
library(reshape2)
library(caret)
library(MASS) # for LDA
library(klaR) # for pls
library(pls)
library(e1071)# for pls
library(pROC) # for pls
library(vegan)
require(scales)
require(gridExtra);
In [3]:
#rm(list=ls(all=TRUE))
load(url('https://github.com/jwist/chemometrics/raw/master/datasets/coffeeMulti.rda'))
d <- coffeeMulti$irms
dim(d)
We have 34 samples and if we look at the country they are from we find out the following:
In [4]:
names(d)
In [5]:
summary(d$country)
We can look to the data in more details. Let's display the first 3 rows (it is always a good idea to look at the data, sometimes there are difficult to diplay because too lengthy. In that case a plot will do the job)
In [6]:
d[1:3,]
# uncomment the following line to run only two classes
#levels(d$country) <- c("otros","Colombia","otros")
We discover that we have two replicates (two columns) and an additional column with means.
Let's make a simple boxplot graph to visualize the data.
In [17]:
d[d$country == "Colombia" | d$country == "Peru",]
In [9]:
options(repr.plot.width=5, repr.plot.height=3)
mean=100; sd=15
lb=80; ub=120
x <- seq(-10,10,length=100)*sd + mean
hx <- dnorm(x,mean,sd)
hy <- dnorm(x,mean+10,sd)
s10 <- t.test(hx, hy)$statistic
var.test(hx, hy)
hy <- dnorm(x,mean+20,sd)
s20 <- t.test(hx, hy)$statistic
hy <- dnorm(x,mean+50,sd)
s50 <- t.test(hx, hy)$statistic
hy <- dnorm(x,mean+100,sd)
s100 <- t.test(hx, hy)$statistic
var.test(hx, hy)
par(mfrow=c(1,3))
plot(x, hx, type="n", xlab="", ylab="",
main=s10, axes=FALSE,xlim=c(0,250))
lines(x, hx)
lines(x, dnorm(x,mean+10,sd), col="red")
axis(1, at=seq(40, 260, 20), pos=0)
plot(x, hx, type="n", xlab="", ylab="",
main=s20, axes=FALSE,xlim=c(0,250))
lines(x, hx)
lines(x, dnorm(x,mean+20,sd), col="red")
axis(1, at=seq(40, 260, 20), pos=0)
plot(x, hx, type="n", xlab="", ylab="",
main=s50, axes=FALSE,xlim=c(0,250))
lines(x, hx)
lines(x, dnorm(x,mean+50,sd), col="red")
axis(1, at=seq(40, 260, 20), pos=0)
plot(x, hx, type="n", xlab="", ylab="",
main=s100, axes=FALSE,xlim=c(0,250))
lines(x, hx)
lines(x, dnorm(x,mean+100,sd), col="red")
axis(1, at=seq(40, 260, 20), pos=0)
Out[9]:
Out[9]:
In [10]:
options(repr.plot.width=4, repr.plot.height=4)
boxplot(tapply(d$mean,d$country,function(x) x))
Or if we prefer we can diplay the density distribution of the data, which gives a similar information as boxplots.
In [11]:
# here we plot the distribution for each class
x <- tapply(d$mean,d$country,function(x) x)
xlims <- range(unlist(tapply(d$mean,d$country,function(x) unlist(range(density(x)$x)))))
ylims <- range(unlist(tapply(d$mean,d$country,function(x) unlist(range(density(x)$y)))))
plot(density(x[1][[1]]),type='l', xlim=xlims, ylim=ylims, col=2, main=" ")
lines(density(x[2][[1]]), col=3)
lines(density(x[3][[1]]), col=4)
legend(min(xlims),max(ylims),levels(d$country), col=2:4, pch="*")
In [12]:
# then we start the calculations
means <- tapply(d$mean,d$country,mean) # means
vars <- tapply(d$mean,d$country,var) # variances
sds <- sqrt(vars) # standard deviation
ns <- tapply(d$mean,d$country,length)
# we perform a multiple t-test, that is we calculate t-test between
# all the pairs of countries, and compute the p-values
# from http://www.chem.utoronto.ca/coursenotes/analsci/stats/ttest.html
t <- dist(means) / sqrt( abs(combn(sds^2/ns,2,sum)) )
dof <- abs(combn(sds^2/ns,2,sum))^2 / combn(sds^4/(ns^2*(ns - 1)),2,sum)
#dof <- sum(ns) - length(levels(d$country)) # sometimes the d.o.f are also computed in that simpler manner
p <- 2*pt(-abs(t),df=dof);
In [13]:
t # value of t
dof # degree of freedom
p # p-value associated to t
# (probability of finding t if null hypothesis is true)
# should be small!
Out[13]:
Out[13]:
Out[13]:
Let's make a simple t-test between Brazil and Colombia.
In [14]:
t.test(d$mean[d$country=="Brasil"], d$mean[d$country=="Colombia"])
Out[14]:
In [15]:
t.test(d$mean[d$country=="Peru"], d$mean[d$country=="Colombia"])
Out[15]:
In [16]:
var.test(d$mean[d$country=="Peru"], d$mean[d$country=="Colombia"])
Out[16]:
In [17]:
p.adjust(c(0.0614565181, 0.1206906508, 0.0005354635), "bonferroni")
Out[17]:
In [18]:
# this complex caluclation can be replaced by this single line
m <- t.test(d$mean[d$country=="Colombia"],d$mean[d$country=="Brasil"])
m2 <- t.test(d$mean[d$country=="Colombia"],d$mean[d$country=="Peru"])
m3 <- t.test(d$mean[d$country=="Brasil"],d$mean[d$country=="Peru"])
In [19]:
plot(function(x)(dt(x,df=m$parameter)),xlim=c(-2,10)); points(m$statistic,dt(m$statistic,df=m$parameter), col=2, xlab=("d"))
points(m2$statistic,dt(m2$statistic,df=m2$parameter), col=3)
points(m3$statistic,dt(m3$statistic,df=m3$parameter), col=4)
legend(6,0.3,c("Colombia-Brazil", "Colombia-Peru", "Brazil-Peru"), col=2:4, pch="*")
But
In [20]:
#ANOVA fixes the first two issues. For non-normal distribution use Kruskal-W
#allis test. In this case the test reject the null hypothesis meaning that t
#he populations are NOT identical.
# annova is best suited for that purpose
fit <- aov(mean ~ country, data=d)
summary(fit)
Out[20]:
In [21]:
# F-value is Mean Sq Residuals / Mean Sq country
pf(8.879,2,31,lower.tail = FALSE)
# the p-value is small indicating that the distribution's difference is
# significant
Out[21]:
In [22]:
# Kruskal-Wallis test is in principle suited for comparison of population that
# are different in size. Small p-values indicates that the null hypothesis is
# safely rejected: populations are too far apart and cannot be explained by randomness.
kruskal.test(mean ~ country, data=d)
Out[22]:
In [23]:
# since data exhibits significance... fortunately,
# post hoc test can be performed to understand how well pairwise populations
# can be separated
#TukeyHSD(fit, "country", ordered = TRUE);
plot(TukeyHSD(fit, "country"))