Chemometrics


Julien Wist / 2017 / Universidad del Valle
Andrés Bernal / 2017 / ???

An up-to-date version of this notebook can be found here: https://github.com/jwist/chemometrics/


In [1]:
options(repr.plot.width=4, repr.plot.height=4)

In [2]:
# we load a few packages
library(ggplot2)
library(corrplot)
library(reshape2)
library(caret)
library(MASS) # for LDA
library(klaR) # for pls
library(pls)
library(e1071)# for pls
library(pROC) # for pls
library(vegan)
require(scales)
require(gridExtra);


Error in library(corrplot): there is no package called ‘corrplot’
Traceback:

1. library(corrplot)
2. stop(txt, domain = NA)

IRMS

Let's load real data. Assume we want to discriminate coffee beans according to their origin. Therefore we perform Isotope Ratio Mass Spectrometry, which allows to determine the isotopic ration for several nucleus. In this case we will measure only the ratio for carbon in caffeine.


In [3]:
#rm(list=ls(all=TRUE))

load(url('https://github.com/jwist/chemometrics/raw/master/datasets/coffeeMulti.rda'))
d <- coffeeMulti$irms
dim(d)


Warning message in load(url("https://github.com/jwist/chemometrics/raw/master/datasets/coffeeMulti.rda")):
“URL 'https://github.com/jwist/chemometrics/raw/master/datasets/coffeeMulti.rda': status was 'Couldn't resolve host name'”
Error in load(url("https://github.com/jwist/chemometrics/raw/master/datasets/coffeeMulti.rda")): cannot open the connection to 'https://github.com/jwist/chemometrics/raw/master/datasets/coffeeMulti.rda'
Traceback:

1. load(url("https://github.com/jwist/chemometrics/raw/master/datasets/coffeeMulti.rda"))

We have 34 samples and if we look at the country they are from we find out the following:


In [4]:
names(d)


  1. 'code'
  2. 'country'
  3. 'department'
  4. 'caffeine1'
  5. 'caffeine2'
  6. 'mean'

In [5]:
summary(d$country)


Brasil
11
Colombia
15
Peru
8

We can look to the data in more details. Let's display the first 3 rows (it is always a good idea to look at the data, sometimes there are difficult to diplay because too lengthy. In that case a plot will do the job)


In [6]:
d[1:3,]
# uncomment the following line to run only two classes
#levels(d$country) <- c("otros","Colombia","otros")


codecountrydepartmentcaffeine1caffeine2mean
51172 ColombiaTolima -28.285 -27.936 -28.111
61173 ColombiaTolima -28.931 -28.800 -28.866
71198 Brasil Otro -28.056 -28.274 -28.165

We discover that we have two replicates (two columns) and an additional column with means.

Let's make a simple boxplot graph to visualize the data.


In [17]:
d[d$country == "Colombia" | d$country == "Peru",]


codecountrydepartmentcaffeine1caffeine2mean
51172 ColombiaTolima -28.285 -27.936 -28.111
61173 ColombiaTolima -28.931 -28.800 -28.866
81199 ColombiaHuila -28.154 -28.022 -28.088
101201 ColombiaHuila -28.198 -28.372 -28.285
141208 ColombiaHuila -28.406 -28.505 -28.456
161216 ColombiaHuila -27.722 -27.735 -27.729
181240 ColombiaNarino -28.605 -28.716 -28.661
191241 ColombiaNarino -28.503 -28.163 -28.333
201242 ColombiaNarino -28.835 -29.207 -29.021
211243 ColombiaNarino -28.387 -28.379 -28.383
221244 ColombiaNarino -28.226 -28.057 -28.142
231245 ColombiaNarino -28.731 -28.893 -28.812
241246 ColombiaNarino -29.029 -28.771 -28.900
251247 ColombiaNarino -28.577 -28.658 -28.618
271249 ColombiaNarino -29.180 -29.105 -29.143
291303 Peru Otro -29.300 -29.148 -29.224
311325 Peru Otro -30.027 -30.038 -30.033
381377 Peru Otro -29.351 -29.273 -29.312
461407 Peru Otro -30.261 -30.089 -30.175
471411 Peru Otro -29.643 -30.076 -29.860
481412 Peru Otro -30.517 -30.247 -30.382
511427 Peru Otro -30.199 -30.259 -30.229
521428 Peru Otro -28.533 -28.601 -28.567

In [9]:
options(repr.plot.width=5, repr.plot.height=3)

mean=100; sd=15
lb=80; ub=120

x <- seq(-10,10,length=100)*sd + mean
hx <- dnorm(x,mean,sd)

hy <- dnorm(x,mean+10,sd)
s10 <- t.test(hx, hy)$statistic
var.test(hx, hy)

hy <- dnorm(x,mean+20,sd)
s20 <- t.test(hx, hy)$statistic

hy <- dnorm(x,mean+50,sd)
s50 <- t.test(hx, hy)$statistic

hy <- dnorm(x,mean+100,sd)
s100 <- t.test(hx, hy)$statistic

var.test(hx, hy)

par(mfrow=c(1,3))

plot(x, hx, type="n", xlab="", ylab="",
  main=s10, axes=FALSE,xlim=c(0,250))

lines(x, hx)
lines(x, dnorm(x,mean+10,sd), col="red")
axis(1, at=seq(40, 260, 20), pos=0)


plot(x, hx, type="n", xlab="", ylab="",
  main=s20, axes=FALSE,xlim=c(0,250))

lines(x, hx)
lines(x, dnorm(x,mean+20,sd), col="red")
axis(1, at=seq(40, 260, 20), pos=0)

plot(x, hx, type="n", xlab="", ylab="",
  main=s50, axes=FALSE,xlim=c(0,250))

lines(x, hx)
lines(x, dnorm(x,mean+50,sd), col="red")
axis(1, at=seq(40, 260, 20), pos=0)

plot(x, hx, type="n", xlab="", ylab="",
  main=s100, axes=FALSE,xlim=c(0,250))

lines(x, hx)
lines(x, dnorm(x,mean+100,sd), col="red")
axis(1, at=seq(40, 260, 20), pos=0)


Out[9]:
	F test to compare two variances

data:  hx and hy
F = 1, num df = 99, denom df = 99, p-value = 1
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6728417 1.4862338
sample estimates:
ratio of variances 
                 1 
Out[9]:
	F test to compare two variances

data:  hx and hy
F = 0.99988, num df = 99, denom df = 99, p-value = 0.9995
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6727589 1.4860509
sample estimates:
ratio of variances 
          0.999877 

In [10]:
options(repr.plot.width=4, repr.plot.height=4)
boxplot(tapply(d$mean,d$country,function(x) x))


Or if we prefer we can diplay the density distribution of the data, which gives a similar information as boxplots.


In [11]:
# here we plot the distribution for each class
x <- tapply(d$mean,d$country,function(x) x)
xlims <- range(unlist(tapply(d$mean,d$country,function(x) unlist(range(density(x)$x)))))
ylims <- range(unlist(tapply(d$mean,d$country,function(x) unlist(range(density(x)$y)))))
plot(density(x[1][[1]]),type='l', xlim=xlims, ylim=ylims, col=2, main=" ")
lines(density(x[2][[1]]), col=3)
lines(density(x[3][[1]]), col=4)
legend(min(xlims),max(ylims),levels(d$country), col=2:4, pch="*")



In [12]:
# then we start the calculations
means <- tapply(d$mean,d$country,mean)      # means
vars <- tapply(d$mean,d$country,var)        # variances
sds <- sqrt(vars)                           # standard deviation
ns <- tapply(d$mean,d$country,length)
# we perform a multiple t-test, that is we calculate t-test between
# all the pairs of countries, and compute the p-values
# from http://www.chem.utoronto.ca/coursenotes/analsci/stats/ttest.html
t <- dist(means) / sqrt( abs(combn(sds^2/ns,2,sum)) )
dof <- abs(combn(sds^2/ns,2,sum))^2 / combn(sds^4/(ns^2*(ns - 1)),2,sum)
#dof <- sum(ns) - length(levels(d$country)) # sometimes the d.o.f are also computed in that simpler manner
p <- 2*pt(-abs(t),df=dof);

In [13]:
t # value of t
dof # degree of freedom
p # p-value associated to t 
  # (probability of finding t if null hypothesis is true) 
  # should be small!


Out[13]:
           Brasil Colombia
Colombia 2.052607         
Peru     1.634253 4.976158
Out[13]:
  1. 12.6163252933171
  2. 16.9034836002539
  3. 10.1295444309156
Out[13]:
               Brasil     Colombia
Colombia 0.0614565181             
Peru     0.1206906508 0.0005354635

Let's make a simple t-test between Brazil and Colombia.


In [14]:
t.test(d$mean[d$country=="Brasil"], d$mean[d$country=="Colombia"])


Out[14]:
	Welch Two Sample t-test

data:  d$mean[d$country == "Brasil"] and d$mean[d$country == "Colombia"]
t = -2.0526, df = 12.616, p-value = 0.06146
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.28761786  0.03492695
sample estimates:
mean of x mean of y 
-29.12955 -28.50320 

In [15]:
t.test(d$mean[d$country=="Peru"], d$mean[d$country=="Colombia"])


Out[15]:
	Welch Two Sample t-test

data:  d$mean[d$country == "Peru"] and d$mean[d$country == "Colombia"]
t = -4.9762, df = 10.13, p-value = 0.0005355
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.7646741 -0.6744259
sample estimates:
mean of x mean of y 
-29.72275 -28.50320 

In [16]:
var.test(d$mean[d$country=="Peru"], d$mean[d$country=="Colombia"])


Out[16]:
	F test to compare two variances

data:  d$mean[d$country == "Peru"] and d$mean[d$country == "Colombia"]
F = 2.4574, num df = 7, denom df = 14, p-value = 0.1443
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
  0.7270675 11.2946238
sample estimates:
ratio of variances 
          2.457439 

In [17]:
p.adjust(c(0.0614565181, 0.1206906508, 0.0005354635), "bonferroni")


Out[17]:
  1. 0.1843695543
  2. 0.3620719524
  3. 0.0016063905

In [18]:
# this complex caluclation can be replaced by this single line
m <- t.test(d$mean[d$country=="Colombia"],d$mean[d$country=="Brasil"])
m2 <- t.test(d$mean[d$country=="Colombia"],d$mean[d$country=="Peru"])
m3 <- t.test(d$mean[d$country=="Brasil"],d$mean[d$country=="Peru"])

In [19]:
plot(function(x)(dt(x,df=m$parameter)),xlim=c(-2,10)); points(m$statistic,dt(m$statistic,df=m$parameter), col=2, xlab=("d"))
points(m2$statistic,dt(m2$statistic,df=m2$parameter), col=3)
points(m3$statistic,dt(m3$statistic,df=m3$parameter), col=4)
legend(6,0.3,c("Colombia-Brazil", "Colombia-Peru", "Brazil-Peru"), col=2:4, pch="*")


Conclusion

  • we cannot discriminate coffees from those 3 countries based on IRMS data

But

  • multiple t-test is not well suited because the 5% error is multiplicative this means that for 3 tests with 5% error each, the global error is approx 15%
  • Moreover, t-test assumes 1) homogeneity of variance, 2) normal distributions within all groups and 3) independant variables.

In [20]:
#ANOVA fixes the first two issues. For non-normal distribution use Kruskal-W
#allis test. In this case the test reject the null hypothesis meaning that t
#he populations are NOT identical. 
# annova is best suited for that purpose
fit <- aov(mean ~ country, data=d)
summary(fit)


Out[20]:
            Df Sum Sq Mean Sq F value   Pr(>F)    
country      2  8.064   4.032   8.879 0.000894 ***
Residuals   31 14.077   0.454                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In [21]:
# F-value is Mean Sq Residuals / Mean Sq country
pf(8.879,2,31,lower.tail = FALSE)
# the p-value is small indicating that the distribution's difference is
# significant


Out[21]:
0.000894118472566286

In [22]:
# Kruskal-Wallis test is in principle suited for comparison of population that
# are different in size. Small p-values indicates that the null hypothesis is
# safely rejected: populations are too far apart and cannot be explained by randomness.
kruskal.test(mean ~ country, data=d)


Out[22]:
	Kruskal-Wallis rank sum test

data:  mean by country
Kruskal-Wallis chi-squared = 11.506, df = 2, p-value = 0.003174

In [23]:
# since data exhibits significance... fortunately,
# post hoc test can be performed to understand how well pairwise populations
# can be separated
#TukeyHSD(fit, "country", ordered = TRUE);
plot(TukeyHSD(fit, "country"))


conclusion

  • Peru is different from Colombia
  • Colombia cannot be distinguished from Brazil
  • Peru and Brazil cannot be distinguished