Chemometrics


Julien Wist / 2017 / Universidad del Valle
Andrés Bernal / 2017 / ???

An up-to-date version of this notebook can be found here: https://github.com/jwist/chemometrics/

To best understand the following notebook, please read chapter 2 of Brereton, R. G. (2003). Chemometrics. Technometrics. Chichester, UK: John Wiley & Sons, Ltd. http://doi.org/10.1002/0470863242.

Use the following commands to create the dataset of Table 2.6 of Brereton.


In [17]:
rm(list=ls(all=TRUE)) # to clean the variable space

In [169]:
options(repr.plot.width=4, repr.plot.height=4) # change these setting to plot larger figures

In [184]:
# we create the matrix manually
d <- c(6,60,4,34.841,6,60,2,16.567,6,20,4,45.396,6,20,2,27.939,4,60,4,19.825,4,60,2,1.444,4,20,4,37.673,4,20,2,23.131,6,40,3,23.088,4,40,3,12.325,5,60,3,16.461,5,20,3,33.489,5,40,4,26.189,5,40,2,8.337,5,40,3,19.192,5,40,3,16.579,5,40,3,17.794,5,40,3,16.65,5,40,3,16.799,5,40,3,16.635)
dim(d) <- c(4,20) # we force the data into a mtrix of dimension 4 x 20
# we create the data frame
datas <- data.frame("ph"=d[1,],"temp"=d[2,],"conc"=d[3,], "ph2"=d[1,]^2, "temp2"=d[2,]^2, "conc2"=d[3,]^2, "phTemp"=d[1,]*d[2,], "phConc"=d[1,]*d[3,], "tempConc"=d[2,]*d[3,], "yield"=d[4,])
datas


Out[184]:
phtempconcph2temp2conc2phTempphConctempConcyield
1 6.000 60.000 4.000 36.0003600.000 16.000 360.000 24.000 240.000 34.841
2 6.000 60.000 2.000 36.0003600.000 4.000 360.000 12.000 120.000 16.567
3 6.000 20.000 4.000 36.000400.000 16.000120.000 24.000 80.000 45.396
4 6.000 20.000 2.000 36.000400.000 4.000120.000 12.000 40.000 27.939
5 4.000 60.000 4.000 16.0003600.000 16.000 240.000 16.000 240.000 19.825
6 4.000 60.000 2.000 16.0003600.000 4.000 240.000 8.000 120.000 1.444
7 4.000 20.000 4.000 16.000400.000 16.000 80.000 16.000 80.000 37.673
8 4.000 20.000 2.000 16.000400.000 4.000 80.000 8.000 40.000 23.131
9 6.000 40.000 3.000 36.0001600.000 9.000 240.000 18.000 120.000 23.088
10 4.000 40.000 3.000 16.0001600.000 9.000 160.000 12.000 120.000 12.325
11 5.000 60.000 3.000 25.0003600.000 9.000 300.000 15.000 180.000 16.461
12 5.000 20.000 3.000 25.000400.000 9.000100.000 15.000 60.000 33.489
13 5.000 40.000 4.000 25.0001600.000 16.000 200.000 20.000 160.000 26.189
14 5.000 40.000 2.000 25.0001600.000 4.000 200.000 10.000 80.000 8.337
15 5.000 40.000 3.000 25.0001600.000 9.000 200.000 15.000 120.000 19.192
16 5.000 40.000 3.000 25.0001600.000 9.000 200.000 15.000 120.000 16.579
17 5.000 40.000 3.000 25.0001600.000 9.000 200.000 15.000 120.000 17.794
18 5.00 40.00 3.00 25.001600.00 9.00 200.00 15.00 120.00 16.65
19 5.000 40.000 3.000 25.0001600.000 9.000 200.000 15.000 120.000 16.799
20 5.000 40.000 3.000 25.0001600.000 9.000 200.000 15.000 120.000 16.635

we can perform a manual regression using the following sequence.


In [119]:
# we define the matrix of data (X)
M <- cbind("intercept"=rep(1,20),datas[,1:9]) # adds intercept
M <- as.matrix(M)

# we define y, the response
y <- datas[,10]

# we calculate the regression coefficient manually
coeffs = data.frame("coeffs"=solve(t(M) %*% M) %*% t(M) %*% y)
round(coeffs, 3)


Out[119]:
coeffs
intercept58.808
ph-6.091
temp-2.603
conc4.805
ph20.598
temp20.02
conc20.154
phTemp0.11
phConc0.351
tempConc0.029

where the first coefficient would be the intercept, as we already discussed in the previous example.

We can check the data and the prediction.


In [120]:
# we can check that predictions are just projection of our data on the coefficients
predictions <- M %*% as.matrix(coeffs)

plot(predictions,y)


By looking at the coefficient it looks like pH is the most important factor, but concentration and temperature are also significant. The other cross terms are, to the contrary, very small and seems not to influence the yield.

The problem with this analysis is that the coefficient are obtained by solving the linear model $y=D*c$. where $D$ is the experimental design matrix and $c$ is for the coefficients. Therefore if the values in the D matrix are large, as for the temperature, the corresponding coefficient will be small to maintain the product constant. It is not possible to compare directly the coefficient if the parameters are not scaled.

Before scaling the matrix $D$, here described as M, we can try to use statistics to help find out if what are the relevant parameters and how good is our model.


In [121]:
# we start the evaluation of our model
Rep <- duplicated(datas[,1:3])
R <- sum( Rep ) 
N <- dim(datas)[1]
P <- length(coeffs[[1]])
D <- N - P - R
D
N
P
R


Out[121]:
5
Out[121]:
20
Out[121]:
10
Out[121]:
5

There are 5 degrees of freedom, so the experimental design should allow to rule out or not the accuracy of the model.


In [153]:
S_replicate <- sum( (mean(datas$yield[Rep]) - datas$yield[Rep])^2 ) # R degrees of freedom
S_residual <- sum ( (predictions - y)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
S_total <- sum( datas$yield^2 ) # N degrees of freedom
S_predicted <- sum( predictions^2 ) 
S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
Rsquare <- S_predicted / S_total

meanS_replicate <- S_replicate / R
meanS_residual <- S_residual / (N-P)
meanS_total <- S_total / N
meanS_predicted <- S_predicted / P
meanS_lackOfFit <- S_lackOfFit / D

result <- data.frame("mean replicate"=meanS_replicate,"mean residuals"=meanS_residual,"mean total"=meanS_total, "mean predicted"=meanS_predicted, "mean lack of fit"=meanS_lackOfFit)
round(result,2)


Out[153]:
mean.replicatemean.residualsmean.totalmean.predictedmean.lack.of.fit
1 0.21 0.80 565.401130.00 1.39

In [154]:
# we calculate t-values and their probability associated 
T <- as.matrix(coeffs) / sqrt( diag( solve(t(M) %*% M) ) * meanS_residual )
result <- data.frame("T"=T)
result["t"] <- 2*pt(abs(T), df = N-P, lower.tail = FALSE)
result["1-t"] <- 1-(2 * pt(abs(T), df = N-P, lower.tail = FALSE))
round(result, 3)


Out[154]:
coeffst1-t
intercept4.5440.0010.999
ph-1.104 0.295 0.705
temp-18.27 0.00 1.00
conc1.3110.2190.781
ph21.1090.2930.707
temp214.597 0.000 1.000
conc20.2870.7800.220
phTemp6.9660.0001.000
phConc1.1110.2930.707
tempConc1.8420.0950.905

In [164]:
result <- data.frame( "F"=meanS_lackOfFit/meanS_replicate )
result["qf"] <- qf(0.95,D,R) # F-factor for 95% confidence
result["pf"] <- pf(result$F,D,R,lower.tail = TRUE)
round(result,2)


Out[164]:
Fqfpf
16.645.050.97

Because p-value is larger than 0.05 we can conclude that the model is accurate. Or more precisely, we cannot rule out our model as inaccurate.

relation between the importance of variables and their magnitude

We can scale the data. Sometime this is referred to as encoding the data.


In [191]:
maxs <- apply(d[1:3,],1,max); mins <- apply(d[1:3,],1,min)
d[1:3,] <- t( scale(t(d)[,1:3],center = (mins + maxs)/2, scale = (maxs - mins)/2) )
datas <- data.frame("ph"=d[1,],"temp"=d[2,],"conc"=d[3,], "ph2"=d[1,]^2, "temp2"=d[2,]^2, "conc2"=d[3,]^2, "phTemp"=d[1,]*d[2,], "phConc"=d[1,]*d[3,], "tempConc"=d[2,]*d[3,], "yield"=d[4,])
datas


Out[191]:
phtempconcph2temp2conc2phTempphConctempConcyield
1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.00034.841
2 1.000 1.000-1.000 1.000 1.000 1.000 1.000-1.000-1.00016.567
3 1.000-1.000 1.000 1.000 1.000 1.000-1.000 1.000-1.00045.396
4 1.000-1.000-1.000 1.000 1.000 1.000-1.000-1.000 1.00027.939
5-1.000 1.000 1.000 1.000 1.000 1.000-1.000-1.000 1.00019.825
6-1.000 1.000-1.000 1.000 1.000 1.000-1.000 1.000-1.000 1.444
7-1.000-1.000 1.000 1.000 1.000 1.000 1.000-1.000-1.00037.673
8-1.000-1.000-1.000 1.000 1.000 1.000 1.000 1.000 1.00023.131
9 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.00023.088
10-1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.00012.325
11 0.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.00016.461
12 0.000-1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.00033.489
13 0.000 0.000 1.000 0.000 0.000 1.000 0.000 0.000 0.00026.189
14 0.000 0.000-1.000 0.000 0.000 1.000 0.000 0.000 0.000 8.337
15 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.00019.192
16 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.00016.579
17 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.00017.794
18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0016.65
19 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.00016.799
20 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.00016.635

If we perform the regression again, the magnitude of the coefficient will not depend any more on their absolute value. This simplifies the analysis of finding the most important parameters.


In [27]:
# we define the matrix of data (X)
M <- as.matrix(datas[,1:9])
M <- cbind(rep(1,20),M) # adds intercept
# we define y, the response
y <- datas[,10]
# we calculate the regression coefficient manually
coeffs <- solve(t(M) %*% M) %*% t(M) %*% y
# we can check that predictions are just projection of our data on the coefficients
predictions <- M %*% coeffs

# we start the evaluation of our model
Rep <- duplicated(datas[,1:3])
R <- sum( Rep ) 
N <- dim(datas)[1]
P <- length(coeffs)
D <- N - P - R

S_replicate <- sum( (mean(datas$yield[Rep]) - datas$yield[Rep])^2 ) # R degrees of freedom
S_residual <- sum ( (predictions - y)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
S_total <- sum( datas$yield^2 ) # N degrees of freedom
S_predicted <- sum( predictions^2 ) 
S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
Rsquare <- S_predicted / S_total

meanS_replicate <- S_replicate / R
meanS_residual <- S_residual / (N-P)
meanS_total <- S_total / N
meanS_predicted <- S_predicted / P
meanS_lackOfFit <- S_lackOfFit / D

F <- meanS_lackOfFit / meanS_replicate

result = data.frame("F"=F,"mean replicate"=meanS_replicate,"mean residuals"=meanS_residual,"mean total"=meanS_total, "mean predicted"=meanS_predicted, "mean lack of fit"=meanS_lackOfFit)
round(result,2)

# we calculate t-values and their probability associated 
ttest <- coeffs / sqrt( diag( solve(t(M) %*% M) ) * meanS_residual )
tprob <- 1-(2 * pt(abs(ttest), df = N-P, lower.tail = FALSE))
table2.11 <- data.frame("coeffs"=coeffs,"t"=ttest,"t-prob"=tprob*100)
table2.11


Out[27]:
Fmean.replicatemean.residualsmean.totalmean.predictedmean.lack.of.fit
1 6.64 0.21 0.80 565.401130.00 1.39
Out[27]:
coeffstt.prob
17.20834 56.01061100.00000
ph 5.34330 18.90673100.00000
temp -7.84900-27.77289100.00000
conc 8.65060 30.60927100.00000
ph2 0.5979091 1.109451470.6797496
temp2 7.866409 14.596530 99.999995
conc2 0.1544091 0.286514121.9667327
phTemp 2.201000 6.96581399.996128
phConc 0.351000 1.11085970.737550
tempConc 0.582000 1.84193790.470412

Scaling the data allows us to directly compare the coefficients, since the significance of the variable is a product of its magnitude by the coefficients. Now, all variable weight the same (their norm is equal to 1), so their coefficient directly reflects their importance. By comparing results obtained with and without scaling the variables we inmediatly notice important changes. By not scaling the data you might have wrongly concluded that only the first 3 parameters were relevant, while a proper scaling of the data shows that the $T^2$ and pH$ \times T$ might be relevant as well.

Leverage

Finally we can test different experimental designs (See Table 2.15) and see how the prediction confidence depends on it. For example the accuracy of the prediction may change if points for sampling are distributed in a different manner.

Not only the number of point, and the number of replicate have an influence of the model accuracy, the way sample points are chosen also affect the result.

Here we calculate the leverate for designs of Table 2.15


In [168]:
A <- cbind(rep(1,11), c(1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5) )
round(diag( A %*% solve(t(A) %*% A) %*% t(A) ),2)

A <- cbind(rep(1,11), c(1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5) )
round(diag( A %*% solve(t(A) %*% A) %*% t(A) ),2)

A <- cbind(rep(1,11), c(1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5) )
round(diag( A %*% solve(t(A) %*% A) %*% t(A) ),2)


Out[168]:
  1. 0.23
  2. 0.23
  3. 0.23
  4. 0.13
  5. 0.13
  6. 0.09
  7. 0.13
  8. 0.13
  9. 0.23
  10. 0.23
  11. 0.23
Out[168]:
  1. 0.29
  2. 0.29
  3. 0.14
  4. 0.14
  5. 0.09
  6. 0.09
  7. 0.09
  8. 0.14
  9. 0.14
  10. 0.29
  11. 0.29
Out[168]:
  1. 0.18
  2. 0.18
  3. 0.18
  4. 0.18
  5. 0.1
  6. 0.1
  7. 0.1
  8. 0.12
  9. 0.12
  10. 0.26
  11. 0.5

For more on experimental design, please consult Brereton, R. G. (2003). Chemometrics. Technometrics. Chichester, UK: John Wiley & Sons, Ltd. http://doi.org/10.1002/0470863242