Chemometrics


Julien Wist / 2017 / Universidad del Valle

An up-to-date version of this notebook can be found here: https://github.com/jwist/chemometrics/


In [1]:
options(repr.plot.width=4, repr.plot.height=4) # change these setting to plot larger figures

A first example - simple linear regression

we make repeated measurement of an observable Y as a function of X. You might think of a calibration curve where you measure the transmitted intensity as a function of the concentration of a solution.

create data and visualize it


In [2]:
#rm(list=ls(all=TRUE)) # we clear the variable space

N <- 20 # we define the number of observations

# we create a fake dataset using a quadratic
#equation and adding some noise
# first create a vector of x repeated rep times
rep <- 2 # number of replicates
X <- rep(seq(from=43, to=96, length.out=N),rep)
# then create the Y vector according to the equation:
Y <- (0.075 * X + -1.874 + 0.01* X^2)
# create some noise
noise <-runif(length(Y), -1, 1) 
# add some noise to Y
Y <- Y + 1*noise

we take a look to the data that we just created.


In [3]:
plot(X,Y)


In order to prepare a nice dataset for R we need to sort the data.


In [4]:
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]

and now we can create a data.frame with columns x,y and $x^2$.

consider data.frame is equivalent to an excel spreadsheet, with columns headers, and data of the same length, however data could be of different kind, such as strings, numbers (numeric), factors, etc.


In [5]:
data <- data.frame(x=x, y=y, x2=x^2)

to access the data within the data frame, use:


In [6]:
x_sorted$ix


  1. 1
  2. 21
  3. 2
  4. 22
  5. 3
  6. 23
  7. 4
  8. 24
  9. 5
  10. 25
  11. 6
  12. 26
  13. 7
  14. 27
  15. 8
  16. 28
  17. 9
  18. 29
  19. 10
  20. 30
  21. 11
  22. 31
  23. 12
  24. 32
  25. 13
  26. 33
  27. 14
  28. 34
  29. 15
  30. 35
  31. 16
  32. 36
  33. 17
  34. 37
  35. 18
  36. 38
  37. 19
  38. 39
  39. 20
  40. 40

In [7]:
data$y


  1. 20.6808896446452
  2. 19.1778454063311
  3. 23.4472179714153
  4. 22.5384362566779
  5. 25.3238724275593
  6. 24.9845763441276
  7. 28.9892663422575
  8. 27.9238017571693
  9. 30.5824202879262
  10. 32.2534830246982
  11. 35.2932276019979
  12. 35.2544419390302
  13. 37.6869978821175
  14. 39.2094421764174
  15. 41.5749492880239
  16. 42.8672879615149
  17. 46.101612092294
  18. 46.6758109566944
  19. 49.1283425523889
  20. 49.7349329449226
  21. 52.8180665962084
  22. 54.477099127291
  23. 58.9155986792616
  24. 58.3801859486125
  25. 61.793895982734
  26. 62.5607569302126
  27. 66.2240875969646
  28. 66.0722021042375
  29. 71.3166885888416
  30. 70.6773536568719
  31. 76.8147496463112
  32. 75.6678853219688
  33. 80.9065056452836
  34. 82.416573231605
  35. 87.2605234804762
  36. 86.1353635709932
  37. 92.9663240766138
  38. 92.0138504459508
  39. 97.9046644102447
  40. 96.7458502384499

In [8]:
data['x'][data['y']>20.10071]


  1. 43
  2. 45.7894736842105
  3. 45.7894736842105
  4. 48.5789473684211
  5. 48.5789473684211
  6. 51.3684210526316
  7. 51.3684210526316
  8. 54.1578947368421
  9. 54.1578947368421
  10. 56.9473684210526
  11. 56.9473684210526
  12. 59.7368421052632
  13. 59.7368421052632
  14. 62.5263157894737
  15. 62.5263157894737
  16. 65.3157894736842
  17. 65.3157894736842
  18. 68.1052631578947
  19. 68.1052631578947
  20. 70.8947368421053
  21. 70.8947368421053
  22. 73.6842105263158
  23. 73.6842105263158
  24. 76.4736842105263
  25. 76.4736842105263
  26. 79.2631578947368
  27. 79.2631578947368
  28. 82.0526315789474
  29. 82.0526315789474
  30. 84.8421052631579
  31. 84.8421052631579
  32. 87.6315789473684
  33. 87.6315789473684
  34. 90.4210526315789
  35. 90.4210526315789
  36. 93.2105263157895
  37. 93.2105263157895
  38. 96
  39. 96

linear model with lm()

Now that we have defined a data.frame we can use it as input into most pre-built functions of R. Here we would like to test a simple linear regression. Therefore we use the lm() function and store the results in the fit.lm object. The linear model function accept a data.frame as input (here data) and allows you to define a model, in this case the function $y=ax$. Therefore it is important to define your data.frame properly so that lm will find the $x$ and $y$ columns in data.


In [10]:
fit.lm = lm(y ~ x, data=data)

In [11]:
names(fit.lm)


  1. 'coefficients'
  2. 'residuals'
  3. 'effects'
  4. 'rank'
  5. 'fitted.values'
  6. 'assign'
  7. 'qr'
  8. 'df.residual'
  9. 'xlevels'
  10. 'call'
  11. 'terms'
  12. 'model'

At this stage we can have a look to the coefficient found, in this case of a linear model


In [12]:
# have a look to the coefficients
fit.lm$coefficients


(Intercept)
-47.1720053904281
x
1.45984794811312

You find a lot of information about your optimization by using the summary() function of R. It allows you to get some information about any object.


In [13]:
summary(fit.lm)


Call:
lm(formula = y ~ x, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5055 -1.9551 -0.7667  1.5985  5.0794 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -47.17201    1.73218  -27.23   <2e-16 ***
x             1.45985    0.02428   60.12   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.47 on 38 degrees of freedom
Multiple R-squared:  0.9896,	Adjusted R-squared:  0.9893 
F-statistic:  3615 on 1 and 38 DF,  p-value: < 2.2e-16

In the case you need to access one of the data shown here to use it further in your script, you can find the name of the objects displayed by summary()


In [14]:
names(summary(fit.lm))


  1. 'call'
  2. 'terms'
  3. 'residuals'
  4. 'coefficients'
  5. 'aliased'
  6. 'sigma'
  7. 'df'
  8. 'r.squared'
  9. 'adj.r.squared'
  10. 'fstatistic'
  11. 'cov.unscaled'

So if you want to use the $r^2$ coefficient you can use the following command.


In [15]:
round(summary(fit.lm)$r.squared,3)


0.99

Maybe you want a nice picture of your data and the regression line.


In [16]:
plot(data$x,data$y); lines(data$x, fitted(fit.lm), col=2)


If you want more control on what you do, you can display the regression by ploting a line that crosses 0 at "intercept" and with a slope of "x".


In [17]:
plot(data$x,data$y,main="linear regression",ylab="intensity",xlab="concentration",ylim=c(-50,100),xlim=c(0,100));
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")


You may have noticed that the data look to have some quadratic component (that we included intentionally at the beginning). You might want to see how good it fits if you use a quadratic function to fit your data.


In [18]:
fit.lm = lm(y ~ x + x2, data=data)
plot(data$x,data$y, main="linear regression",ylab="intensity",xlab="concentration")
lines(data$x, fitted(fit.lm), col=2)
paste("the coefficient $r^2$ = ",round(summary(fit.lm)$r.squared,3))
text(60,80,paste("r2 = ",round(summary(fit.lm)$r.squared,3)))


'the coefficient $r^2$ = 0.999'

Experimental design

linear regression

Use the data of Table 2.2 in Brereton, R. G. (2003). Chemometrics. Technometrics. Chichester, UK: John Wiley & Sons, Ltd. http://doi.org/10.1002/0470863242

Or import the table from the dataset section provided in this course.

import dataset


In [19]:
data <- read.csv("./datasets/table2.2.csv")


Warning message in read.table(file = file, header = header, sep = sep, quote = quote, :
“incomplete final line found by readTableHeader on './datasets/table2.2.csv'”

In [20]:
data


qq.1q.2q.3q.4q.5q.6q.7q.8q.9
1.000 1.000 2.000 3.000 3.000 4.000 4.000 5.000 6.000 6.000
3.803 3.276 5.181 6.948 8.76210.672 8.26613.03215.02116.426
4.797 3.878 6.342 9.186 10.13612.25713.25214.65617.68115.071

As you may notice, the data are not in the way we want. We would like to have two columns and we also would like the index of the row to be correct. We can solve this readily.


In [21]:
data <- t(data) # we transpose the data

In [22]:
rownames(data) <- seq(1,10)
colnames(data) <- c("conc","A","B")

In [23]:
data


concAB
1 3.803 4.797
1 3.276 3.878
2 5.181 6.342
3 6.948 9.186
3 8.76210.136
4 10.67212.257
4 8.26613.252
5 13.03214.656
6 15.02117.681
6 16.42615.071

Although the table looks good there is a problem **may be** a problem with it. R sees each data in the table as characters and not as number. This means that we cannot use these data because they are not "numbers". We should change this. It is done very simply, but it is sometime annoying because it is not something we are used to think of. In order to "see" what R "sees", use the following command.


In [24]:
is(data[1])
data[,1]


  1. 'numeric'
  2. 'vector'
1
1
2
1
3
2
4
3
5
3
6
4
7
4
8
5
9
6
10
6

Putting this altogether in something nicer, it gives:


In [62]:
remove(data)
data <- t(read.csv2("./datasets/table2.2.csv",header=TRUE,sep=",",dec="."))
rownames(data) <- seq(1:10)
data <- data.frame("conc"=as.factor(data[,1]), "A"=data[,2],"B"=data[,3])
data


Warning message in read.table(file = file, header = header, sep = sep, quote = quote, :
“incomplete final line found by readTableHeader on './datasets/table2.2.csv'”
concAB
1 3.803 4.797
1 3.276 3.878
2 5.181 6.342
3 6.948 9.186
3 8.76210.136
4 10.67212.257
4 8.26613.252
5 13.03214.656
6 15.02117.681
6 16.42615.071

Analysis of variance (ANOVA)

In order to estimate the quality of our regression we perform an analysis of variance. Here we will perform this analysis step by step as described in Chapter 2 of Brereton, R. G. (2003). Chemometrics. Technometrics. Chichester, UK: John Wiley & Sons, Ltd. http://doi.org/10.1002/0470863242

One of the most important feature of ANOVA is to compare the lack of fit to the replicate error. The error on replicate gives a estimation of the experimental error of the data, while the lack of fit gives an estimation of how good is our model for our data.

with intercept

We first find duplicated data in our table


In [26]:
duplicated(data[,1])


  1. FALSE
  2. TRUE
  3. FALSE
  4. FALSE
  5. TRUE
  6. FALSE
  7. TRUE
  8. FALSE
  9. FALSE
  10. TRUE

In [27]:
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 2 # if our model is y = ax + b
# find out the degree of freedom for our experiment
D <- N - P - R
D


4

In [28]:
R
N


4
10

In [29]:
library(plyr) # this library helps a lot

meanRep <- aggregate(A ~ conc, data = data, mean) # this calculates the mean for all replicates

sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
    sumRep[i] <- meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] 
    sumRep2[i] <- ( meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom
result <- data.frame("Sum Replicate"=S_replicate)
result


Sum.Replicate
5.665593

We can verify our result by computing the strait sum of the replicate that gives 0.


In [30]:
round(sum(sumRep),3)


0

Using the models proposed in Chapter 2 of Brereton, we can compute the sum of residuals, that is the sum of the difference between the experimental data and the predicted ones.


In [31]:
predicted <- 0.6113 + 2.4364 * as.numeric(data$conc)
S_residual <- sum( (data$A - predicted)^2 )
result["Sum Residual"] <- S_residual # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"]


Sum Residual
8.371165

In [32]:
S_total <- sum( data$A^2 ) # N degrees of freedom 
result["Sum Total"] <- S_total; result["Sum Total"]

S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted; result["Sum Predicted"]


Sum Total
1024.58
Sum Predicted
1016.208

In [33]:
S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit; result["Sum Lack Of Fit"]
result["F"] <- (S_lackOfFit / D) / (S_replicate / R); result["F"]


Sum Lack Of Fit
2.705572
F
0.4775444

In [34]:
plot(data$conc, data$A)
abline(0.6113, 2.4364, col="red")


lack of fit

We can now put this in order and run the four cases, for both datasets A and B and with or without intercept. Table of comparison is shown below.


In [64]:
plot(as.numeric(data$conc), as.numeric(data$A))



In [35]:
#var.test((data$A - predicted),sumRep2[c(1,4,6,9)])
#sumRep2[c(1,4,6,9)]
#to be tested. Should be possible to use this command in this context.

In [36]:
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 2
D <- N - P - R
meanRep <- aggregate(A ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P

sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
    sumRep[i] <- meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] 
    sumRep2[i] <- ( meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom 
result["Sum Replicate"] <- S_replicate

predicted <- 0.6113 + 2.4364 * as.numeric(data$conc)
S_residual <- sum( (data$A - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual

S_total <- sum( data$A^2 ) # N degrees of freedom
result["Sum Total"] <- S_total

S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted

S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit

result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
A2 <- t(result)

In [37]:
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 1
D <- N - P - R
meanRep <- aggregate(A ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P

sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
    sumRep[i] <- meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] 
    sumRep2[i] <- ( meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom 
result["Sum Replicate"] <- S_replicate

predicted <- 2.576 * as.numeric(data$conc)
S_residual <- sum( (data$A - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual

S_total <- sum( data$A^2 ) # N degrees of freedom
result["Sum Total"] <- S_total

S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted

S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit

result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
A1 <- t(result)

In [38]:
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 2
D <- N - P - R
meanRep <- aggregate(B ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P

sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
    sumRep[i] <- meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i] 
    sumRep2[i] <- ( meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom 
result["Sum Replicate"] <- S_replicate

predicted <- 2.032 + 2.484 * as.numeric(data$conc)
S_residual <- sum( (data$B - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual

S_total <- sum( data$B^2 ) # N degrees of freedom
result["Sum Total"] <- S_total

S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted

S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit

result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
B2 <- t(result)

In [39]:
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 1
D <- N - P - R
meanRep <- aggregate(B ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P

sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
    sumRep[i] <- meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i] 
    sumRep2[i] <- ( meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom 
result["Sum Replicate"] <- S_replicate

predicted <- 2.948 * as.numeric(data$conc)
S_residual <- sum( (data$B - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual

S_total <- sum( data$B^2 ) # N degrees of freedom
result["Sum Total"] <- S_total

S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted

S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit

result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
B1 <- t(result)

In [40]:
tableOfResult1 <- data.frame(A1,A2,B1,B2)
tableOfResult1


A1A2B1B2
Number.Replicate 4.000000 4.000000 4.000000 4.0000000
Number of Data 10.000000 10.000000 10.000000 10.0000000
Number of Parameters 1.000000 2.000000 1.000000 2.0000000
Sum Replicate 5.665593 5.665593 4.774593 4.7745930
Sum Residual 9.116023 8.371165 15.472464 7.2399200
Sum Total1024.580375 1024.580375 1345.792840 1345.7928400
Sum Predicted1015.273728 1016.207748 1329.677712 1338.6635680
Sum Lack Of Fit 3.450430 2.705572 10.697871 2.4653270
Mean Residuals 1.012891 1.046396 1.719163 0.9049900
Mean Total 102.458038 102.458038 134.579284 134.5792840
Mean Predicted1015.273728 508.103874 1329.677712 669.3317840
Mean Replicate 1.416398 1.416398 1.193648 1.1936483
Lack Of Fit 0.690086 0.676393 2.139574 0.6163318

Both datasets are similar. However, for the second set, excluding the intercept increase the lack of fit. It is important to note that it is not possible to judge the lack of fit other than by comparing it to the Mean Sum of Replicates. In all cases the LOF is smaller than the MSR, except for B1. The conclusion is that we need 2 parameters to best fit our data.


In [41]:
# Brereton p.42
qf(0.95,5,4)  # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
2.1395742/1.193648
0.6900860/1.416398
pf(1.792,5,4,lower.tail = TRUE)
pf(0.48721,5,4,lower.tail = TRUE)


6.25605650216089
1.79246662332614
0.487211927720881
0.704083882026608
0.225084202869423

So we can conclude with 70.4% of confidence that intercept is useful in the case of the B dataset. It is not possible to conclude in the case of the dataset A, since the value for F-ratio is very low.

programing a function

The way we performed the calculation above is not very efficient. We copy and paste 4 time the same code. If we want to make a modification we would have to propagate in all four pieces. In this case, a better practise consists in creating a funcion. If we are interested in more flexibility we can program a function to perform the calculation. This is done like this:


In [42]:
lof <- function(x, y, fit.lm) {
    
    data=data.frame("x"=as.numeric(x), "y"=y)
    #fit.lm <- lm( y ~ x, data=data )
    
    Rep <- duplicated(data$x) # find replicates
    R <- sum(Rep) # enumerate replicates
    N <- dim(data)[1] # find out dimension of data
    P <- length(fit.lm$coefficients)
    D <- N - P - R
    result <- data.frame("Number Replicate"=R)
    result["Number of Data"] <- N
    result["Number of Parameters"] <- P
    result["Degrees of Freedom"] <- D
    
    meanRep <- aggregate(y ~ x, data = data, mean)
    
    sumRep=0
    sumRep2=0
    for (i in seq(1,nrow(data))) {
        sumRep[i] <- meanRep$y[ meanRep$x == data$x[i] ] - data$y[i] 
        sumRep2[i] <- ( meanRep$y[ meanRep$x == data$x[i] ] - data$y[i] )^2
    }

    S_replicate <- sum(sumRep2) # R degrees of freedom 
    result["Sum Replicate"] <- round(S_replicate,3)
 
    S_residual <- sum ( resid(fit.lm)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
    result["Sum Residual"] <- round(S_residual,3)

    S_total <- sum( data$y^2 ) # N degrees of freedom
    result["Sum Total"] <- round(S_total,3)

    S_predicted <- sum( fitted(fit.lm)^2 ) # P degrees of freedom
    result["Sum Predicted"] <- round(S_predicted,3)

    S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
    result["Sum Lack Of Fit"] <- round(S_lackOfFit,3)

    result["Mean Residuals"] <- round(S_residual / (N-P),3)
    
    result["Mean Total"] <- round(S_total / N,3)
    
    result["Mean Predicted"] <- round(S_predicted / P,3)
    
    result["Mean Replicate"] <- round(S_replicate / R,3)
    
    result["Lack Of Fit"] <- round(S_lackOfFit / D,3)
    
    result["F-value"] <- round(result["Lack Of Fit"] / result["Mean Replicate"],3)
    
    result["p-value"] <- df(as.numeric(result["F-value"]),D,R)
    
    result["r2"] <- round(summary(fit.lm)$r.squared,3)
    
    result["a(slope)"] <- round(fit.lm$coefficients[2],3)
    
    result["b(intercept)"] <- round(fit.lm$coefficients[1],3)
    
    return( t(result) )
}

We can use this function on our dataset A


In [65]:
fit.lm <- lm( A ~ conc, data=data.frame("conc"=as.numeric(data$conc), "A"=data$A) )
r1 <- lof(data$conc, data$A, fit.lm)

plot(data$conc,data$A, main="linear regression",ylab="intensity",xlab="concentration")
#lines(data$conc, fitted(fit.lm), col=2)
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")


We repeated the calculation for the A dataset, but this time we use lm() to fit the data. It means that we are not using the models from the book, but the one we are optimizing here. In addition we computed the p-value. A p-value higher than 0.05 means that we cannot conclude that our model doesn't fit our data acurrately.

We repeat the same calculation but this time we concatenate A and B datasets to see the effect of a larger dataset.


In [66]:
fit.lm <- lm( A ~ conc, data=data.frame("conc"=as.numeric(c(data$conc,data$conc)), "A"=c(data$A, data$B)) )
r2 <- lof(c(data$conc,data$conc), c(data$A, data$B), fit.lm)

plot(as.factor(c(data$conc,data$conc)), c(data$A, data$B), main="linear regression",ylab="intensity",xlab="concentration")
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")


The next table shows the comparison. The first thing to note is that the degree of freedom is not changed because we added more points, but also more replicates. Maybe there are too many replicates in this case. The next thing to note is that the lack of fit is smaller which is good, but the MSR is larger. p-values for both are larger than 0.05 and we cannot rule out that our model is not accurate, which is ok.


In [67]:
tableOfResult2 <- cbind(r1,r2)
tableOfResult2


Number.Replicate 4.0000000 14.0000000
Number of Data 10.0000000 20.0000000
Number of Parameters 2.0000000 2.0000000
Degrees of Freedom 4.0000000 4.0000000
Sum Replicate 5.6660000 27.5520000
Sum Residual 8.3710000 28.2370000
Sum Total1024.58000002370.3730000
Sum Predicted1016.20900002342.1370000
Sum Lack Of Fit 2.7060000 0.6850000
Mean Residuals 1.0460000 1.5690000
Mean Total 102.4580000 118.5190000
Mean Predicted 508.10500001171.0680000
Mean Replicate 1.4160000 1.9680000
Lack Of Fit 0.6760000 0.1710000
F-value 0.4770000 0.0870000
p-value 0.6013782 0.3188609
r2 0.9560000 0.9290000
a(slope) 2.4360000 2.4600000
b(intercept) 0.6110000 1.3220000
Replicates

If we are curious to understand the effect of the replicates we can remove them and apply the test again. Clearly we will have no estimation of the experimental error to compare with. We can compute the LOF but we have no MRS to compute a p-value.

Finally it is important to look at the coefficient. After all we are interested in a calibration curve, how the different experimental designs affect the value of the coefficients?


In [46]:
F <- duplicated(data$conc)
fit.lm <- lm( A ~ conc, data=data.frame("conc"=as.numeric(data$conc[F]), "A"=data$A[F]) )
lof(data$conc[F], data$A[F], fit.lm)

plot(data$conc[F], data$A[F], main="linear regression",ylab="intensity",xlab="concentration")
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")


Number.Replicate 0.000
Number of Data 4.000
Number of Parameters 2.000
Degrees of Freedom 2.000
Sum Replicate 0.000
Sum Residual 6.486
Sum Total425.645
Sum Predicted419.159
Sum Lack Of Fit 6.486
Mean Residuals 3.243
Mean Total106.411
Mean Predicted209.580
Mean Replicate NaN
Lack Of Fit 3.243
F-value NaN
p-value NaN
r2 0.927
a(slope) 2.510
b(intercept) 0.398

p-values, model and noise

As already mentionned, the p-value can be used to decide if we can conclude that our model is not appropriate. We can go back to the data of the first example to check demonstrate how p-value help us decide if our model is accurate or not and to illustrate the effect of noise in the data.


In [47]:
N <- 20 # we define the number of observations

# we create a fake dataset using a quadratic
#equation and adding some noise
# first create a vector of x repeated rep times
rep <- 2 # number of replicates
X <- rep(seq(from=43, to=96, length.out=N),rep)
# then create the Y vector according to the equation:
Y <- (0.075 * X + -1.874 + 0.01* X^2)
# create some noise
noise <-runif(length(Y), -1, 1) 
# add some noise to Y
Y <- Y + 1*noise
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]
data <- data.frame("x"=x, "y"=y, x2=x^2)

# we fit the data and evaluate the goodness of it
fit.lm <- lm( y ~ x, data=data )
noQuadLowNoise <- lof(data$x, data$y, fit.lm)

In [48]:
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="low noise data / linear model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)
noQuadLowNoise[17]


0.992

Although the correlation coefficient is high, we can see that there is a trend in the data that is not correctly described by the model, the model is linear and we know that there is a quadratic term in the data. If we only look at the lack of fit and at the correlation coefficient we can conclude that the model is not bad.

However the LOF is bigger than the MSR, which indicates a problem with our model. If we compute the p-value


In [49]:
noQuadLowNoise[16] # p-value


2.83496955374502e-14

In [50]:
# Brereton p.42
qf(0.95,noQuadLowNoise[4],noQuadLowNoise[1])  # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
noQuadLowNoise[15]
pf(noQuadLowNoise[15],noQuadLowNoise[4],noQuadLowNoise[1],lower.tail = TRUE)


2.15112442712183
59.163
0.999999999999827
we find a very small number that means that we can conclude that our model is not accurate. Let's repeat this but this time we include a quadratic term in the model.

In [51]:
fit.lm <- lm( y ~ x + x2, data=data )
QuadLowNoise <- lof(data$x, data$y, fit.lm)
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="low noise data / quadratic model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)
QuadLowNoise[16]


0.00758060780767054

In [52]:
# Brereton p.42
qf(0.95,QuadLowNoise[4],QuadLowNoise[1])  # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
QuadLowNoise[15]
pf(QuadLowNoise[15],QuadLowNoise[4],QuadLowNoise[1],lower.tail = TRUE)


2.16670099681198
3.477
0.995485550032857

Clearly the model is more accurate and we obtain a p-value that is larger than 0.05.

Now let's see what happens if the noise is large.


In [53]:
N <- 20 
rep <- 2 
X <- rep(seq(from=43, to=96, length.out=N),rep)
Y <- (0.075 * X + -1.874 + 0.01* X^2)
noise <-runif(length(Y), -1, 1) 
Y <- Y + 4*noise # we increment the noise 
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]
data <- data.frame("x"=x, "y"=y, x2=x^2)

fit.lm <- lm( y ~ x + x2, data=data )
QuadHighNoise <- lof(data$x, data$y, fit.lm)

plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="high noise data / quadratic model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)

QuadHighNoise[16]


0.553610438205782

In [54]:
# Brereton p.42
qf(0.95,QuadHighNoise[4],QuadHighNoise[1])  # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
QuadHighNoise[15]
pf(QuadHighNoise[15],QuadHighNoise[4],QuadHighNoise[1],lower.tail = TRUE)


2.16670099681198
1.3
0.715217234887876

We repeat now the same operation but without the quadratic term.


In [55]:
fit.lm <- lm( y ~ x, data=data )
noQuadHighNoise <- lof(data$x, data$y, fit.lm)

plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="high noise data / linear model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)

noQuadHighNoise[16]


0.0114897700714444

In [56]:
# Brereton p.42
qf(0.95,noQuadHighNoise[4],noQuadHighNoise[1])  # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
noQuadHighNoise[15]
pf(noQuadHighNoise[15],noQuadHighNoise[4],noQuadHighNoise[1],lower.tail = TRUE)


2.15112442712183
3.227
0.99347154530691

It can be observed than both p-values are larger than 0.05 and thus we cannot rule out a model and must statistically accept both model as acceptable for our data. This is the effect of noise, it increase the MSR resulting in larger p-values.

You may have to re-run the last cells several time, this because the data are generated randomly and the p-value may vary largely.

It may be interesting to plot the residual. Here we clearly distinguish the quadratic behavior that is not taken into account by our linear model. Plotting residual is always a good idea to help validating a model.


In [57]:
plot( resid(fit.lm) )
abline(0,0)


Finally, we may want to add the 95% confidence interval for our calibration curve. This can be achieved with a single command.


In [58]:
interval <- confint(fit.lm)
interval90 <- confint(fit.lm, level=0.8)

In [59]:
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="high noise data / linear model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col="red")
abline(interval[1,1], interval[2,1], col="gray")
abline(interval[1,2], interval[2,2], col="gray")
abline(interval90[1,1], interval90[2,1], col="gray", lty=2)
abline(interval90[1,2], interval90[2,2], col="gray", lty=2)


Here is a calibration curve with confidence interval at 80% (gray dashed line) and 95% (gray line). More concrete information can be found here for calibration curves (https://raw.githubusercontent.com/jwist/chemometrics/master/pdf/Calibration-curve-guide.pdf)

According to the decision tree, in this simple decision tree we have 1 IV (X, causes) and 1 DV (Y, effect). Since both are continuous variable we are in the domain of regression. Because we only have a single DV and it is continuous we speak about simple linear regression. In the next example we will study the effect of pH, temperature and concentration on the yield of a reaction, thus we will have more than one IV but a single DV and we speaks about multiple regression or multivariable regression.

for more about this read: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3518362/


In [ ]:


In [ ]: