Chemometrics

Julien Wist / 2017 / Universidad del Valle

An up-to-date version of this notebook can be found here: https://github.com/jwist/chemometrics/



In [1]:

    
options(repr.plot.width=4, repr.plot.height=4) # change these setting to plot larger figures

A first example - simple linear regression

we make repeated measurement of an observable Y as a function of X. You might think of a calibration curve where you measure the transmitted intensity as a function of the concentration of a solution.

create data and visualize it



In [2]:

    
#rm(list=ls(all=TRUE)) # we clear the variable space

N <- 20 # we define the number of observations

# we create a fake dataset using a quadratic
#equation and adding some noise
# first create a vector of x repeated rep times
rep <- 2 # number of replicates
X <- rep(seq(from=43, to=96, length.out=N),rep)
# then create the Y vector according to the equation:
Y <- (0.075 * X + -1.874 + 0.01* X^2)
# create some noise
noise <-runif(length(Y), -1, 1) 
# add some noise to Y
Y <- Y + 1*noise

we take a look to the data that we just created.



In [3]:

    
plot(X,Y)

In order to prepare a nice dataset for R we need to sort the data.



In [4]:

    
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]

and now we can create a data.frame with columns x,y and $x^2$.

consider data.frame is equivalent to an excel spreadsheet, with columns headers, and data of the same length, however data could be of different kind, such as strings, numbers (numeric), factors, etc.



In [5]:

    
data <- data.frame(x=x, y=y, x2=x^2)

to access the data within the data frame, use:



In [6]:

    
x_sorted$ix



In [7]:

    
data$y









    





	20.6808896446452
	19.1778454063311
	23.4472179714153
	22.5384362566779
	25.3238724275593
	24.9845763441276
	28.9892663422575
	27.9238017571693
	30.5824202879262
	32.2534830246982
	35.2932276019979
	35.2544419390302
	37.6869978821175
	39.2094421764174
	41.5749492880239
	42.8672879615149
	46.101612092294
	46.6758109566944
	49.1283425523889
	49.7349329449226
	52.8180665962084
	54.477099127291
	58.9155986792616
	58.3801859486125
	61.793895982734
	62.5607569302126
	66.2240875969646
	66.0722021042375
	71.3166885888416
	70.6773536568719
	76.8147496463112
	75.6678853219688
	80.9065056452836
	82.416573231605
	87.2605234804762
	86.1353635709932
	92.9663240766138
	92.0138504459508
	97.9046644102447
	96.7458502384499



In [8]:

    
data['x'][data['y']>20.10071]









    





	43
	45.7894736842105
	45.7894736842105
	48.5789473684211
	48.5789473684211
	51.3684210526316
	51.3684210526316
	54.1578947368421
	54.1578947368421
	56.9473684210526
	56.9473684210526
	59.7368421052632
	59.7368421052632
	62.5263157894737
	62.5263157894737
	65.3157894736842
	65.3157894736842
	68.1052631578947
	68.1052631578947
	70.8947368421053
	70.8947368421053
	73.6842105263158
	73.6842105263158
	76.4736842105263
	76.4736842105263
	79.2631578947368
	79.2631578947368
	82.0526315789474
	82.0526315789474
	84.8421052631579
	84.8421052631579
	87.6315789473684
	87.6315789473684
	90.4210526315789
	90.4210526315789
	93.2105263157895
	93.2105263157895
	96
	96

linear model with `lm()`

Now that we have defined a data.frame we can use it as input into most pre-built functions of R. Here we would like to test a simple linear regression. Therefore we use the lm() function and store the results in the fit.lm object. The linear model function accept a data.frame as input (here data) and allows you to define a model, in this case the function $y=ax$. Therefore it is important to define your data.frame properly so that lm will find the $x$ and $y$ columns in data.



In [10]:

    
fit.lm = lm(y ~ x, data=data)



In [11]:

    
names(fit.lm)









    





	'coefficients'
	'residuals'
	'effects'
	'rank'
	'fitted.values'
	'assign'
	'qr'
	'df.residual'
	'xlevels'
	'call'
	'terms'
	'model'

At this stage we can have a look to the coefficient found, in this case of a linear model



In [12]:

    
# have a look to the coefficients
fit.lm$coefficients









    





	(Intercept)
		-47.1720053904281
	x
		1.45984794811312

You find a lot of information about your optimization by using the summary() function of R. It allows you to get some information about any object.



In [13]:

    
summary(fit.lm)









    





Call:
lm(formula = y ~ x, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5055 -1.9551 -0.7667  1.5985  5.0794 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -47.17201    1.73218  -27.23   <2e-16 ***
x             1.45985    0.02428   60.12   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.47 on 38 degrees of freedom
Multiple R-squared:  0.9896,	Adjusted R-squared:  0.9893 
F-statistic:  3615 on 1 and 38 DF,  p-value: < 2.2e-16

In the case you need to access one of the data shown here to use it further in your script, you can find the name of the objects displayed by summary()



In [14]:

    
names(summary(fit.lm))









    





	'call'
	'terms'
	'residuals'
	'coefficients'
	'aliased'
	'sigma'
	'df'
	'r.squared'
	'adj.r.squared'
	'fstatistic'
	'cov.unscaled'

So if you want to use the $r^2$ coefficient you can use the following command.



In [15]:

    
round(summary(fit.lm)$r.squared,3)

Maybe you want a nice picture of your data and the regression line.



In [16]:

    
plot(data$x,data$y); lines(data$x, fitted(fit.lm), col=2)

If you want more control on what you do, you can display the regression by ploting a line that crosses 0 at "intercept" and with a slope of "x".



In [17]:

    
plot(data$x,data$y,main="linear regression",ylab="intensity",xlab="concentration",ylim=c(-50,100),xlim=c(0,100));
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")

You may have noticed that the data look to have some quadratic component (that we included intentionally at the beginning). You might want to see how good it fits if you use a quadratic function to fit your data.



In [18]:

    
fit.lm = lm(y ~ x + x2, data=data)
plot(data$x,data$y, main="linear regression",ylab="intensity",xlab="concentration")
lines(data$x, fitted(fit.lm), col=2)
paste("the coefficient $r^2$ = ",round(summary(fit.lm)$r.squared,3))
text(60,80,paste("r2 = ",round(summary(fit.lm)$r.squared,3)))









    




'the coefficient $r^2$ =  0.999'

Experimental design

linear regression

Use the data of Table 2.2 in Brereton, R. G. (2003). Chemometrics. Technometrics. Chichester, UK: John Wiley & Sons, Ltd. http://doi.org/10.1002/0470863242

Or import the table from the dataset section provided in this course.

import dataset



In [19]:

    
data <- read.csv("./datasets/table2.2.csv")









    



Warning message in read.table(file = file, header = header, sep = sep, quote = quote, :
“incomplete final line found by readTableHeader on './datasets/table2.2.csv'”



In [20]:

    
data









    





q q.1 q.2 q.3 q.4 q.5 q.6 q.7 q.8 q.9

	1.000 1.000 2.000 3.000  3.000  4.000  4.000  5.000  6.000  6.000
	3.803 3.276 5.181 6.948  8.762 10.672  8.266 13.032 15.021 16.426
	4.797 3.878 6.342 9.186 10.136 12.257 13.252 14.656 17.681 15.071

As you may notice, the data are not in the way we want. We would like to have two columns and we also would like the index of the row to be correct. We can solve this readily.



In [21]:

    
data <- t(data) # we transpose the data



In [22]:

    
rownames(data) <- seq(1,10)
colnames(data) <- c("conc","A","B")



In [23]:

    
data









    





conc A B

	1      3.803  4.797
	1      3.276  3.878
	2      5.181  6.342
	3      6.948  9.186
	3      8.762 10.136
	4     10.672 12.257
	4      8.266 13.252
	5     13.032 14.656
	6     15.021 17.681
	6     16.426 15.071

Although the table looks good there is a problem **may be** a problem with it. R sees each data in the table as characters and not as number. This means that we cannot use these data because they are not "numbers". We should change this. It is done very simply, but it is sometime annoying because it is not something we are used to think of. In order to "see" what R "sees", use the following command.



In [24]:

    
is(data[1])
data[,1]

Putting this altogether in something nicer, it gives:



In [62]:

    
remove(data)
data <- t(read.csv2("./datasets/table2.2.csv",header=TRUE,sep=",",dec="."))
rownames(data) <- seq(1:10)
data <- data.frame("conc"=as.factor(data[,1]), "A"=data[,2],"B"=data[,3])
data









    



Warning message in read.table(file = file, header = header, sep = sep, quote = quote, :
“incomplete final line found by readTableHeader on './datasets/table2.2.csv'”





    





conc A B

	1      3.803  4.797
	1      3.276  3.878
	2      5.181  6.342
	3      6.948  9.186
	3      8.762 10.136
	4     10.672 12.257
	4      8.266 13.252
	5     13.032 14.656
	6     15.021 17.681
	6     16.426 15.071

Analysis of variance (ANOVA)

In order to estimate the quality of our regression we perform an analysis of variance. Here we will perform this analysis step by step as described in Chapter 2 of Brereton, R. G. (2003). Chemometrics. Technometrics. Chichester, UK: John Wiley & Sons, Ltd. http://doi.org/10.1002/0470863242

One of the most important feature of ANOVA is to compare the lack of fit to the replicate error. The error on replicate gives a estimation of the experimental error of the data, while the lack of fit gives an estimation of how good is our model for our data.

with intercept

We first find duplicated data in our table



In [26]:

    
duplicated(data[,1])









    





	FALSE
	TRUE
	FALSE
	FALSE
	TRUE
	FALSE
	TRUE
	FALSE
	FALSE
	TRUE



In [27]:

    
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 2 # if our model is y = ax + b
# find out the degree of freedom for our experiment
D <- N - P - R
D



In [28]:

    
R
N



In [29]:

    
library(plyr) # this library helps a lot

meanRep <- aggregate(A ~ conc, data = data, mean) # this calculates the mean for all replicates

sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
    sumRep[i] <- meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] 
    sumRep2[i] <- ( meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom
result <- data.frame("Sum Replicate"=S_replicate)
result









    





Sum.Replicate

	5.665593

We can verify our result by computing the strait sum of the replicate that gives 0.



In [30]:

    
round(sum(sumRep),3)

Using the models proposed in Chapter 2 of Brereton, we can compute the sum of residuals, that is the sum of the difference between the experimental data and the predicted ones.



In [31]:

    
predicted <- 0.6113 + 2.4364 * as.numeric(data$conc)
S_residual <- sum( (data$A - predicted)^2 )
result["Sum Residual"] <- S_residual # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"]









    





Sum Residual

	8.371165



In [32]:

    
S_total <- sum( data$A^2 ) # N degrees of freedom 
result["Sum Total"] <- S_total; result["Sum Total"]

S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted; result["Sum Predicted"]









    





Sum Total

	1024.58









    





Sum Predicted

	1016.208



In [33]:

    
S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit; result["Sum Lack Of Fit"]
result["F"] <- (S_lackOfFit / D) / (S_replicate / R); result["F"]









    





Sum Lack Of Fit

	2.705572









    





F

	0.4775444



In [34]:

    
plot(data$conc, data$A)
abline(0.6113, 2.4364, col="red")

lack of fit

We can now put this in order and run the four cases, for both datasets A and B and with or without intercept. Table of comparison is shown below.



In [64]:

    
plot(as.numeric(data$conc), as.numeric(data$A))



In [35]:

    
#var.test((data$A - predicted),sumRep2[c(1,4,6,9)])
#sumRep2[c(1,4,6,9)]
#to be tested. Should be possible to use this command in this context.



In [36]:

    
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 2
D <- N - P - R
meanRep <- aggregate(A ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P

sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
    sumRep[i] <- meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] 
    sumRep2[i] <- ( meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom 
result["Sum Replicate"] <- S_replicate

predicted <- 0.6113 + 2.4364 * as.numeric(data$conc)
S_residual <- sum( (data$A - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual

S_total <- sum( data$A^2 ) # N degrees of freedom
result["Sum Total"] <- S_total

S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted

S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit

result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
A2 <- t(result)



In [37]:

    
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 1
D <- N - P - R
meanRep <- aggregate(A ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P

sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
    sumRep[i] <- meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] 
    sumRep2[i] <- ( meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom 
result["Sum Replicate"] <- S_replicate

predicted <- 2.576 * as.numeric(data$conc)
S_residual <- sum( (data$A - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual

S_total <- sum( data$A^2 ) # N degrees of freedom
result["Sum Total"] <- S_total

S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted

S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit

result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
A1 <- t(result)



In [38]:

    
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 2
D <- N - P - R
meanRep <- aggregate(B ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P

sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
    sumRep[i] <- meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i] 
    sumRep2[i] <- ( meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom 
result["Sum Replicate"] <- S_replicate

predicted <- 2.032 + 2.484 * as.numeric(data$conc)
S_residual <- sum( (data$B - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual

S_total <- sum( data$B^2 ) # N degrees of freedom
result["Sum Total"] <- S_total

S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted

S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit

result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
B2 <- t(result)



In [39]:

    
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 1
D <- N - P - R
meanRep <- aggregate(B ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P

sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
    sumRep[i] <- meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i] 
    sumRep2[i] <- ( meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom 
result["Sum Replicate"] <- S_replicate

predicted <- 2.948 * as.numeric(data$conc)
S_residual <- sum( (data$B - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual

S_total <- sum( data$B^2 ) # N degrees of freedom
result["Sum Total"] <- S_total

S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted

S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit

result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
B1 <- t(result)



In [40]:

    
tableOfResult1 <- data.frame(A1,A2,B1,B2)
tableOfResult1









    





A1 A2 B1 B2

	Number.Replicate    4.000000    4.000000    4.000000    4.0000000
	Number of Data   10.000000   10.000000   10.000000   10.0000000
	Number of Parameters    1.000000    2.000000    1.000000    2.0000000
	Sum Replicate    5.665593    5.665593    4.774593    4.7745930
	Sum Residual    9.116023    8.371165   15.472464    7.2399200
	Sum Total 1024.580375 1024.580375 1345.792840 1345.7928400
	Sum Predicted 1015.273728 1016.207748 1329.677712 1338.6635680
	Sum Lack Of Fit    3.450430    2.705572   10.697871    2.4653270
	Mean Residuals    1.012891    1.046396    1.719163    0.9049900
	Mean Total  102.458038  102.458038  134.579284  134.5792840
	Mean Predicted 1015.273728  508.103874 1329.677712  669.3317840
	Mean Replicate    1.416398    1.416398    1.193648    1.1936483
	Lack Of Fit    0.690086    0.676393    2.139574    0.6163318

Both datasets are similar. However, for the second set, excluding the intercept increase the lack of fit. It is important to note that it is not possible to judge the lack of fit other than by comparing it to the Mean Sum of Replicates. In all cases the LOF is smaller than the MSR, except for B1. The conclusion is that we need 2 parameters to best fit our data.



In [41]:

    
# Brereton p.42
qf(0.95,5,4)  # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
2.1395742/1.193648
0.6900860/1.416398
pf(1.792,5,4,lower.tail = TRUE)
pf(0.48721,5,4,lower.tail = TRUE)









    




6.25605650216089






    




1.79246662332614






    




0.487211927720881






    




0.704083882026608






    




0.225084202869423

So we can conclude with 70.4% of confidence that intercept is useful in the case of the B dataset. It is not possible to conclude in the case of the dataset A, since the value for F-ratio is very low.

programing a function

The way we performed the calculation above is not very efficient. We copy and paste 4 time the same code. If we want to make a modification we would have to propagate in all four pieces. In this case, a better practise consists in creating a funcion. If we are interested in more flexibility we can program a function to perform the calculation. This is done like this:



In [42]:

    
lof <- function(x, y, fit.lm) {
    
    data=data.frame("x"=as.numeric(x), "y"=y)
    #fit.lm <- lm( y ~ x, data=data )
    
    Rep <- duplicated(data$x) # find replicates
    R <- sum(Rep) # enumerate replicates
    N <- dim(data)[1] # find out dimension of data
    P <- length(fit.lm$coefficients)
    D <- N - P - R
    result <- data.frame("Number Replicate"=R)
    result["Number of Data"] <- N
    result["Number of Parameters"] <- P
    result["Degrees of Freedom"] <- D
    
    meanRep <- aggregate(y ~ x, data = data, mean)
    
    sumRep=0
    sumRep2=0
    for (i in seq(1,nrow(data))) {
        sumRep[i] <- meanRep$y[ meanRep$x == data$x[i] ] - data$y[i] 
        sumRep2[i] <- ( meanRep$y[ meanRep$x == data$x[i] ] - data$y[i] )^2
    }

    S_replicate <- sum(sumRep2) # R degrees of freedom 
    result["Sum Replicate"] <- round(S_replicate,3)
 
    S_residual <- sum ( resid(fit.lm)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
    result["Sum Residual"] <- round(S_residual,3)

    S_total <- sum( data$y^2 ) # N degrees of freedom
    result["Sum Total"] <- round(S_total,3)

    S_predicted <- sum( fitted(fit.lm)^2 ) # P degrees of freedom
    result["Sum Predicted"] <- round(S_predicted,3)

    S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
    result["Sum Lack Of Fit"] <- round(S_lackOfFit,3)

    result["Mean Residuals"] <- round(S_residual / (N-P),3)
    
    result["Mean Total"] <- round(S_total / N,3)
    
    result["Mean Predicted"] <- round(S_predicted / P,3)
    
    result["Mean Replicate"] <- round(S_replicate / R,3)
    
    result["Lack Of Fit"] <- round(S_lackOfFit / D,3)
    
    result["F-value"] <- round(result["Lack Of Fit"] / result["Mean Replicate"],3)
    
    result["p-value"] <- df(as.numeric(result["F-value"]),D,R)
    
    result["r2"] <- round(summary(fit.lm)$r.squared,3)
    
    result["a(slope)"] <- round(fit.lm$coefficients[2],3)
    
    result["b(intercept)"] <- round(fit.lm$coefficients[1],3)
    
    return( t(result) )
}

We can use this function on our dataset A



In [65]:

    
fit.lm <- lm( A ~ conc, data=data.frame("conc"=as.numeric(data$conc), "A"=data$A) )
r1 <- lof(data$conc, data$A, fit.lm)

plot(data$conc,data$A, main="linear regression",ylab="intensity",xlab="concentration")
#lines(data$conc, fitted(fit.lm), col=2)
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")

We repeated the calculation for the A dataset, but this time we use lm() to fit the data. It means that we are not using the models from the book, but the one we are optimizing here. In addition we computed the p-value. A p-value higher than 0.05 means that we cannot conclude that our model doesn't fit our data acurrately.

We repeat the same calculation but this time we concatenate A and B datasets to see the effect of a larger dataset.



In [66]:

    
fit.lm <- lm( A ~ conc, data=data.frame("conc"=as.numeric(c(data$conc,data$conc)), "A"=c(data$A, data$B)) )
r2 <- lof(c(data$conc,data$conc), c(data$A, data$B), fit.lm)

plot(as.factor(c(data$conc,data$conc)), c(data$A, data$B), main="linear regression",ylab="intensity",xlab="concentration")
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")

The next table shows the comparison. The first thing to note is that the degree of freedom is not changed because we added more points, but also more replicates. Maybe there are too many replicates in this case. The next thing to note is that the lack of fit is smaller which is good, but the MSR is larger. p-values for both are larger than 0.05 and we cannot rule out that our model is not accurate, which is ok.



In [67]:

    
tableOfResult2 <- cbind(r1,r2)
tableOfResult2









    






	Number.Replicate    4.0000000   14.0000000
	Number of Data   10.0000000   20.0000000
	Number of Parameters    2.0000000    2.0000000
	Degrees of Freedom    4.0000000    4.0000000
	Sum Replicate    5.6660000   27.5520000
	Sum Residual    8.3710000   28.2370000
	Sum Total 1024.5800000 2370.3730000
	Sum Predicted 1016.2090000 2342.1370000
	Sum Lack Of Fit    2.7060000    0.6850000
	Mean Residuals    1.0460000    1.5690000
	Mean Total  102.4580000  118.5190000
	Mean Predicted  508.1050000 1171.0680000
	Mean Replicate    1.4160000    1.9680000
	Lack Of Fit    0.6760000    0.1710000
	F-value    0.4770000    0.0870000
	p-value    0.6013782    0.3188609
	r2    0.9560000    0.9290000
	a(slope)    2.4360000    2.4600000
	b(intercept)    0.6110000    1.3220000

Replicates

If we are curious to understand the effect of the replicates we can remove them and apply the test again. Clearly we will have no estimation of the experimental error to compare with. We can compute the LOF but we have no MRS to compute a p-value.

Finally it is important to look at the coefficient. After all we are interested in a calibration curve, how the different experimental designs affect the value of the coefficients?



In [46]:

    
F <- duplicated(data$conc)
fit.lm <- lm( A ~ conc, data=data.frame("conc"=as.numeric(data$conc[F]), "A"=data$A[F]) )
lof(data$conc[F], data$A[F], fit.lm)

plot(data$conc[F], data$A[F], main="linear regression",ylab="intensity",xlab="concentration")
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")









    






	Number.Replicate   0.000
	Number of Data   4.000
	Number of Parameters   2.000
	Degrees of Freedom   2.000
	Sum Replicate   0.000
	Sum Residual   6.486
	Sum Total 425.645
	Sum Predicted 419.159
	Sum Lack Of Fit   6.486
	Mean Residuals   3.243
	Mean Total 106.411
	Mean Predicted 209.580
	Mean Replicate     NaN
	Lack Of Fit   3.243
	F-value     NaN
	p-value     NaN
	r2   0.927
	a(slope)   2.510
	b(intercept)   0.398

p-values, model and noise

As already mentionned, the p-value can be used to decide if we can conclude that our model is not appropriate. We can go back to the data of the first example to check demonstrate how p-value help us decide if our model is accurate or not and to illustrate the effect of noise in the data.



In [47]:

    
N <- 20 # we define the number of observations

# we create a fake dataset using a quadratic
#equation and adding some noise
# first create a vector of x repeated rep times
rep <- 2 # number of replicates
X <- rep(seq(from=43, to=96, length.out=N),rep)
# then create the Y vector according to the equation:
Y <- (0.075 * X + -1.874 + 0.01* X^2)
# create some noise
noise <-runif(length(Y), -1, 1) 
# add some noise to Y
Y <- Y + 1*noise
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]
data <- data.frame("x"=x, "y"=y, x2=x^2)

# we fit the data and evaluate the goodness of it
fit.lm <- lm( y ~ x, data=data )
noQuadLowNoise <- lof(data$x, data$y, fit.lm)



In [48]:

    
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="low noise data / linear model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)
noQuadLowNoise[17]

Although the correlation coefficient is high, we can see that there is a trend in the data that is not correctly described by the model, the model is linear and we know that there is a quadratic term in the data. If we only look at the lack of fit and at the correlation coefficient we can conclude that the model is not bad.

However the LOF is bigger than the MSR, which indicates a problem with our model. If we compute the p-value



In [49]:

    
noQuadLowNoise[16] # p-value









    




2.83496955374502e-14



In [50]:

    
# Brereton p.42
qf(0.95,noQuadLowNoise[4],noQuadLowNoise[1])  # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
noQuadLowNoise[15]
pf(noQuadLowNoise[15],noQuadLowNoise[4],noQuadLowNoise[1],lower.tail = TRUE)









    




2.15112442712183






    




59.163






    




0.999999999999827

we find a very small number that means that we can conclude that our model is not accurate. Let's repeat this but this time we include a quadratic term in the model.



In [51]:

    
fit.lm <- lm( y ~ x + x2, data=data )
QuadLowNoise <- lof(data$x, data$y, fit.lm)
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="low noise data / quadratic model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)
QuadLowNoise[16]









    




0.00758060780767054



In [52]:

    
# Brereton p.42
qf(0.95,QuadLowNoise[4],QuadLowNoise[1])  # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
QuadLowNoise[15]
pf(QuadLowNoise[15],QuadLowNoise[4],QuadLowNoise[1],lower.tail = TRUE)









    




2.16670099681198






    




3.477






    




0.995485550032857

Clearly the model is more accurate and we obtain a p-value that is larger than 0.05.

Now let's see what happens if the noise is large.



In [53]:

    
N <- 20 
rep <- 2 
X <- rep(seq(from=43, to=96, length.out=N),rep)
Y <- (0.075 * X + -1.874 + 0.01* X^2)
noise <-runif(length(Y), -1, 1) 
Y <- Y + 4*noise # we increment the noise 
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]
data <- data.frame("x"=x, "y"=y, x2=x^2)

fit.lm <- lm( y ~ x + x2, data=data )
QuadHighNoise <- lof(data$x, data$y, fit.lm)

plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="high noise data / quadratic model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)

QuadHighNoise[16]









    




0.553610438205782



In [54]:

    
# Brereton p.42
qf(0.95,QuadHighNoise[4],QuadHighNoise[1])  # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
QuadHighNoise[15]
pf(QuadHighNoise[15],QuadHighNoise[4],QuadHighNoise[1],lower.tail = TRUE)









    




2.16670099681198






    




1.3






    




0.715217234887876

We repeat now the same operation but without the quadratic term.



In [55]:

    
fit.lm <- lm( y ~ x, data=data )
noQuadHighNoise <- lof(data$x, data$y, fit.lm)

plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="high noise data / linear model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)

noQuadHighNoise[16]









    




0.0114897700714444



In [56]:

    
# Brereton p.42
qf(0.95,noQuadHighNoise[4],noQuadHighNoise[1])  # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
noQuadHighNoise[15]
pf(noQuadHighNoise[15],noQuadHighNoise[4],noQuadHighNoise[1],lower.tail = TRUE)









    




2.15112442712183






    




3.227






    




0.99347154530691

It can be observed than both p-values are larger than 0.05 and thus we cannot rule out a model and must statistically accept both model as acceptable for our data. This is the effect of noise, it increase the MSR resulting in larger p-values.

You may have to re-run the last cells several time, this because the data are generated randomly and the p-value may vary largely.

It may be interesting to plot the residual. Here we clearly distinguish the quadratic behavior that is not taken into account by our linear model. Plotting residual is always a good idea to help validating a model.



In [57]:

    
plot( resid(fit.lm) )
abline(0,0)

Finally, we may want to add the 95% confidence interval for our calibration curve. This can be achieved with a single command.



In [58]:

    
interval <- confint(fit.lm)
interval90 <- confint(fit.lm, level=0.8)



In [59]:

    
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="high noise data / linear model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col="red")
abline(interval[1,1], interval[2,1], col="gray")
abline(interval[1,2], interval[2,2], col="gray")
abline(interval90[1,1], interval90[2,1], col="gray", lty=2)
abline(interval90[1,2], interval90[2,2], col="gray", lty=2)

Here is a calibration curve with confidence interval at 80% (gray dashed line) and 95% (gray line). More concrete information can be found here for calibration curves (https://raw.githubusercontent.com/jwist/chemometrics/master/pdf/Calibration-curve-guide.pdf)

According to the decision tree, in this simple decision tree we have 1 IV (X, causes) and 1 DV (Y, effect). Since both are continuous variable we are in the domain of regression. Because we only have a single DV and it is continuous we speak about simple linear regression. In the next example we will study the effect of pH, temperature and concentration on the yield of a reaction, thus we will have more than one IV but a single DV and we speaks about multiple regression or multivariable regression.

for more about this read: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3518362/



In [ ]:



In [ ]:

q	q.1	q.2	q.3	q.4	q.5	q.6	q.7	q.8	q.9
1.000	1.000	2.000	3.000	3.000	4.000	4.000	5.000	6.000	6.000
3.803	3.276	5.181	6.948	8.762	10.672	8.266	13.032	15.021	16.426
4.797	3.878	6.342	9.186	10.136	12.257	13.252	14.656	17.681	15.071

	A1	A2	B1	B2
Number.Replicate	4.000000	4.000000	4.000000	4.0000000
Number of Data	10.000000	10.000000	10.000000	10.0000000
Number of Parameters	1.000000	2.000000	1.000000	2.0000000
Sum Replicate	5.665593	5.665593	4.774593	4.7745930
Sum Residual	9.116023	8.371165	15.472464	7.2399200
Sum Total	1024.580375	1024.580375	1345.792840	1345.7928400
Sum Predicted	1015.273728	1016.207748	1329.677712	1338.6635680
Sum Lack Of Fit	3.450430	2.705572	10.697871	2.4653270
Mean Residuals	1.012891	1.046396	1.719163	0.9049900
Mean Total	102.458038	102.458038	134.579284	134.5792840
Mean Predicted	1015.273728	508.103874	1329.677712	669.3317840
Mean Replicate	1.416398	1.416398	1.193648	1.1936483
Lack Of Fit	0.690086	0.676393	2.139574	0.6163318

Number.Replicate	4.0000000	14.0000000
Number of Data	10.0000000	20.0000000
Number of Parameters	2.0000000	2.0000000
Degrees of Freedom	4.0000000	4.0000000
Sum Replicate	5.6660000	27.5520000
Sum Residual	8.3710000	28.2370000
Sum Total	1024.5800000	2370.3730000
Sum Predicted	1016.2090000	2342.1370000
Sum Lack Of Fit	2.7060000	0.6850000
Mean Residuals	1.0460000	1.5690000
Mean Total	102.4580000	118.5190000
Mean Predicted	508.1050000	1171.0680000
Mean Replicate	1.4160000	1.9680000
Lack Of Fit	0.6760000	0.1710000
F-value	0.4770000	0.0870000
p-value	0.6013782	0.3188609
r2	0.9560000	0.9290000
a(slope)	2.4360000	2.4600000
b(intercept)	0.6110000	1.3220000

Number.Replicate	0.000
Number of Data	4.000
Number of Parameters	2.000
Degrees of Freedom	2.000
Sum Replicate	0.000
Sum Residual	6.486
Sum Total	425.645
Sum Predicted	419.159
Sum Lack Of Fit	6.486
Mean Residuals	3.243
Mean Total	106.411
Mean Predicted	209.580
Mean Replicate	NaN
Lack Of Fit	3.243
F-value	NaN
p-value	NaN
r2	0.927
a(slope)	2.510
b(intercept)	0.398