Julien Wist / 2017 / Universidad del Valle
An up-to-date version of this notebook can be found here: https://github.com/jwist/chemometrics/
In [1]:
options(repr.plot.width=4, repr.plot.height=4) # change these setting to plot larger figures
In [2]:
#rm(list=ls(all=TRUE)) # we clear the variable space
N <- 20 # we define the number of observations
# we create a fake dataset using a quadratic
#equation and adding some noise
# first create a vector of x repeated rep times
rep <- 2 # number of replicates
X <- rep(seq(from=43, to=96, length.out=N),rep)
# then create the Y vector according to the equation:
Y <- (0.075 * X + -1.874 + 0.01* X^2)
# create some noise
noise <-runif(length(Y), -1, 1)
# add some noise to Y
Y <- Y + 1*noise
we take a look to the data that we just created.
In [3]:
plot(X,Y)
In order to prepare a nice dataset for R we need to sort the data.
In [4]:
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]
and now we can create a data.frame with columns x,y and $x^2$.
consider data.frame is equivalent to an excel spreadsheet, with columns headers, and data of the same length, however data could be of different kind, such as strings, numbers (numeric), factors, etc.
In [5]:
data <- data.frame(x=x, y=y, x2=x^2)
to access the data within the data frame, use:
In [6]:
x_sorted$ix
In [7]:
data$y
In [8]:
data['x'][data['y']>20.10071]
lm()
Now that we have defined a data.frame
we can use it as input into most pre-built functions of R. Here we would like to test a simple linear regression. Therefore we use the lm()
function and store the results in the fit.lm
object. The linear model function accept a data.frame as input (here data) and allows you to define a model, in this case the function $y=ax$. Therefore it is important to define your data.frame properly so that lm will find the $x$ and $y$ columns in data.
In [10]:
fit.lm = lm(y ~ x, data=data)
In [11]:
names(fit.lm)
At this stage we can have a look to the coefficient found, in this case of a linear model
In [12]:
# have a look to the coefficients
fit.lm$coefficients
You find a lot of information about your optimization by using the summary()
function of R. It allows you to get some information about any object.
In [13]:
summary(fit.lm)
In the case you need to access one of the data shown here to use it further in your script, you can find the name of the objects displayed by summary()
In [14]:
names(summary(fit.lm))
So if you want to use the $r^2$ coefficient you can use the following command.
In [15]:
round(summary(fit.lm)$r.squared,3)
Maybe you want a nice picture of your data and the regression line.
In [16]:
plot(data$x,data$y); lines(data$x, fitted(fit.lm), col=2)
If you want more control on what you do, you can display the regression by ploting a line that crosses 0 at "intercept" and with a slope of "x".
In [17]:
plot(data$x,data$y,main="linear regression",ylab="intensity",xlab="concentration",ylim=c(-50,100),xlim=c(0,100));
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")
You may have noticed that the data look to have some quadratic component (that we included intentionally at the beginning). You might want to see how good it fits if you use a quadratic function to fit your data.
In [18]:
fit.lm = lm(y ~ x + x2, data=data)
plot(data$x,data$y, main="linear regression",ylab="intensity",xlab="concentration")
lines(data$x, fitted(fit.lm), col=2)
paste("the coefficient $r^2$ = ",round(summary(fit.lm)$r.squared,3))
text(60,80,paste("r2 = ",round(summary(fit.lm)$r.squared,3)))
Use the data of Table 2.2 in Brereton, R. G. (2003). Chemometrics. Technometrics. Chichester, UK: John Wiley & Sons, Ltd. http://doi.org/10.1002/0470863242
Or import the table from the dataset section provided in this course.
In [19]:
data <- read.csv("./datasets/table2.2.csv")
In [20]:
data
As you may notice, the data are not in the way we want. We would like to have two columns and we also would like the index of the row to be correct. We can solve this readily.
In [21]:
data <- t(data) # we transpose the data
In [22]:
rownames(data) <- seq(1,10)
colnames(data) <- c("conc","A","B")
In [23]:
data
Although the table looks good there is a problem **may be** a problem with it. R sees each data in the table as characters and not as number. This means that we cannot use these data because they are not "numbers". We should change this. It is done very simply, but it is sometime annoying because it is not something we are used to think of. In order to "see" what R "sees", use the following command.
In [24]:
is(data[1])
data[,1]
Putting this altogether in something nicer, it gives:
In [62]:
remove(data)
data <- t(read.csv2("./datasets/table2.2.csv",header=TRUE,sep=",",dec="."))
rownames(data) <- seq(1:10)
data <- data.frame("conc"=as.factor(data[,1]), "A"=data[,2],"B"=data[,3])
data
In order to estimate the quality of our regression we perform an analysis of variance. Here we will perform this analysis step by step as described in Chapter 2 of Brereton, R. G. (2003). Chemometrics. Technometrics. Chichester, UK: John Wiley & Sons, Ltd. http://doi.org/10.1002/0470863242
One of the most important feature of ANOVA is to compare the lack of fit to the replicate error. The error on replicate gives a estimation of the experimental error of the data, while the lack of fit gives an estimation of how good is our model for our data.
We first find duplicated data in our table
In [26]:
duplicated(data[,1])
In [27]:
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 2 # if our model is y = ax + b
# find out the degree of freedom for our experiment
D <- N - P - R
D
In [28]:
R
N
In [29]:
library(plyr) # this library helps a lot
meanRep <- aggregate(A ~ conc, data = data, mean) # this calculates the mean for all replicates
sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
sumRep[i] <- meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i]
sumRep2[i] <- ( meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom
result <- data.frame("Sum Replicate"=S_replicate)
result
We can verify our result by computing the strait sum of the replicate that gives 0.
In [30]:
round(sum(sumRep),3)
Using the models proposed in Chapter 2 of Brereton, we can compute the sum of residuals, that is the sum of the difference between the experimental data and the predicted ones.
In [31]:
predicted <- 0.6113 + 2.4364 * as.numeric(data$conc)
S_residual <- sum( (data$A - predicted)^2 )
result["Sum Residual"] <- S_residual # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"]
In [32]:
S_total <- sum( data$A^2 ) # N degrees of freedom
result["Sum Total"] <- S_total; result["Sum Total"]
S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted; result["Sum Predicted"]
In [33]:
S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit; result["Sum Lack Of Fit"]
result["F"] <- (S_lackOfFit / D) / (S_replicate / R); result["F"]
In [34]:
plot(data$conc, data$A)
abline(0.6113, 2.4364, col="red")
In [64]:
plot(as.numeric(data$conc), as.numeric(data$A))
In [35]:
#var.test((data$A - predicted),sumRep2[c(1,4,6,9)])
#sumRep2[c(1,4,6,9)]
#to be tested. Should be possible to use this command in this context.
In [36]:
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 2
D <- N - P - R
meanRep <- aggregate(A ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P
sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
sumRep[i] <- meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i]
sumRep2[i] <- ( meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom
result["Sum Replicate"] <- S_replicate
predicted <- 0.6113 + 2.4364 * as.numeric(data$conc)
S_residual <- sum( (data$A - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual
S_total <- sum( data$A^2 ) # N degrees of freedom
result["Sum Total"] <- S_total
S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted
S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit
result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
A2 <- t(result)
In [37]:
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 1
D <- N - P - R
meanRep <- aggregate(A ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P
sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
sumRep[i] <- meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i]
sumRep2[i] <- ( meanRep$A[ meanRep$conc == data$conc[i] ] - data$A[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom
result["Sum Replicate"] <- S_replicate
predicted <- 2.576 * as.numeric(data$conc)
S_residual <- sum( (data$A - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual
S_total <- sum( data$A^2 ) # N degrees of freedom
result["Sum Total"] <- S_total
S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted
S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit
result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
A1 <- t(result)
In [38]:
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 2
D <- N - P - R
meanRep <- aggregate(B ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P
sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
sumRep[i] <- meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i]
sumRep2[i] <- ( meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom
result["Sum Replicate"] <- S_replicate
predicted <- 2.032 + 2.484 * as.numeric(data$conc)
S_residual <- sum( (data$B - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual
S_total <- sum( data$B^2 ) # N degrees of freedom
result["Sum Total"] <- S_total
S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted
S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit
result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
B2 <- t(result)
In [39]:
Rep <- duplicated(data$conc) # find replicates
R <- sum( Rep ) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- 1
D <- N - P - R
meanRep <- aggregate(B ~ conc, data = data, mean)
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P
sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
sumRep[i] <- meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i]
sumRep2[i] <- ( meanRep$B[ meanRep$conc == data$conc[i] ] - data$B[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom
result["Sum Replicate"] <- S_replicate
predicted <- 2.948 * as.numeric(data$conc)
S_residual <- sum( (data$B - predicted)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- S_residual
S_total <- sum( data$B^2 ) # N degrees of freedom
result["Sum Total"] <- S_total
S_predicted <- sum( predicted^2 ) # P degrees of freedom
result["Sum Predicted"] <- S_predicted
S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- S_lackOfFit
result["Mean Residuals"] <- S_residual / (N-P)
result["Mean Total"] <- S_total / N
result["Mean Predicted"] <- S_predicted / P
result["Mean Replicate"] <- S_replicate / R
result["Lack Of Fit"] <- S_lackOfFit / D
B1 <- t(result)
In [40]:
tableOfResult1 <- data.frame(A1,A2,B1,B2)
tableOfResult1
Both datasets are similar. However, for the second set, excluding the intercept increase the lack of fit. It is important to note that it is not possible to judge the lack of fit other than by comparing it to the Mean Sum of Replicates. In all cases the LOF is smaller than the MSR, except for B1. The conclusion is that we need 2 parameters to best fit our data.
In [41]:
# Brereton p.42
qf(0.95,5,4) # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
2.1395742/1.193648
0.6900860/1.416398
pf(1.792,5,4,lower.tail = TRUE)
pf(0.48721,5,4,lower.tail = TRUE)
So we can conclude with 70.4% of confidence that intercept is useful in the case of the B dataset. It is not possible to conclude in the case of the dataset A, since the value for F-ratio is very low.
The way we performed the calculation above is not very efficient. We copy and paste 4 time the same code. If we want to make a modification we would have to propagate in all four pieces. In this case, a better practise consists in creating a funcion. If we are interested in more flexibility we can program a function to perform the calculation. This is done like this:
In [42]:
lof <- function(x, y, fit.lm) {
data=data.frame("x"=as.numeric(x), "y"=y)
#fit.lm <- lm( y ~ x, data=data )
Rep <- duplicated(data$x) # find replicates
R <- sum(Rep) # enumerate replicates
N <- dim(data)[1] # find out dimension of data
P <- length(fit.lm$coefficients)
D <- N - P - R
result <- data.frame("Number Replicate"=R)
result["Number of Data"] <- N
result["Number of Parameters"] <- P
result["Degrees of Freedom"] <- D
meanRep <- aggregate(y ~ x, data = data, mean)
sumRep=0
sumRep2=0
for (i in seq(1,nrow(data))) {
sumRep[i] <- meanRep$y[ meanRep$x == data$x[i] ] - data$y[i]
sumRep2[i] <- ( meanRep$y[ meanRep$x == data$x[i] ] - data$y[i] )^2
}
S_replicate <- sum(sumRep2) # R degrees of freedom
result["Sum Replicate"] <- round(S_replicate,3)
S_residual <- sum ( resid(fit.lm)^2 ) # N-P degrees of freedom (# or S_total - S_predicted)
result["Sum Residual"] <- round(S_residual,3)
S_total <- sum( data$y^2 ) # N degrees of freedom
result["Sum Total"] <- round(S_total,3)
S_predicted <- sum( fitted(fit.lm)^2 ) # P degrees of freedom
result["Sum Predicted"] <- round(S_predicted,3)
S_lackOfFit <- S_residual - S_replicate # N-P-R degrees of freedom
result["Sum Lack Of Fit"] <- round(S_lackOfFit,3)
result["Mean Residuals"] <- round(S_residual / (N-P),3)
result["Mean Total"] <- round(S_total / N,3)
result["Mean Predicted"] <- round(S_predicted / P,3)
result["Mean Replicate"] <- round(S_replicate / R,3)
result["Lack Of Fit"] <- round(S_lackOfFit / D,3)
result["F-value"] <- round(result["Lack Of Fit"] / result["Mean Replicate"],3)
result["p-value"] <- df(as.numeric(result["F-value"]),D,R)
result["r2"] <- round(summary(fit.lm)$r.squared,3)
result["a(slope)"] <- round(fit.lm$coefficients[2],3)
result["b(intercept)"] <- round(fit.lm$coefficients[1],3)
return( t(result) )
}
We can use this function on our dataset A
In [65]:
fit.lm <- lm( A ~ conc, data=data.frame("conc"=as.numeric(data$conc), "A"=data$A) )
r1 <- lof(data$conc, data$A, fit.lm)
plot(data$conc,data$A, main="linear regression",ylab="intensity",xlab="concentration")
#lines(data$conc, fitted(fit.lm), col=2)
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")
We repeated the calculation for the A dataset, but this time we use lm()
to fit the data. It means that we are not using the models from the book, but the one we are optimizing here. In addition we computed the p-value. A p-value higher than 0.05 means that we cannot conclude that our model doesn't fit our data acurrately.
We repeat the same calculation but this time we concatenate A and B datasets to see the effect of a larger dataset.
In [66]:
fit.lm <- lm( A ~ conc, data=data.frame("conc"=as.numeric(c(data$conc,data$conc)), "A"=c(data$A, data$B)) )
r2 <- lof(c(data$conc,data$conc), c(data$A, data$B), fit.lm)
plot(as.factor(c(data$conc,data$conc)), c(data$A, data$B), main="linear regression",ylab="intensity",xlab="concentration")
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")
The next table shows the comparison. The first thing to note is that the degree of freedom is not changed because we added more points, but also more replicates. Maybe there are too many replicates in this case. The next thing to note is that the lack of fit is smaller which is good, but the MSR is larger. p-values for both are larger than 0.05 and we cannot rule out that our model is not accurate, which is ok.
In [67]:
tableOfResult2 <- cbind(r1,r2)
tableOfResult2
If we are curious to understand the effect of the replicates we can remove them and apply the test again. Clearly we will have no estimation of the experimental error to compare with. We can compute the LOF but we have no MRS to compute a p-value.
Finally it is important to look at the coefficient. After all we are interested in a calibration curve, how the different experimental designs affect the value of the coefficients?
In [46]:
F <- duplicated(data$conc)
fit.lm <- lm( A ~ conc, data=data.frame("conc"=as.numeric(data$conc[F]), "A"=data$A[F]) )
lof(data$conc[F], data$A[F], fit.lm)
plot(data$conc[F], data$A[F], main="linear regression",ylab="intensity",xlab="concentration")
abline(fit.lm$coefficients[1],fit.lm$coefficients[2],col="blue")
As already mentionned, the p-value can be used to decide if we can conclude that our model is not appropriate. We can go back to the data of the first example to check demonstrate how p-value help us decide if our model is accurate or not and to illustrate the effect of noise in the data.
In [47]:
N <- 20 # we define the number of observations
# we create a fake dataset using a quadratic
#equation and adding some noise
# first create a vector of x repeated rep times
rep <- 2 # number of replicates
X <- rep(seq(from=43, to=96, length.out=N),rep)
# then create the Y vector according to the equation:
Y <- (0.075 * X + -1.874 + 0.01* X^2)
# create some noise
noise <-runif(length(Y), -1, 1)
# add some noise to Y
Y <- Y + 1*noise
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]
data <- data.frame("x"=x, "y"=y, x2=x^2)
# we fit the data and evaluate the goodness of it
fit.lm <- lm( y ~ x, data=data )
noQuadLowNoise <- lof(data$x, data$y, fit.lm)
In [48]:
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="low noise data / linear model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)
noQuadLowNoise[17]
Although the correlation coefficient is high, we can see that there is a trend in the data that is not correctly described by the model, the model is linear and we know that there is a quadratic term in the data. If we only look at the lack of fit and at the correlation coefficient we can conclude that the model is not bad.
However the LOF is bigger than the MSR, which indicates a problem with our model. If we compute the p-value
In [49]:
noQuadLowNoise[16] # p-value
In [50]:
# Brereton p.42
qf(0.95,noQuadLowNoise[4],noQuadLowNoise[1]) # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
noQuadLowNoise[15]
pf(noQuadLowNoise[15],noQuadLowNoise[4],noQuadLowNoise[1],lower.tail = TRUE)
In [51]:
fit.lm <- lm( y ~ x + x2, data=data )
QuadLowNoise <- lof(data$x, data$y, fit.lm)
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="low noise data / quadratic model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)
QuadLowNoise[16]
In [52]:
# Brereton p.42
qf(0.95,QuadLowNoise[4],QuadLowNoise[1]) # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
QuadLowNoise[15]
pf(QuadLowNoise[15],QuadLowNoise[4],QuadLowNoise[1],lower.tail = TRUE)
Clearly the model is more accurate and we obtain a p-value that is larger than 0.05.
Now let's see what happens if the noise is large.
In [53]:
N <- 20
rep <- 2
X <- rep(seq(from=43, to=96, length.out=N),rep)
Y <- (0.075 * X + -1.874 + 0.01* X^2)
noise <-runif(length(Y), -1, 1)
Y <- Y + 4*noise # we increment the noise
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]
data <- data.frame("x"=x, "y"=y, x2=x^2)
fit.lm <- lm( y ~ x + x2, data=data )
QuadHighNoise <- lof(data$x, data$y, fit.lm)
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="high noise data / quadratic model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)
QuadHighNoise[16]
In [54]:
# Brereton p.42
qf(0.95,QuadHighNoise[4],QuadHighNoise[1]) # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
QuadHighNoise[15]
pf(QuadHighNoise[15],QuadHighNoise[4],QuadHighNoise[1],lower.tail = TRUE)
We repeat now the same operation but without the quadratic term.
In [55]:
fit.lm <- lm( y ~ x, data=data )
noQuadHighNoise <- lof(data$x, data$y, fit.lm)
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="high noise data / linear model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col=2)
noQuadHighNoise[16]
In [56]:
# Brereton p.42
qf(0.95,noQuadHighNoise[4],noQuadHighNoise[1]) # this would be the F-ratio for 95% confidence.
# above this number we can conclude with 95% confidence that intercept is useful.
noQuadHighNoise[15]
pf(noQuadHighNoise[15],noQuadHighNoise[4],noQuadHighNoise[1],lower.tail = TRUE)
It can be observed than both p-values are larger than 0.05 and thus we cannot rule out a model and must statistically accept both model as acceptable for our data. This is the effect of noise, it increase the MSR resulting in larger p-values.
You may have to re-run the last cells several time, this because the data are generated randomly and the p-value may vary largely.
It may be interesting to plot the residual. Here we clearly distinguish the quadratic behavior that is not taken into account by our linear model. Plotting residual is always a good idea to help validating a model.
In [57]:
plot( resid(fit.lm) )
abline(0,0)
Finally, we may want to add the 95% confidence interval for our calibration curve. This can be achieved with a single command.
In [58]:
interval <- confint(fit.lm)
interval90 <- confint(fit.lm, level=0.8)
In [59]:
plot(data$x,data$y, cex=0.7, cex.axis=0.8, cex.main=0.8, cex.lab=0.8, main="high noise data / linear model", xlab="concentration", ylab="intensity")
lines(data$x, fitted(fit.lm), col="red")
abline(interval[1,1], interval[2,1], col="gray")
abline(interval[1,2], interval[2,2], col="gray")
abline(interval90[1,1], interval90[2,1], col="gray", lty=2)
abline(interval90[1,2], interval90[2,2], col="gray", lty=2)
Here is a calibration curve with confidence interval at 80% (gray dashed line) and 95% (gray line). More concrete information can be found here for calibration curves (https://raw.githubusercontent.com/jwist/chemometrics/master/pdf/Calibration-curve-guide.pdf)
According to the decision tree, in this simple decision tree we have 1 IV (X, causes) and 1 DV (Y, effect). Since both are continuous variable we are in the domain of regression. Because we only have a single DV and it is continuous we speak about simple linear regression. In the next example we will study the effect of pH, temperature and concentration on the yield of a reaction, thus we will have more than one IV but a single DV and we speaks about multiple regression or multivariable regression.
for more about this read: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3518362/
In [ ]:
In [ ]: