RI have written several papers on the subject, namely on tomato, guava, kiwi, corn and mango. As conceptual papers as well. I have been asked several times how the calculations were done. I now tend to publish all my codes and data on my GitHub repository, but in the past this was not so systematic. This notebook is for people who want to know how I work with plant ionomes with R.
One of the reasons why science is beautiful is that knowledge moves through better accuracies. When I worked with mango ionomes, I created an in-house code to isolate a high-yield, healthy group, then assessed the health index of an ionome by computing the Mahalanobis distance between the observation to diagnose and the center and covariance of the healthy group (Mahalanobis distances need covariance).
Nowadays, I prefer to use established machine learning methods. In particular, a distance-based method named k-nearest neighbors offers insightful results. I'm not saying it's a catch-all approach. There are probably better methods waiting to be discovered. And this notebook will be improved over time, just like a wiki.
I'm loading tidyverse to load both ggplot2 and dplyr. compositions is a package to transform components (like concentrations) to compositionnally friendly variables, like isometric log-ratios. This package has not been updated for a while, so if you are on Windows, R will throw warnings, which as far as I know can be ignored. The mvoutlier package is usefull to detect multivariate outliers. If you are a R user and don't know caret, you are missing something huge: it's a all-in-one package for machine learning. The MASS package contains some useful functions, namely ginv for matrix inversion (used to compute the Mahalanobis distance.)
In [1]:
library("tidyverse")
library("compositions")
library("mvoutlier")
library("caret")
library("MASS")
options(repr.plot.width=5, repr.plot.height=5)
set.seed(534636)
There are tools that are unavailable in packages. So I created some code and put them in my AgFun repository on GitHub (fun is for function, of course - who would think these things are fun?!?).
In [2]:
source("https://raw.githubusercontent.com/essicolo/AgFun/master/DRIS.R") # DRIS
source("https://raw.githubusercontent.com/essicolo/AgFun/master/sign1_ref.R") # Mahalanobis
source("https://raw.githubusercontent.com/essicolo/AgFun/master/codadend2.R") # CoDaDendrogram2
source("https://raw.githubusercontent.com/essicolo/AgFun/master/ilrDefinition.R") # ilrDefinition
source("https://raw.githubusercontent.com/essicolo/AgFun/master/CateNelson.R") # CateNelson
source("https://raw.githubusercontent.com/essicolo/AgFun/master/biclass.R") # biclass
DRIS stands for Diagnosis and Recommandation Integrated System. It's a mathematical formulation created by Beaufils () to assess plant ionomes. The sign1_ref function is inspired by the sign1 function of the mvoutlier package: it does the isolation of a healthy group and computes Mahalanobis distances. The codadend2.R fil contains the CoDaDendrogram2 function. It's a fork of the CoDaDendrogram function from the compositions package which mostly adds the ability to plot statistics on the vertical branches. The ilrDefinition.R file contains a function that generates [E,D | C,B,A] strings from a sequential binary partition (a SBP, used to give a structure to isometric log-ratios). The CateNelson.R and the biclass.R files contain respectively a function to plot Cate-Nelson data and to classify these data in binary classes.
In [3]:
atoca <- read_delim('https://raw.githubusercontent.com/essicolo/AgFun/master/examples/plant-ionome-diagnosis-with-R/data/data.csv',
delim = ';')
sbp <- read_delim('https://raw.githubusercontent.com/essicolo/AgFun/master/examples/plant-ionome-diagnosis-with-R/data/sbp.csv',
delim = ';')
In [4]:
atoca %>% sample_n(10)
I use a filling value (Fv), representing all the components which were not measured (here mostly oxigen and hydrogen) in the leaves. This is useful not only in the diagnostic, but also it's necessary to back-transform isometric log-ratios to concentrations in the original scale.
In [5]:
parts <- atoca[ ,7:18]
if (!'Fv' %in% colnames(parts)) parts$Fv <- 1e6 - rowSums(parts)
It's always useful to check if the column names in parts and sbp are the same.
In [6]:
t(data.frame(colnames(sbp), colnames(parts)))
In this section, I transform the parts to balances. This is done to free concentrations from their closed space, from their redundancy and from their subcompositional incoherence (this paper for more details).
In [7]:
psi <- gsi.buildilrBase(t(sbp))
ilrDef <- ilrDefinition(sbp, side = '-+') # assign [E,D | C,B,A] names to balances
comp <- acomp(parts)
bal <- ilr(comp, V = psi)
colnames(bal) <- ilrDef
The CoDaDendrogram2 function can be use to plot trends in yield at each fulcrum.
In [8]:
grouping <- cut(atoca$Rendement, breaks = 10)
grouping <- factor(grouping, levels = rev(levels(grouping)))
CoDaDendrogram2(comp, V = psi,
range = 'auto', equal.height = TRUE, type = 'conf', conf.level = 0,
show.range = FALSE, group = grouping, pch.col = FALSE)
In [9]:
out_test <- sign2(unclass(bal), qcrit = 0.999)
pairs(bal, col = factor(out_test$wfinal01), pch=16)
I keep the inliers in the bal_in object.
In [10]:
bal_in = bal[out_test$wfinal01 == 1, ]
Just for sake of data mining, we can do a pca and ceck if there is a trend with cranberry yield.
In [11]:
bal_pca = princomp(bal, cor=FALSE)
gg_pca = data.frame(bal_pca$scores[, 1:2], yield = atoca$Rendement)
ggplot(gg_pca, aes(x = Comp.1, y = Comp.2)) +
geom_point(aes(size = yield), alpha=0.2)
Some regions are definitly yielding low, showing potential trends in the data.
In [12]:
bal_sc <- scale(bal)
colnames(bal_sc) <- ilrDef
In [13]:
bal_df <- data.frame(bal_sc, yield = atoca$Rendement)
mod1 <- lm(yield ~ ., bal_df) # log10 of yield could also be used
summary(mod1)
In R, there is a very useful package named MCMCpack, which can easily handle bayesian statistics. I'm not using it here to keep focus. Just saying.
We verify the distribution of residues. The null hypothesis is that residues are normally distributed.
In [14]:
shapiro.test(residuals(mod1))
The null hypothesis is not rejected at a 0.05 level. Great, but let's see how the linear model behaves.
In [15]:
plot(atoca$Rendement, predict(mod1),
xlab = "Measured yield",
ylab = "Predicted yield",
col='gray40')
abline(a=0,b=1, col="red", lty = 2, lwd = 2)
I have to say that this is unusually good for ecological data. But a linear statistical model describes the effects: it doesn't predict anything.
This is where machine learning come to play.
In ML, we either regress to predict continuous values or classify to predict categories. Of course, yield is continuous. But for diagnosis and decision-making, continuous values are difficult to interpret.
Anyways, let's try a regression model. In ML, we need to split data into a training set, on which the model will be fitted, and a testing set used to test how the model behaves with data which were not used for fitting.
In [16]:
inTrain <- createDataPartition(y = bal_df$yield, p = .70, list = FALSE) # indexes of the training set
training <- bal_df[inTrain, ] # training set
testing <- bal_df[-inTrain, ] # testing set
Many algotithms cans be tested. In my experience, KNNs usually work great with ilrs. Beware that fitting ML can be a long process...
In [17]:
grid <- expand.grid(kmax = seq(2, 10, by = 2),
distance = seq(0.2, 2, by = 0.2),
kernel = 'optimal') # grid of parameters to try
ml_kknn_reg = train(yield ~ ., data = training, method = "kknn", tuneGrid = grid, tuneLength = 10) # long process
In [18]:
options(repr.plot.width=5, repr.plot.height=4)
plot(bal_df[inTrain, 'yield'], predict(ml_kknn_reg),
xlab = "Measured yield",
ylab = "Predicted yield",
col='lightblue')
points(bal_df[-inTrain, 'yield'], predict(ml_kknn_reg, testing),
col='red', pch=4)
abline(a=0,b=1, col="red", lty = 2, lwd = 2)
The caret package allows to plot an index of importance of variables. Each algorithm ha its own meaning of "importance" of variables. We should be careful on what "importance" means, case by case.
In [19]:
vI = varImp(ml_kknn_reg)
rownames(vI$importance) = ilrDef
plot(vI)
In [20]:
print(quantile(bal_df$yield))
boxplot(bal_df$yield)
A delimiter of 50 t/ha seems reasonnable.
In [21]:
yield_delimiter = 50
bal_df$yieldclass = cut(bal_df$yield, c(0, yield_delimiter, 75))
levels(bal_df$yieldclass) = c('LY', 'HY')
training = bal_df[inTrain, -13] # training set
testing = bal_df[-inTrain, -13] # testing set
In [22]:
grid = expand.grid(kmax = seq(10, 30, by = 5),
distance = 1.2,#seq(0.6, 3, by = 0.2),
kernel = 'optimal') # grid of parameters to try
ml_kknn_clf = train(yieldclass ~ .,
data = training,
method = "kknn",
tuneGrid = grid,
tuneLength = 20) # long process
The contingency tables in the training and testing sets.
In [23]:
ct_train = table(Measured = bal_df[inTrain, 'yieldclass'],
Predicted = predict(ml_kknn_clf))
ct_train
In [24]:
ct_test = table(Measured = bal_df[-inTrain, 'yieldclass'],
Predicted = predict(ml_kknn_clf,testing))
ct_test
The accuracy in the testing set.
In [25]:
(ct_test[1, 1] + ct_test[2, 2]) / sum(ct_test)
The optimal parameters of the model.
In [26]:
ml_kknn_clf
With KNNs, we can generate an imbalance index by computing the distance between a point and the mean nearest K balanced points in the ilr space. First, we must isolate the reference population, the one have high yield which can be correctly predicted, i.e. the true negatives (truely not sick).
In [27]:
vI = varImp(ml_kknn_clf)
rownames(vI$importance) = ilrDef
plot(vI)
The euclidean distance as imbalance index is not always good. Let's see how it behaves in this case. First, I create a vector containing the predictions of the model. This vector is a series of LY and HY values. The TN object is a data frame containing the ionomes of true negatives, i.e. whose actual and predicted values are high yields.
In [28]:
predicted = predict(ml_kknn_clf, bal_df %>% dplyr::select(-matches("yield")))
TN = bal_df[bal_df$yieldclass == 'HY' & predicted == 'HY', ]
I also create a quick function returning the euclidean distance between the values of two vectors.
In [29]:
euclidean_distance = function(x1, x2) (sqrt(sum(x1-x2)^2))
The following function will be applied to each observation. I checks if the observation is predicted as LY or HY. Then it computes the euclidean distance between the observation and the mean of the k_nearest (let's say 10) nearest balanced points. It also checks if the mean of the nearest balanced points is indeed a HY, just in case the mean falls into a LY hole.
In [30]:
balance_knn = function(observed_bal, bal_dataset, model, k_nearest) {
# The model must return either LY or HY
is_obs_balanced = predict(model, observed_bal) == 'HY'
distances = apply(bal_dataset, 1, euclidean_distance, x2=observed_bal)
nearest = names(sort(distances, decreasing=TRUE)[1:k_nearest])
reference_bals = bal_dataset[rownames(bal_dataset) %in% nearest, ]
reference_bal = apply(reference_bals, 2, mean)
reference_bal = t(data.frame(reference_bal))
is_ref_balanced = predict(ml_kknn_clf, reference_bal) == 'HY'
imbalance_index = euclidean_distance(reference_bal, observed_bal)
return(list(is_obs_balanced, is_ref_balanced, imbalance_index, reference_bals))
}
Here I'm aranging the balances matrix as a data frame, and I create a data frame where the diagnostic will be stored.
In [31]:
observed_bal = data.frame(bal)
colnames(observed_bal) = colnames(training)[-13]
diagnostic = data.frame(matrix(ncol = 3, nrow = nrow(observed_bal)))
colnames(diagnostic) = c("obs_is_balanced", "ref_is_balanced", "imbalance_index")
The following loop computes the diagnosis for every observed balance. It's a long process, and probably a highly sub-optimal approach in terms of computation efficiency.
In [32]:
reference_bals_l = list()
for (i in 1:nrow(observed_bal)) {
output = balance_knn(observed_bal = observed_bal[i, ],
bal_dataset = TN[, 1:12],
model = ml_kknn_clf,
k_nearest=10)
diagnostic[i, 1] = output[[1]]
diagnostic[i, 2] = output[[2]]
diagnostic[i, 3] = output[[3]]
reference_bals_l[[i]] = output[[4]]
}
I put the yield and yield class in the diagnosis table to make thing pretty for plotting.
In [33]:
diagnostic$yield = atoca$Rendement
diagnostic$yieldclass = cut(atoca$Rendement, c(0, yield_delimiter, 75))
levels(diagnostic$yieldclass) = c('LY', 'HY')
In [34]:
options(repr.plot.width=6, repr.plot.height=4)
par(mfrow=c(1, 2))
plot(diagnostic$yield, diagnostic$imbalance_index,
xlab="Yield, t/ha", ylab="Imbalance index")
plot(density(diagnostic$imbalance_index[diagnostic$yieldclass == 'LY']),
main="", xlab="Imbalance index", col="red", ylim=c(0, 0.5))
lines(density(diagnostic$imbalance_index[diagnostic$yieldclass == 'HY']),
col="blue")
legend(4, 0.5, legend=c("LY", "HY"),
col=c("red", "blue"), cex=0.8)
From the first plot it's obvious that there is a trend between yield and imbalance index, but it's very noisy. The density plots show that high yielders have generally a lower imbalance index, but low yielders have a broad range of values, corevering high yielders. The imbalance index is not suitable here.
For a signle observation, it is possible to plot it's position in a CoDa dendrogram, with the statistics of its reference (the confidence range of the nearest values is to interpret with care when it has few values, here just 10).
It's not obvious which color is the reference and which is the observed. If you check the numbers, red is the observed and green is the reference.
In [35]:
i=10
observed_bal = bal[i, ]
codadend_df = rbind(reference_bals_l[[i]], observed_bal, observed_bal)
codadend_df$group = c(rep('reference', 10), 'observed', 'observed')
In [36]:
options(repr.plot.width=8, repr.plot.height=5)
CoDaDendrogram2(ilrInv(codadend_df[, -13], V=psi), V = psi,
range = 'auto', equal.height = TRUE, type = 'conf', conf.level=0.95,
group = factor(codadend_df$group),
show.range = TRUE)
In [37]:
imbalance_index = euclidean_distance(apply(reference_bals_l[[i]], 2, mean), observed_bal)
imbalance_index
What we did was the latest approach I retained. But in most papers I published, I used an in-house approach based on the binary classification approach, similar to the Sufficiency levels of available nutrients concept known as the Cate-Nelson approach in agriculture (Nelson 1975; example in a FAO document, page 33).
Here, I loop across several qcrit thresholds to recover a sensitivity versus sensitivity curve for the training set.
In [38]:
in_train = createDataPartition(y = bal_df$yield, p = .75, list = FALSE) # indexes of the training set
training = bal_df[in_train, ] # training set
testing = bal_df[-in_train, ] # testing set
In [39]:
qcrit_vector = seq(0.950, 0.999, by = 0.001) # ajust minimum value if an error message is returned from the sign1_ref function
senspe = matrix(ncol = 2, nrow = length(qcrit_vector))
colnames(senspe) = c("Sensitivity", "Specificity")
sign1_results = list()
for (j in 1:length(qcrit_vector)) {
sign1_results[[j]] = sign1_ref(mv = training[, -c(13, 14)], Y = training[, 13], Ycut = yield_delimiter,
niter = 20, qcrit = qcrit_vector[j])
classification = biclass(x=sign1_results[[j]]$mv.dist, y = training[, 13],
xl = sign1_results[[j]]$const, yl = yield_delimiter)$classification
senspe[j, 1] = sum(classification == "TP") / (sum(classification == "TP") + sum(classification == "FN"))
senspe[j, 2] = sum(classification == "TN") / (sum(classification == "TN") + sum(classification == "FP"))
}
The senspe matrix is used to plot a ROC curve. This needs several steps. First, I add theoretical values where sensitivity is 1 and specificity is 0, and where sensitivity is 0 and specificity is 1.
In [40]:
senspe = rbind(c(1, 0), senspe, c(0, 1))
Then I compute the AUC (area under the curve) index...
In [41]:
sAUC = vector()
sAUC[1] = senspe[1,2]*senspe[1,1]
for (i in 2:nrow(senspe)) sAUC[i] = (senspe[i,2]-senspe[i-1,2])*senspe[i,1]
AUC = sum(sAUC, na.rm = TRUE)
... as well as the Youden index ($Sensitivity + Specificity - 1$).
In [42]:
youden <- apply(senspe, 1, function(X) sum(X) - 1)
optXi = which.max(youden)
The following code produces a ROC plot.
In [43]:
options(repr.plot.width=5, repr.plot.height=5)
ggplot(data = data.frame(senspe), aes(x = Specificity, y = Sensitivity)) +
theme_bw() +
geom_step(direction = "vh") +
geom_abline(intercept = 1, slope= -1, lty=2) +
geom_point(data = data.frame(senspe)[optXi, ], size=3) +
geom_text(data = data.frame(senspe)[optXi, ],
label = paste("[", round(senspe[optXi, 1], 2), ", ",
round(senspe[optXi, 2], 2), "]", sep=""),
hjust=0.3, vjust=-1) +
geom_text(data = data.frame(Specificity = 0.5, Sensitivity = 0.5),
label = paste("AUC = ", round(AUC, 2), sep="")) +
coord_equal()
The optimal value for qcrit is the following.
In [44]:
qcrit_vector[optXi]
I will not print it here, but the information for the partition done with the optimal qcrit can be recovered in the sign1_results[[optXi]] object. I'm using this object for the binary classification (Cate-Nelson)
In [45]:
options(repr.plot.width=5, repr.plot.height=8)
bin_class_train = CateNelson(x = sign1_results[[optXi]]$mv.dist, y = training$yield,
xcn = sign1_results[[optXi]]$const, ycn = yield_delimiter,
TNpos = "ul", xlab = expression("Mahalanobis distance"), ylab = "Yield",
draw.ellipse = "none",
anovm = TRUE, anovm.span = 0.1, anovm.degree = 1,
dens.plot = TRUE, colors = "greys")
We get an accuracy of 0.89 with the training set. This is pretty good! Now let's try with the testing set.
To do this I isolate the balances of the reference population (true high yielders), thy.
In [46]:
thy = training[sign1_results[[optXi]]$mv.dist <= sign1_results[[optXi]]$const &
training$yield > yield_delimiter, -c(13, 14)]
Then I compute the Mahalanobis distance between the testing set and the reference population.
In [47]:
invCov = ginv(cov(thy)) # I invert the matrix separetly with the robust ginv function (MASS package)
test_mv.dist = sqrt(mahalanobis(testing[ , -c(13, 14)],
center = apply(thy, 2, median),
cov = invCov,
inverted = TRUE))
In [48]:
options(repr.plot.width=6, repr.plot.height=4)
bin_class_test = CateNelson(x = test_mv.dist, y = testing$yield,
xcn = sign1_results[[optXi]]$const, ycn = yield_delimiter,
TNpos = "ul", xlab = expression("Mahalanobis distance"), ylab = "Yield",
draw.ellipse = "none",
anovm = FALSE,
dens.plot = FALSE,
colors = "greys")
The testing accuracy is 0.79. Not so bad, but below what we get with KNNs.
The DRIS is usually not conducted with training and testing sets. I could do it, but I guess at this point you can figure out how it can be done. Here I'm using the whole data set. The DRIS function takes a collection of high yielders, a collection of low yielders and a data set on which we want to compute the imbalance indices.
But when I first ran the DRIS function it failed because there is a zero value in the data set, in the iron column, on line 225. Let's impute this value as 65% of the lowest measurement.
In [49]:
zero_imp = function(x, rate=0.65) {
x[x==0] = rate * min(x[x>0])
return(x)
}
parts_imp = apply(parts, 2, zero_imp)
For isolation of low and high yielders, in percentage.
In [50]:
high_yielders = parts_imp[atoca$Rendement >= yield_delimiter, -13] / 10000
low_yielders = parts_imp[atoca$Rendement < yield_delimiter, -13] / 10000
The DRIS function include several methods. The Beaufils method is the classical one.
In [51]:
i=1:10 # argument obs must contain at least two rows
dris_i = DRIS(obs = parts_imp[i, -13] / 10000,
ref.hy=high_yielders,
ref.ly=low_yielders,
method="Beaufils")
In [52]:
dris_i$Index
In [53]:
dris_i$IBN