The diagnosis of the plant ionome with R

I have written several papers on the subject, namely on tomato, guava, kiwi, corn and mango. As conceptual papers as well. I have been asked several times how the calculations were done. I now tend to publish all my codes and data on my GitHub repository, but in the past this was not so systematic. This notebook is for people who want to know how I work with plant ionomes with R.

One of the reasons why science is beautiful is that knowledge moves through better accuracies. When I worked with mango ionomes, I created an in-house code to isolate a high-yield, healthy group, then assessed the health index of an ionome by computing the Mahalanobis distance between the observation to diagnose and the center and covariance of the healthy group (Mahalanobis distances need covariance).

Nowadays, I prefer to use established machine learning methods. In particular, a distance-based method named k-nearest neighbors offers insightful results. I'm not saying it's a catch-all approach. There are probably better methods waiting to be discovered. And this notebook will be improved over time, just like a wiki.

Load libraries

I'm loading tidyverse to load both ggplot2 and dplyr. compositions is a package to transform components (like concentrations) to compositionnally friendly variables, like isometric log-ratios. This package has not been updated for a while, so if you are on Windows, R will throw warnings, which as far as I know can be ignored. The mvoutlier package is usefull to detect multivariate outliers. If you are a R user and don't know caret, you are missing something huge: it's a all-in-one package for machine learning. The MASS package contains some useful functions, namely ginv for matrix inversion (used to compute the Mahalanobis distance.)


In [1]:
library("tidyverse")
library("compositions")
library("mvoutlier")
library("caret")
library("MASS")

options(repr.plot.width=5, repr.plot.height=5)
set.seed(534636)


-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.0.0     v purrr   0.2.5
v tibble  1.4.2     v dplyr   0.7.7
v tidyr   0.8.1     v stringr 1.3.1
v readr   1.1.1     v forcats 0.3.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Loading required package: tensorA

Attaching package: 'tensorA'

The following object is masked from 'package:base':

    norm

Loading required package: robustbase
Loading required package: energy
Loading required package: bayesm
Welcome to compositions, a package for compositional data analysis.
Find an intro with "? compositions"


Attaching package: 'compositions'

The following objects are masked from 'package:stats':

    cor, cov, dist, var

The following objects are masked from 'package:base':

    %*%, scale, scale.default

Loading required package: sgeostat
sROC 0.1-2 loaded
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:compositions':

    R2

The following object is masked from 'package:purrr':

    lift


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

There are tools that are unavailable in packages. So I created some code and put them in my AgFun repository on GitHub (fun is for function, of course - who would think these things are fun?!?).


In [2]:
source("https://raw.githubusercontent.com/essicolo/AgFun/master/DRIS.R") # DRIS
source("https://raw.githubusercontent.com/essicolo/AgFun/master/sign1_ref.R") # Mahalanobis
source("https://raw.githubusercontent.com/essicolo/AgFun/master/codadend2.R") # CoDaDendrogram2
source("https://raw.githubusercontent.com/essicolo/AgFun/master/ilrDefinition.R") # ilrDefinition
source("https://raw.githubusercontent.com/essicolo/AgFun/master/CateNelson.R") # CateNelson
source("https://raw.githubusercontent.com/essicolo/AgFun/master/biclass.R") # biclass

DRIS stands for Diagnosis and Recommandation Integrated System. It's a mathematical formulation created by Beaufils () to assess plant ionomes. The sign1_ref function is inspired by the sign1 function of the mvoutlier package: it does the isolation of a healthy group and computes Mahalanobis distances. The codadend2.R fil contains the CoDaDendrogram2 function. It's a fork of the CoDaDendrogram function from the compositions package which mostly adds the ability to plot statistics on the vertical branches. The ilrDefinition.R file contains a function that generates [E,D | C,B,A] strings from a sequential binary partition (a SBP, used to give a structure to isometric log-ratios). The CateNelson.R and the biclass.R files contain respectively a function to plot Cate-Nelson data and to classify these data in binary classes.

Load data

The data set comes from cranberries. When there are numerous components to analyse, I prefer to design SBPs in a spreadsheet (LibreOffice Calc) rather than entering it in R.


In [3]:
atoca <- read_delim('https://raw.githubusercontent.com/essicolo/AgFun/master/examples/plant-ionome-diagnosis-with-R/data/data.csv',
                    delim = ';')
sbp <- read_delim('https://raw.githubusercontent.com/essicolo/AgFun/master/examples/plant-ionome-diagnosis-with-R/data/sbp.csv',
                  delim = ';')


Parsed with column specification:
cols(
  Annee = col_integer(),
  Site = col_character(),
  Type = col_character(),
  Dose = col_integer(),
  Répétition = col_integer(),
  Rendement = col_double(),
  B = col_double(),
  Ca = col_double(),
  Cu = col_double(),
  Fe = col_double(),
  K = col_double(),
  Mg = col_double(),
  Mn = col_double(),
  P = col_double(),
  Zn = col_double(),
  N = col_double(),
  C = col_double(),
  S = col_double()
)
Parsed with column specification:
cols(
  B = col_integer(),
  Ca = col_integer(),
  Cu = col_integer(),
  Fe = col_integer(),
  K = col_integer(),
  Mg = col_integer(),
  Mn = col_integer(),
  P = col_integer(),
  Zn = col_integer(),
  N = col_integer(),
  C = col_integer(),
  S = col_integer(),
  Fv = col_integer()
)

In [4]:
atoca %>% sample_n(10)


AnneeSiteTypeDoseRépétitionRendementBCaCuFeKMgMnPZnNCS
2015 9 N 60 1 25.47 37.7700 5087.2311.065 44.547 4914.3641325.100197.556 759.77241.015 11870.0 509600 644.8
2014 45 N 60 1 63.97 67.1959 8039.2372.609 74.605 4923.2181768.443197.914 898.33816.662 11914.0 525100 903.5
2014 A9 N 45 2 39.98 42.7036 8177.9103.009 169.773 4725.6042265.449170.687 972.92014.251 10718.0 510330 766.2
2015 10 N 60 1 49.71 54.9900 4749.0720.862 12.733 4074.5031389.089153.455 702.29742.493 10830.0 499280 895.2
2015 A9 B 1 1 31.66 57.8300 5166.5395.076 54.197 4584.4781575.689146.329 778.45955.073 12358.0 509790 944.8
2015 45 K 40 2 32.07 55.5000 5624.4322.202 45.395 3887.2141476.326113.487 752.29213.909 10409.0 500160 731.4
2015 A9 N 30 1 20.27 80.4400 5767.2591.351 64.645 3647.4181924.120201.747 804.38875.353 9006.7 501600 696.3
2014 10 N 45 1 41.93 69.9526 7445.5542.552 129.249 5025.8251600.527158.504 775.430 9.813 9121.6 518340 562.1
2015 9 Mg 12 1 45.72 51.3300 5623.2291.037 52.513 4571.8261402.507183.592 715.01639.967 11616.0 506160 652.0
2014 45 B 0 1 53.14 37.4904 8223.0432.281 91.325 4734.8632254.698166.760 1023.43011.593 9542.2 528250 788.3

I use a filling value (Fv), representing all the components which were not measured (here mostly oxigen and hydrogen) in the leaves. This is useful not only in the diagnostic, but also it's necessary to back-transform isometric log-ratios to concentrations in the original scale.


In [5]:
parts <- atoca[ ,7:18]
if (!'Fv' %in% colnames(parts)) parts$Fv <- 1e6 - rowSums(parts)

It's always useful to check if the column names in parts and sbp are the same.


In [6]:
t(data.frame(colnames(sbp), colnames(parts)))


colnames.sbp.B CaCuFeK MgMnP ZnN C S Fv
colnames.parts.B CaCuFeK MgMnP ZnN C S Fv

Compositional data analysis

In this section, I transform the parts to balances. This is done to free concentrations from their closed space, from their redundancy and from their subcompositional incoherence (this paper for more details).


In [7]:
psi <- gsi.buildilrBase(t(sbp))
ilrDef <- ilrDefinition(sbp, side = '-+') # assign [E,D | C,B,A] names to balances
comp <- acomp(parts)
bal <- ilr(comp, V = psi)
colnames(bal) <- ilrDef

The CoDaDendrogram2 function can be use to plot trends in yield at each fulcrum.


In [8]:
grouping <- cut(atoca$Rendement, breaks = 10)
grouping <- factor(grouping, levels = rev(levels(grouping)))

CoDaDendrogram2(comp, V = psi,
                range = 'auto', equal.height = TRUE, type = 'conf', conf.level = 0,
                show.range = FALSE, group = grouping, pch.col = FALSE)


Check for outliers

I'm using sign2 with a loose criteria to detect potential outliers. The pairs plot shows outliers in black.


In [9]:
out_test <- sign2(unclass(bal), qcrit = 0.999)
pairs(bal, col = factor(out_test$wfinal01), pch=16)


I keep the inliers in the bal_in object.


In [10]:
bal_in = bal[out_test$wfinal01 == 1, ]

Just for sake of data mining, we can do a pca and ceck if there is a trend with cranberry yield.


In [11]:
bal_pca = princomp(bal, cor=FALSE)
gg_pca = data.frame(bal_pca$scores[, 1:2], yield = atoca$Rendement)
ggplot(gg_pca, aes(x = Comp.1, y = Comp.2)) +
  geom_point(aes(size = yield), alpha=0.2)


Some regions are definitly yielding low, showing potential trends in the data.

Linear regression

For sake of statistical testing, I test if yield can be described by a linear combinaision of balances. If one wants to compare coefficients, variables should be scaled (zero mean and unit standard deviation).


In [12]:
bal_sc <- scale(bal)
colnames(bal_sc) <- ilrDef

In [13]:
bal_df <- data.frame(bal_sc, yield = atoca$Rendement)
mod1 <- lm(yield ~ ., bal_df) # log10 of yield could also be used
summary(mod1)


Call:
lm(formula = yield ~ ., data = bal_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.8051  -6.5717  -0.2561   6.6827  24.4092 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                            40.0493     0.5641  70.996  < 2e-16 ***
X.S.C.N.Zn.P.Mn.Mg.K.Fe.Cu.Ca.B...Fv.  -1.9681     1.6937  -1.162 0.246239    
X.N.Zn.P.Mn.Mg.K.Fe.Cu.Ca.B...S.C.     -6.0682     2.1117  -2.874 0.004377 ** 
X.S...C.                               -6.8642     2.0775  -3.304 0.001080 ** 
X.N.P.Mg.K.Ca.B...Zn.Mn.Fe.Cu.         -6.7107     1.3601  -4.934  1.4e-06 ***
X.Zn.Cu...Mn.Fe.                        0.6384     1.1624   0.549 0.583343    
X.Mn...Fe.                              0.5219     1.0434   0.500 0.617325    
X.Zn...Cu.                              2.8604     0.9055   3.159 0.001760 ** 
X.N.P.Mg.K.Ca...B.                     -1.0229     0.6961  -1.469 0.142862    
X.Mg.K.Ca...N.P.                        1.2087     1.1030   1.096 0.274099    
X.N...P.                               -2.8013     0.9981  -2.807 0.005366 ** 
X.Mg.Ca...K.                           -2.7096     0.7623  -3.555 0.000446 ***
X.Mg...Ca.                              4.1586     0.9435   4.408  1.5e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.557 on 274 degrees of freedom
Multiple R-squared:  0.5608,	Adjusted R-squared:  0.5415 
F-statistic: 29.15 on 12 and 274 DF,  p-value: < 2.2e-16

In R, there is a very useful package named MCMCpack, which can easily handle bayesian statistics. I'm not using it here to keep focus. Just saying.

library(MCMCpack) model1_bayes = MCMCregress(yield ~ ., bal_df)

We verify the distribution of residues. The null hypothesis is that residues are normally distributed.


In [14]:
shapiro.test(residuals(mod1))


	Shapiro-Wilk normality test

data:  residuals(mod1)
W = 0.99687, p-value = 0.8511

The null hypothesis is not rejected at a 0.05 level. Great, but let's see how the linear model behaves.


In [15]:
plot(atoca$Rendement, predict(mod1),
     xlab = "Measured yield",
     ylab = "Predicted yield",
    col='gray40')
abline(a=0,b=1, col="red", lty = 2, lwd = 2)


I have to say that this is unusually good for ecological data. But a linear statistical model describes the effects: it doesn't predict anything.

This is where machine learning come to play.

Machine learning

In ML, we either regress to predict continuous values or classify to predict categories. Of course, yield is continuous. But for diagnosis and decision-making, continuous values are difficult to interpret.

Yield as continuous target

Anyways, let's try a regression model. In ML, we need to split data into a training set, on which the model will be fitted, and a testing set used to test how the model behaves with data which were not used for fitting.


In [16]:
inTrain <- createDataPartition(y = bal_df$yield, p = .70, list = FALSE) # indexes of the training set
training <- bal_df[inTrain, ] # training set
testing <- bal_df[-inTrain, ] # testing set

Many algotithms cans be tested. In my experience, KNNs usually work great with ilrs. Beware that fitting ML can be a long process...


In [17]:
grid <- expand.grid(kmax = seq(2, 10, by = 2),
                    distance = seq(0.2, 2, by = 0.2),
                    kernel = 'optimal') # grid of parameters to try
ml_kknn_reg = train(yield ~ ., data = training, method = "kknn", tuneGrid = grid, tuneLength = 10) # long process

In [18]:
options(repr.plot.width=5, repr.plot.height=4)
plot(bal_df[inTrain, 'yield'], predict(ml_kknn_reg),
     xlab = "Measured yield",
     ylab = "Predicted yield",
    col='lightblue')
points(bal_df[-inTrain, 'yield'], predict(ml_kknn_reg, testing),
    col='red', pch=4)
abline(a=0,b=1, col="red", lty = 2, lwd = 2)


The caret package allows to plot an index of importance of variables. Each algorithm ha its own meaning of "importance" of variables. We should be careful on what "importance" means, case by case.


In [19]:
vI = varImp(ml_kknn_reg)
rownames(vI$importance) = ilrDef
plot(vI)


Yield as ordered category

For dagnosis purposes, yield can be segmented in categories, usually only two.


In [20]:
print(quantile(bal_df$yield))
boxplot(bal_df$yield)


   0%   25%   50%   75%  100% 
 6.45 29.89 39.05 50.05 74.14 

A delimiter of 50 t/ha seems reasonnable.


In [21]:
yield_delimiter = 50
bal_df$yieldclass = cut(bal_df$yield, c(0, yield_delimiter, 75))
levels(bal_df$yieldclass) = c('LY', 'HY')
training = bal_df[inTrain, -13] # training set
testing = bal_df[-inTrain, -13] # testing set

In [22]:
grid = expand.grid(kmax = seq(10, 30, by = 5),
                   distance = 1.2,#seq(0.6, 3, by = 0.2), 
                   kernel = 'optimal') # grid of parameters to try
ml_kknn_clf = train(yieldclass ~ ., 
                    data = training, 
                    method = "kknn", 
                    tuneGrid = grid, 
                    tuneLength = 20) # long process

The contingency tables in the training and testing sets.


In [23]:
ct_train = table(Measured = bal_df[inTrain, 'yieldclass'], 
                 Predicted = predict(ml_kknn_clf))
ct_train


        Predicted
Measured  LY  HY
      LY 146   6
      HY   6  45

In [24]:
ct_test = table(Measured = bal_df[-inTrain, 'yieldclass'], 
                Predicted = predict(ml_kknn_clf,testing))
ct_test


        Predicted
Measured LY HY
      LY 57  6
      HY  6 15

The accuracy in the testing set.


In [25]:
(ct_test[1, 1] + ct_test[2, 2]) / sum(ct_test)


0.857142857142857

The optimal parameters of the model.


In [26]:
ml_kknn_clf


k-Nearest Neighbors 

203 samples
 12 predictor
  2 classes: 'LY', 'HY' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 203, 203, 203, 203, 203, 203, ... 
Resampling results across tuning parameters:

  kmax  Accuracy   Kappa   
  10    0.8493381  0.593071
  15    0.8493381  0.593071
  20    0.8493381  0.593071
  25    0.8493381  0.593071
  30    0.8493381  0.593071

Tuning parameter 'distance' was held constant at a value of 1.2

Tuning parameter 'kernel' was held constant at a value of optimal
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were kmax = 30, distance = 1.2 and kernel
 = optimal.

With KNNs, we can generate an imbalance index by computing the distance between a point and the mean nearest K balanced points in the ilr space. First, we must isolate the reference population, the one have high yield which can be correctly predicted, i.e. the true negatives (truely not sick).


In [27]:
vI = varImp(ml_kknn_clf)
rownames(vI$importance) = ilrDef
plot(vI)


Assessing imbalance

The euclidean distance as imbalance index is not always good. Let's see how it behaves in this case. First, I create a vector containing the predictions of the model. This vector is a series of LY and HY values. The TN object is a data frame containing the ionomes of true negatives, i.e. whose actual and predicted values are high yields.


In [28]:
predicted = predict(ml_kknn_clf, bal_df %>% dplyr::select(-matches("yield")))
TN = bal_df[bal_df$yieldclass == 'HY' & predicted == 'HY', ]

I also create a quick function returning the euclidean distance between the values of two vectors.


In [29]:
euclidean_distance = function(x1, x2) (sqrt(sum(x1-x2)^2))

The following function will be applied to each observation. I checks if the observation is predicted as LY or HY. Then it computes the euclidean distance between the observation and the mean of the k_nearest (let's say 10) nearest balanced points. It also checks if the mean of the nearest balanced points is indeed a HY, just in case the mean falls into a LY hole.


In [30]:
balance_knn = function(observed_bal, bal_dataset, model, k_nearest) {
    # The model must return either LY or HY
    is_obs_balanced = predict(model, observed_bal) == 'HY'
    
    distances = apply(bal_dataset, 1, euclidean_distance, x2=observed_bal)
    nearest = names(sort(distances, decreasing=TRUE)[1:k_nearest])
    reference_bals = bal_dataset[rownames(bal_dataset) %in% nearest, ]
    reference_bal = apply(reference_bals, 2, mean)
    reference_bal = t(data.frame(reference_bal))
    
    is_ref_balanced = predict(ml_kknn_clf, reference_bal) == 'HY'
    
    imbalance_index = euclidean_distance(reference_bal, observed_bal)
    return(list(is_obs_balanced, is_ref_balanced, imbalance_index, reference_bals))
}

Here I'm aranging the balances matrix as a data frame, and I create a data frame where the diagnostic will be stored.


In [31]:
observed_bal = data.frame(bal)
colnames(observed_bal) = colnames(training)[-13]
diagnostic = data.frame(matrix(ncol = 3, nrow = nrow(observed_bal)))
colnames(diagnostic) = c("obs_is_balanced", "ref_is_balanced", "imbalance_index")

The following loop computes the diagnosis for every observed balance. It's a long process, and probably a highly sub-optimal approach in terms of computation efficiency.


In [32]:
reference_bals_l = list()
for (i in 1:nrow(observed_bal)) {
    output = balance_knn(observed_bal = observed_bal[i, ], 
                                  bal_dataset = TN[, 1:12], 
                                  model = ml_kknn_clf, 
                                  k_nearest=10)
    diagnostic[i, 1] = output[[1]]
    diagnostic[i, 2] = output[[2]]
    diagnostic[i, 3] = output[[3]]
    reference_bals_l[[i]] = output[[4]]
}

I put the yield and yield class in the diagnosis table to make thing pretty for plotting.


In [33]:
diagnostic$yield = atoca$Rendement
diagnostic$yieldclass = cut(atoca$Rendement, c(0, yield_delimiter, 75))
levels(diagnostic$yieldclass) = c('LY', 'HY')

In [34]:
options(repr.plot.width=6, repr.plot.height=4)
par(mfrow=c(1, 2))
plot(diagnostic$yield, diagnostic$imbalance_index, 
     xlab="Yield, t/ha", ylab="Imbalance index")
plot(density(diagnostic$imbalance_index[diagnostic$yieldclass == 'LY']),
    main="", xlab="Imbalance index", col="red", ylim=c(0, 0.5))
lines(density(diagnostic$imbalance_index[diagnostic$yieldclass == 'HY']),
     col="blue")
legend(4, 0.5, legend=c("LY", "HY"),
       col=c("red", "blue"), cex=0.8)


From the first plot it's obvious that there is a trend between yield and imbalance index, but it's very noisy. The density plots show that high yielders have generally a lower imbalance index, but low yielders have a broad range of values, corevering high yielders. The imbalance index is not suitable here.

For a signle observation, it is possible to plot it's position in a CoDa dendrogram, with the statistics of its reference (the confidence range of the nearest values is to interpret with care when it has few values, here just 10).

It's not obvious which color is the reference and which is the observed. If you check the numbers, red is the observed and green is the reference.


In [35]:
i=10
observed_bal = bal[i, ]
codadend_df = rbind(reference_bals_l[[i]], observed_bal, observed_bal)
codadend_df$group = c(rep('reference', 10), 'observed', 'observed')

In [36]:
options(repr.plot.width=8, repr.plot.height=5)
CoDaDendrogram2(ilrInv(codadend_df[, -13], V=psi), V = psi,
                range = 'auto', equal.height = TRUE, type = 'conf', conf.level=0.95,
                 group = factor(codadend_df$group),
                show.range = TRUE)



In [37]:
imbalance_index = euclidean_distance(apply(reference_bals_l[[i]], 2, mean), observed_bal)
imbalance_index


8.45395759091796

The Cate-Nelson approach

What we did was the latest approach I retained. But in most papers I published, I used an in-house approach based on the binary classification approach, similar to the Sufficiency levels of available nutrients concept known as the Cate-Nelson approach in agriculture (Nelson 1975; example in a FAO document, page 33).

Here, I loop across several qcrit thresholds to recover a sensitivity versus sensitivity curve for the training set.


In [38]:
in_train = createDataPartition(y = bal_df$yield, p = .75, list = FALSE) # indexes of the training set
training = bal_df[in_train, ] # training set
testing = bal_df[-in_train, ] # testing set

In [39]:
qcrit_vector = seq(0.950, 0.999, by = 0.001) # ajust minimum value if an error message is returned from the sign1_ref function
senspe = matrix(ncol = 2, nrow = length(qcrit_vector))
colnames(senspe) = c("Sensitivity", "Specificity")

sign1_results = list()
for (j in 1:length(qcrit_vector)) {
  sign1_results[[j]] = sign1_ref(mv = training[, -c(13, 14)], Y = training[, 13], Ycut = yield_delimiter, 
                                 niter = 20, qcrit = qcrit_vector[j])
  classification = biclass(x=sign1_results[[j]]$mv.dist, y = training[, 13],
                           xl = sign1_results[[j]]$const, yl = yield_delimiter)$classification
  senspe[j, 1] = sum(classification == "TP") / (sum(classification == "TP") + sum(classification == "FN"))
  senspe[j, 2] = sum(classification == "TN") / (sum(classification == "TN") + sum(classification == "FP"))
}

The senspe matrix is used to plot a ROC curve. This needs several steps. First, I add theoretical values where sensitivity is 1 and specificity is 0, and where sensitivity is 0 and specificity is 1.


In [40]:
senspe = rbind(c(1, 0), senspe, c(0, 1))

Then I compute the AUC (area under the curve) index...


In [41]:
sAUC = vector()
sAUC[1] = senspe[1,2]*senspe[1,1]
for (i in 2:nrow(senspe)) sAUC[i] = (senspe[i,2]-senspe[i-1,2])*senspe[i,1]
AUC = sum(sAUC, na.rm = TRUE)

... as well as the Youden index ($Sensitivity + Specificity - 1$).


In [42]:
youden <- apply(senspe, 1, function(X) sum(X) - 1)
optXi = which.max(youden)

The following code produces a ROC plot.


In [43]:
options(repr.plot.width=5, repr.plot.height=5)
ggplot(data = data.frame(senspe), aes(x = Specificity, y = Sensitivity)) +
  theme_bw() +
  geom_step(direction = "vh") +
  geom_abline(intercept = 1, slope= -1, lty=2) +
  geom_point(data = data.frame(senspe)[optXi, ], size=3) +
  geom_text(data = data.frame(senspe)[optXi, ],
            label = paste("[", round(senspe[optXi, 1], 2), ", ", 
                          round(senspe[optXi, 2], 2), "]", sep=""),
            hjust=0.3, vjust=-1) +
  geom_text(data = data.frame(Specificity = 0.5, Sensitivity = 0.5),
            label = paste("AUC = ", round(AUC, 2), sep="")) +
  coord_equal()


The optimal value for qcrit is the following.


In [44]:
qcrit_vector[optXi]


0.994

I will not print it here, but the information for the partition done with the optimal qcrit can be recovered in the sign1_results[[optXi]] object. I'm using this object for the binary classification (Cate-Nelson)


In [45]:
options(repr.plot.width=5, repr.plot.height=8)
bin_class_train = CateNelson(x = sign1_results[[optXi]]$mv.dist, y = training$yield,
                             xcn = sign1_results[[optXi]]$const, ycn = yield_delimiter,
                             TNpos = "ul", xlab = expression("Mahalanobis distance"), ylab = "Yield",
                             draw.ellipse = "none",
                             anovm = TRUE, anovm.span = 0.1, anovm.degree = 1,
                             dens.plot = TRUE, colors = "greys")


We get an accuracy of 0.89 with the training set. This is pretty good! Now let's try with the testing set.

To do this I isolate the balances of the reference population (true high yielders), thy.


In [46]:
thy = training[sign1_results[[optXi]]$mv.dist <= sign1_results[[optXi]]$const & 
               training$yield > yield_delimiter, -c(13, 14)]

Then I compute the Mahalanobis distance between the testing set and the reference population.


In [47]:
invCov = ginv(cov(thy)) # I invert the matrix separetly with the robust ginv function (MASS package)
test_mv.dist = sqrt(mahalanobis(testing[ , -c(13, 14)], 
                                center = apply(thy, 2, median), 
                                cov = invCov, 
                                inverted = TRUE))

In [48]:
options(repr.plot.width=6, repr.plot.height=4)
bin_class_test = CateNelson(x = test_mv.dist, y = testing$yield,
                            xcn = sign1_results[[optXi]]$const, ycn = yield_delimiter,
                            TNpos = "ul", xlab = expression("Mahalanobis distance"), ylab = "Yield",
                            draw.ellipse = "none",
                            anovm = FALSE,
                            dens.plot = FALSE,
                            colors = "greys")


The testing accuracy is 0.79. Not so bad, but below what we get with KNNs.

DRIS

The DRIS is usually not conducted with training and testing sets. I could do it, but I guess at this point you can figure out how it can be done. Here I'm using the whole data set. The DRIS function takes a collection of high yielders, a collection of low yielders and a data set on which we want to compute the imbalance indices.

But when I first ran the DRIS function it failed because there is a zero value in the data set, in the iron column, on line 225. Let's impute this value as 65% of the lowest measurement.


In [49]:
zero_imp = function(x, rate=0.65) {
    x[x==0] = rate * min(x[x>0])
    return(x)
}
parts_imp = apply(parts, 2, zero_imp)

For isolation of low and high yielders, in percentage.


In [50]:
high_yielders = parts_imp[atoca$Rendement >= yield_delimiter, -13] / 10000
low_yielders = parts_imp[atoca$Rendement < yield_delimiter, -13] / 10000

The DRIS function include several methods. The Beaufils method is the classical one.


In [51]:
i=1:10 # argument obs must contain at least two rows
dris_i = DRIS(obs = parts_imp[i, -13] / 10000, 
              ref.hy=high_yielders, 
              ref.ly=low_yielders,
              method="Beaufils")

In [52]:
dris_i$Index


BCaCuFeKMgMnPZnNCS
0.06606111 -1.0012492 0.88100111 -0.1470924 3.25744015 0.1746577 -0.74192784 0.2761492 0.4171985 -2.3372454 0.6186254490-1.46361830
-0.95883559 0.3705762 0.23089634 0.2334644 1.03444349 1.2040973 -1.51798791 0.7040106 0.8218244 -1.7638275 0.3558257024-0.71448736
0.49948454 -0.3352700 -0.42236072 1.8508442 0.20638529 0.7926367 -1.05089169 -0.4328939 0.5528011 -0.7518617 -0.0512436627-0.85763012
-0.59329363 -0.2847561 -0.36510665 1.0989168 0.63918754 1.0906048 -1.27937039 0.1998793 0.7247023 -1.1290017 0.1445670060-0.24632929
-0.38257121 -0.3621281 0.64939613 0.4967276 -0.18758101 0.7500797 0.05469088 -0.4528397 0.3417799 -0.4566103 -0.1048941867-0.34604976
-0.79149209 0.2422710 -0.10007938 -0.2774750 -0.19658697 1.1672187 -0.37344033 0.3124707 0.4831523 -0.3503757 -0.1784764647 0.06281320
-0.67934459 -0.3659084 0.39397953 2.2808733 -0.55999543 0.3796897 0.36226436 -0.4463060 0.2818398 -0.2020744 -0.5873487721-0.85766919
-1.24310824 -0.1996018 0.07345773 1.1637386 -0.05079655 0.8884791 -0.97524017 0.3197256 0.5257259 0.1099351 -0.0008333157-0.61148201
0.34263819 -0.1976793 -0.47331686 0.3376224 -0.03721161 0.4047769 -0.30383521 -0.1578112 0.2944318 0.3669648 -0.4855948221-0.09098494
-0.76225885 -0.2100305 -0.38708717 -0.4742901 -0.43361474 0.8603274 -0.41421770 0.5372788 0.4381843 0.8077815 -0.2456443418 0.28357141

In [53]:
dris_i$IBN


  1. 0.948522196187152
  2. 0.825856393351567
  3. 0.650358625375224
  4. 0.64964296776262
  5. 0.382112375674215
  6. 0.377987653212307
  7. 0.616441126930497
  8. 0.513510347231134
  9. 0.291072329621244
  10. 0.487857224143511