One-class classification in R with the oneClass package

Introduction

The purpose of a one-class classifier is identical to the purpose of a supervised binary classifier. New data is classified to belong to one of two classes based on a classification model trained from labeled samples, for which the class membership is known. In contrast to the supervised classifier, the training data of the one-class classifier only contains labeled samples from the class of interest, i.e. the positive class. In the case of the binary classifier also the other class, or the negative class, has to be represented with the training set. Collecting a representative training set for the negative class can be very costly and time-consuming due to the fact that the negative class is the aggregation of all other classes without the positive class. Thus, a one-class classifier is particularly useful when only one or a few classes have to be mapped and when the acquisition of representative labeled data for the negative class is expensive or not possible at all.

However, the convenience of not requiring negative training data comes at a price. One-class classification is challenging due to the limited information contained in the training set. Unlabeled training data can be necessary for some classification problems in order to learn more accurate predictive models. Still, the process is uncertain and the classification outcome has to be treated with caution.

The package oneClass shall serve the requirements of two potential users, the analyst and the developer. These are extrem characters and in reality one will usually be located somewhere in between. The analyst is faced with a particular one-class classification problem, i.e. a set of positive training samples and the unlabeled data to be classified. It is assumed that no complete and representative test set is available for the purpose of validation and testing. In such a situation a careful evaluation of the classification outcome based on the available (positive and unlabeled) data is required in order to select the most promising final model and threshold [1]. The function trainOcc() is a wrapper for the train() function of caret [2] which is called with one of the one-class classification methods implemented in the package oneClass. trainOcc() returns an object of class trainOcc which inherits from class train. Thus, the extensive infrastructure of caret is available, such as parallel processing and different methods for pre-processing, resampling, and model comparision.

Furthermore, the oneClass infrastructure comprises one-class classification specific methods, such as performance metrics based on positive and unlabeled data and diagnostic plots, which further support the handling of the one-class classification methods and particularly understanding their outcome in the absence of representative test data. In this tutorial a one-class classification task is solved step by step in order to show which outcomes should be screend by the analyst in order to detect deficient settings, input data, or model outcomes and improve the model if necessary. Hopefully the package is helpful for solving one-class classification problems more effectively and conveniently. The developer is interested in the developement of new or optimization of existing methods. The package oneClass builds upon the powerful package caret and tries to adapt its philosophy. The package caret allows the user to embed own custom functions and performance metrics in the rich infrastructure of the caret package. Furthermore, convenient functions are available for testing the classifier outcome with positive/negative (PN) test sets.

One-class classifiers

The oneClass package is a user-oriented environment for analyzing one-class classification problems. It implements three commonly used classifiers, the one-class SVM (OCSVM) [3] and biased SVM (BSVM) [4, 5] via the package kernlab [6], and a one-class classifier based on calculating a density ratio with a maximum entropy approach (MAXENT) [7, 8] via the package dismo [9]. As mentioned before these classifiers are implemented as custom train() methods using the package caret [2].

The one-class SVM is a P-classifier, i.e. the classification model is trained with positive samples only. Nevertheless, unlabeled samples can be used to calculate PU-performance and support model selection. The biased SVM and Maxent are PU-classifiers, i.e. they are trained on positive and unlabeled data. P-classifiers are usually computationally less complex than PU-classifiers. However, PU-classifiers often perform better in terms of classification accuracy because with the information contained in the unlabeled training data models can be build which better fit the particular classification problem to be solved.

PU-performance metrics

As with other pattern recognition and machine learning algorithms, it is crutial to parameterize the one-class classification methods carefully. The parameterization or model selection is usually performed via a grid-search. The grid points are combinations of discrete parameter values. The performance of the model is evaluated for all grid points and the parameters are chosen which optimizes the performance metric. In the case of supervised classification the performance metric, such as the overall accuracy or kappa coefficient, can be used. Such metrics have to be derived from complete validation data comprising the positive and negative class. They are therefore unidentifiable in a one-class classification situation.

Some performance metrics have been defined which can be derived from positive and unlabeled data (PU-performance metrics). From PU-data we can estimate two interesting probabilities: From the positive training samples we can estimate the probability of classifying a positive sample correctly, also known as the true positive rate (TPR). From the unlabeled samples we can estimate the probability of classifying a sample as positive, which we call the probability of positive prediction (PPP). Given we have two models with the same TPR but with different PPP it is valid to say that the model with lower PPP is more accurate because the TPR is the same but the false positive rate is necessarily lower. This conclusion is however only valid if the TPR can be estimated accurately. Furthermore, it does not solve the question which of a set of models is the best when both the TPR and PPP differ.

The PU-performance metrics puF is related to the F-score [10] and puAuc related to the area under the receiver operating curve [8] try to give an answer. Both have been shown to be suitable for ranking models based on PU-data. It is impossible to say which metrics is better in a particular situation. Note that puF is based on the TPR and PPP which are derived for a particular threshold, here zero. It is possible that the threshold with which TPR and PPP are estimated is not optimal and thus puF can be low even though the model has high discrimnative power. Instead puAuc is calculated independent of a particular threshold. In other words, it calculates the performance over the whole range of possible thresholds. Thus it also considers thresholds which are definitively unsuitable which might also lead to misleading results [11]. Based on these thoughts and experience it is not recommended to trust these rankings blindely, particularly in challenging classification problems, e.g. with a small amount of positive training samples or an eventually unsuitable set of unlabeled samples.

We can reasonably assume the the PU-performance metrics are positively correlated with PN-performance metrics, such as the overall accuracy of the kappa coefficient. But the relationship can be noisy and in the worst case this could mean that the model with the highest PU metric has very poor discriminative power [1]. They should rather be used as helpers for selecting a couple of candidate models, which are examined more thoroughly. Furthermore, because the performance metrics do not proved information on the absolute accuracy, such as the overall accuracy, they do not reveal if the model is poor even though it might be the best one of all evaluated models, e.g. because non of the specified parameter settings are suitable. Therefore, it can be useful to also investigate the true positive rate (TPR), and the probability of positive prediction (PPP). The quantities are implemented in the function puSummary() and calculated by default for all models evaluated during model selection.

Installation and parallel processing

The package can be found on GitHub (https://github.com/benmack/oneClass). It can be installed from within R when the package devtools is loaded:

require(devtools)
install_github('benmack/oneClass')

If a parallel backend is registered for the pacakge foreach (http://topepo.github.io/caret/parallel.html) model selection and prediction of raster data can be performed parallel. For parallel prediction of raster data the package spatial.tools must also be available. The following code registers a parallel backend for foreach via the package doParallel.


In [1]:
suppressWarnings(suppressMessages(library(oneClass)))

Furthermore we need the following packages and register a parallel backend such that grid-search and predictions are performed in parallel.


In [2]:
suppressWarnings(suppressMessages(library(doParallel)))
suppressWarnings(suppressMessages(library(raster)))
suppressWarnings(suppressMessages(library(RColorBrewer)))

cl <- makeCluster(detectCores()-1)
doParallel:::registerDoParallel(cl)

An illustrative synthetic data set

In the following the package is demonstrated by means of the synthetic data set consisting of two banana shaped distributions. The bananas data is stored as raster data, where bananas$y is a one-band raster with the class patches, i.e. what we want to find out when performing one-class classification with remotely sensed data. bananas$x are the features or predictors based on which the classification has to be build.


In [3]:
data(bananas)
options(repr.plot.width=7, repr.plot.height=1.65)
plot(stack(bananas$y, bananas$x), nc=3)


In one-class classification a set of positive labeled samples is available for training the classifier. Furthermore, unlabeled samples can be used, which are usually a random sample of the whole data. Such a PU-training set is also stored in bananas$tr.

In order to make the results comparable we create a list with resampling partitions (tr.index). Note that if we do not explicitly pass such a list the partitions differ because they are generated randomly every time trainOcc() is run again. Additionally, we generate a test data set consisting of 1.000 random samples of the image.


In [4]:
seed <- 123456
tr.x <- bananas$tr[, -1]
tr.y <- puFactor(bananas$tr[, 1], positive=1)
set.seed(seed)
tr.index <- createFolds(tr.y, k=10, returnTrain=TRUE)
set.seed(seed)
te.i <- sample(ncell(bananas$y), 1000)
te.x <- extract(bananas$x, te.i)
te.y <- extract(bananas$y, te.i)

The two-dimensional synthetic data set can be visualized in the feature space. A one-class classifier is supposed to learn from P- (LEFT) or PU-data (MIDDLE) in order to find an optimal model for separating PN-data (RIGHT). Recall that even if we use a P-classifier (e.g. the one-class SVM), which is only trained with P-data, it is recommendable to make use of PU-data for model selection, as we will do in the next section.


In [5]:
options(repr.plot.width=7, repr.plot.height=2.5)
par(mfrow=c(1,3), mar=c(4.1, 4.1, 1.1, 1.1))
plot(tr.x[tr.y=="pos", ], pch=16 )
plot(tr.x, pch=ifelse(tr.y=="pos", 16, 4) )
plot(te.x, pch=ifelse(te.y==1, 16, 1) )


Finding a suitable one-class classification model

One-class SVM with default settings

Let us first try to solve the classification task using the OCSVM and the default settings. We simply pass the training data to trainOcc(). A ten-fold cross validation is performed over a pre-defined grid and the model with the highest puF value is selected and trained on the whole data. Then we predict the whole image with the fitted model. By default the prediction returns the continuous decision values. In the case of the OCSVM zero can be considered the natural threshold for the derivation of a binary classification result. Note, that this does not mean that zero is always the optimal threshold!


In [6]:
ocsvm.fit <- trainOcc(x=tr.x, y=tr.y, method="ocsvm", index=tr.index)
ocsvm.fit.def <- ocsvm.fit # store for comparison
ocsvm.pred <- predict(ocsvm.fit, bananas$x)
ocsvm.bin <- ocsvm.pred>0


Aggregating results
Selecting tuning parameters
Fitting sigma = 1, nu = 0.01 on full training set

If we need to evaluate the classification result without PN-reference data the following histogram plot can help us to assess the classification results.


In [7]:
options(repr.plot.width=14, repr.plot.height=3.5)
par(mfrow=c(1, 3))
hist(ocsvm.fit, ocsvm.pred, th=0, noWarnRasHist=TRUE)
plot(ocsvm.pred, col=brewer.pal(9, "RdBu"))
plot(ocsvm.pred>0, col=brewer.pal(9, "RdBu")[c(2,9)])


For two-dimensional data it is also possible to plot the model in the feature space. The black line shows the threshold applied to derive the binary classification.


In [8]:
options(repr.plot.width=6, repr.plot.height=5)
featurespace(ocsvm.fit, th=0)


The diagnostic histogram plot is a simple yet informative plot. As long as no complete and representative test data is available, it is recommended to always analyze the plot carefully before accepting any one-class classification result. When interpreting this plot keep Bayes’ Theorem in mind.

The diagnostic plot shows evidence that the class of interest can be separated moderately from the negative class. The positive hold-out predictions (dark blue boxplot, from now on positive predictions) are located at the upper tail of the predictive values. Their median corresponds relatively well with the local maximum of the classifier output histogram. It can therfore be concluded that (most of) the data building the right data cluster belongs to the positive class. However, it is clear that the positive data overlaps significantly with the negative class. Thus, a significant amount of mis-classification has to be accepted, wherever we apply the final binarization threshold. If there was no or just a very small class overlap, a very low density region would exist between two (very well separated) data clusters, one build up by the negative class at lower predictive values and the other build up by the positive class.

However, the nature of the training data and the model complexity have to be taken into account when reasoning with the cross-validated predictions. The light blue boxplot show the calibration predictions, i.e. the prediction on the positive training data with the model trained on the full training data. Knowing the functioning of support vector machines, we can imagine that most of the training samples are important for the support of the decision boundary. During cross-validation the predictive values for the training samples are derived from models trained without the samples to be predicted which, given the number of samples and complexity of the model, obviously leads to biased cross-validation predictions. Most likely the real distribution of the predictions of (all) the positive class samples is somewhere in between the two boxplots.

The feature space plot also reveals the weakness of the OCSVM (and any other classifier which is trained on the positive data only). The decision boundary should be tight in those regions where the negative class overlaps with the positive class to avoid a high amount of false positives. Note particularly the single trainign sample located in the inner curvation of the negative class/banana. This point is obviously located in a region where the positive class has low density but the negative class has very high density. A P-classifier can not learn anything about the density of the negative class while a PU-classifier can derive such information for the unlabeled trainign samples. Here the decision boundary should be less tight in regions where no negative data lives, e.g. at the right side of the feature space. Then the positive data from low density areas could also be classified correctly without any additional fals positive classifications. But due to the fact that the OCSVM model has only seen positive data during the training it can not be aware of such differences. We will see later that a PU-classifier, such as the BSVM can learn this from the unlabeled data.

Revising the parameter space

By now let us first try to find a better OCSVM model. Note that the analytic procedure is the same for other methods. First, it is a good idea to proof if the grid of parameter values has been reasonable. We can use methods from the package caret to visualize the dependency of the PU-performance of the parameters. The grid shows that the puF drops sharply at sigma values smaller and greater than one and highest values at the higher end of the nu range. It is thus possible that a finer grid around sigma==1 contains more powerful models. nu-values below 0.01 do not make sense but if so, it would also be a good idea to extend the grid in this direction of the parameter space. Let us re-run trainOcc() with a customized grid and visualize it again. Note that there are now several models reaching higher puF values (max. puF in customized/default grid: 5.81/ 3.62).


In [9]:
tuneGrid <- expand.grid( sigma = seq(.1, 2, .1), nu = seq(.05, .5, .05) )
ocsvm.fit <- trainOcc(x=tr.x, y=tr.y, method="ocsvm", tuneGrid=tuneGrid, index=tr.index)
ocsvm.pred <- predict(ocsvm.fit, bananas$x)


Aggregating results
Selecting tuning parameters
Fitting sigma = 0.9, nu = 0.45 on full training set

In [10]:
options(repr.plot.width=4, repr.plot.height=3)
trellis.par.set(caretTheme()) # nice colors from caret
plot(ocsvm.fit.def, plotType="level") # see ?plot.train for other plot types
plot(ocsvm.fit, plotType="level")


The diagnostic histogram plot of the new model...


In [11]:
options(repr.plot.width=14, repr.plot.height=3.5)
par(mfrow=c(1, 3))
hist(ocsvm.fit, ocsvm.pred, th=0, noWarnRasHist=TRUE)
plot(ocsvm.pred, col=brewer.pal(9, "RdBu"))
plot(ocsvm.pred>0, col=brewer.pal(9, "RdBu")[c(2,9)])


... looks more convincing because the positive data cluster looks more distinctive from the rest of the data. But it also reveals that the threshold is unlikely to be located at an optimal location. With Bayes’ Theorem in mind and assuming that the positive data mainly builds the right cluster, we would intuitively set the threshold in the middle or the minimum of the low density area. It is reasonable to believe that the amount of false positives increases less than the amount of true positives when moving the threshold to this point.

Note that it is more difficult to evaluate and compare the two models visulized in based on the predicted images alone. The diagnostic plot provides additional information which is also easy to understand and interpret.

Manual model selection (OCSVM)

So far, we trusted the PU-performance metric and selected the model with the highest PU-performance for further evaluation. As has been mentioned before the PU-performance metrics are usually in a positive but sometimes noisy relationship with the unidentifiable PN-performance metrics. Eventually, better models can be determined by comparing the diagnostic plot of other models. Of course, it is too elaborative to compare all models like that because it requires to predict all (or a large fraction) of the unlabeled data to construct the histogram and there might be a hugh amount of models considered for model selection, e.g. in the refined grid defined before. It is plausible to select the models with highest PU-performance metrics first because we expect a positive relationship between the PU-performance metric and the PN-performance metric. The model selection table stores the model parameters and performance metrics and can be used for such a ranking. It is stored in ocsvm.fit$results and is printed to the console when printing the trainOcc object.


In [12]:
head(ocsvm.fit$results)


sigmanutprpuPppppuAucpuFpuF1pntprSDpuPSDpppSDpuAucSDpuFSDpuF1SDpnSD
10.10.050.750.06559550.43653850.681.3883520.120473510.26352310.019015370.065403470.089690830.80568320.035421640
20.10.10.750.06559550.43653850.6811.3883520.120473510.26352310.019015370.065403470.091341850.80568320.035421640
30.10.150.750.066154950.43076920.6781.3956250.121417210.26352310.018304570.062938140.093309520.79748720.034231130
40.10.20.750.067558720.42115380.6861.4261550.123797410.26352310.018586360.056939530.095475420.81411890.034765150
50.10.250.750.073186470.39807690.6831.6477610.132872610.35355340.029869270.081109380.0947570.9325820.054569690
60.10.30.60.066491860.33653850.7071.386260.11941210.39440530.039312120.079678220.093577061.1105930.071199670

A sorted version of this table can also be derived and used to compare the characteristics of the models, e.g. with highest puAuc values. Below we see the table entries for the ten models with highest puAuc values. The numbers at the very left are the rownames and correspond to the rows in ’ocsvm.fit$results’. The first two columns show the model parameters (sigma and nu), followed by the estimated true positive rate (tpr), PU-precision (PU-false positive rate, puP), the probability of positive prediction (ppp), and the performance metrics. Note that the PU-performance metrics do not return the same ranking.


In [13]:
sort(ocsvm.fit, digits=2, by = 'puAuc', rows = 1:10, cols =1:8)


    sigma   nu  tpr  puP   ppp puAuc puF puF1
70    0.7 0.50 0.45 0.23 0.063  0.87 4.5 0.29
50    0.5 0.50 0.40 0.19 0.069  0.87 3.1 0.24
100   1.0 0.50 0.40 0.21 0.062  0.87 3.7 0.27
60    0.6 0.50 0.45 0.22 0.065  0.87 4.2 0.28
79    0.8 0.45 0.50 0.24 0.073  0.87 5.0 0.31
90    0.9 0.50 0.45 0.23 0.063  0.87 4.5 0.29
110   1.1 0.50 0.40 0.21 0.063  0.87 3.7 0.27
120   1.2 0.50 0.35 0.19 0.063  0.87 3.4 0.24
130   1.3 0.50 0.35 0.19 0.063  0.87 3.4 0.24
48    0.5 0.40 0.50 0.19 0.085  0.87 4.2 0.27

Note that modelPosition() can also be used to get the row/id of a model for a specific rank according to a specific metric:


In [14]:
mp <- modelPosition(ocsvm.fit, modRank=3, by="puAuc")
mp$param
mp$row
mp$rank
mp$metric


sigmanu
10010.5
100
3
"puAuc"

We can now investigate the diagnostic plots of manually selected models and models which eventually lead to a better model. The diagnostic plot of another model can be easily created by

  • updating the trainOcc object such that the final models is the manually selected one,
  • predicting the unlabeled samples with the updated model,
  • creating a new diagnostic plot.

E.g. for the model with the highest puAuc-values, i.e. model 70.


In [15]:
ocsvm.fit <- update(ocsvm.fit, modRow=70)
ocsvm.pred <- predict(ocsvm.fit, bananas$x)
options(repr.plot.width=14, repr.plot.height=3.5)
par(mfrow=c(1, 3))
hist(ocsvm.fit, ocsvm.pred, th=0, noWarnRasHist=TRUE)
plot(ocsvm.pred, col=brewer.pal(9, "RdBu"))
plot(ocsvm.pred>0, col=brewer.pal(9, "RdBu")[c(2,9)])


So far we have shown the most important functions and plots required for model selection and model evaluation. The same procedure can be followed when working with one of the other classifiers. In the next section we will solve the classification problem with BSVM and MAXENT, i.e. PU-classifiers, and compare the outcome with the OCSVM.

Comparison to a PU-classifier: BSVM

It has been argued before that PU-classifiers often outperforme P-classifiers. In the following we inverstigate the dignostic plots of the (PU-)best BSVM models.

Note that by default puF is maximized for selecting the BSVM model while puAuc is used to select the MAXENT model. Furthermore, the MAXENT model does not have a "natural" threshold such as the SVM methods. In order to derive the ppp, tpr and puF, the threshold maximizing the sum of sensitivity and specificity (maxSSS) is used as suggested in [12]. This and other commonly used thresholds can be derived from a fitted MAXENT model.


In [16]:
bsvm.fit <- trainOcc(x=tr.x, y=tr.y, index=tr.index)


Aggregating results
Selecting tuning parameters
Fitting sigma = 0.1, cNeg = 4, cMultiplier = 16 on full training set

In [17]:
options(repr.plot.width=14, repr.plot.height=3.5)
for (rank in 1:4) {
    bsvm.fit <- update(bsvm.fit, modRank=rank, metric="puF")
    bsvm.pred <- predict(bsvm.fit, bananas$x)
    par(mfrow=c(1, 3))
    hist(bsvm.fit, bsvm.pred, th=0, ylim=c(0,.15), noWarnRasHist=TRUE)
    plot(bsvm.pred, col=brewer.pal(9, "RdBu"))
    plot(bsvm.pred>0, col=brewer.pal(9, "RdBu")[c(2,9)])
}


Based on theses plots we would probably choose the 4th ranked BSVM model. The low density area between the positive and the negative data cluster is wide and low. Furthermore the positive hold-out predictions are for the most part right of the default threshold.

The feature space plot conforms the suitability of the model and reveals the advantage of the PU-classifier BSVM compared the P-classifier OCSVM. Towards the directions of the negative class the decision boundary is fitted tighter - thus avoiding large amounts of false positives - while it is more relaxed towards the other directions - thus avouding unnecessary false negatives.


In [18]:
options(repr.plot.width=6, repr.plot.height=5)
featurespace(bsvm.fit, th=0)


Direct comparison between the BSVM and the OCSVM histogram plots show the deeper low density valley of the BSVM.


In [19]:
options(repr.plot.width=14, repr.plot.height=3.5)
par(mfrow=c(1, 2))
hist(bsvm.fit, bsvm.pred, th=0, ylim=c(0,.2), noWarnRasHist=TRUE)
hist(ocsvm.fit, ocsvm.pred, th=0, ylim=c(0,.2), noWarnRasHist=TRUE)


Accuracy assessment with PN-data

In many situations PN-data is available for an objective accuracy assessment. Particularly, when optimizing or developing new methods such as one-class classifiers and model selection or threshold selection techniques. Dependent on the objective we might want to

  • know the threshold dependent accuracy for a particular model,
  • know the accuracy for a particular model and threshold,
  • compare the discriminative performance of all models considered during model selection with the PU-metrics.

With evaluateOcc() an object of class trainOcc can be evaluated thoroughly. The core function of evaluateOcc() is evaluate() from the package dismo.


In [20]:
bsvm.ev <- evaluateOcc(bsvm.fit, te.u=te.x, te.y=te.y, positive=1)

We can plot the threshold dependent accuracy for a particular model over the histogram plot:


In [21]:
options(repr.plot.width=7, repr.plot.height=3.5)
th.sel <- 0
th.opt <- slot(bsvm.ev, "t")[which.max(slot(bsvm.ev, "kappa"))]
hist(bsvm.fit, bsvm.pred, ylim=c(0, 0.5), col="grey", border=NA, noWarnRasHist=TRUE)
plot(bsvm.ev, add=TRUE, yLimits=c(0, 0.5))
abline(v=c(th.opt, th.sel), lwd=2)


Print the confusion matrix of a specific threshold:


In [22]:
evaluateAtTh(bsvm.ev, th=0)


Positive/negative (+/-) test samples:
 105 / 895 

Confusion Matrix at threshold -0.0113 : 

          + (Test) - (Test)  SUM UA[%]
+ (Pred)        94       15  109    86
- (Pred)        11      880  891    99
SUM            105      895 1000      
PA[%]           90       98           
---                                   
OA[%]           97                    
AUC[*100]       99                    
K[*100]         86                    

Compare the discriminative performance (e.g. in terms of the maximum achievable kappa) of all models considered during model selection with the PU-metrics.

This might be useful to evaluate existing or new PU-performance metrics. If we compare the best achievable accuracy of OCSVM and BSVM based on the final model we must be aware that we do also evaluate a particular model selection approach, which might not necessarily be desired if comparing the discriminative power of the classifiers.

Therefore, we need to derive the threshold dependent accuracy for all models. When setting the parameter allModels=TRUE an object is returned with the PU-metrics and threshold dependent PN-metrics.


In [23]:
ocsvm.ev <- evaluateOcc(ocsvm.fit, te.u=te.x, te.y=te.y, allModels=TRUE, positive=1)
bsvm.ev <- evaluateOcc(bsvm.fit, te.u=te.x, te.y=te.y, allModels=TRUE, positive=1)
bsvm.ev.list <- print(bsvm.ev, invisible=T)
ocsvm.ev.list <- print(ocsvm.ev, invisible=T)

As we can see, BSVM has slightly higher discriminative power than OCSVM. However, usually the lower the class separability the better BSVM will performe with respect to OCSVM. Furthermore, we see that there is a correlation between the PU- and PN-performance metrics. But the manually selected BSVM model is a better choice than model maximizing puF.


In [24]:
sel.model.bsvm <- 79
par(mfrow=c(1, 2), mar=c(4.1, 4.1, 1.1, 1.1))
plot(ocsvm.ev.list$puF, ocsvm.ev.list$mxK.K, xlab="puF", ylab="max. Kappa", ylim=c(0,1), xlim=c(0,6))
abline(h=max(ocsvm.ev.list$mxK.K))
abline(h=max(bsvm.ev.list$mxK.K), col="grey")
plot(bsvm.ev.list$puF, bsvm.ev.list$mxK.K, xlab="puF", ylab="max. Kappa", ylim=c(0,1), xlim=c(0,6))
abline(h=max(bsvm.ev.list$mxK.K))
abline(h=max(ocsvm.ev.list$mxK.K), col="grey")
points(bsvm.ev.list$puF[sel.model.bsvm], bsvm.ev.list$mxK.K[sel.model.bsvm], col="red", pch=3, lwd=4)


Summary

The package oneClass serves a double purpose. It should provide an environment for solving real world one-class classification problems in the absence of complete and representative PN-data. Particularly the diagnostic plot can be helpful to examine, approve or improve particular model or threshold selection outcomes.

The package builds upon the powerful infrastructure of the caret packege. The defined one-class classifiers (see getModelInfoOneClass()) and PU-performance metrics are custom function defined to be passed to train(). Thus, the package can be extended easily, e.g. by defining new custom one-class classifiers or PU-performance metrics. In order to define a custom one-class method a good starting point is to learn about defining custom functions for the function train() (see http://caret.r-forge.r-project.org/custom_models.html) and use one of the already defined method lists as templates (see getModelInfoOneClass()).

References

[1] Benjamin Mack, Ribana Roscher, and BjÃ˝ urn Waske. “Can I Trust My One-Class Classification?” In: Remote Sensing 6.9 (2014), pp. 8779–8802. ISSN : 2072-4292. DOI : 10.3390/rs6098779. URL : http://www.mdpi.com/2072-4292/6/9/8779.

[2] Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, and the R Core Team. caret: Classification and Regression Training. R package version 6.0-24. 2014. URL : http://CRAN.R-project.org/package=caret.

[3] Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. “Estimating the Support of a High-Dimensional Distribution”. In: Neural Computation 13.7 (2001), pp. 1443–1471. ISSN : 0899-7667. DOI : 10.1162/089976601750264965.

[4] Bing Liu, Wee Sun Lee, Philip S. Yu, and Xiaoli Li. “Partially Supervised Classification of Text Documents”. In: 2002, pp. 387–394.

[5] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. “Building text classifiers using positive and unlabeled examples”. In: In: Intl. Conf. on Data Mining. 2003, pp. 179–188.

[6] Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. “kernlab – An S4 Package for Kernel Methods in R”. In: Journal of Statistical Software 11.9 (2004), pp. 1–20. URL : http://www.jstatsoft.org/v11/i09/.

[7] Jane Elith, Steven J. Phillips, Trevor Hastie, Miroslav Dudík, Yung En Chee, and Colin J. Yates. “A statistical explanation of MaxEnt for ecologists”. In: Diversity and Distributions 17.1 (2011), pp. 43–57.

[8] Steven J. Phillips and Miroslav Dudík. “Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation”. In: Ecography 31.2 (2008), pp. 161–175.

[9] Robert J. Hijmans, Steven Phillips, John Leathwick, and Jane Elith. dismo: Species distribution modeling. 2013. URL : http://CRAN.R-project.org/package=dismo.

[10] Wee Sun Lee and Bing Liu. “Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression”. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML. 2003, p. 2003.

[11] Jorge M. Lobo, Alberto Jiménez-Valverde, and Raimundo Real. “AUC: a misleading measure of the performanceofpredictivedistributionmodels”.In:GlobalEcologyandBiogeography17.2(2008),145–151. ISSN : 1466-822X. DOI : 10.1111/j.1466-8238.2007.00358.x.

[12] Canran Liu, Matt White, Graeme Newell, and Richard Pearson. “Selecting thresholds for the prediction of species occurrence with presence-only data”. In: Journal of Biogeography 40.4 (2013), 778–789. ISSN : 03050270. DOI : 10.1111/jbi.12058.


In [25]:
# Session Info
sessionInfo()


R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 10586)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] RColorBrewer_1.1-2 raster_2.5-2       sp_1.2-3           doParallel_1.0.10 
 [5] iterators_1.0.8    foreach_1.4.3      oneClass_1.0       kernlab_0.9-24    
 [9] pROC_1.8           caret_6.0-68       ggplot2_2.1.0      lattice_0.20-33   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5         compiler_3.2.4      nloptr_1.0.4       
 [4] plyr_1.8.3          base64enc_0.1-3     tools_3.2.4        
 [7] digest_0.6.9        lme4_1.1-12         uuid_0.1-2         
[10] jsonlite_0.9.19     evaluate_0.8        nlme_3.1-124       
[13] gtable_0.2.0        mgcv_1.8-12         Matrix_1.2-4       
[16] IRdisplay_0.3       rgdal_1.1-10        IRkernel_0.6       
[19] SparseM_1.7         dismo_1.0-15        mmap_0.6-12        
[22] repr_0.4            stringr_1.0.0       MatrixModels_0.4-1 
[25] stats4_3.2.4        grid_3.2.4          nnet_7.3-12        
[28] R6_2.1.2            pbdZMQ_0.2-1        minqa_1.2.4        
[31] reshape2_1.4.1      car_2.1-2           magrittr_1.5       
[34] spatial.tools_1.4.8 scales_0.4.0        codetools_0.2-14   
[37] MASS_7.3-45         splines_3.2.4       abind_1.4-3        
[40] pbkrtest_0.4-6      colorspace_1.2-6    quantreg_5.24      
[43] stringi_1.0-1       munsell_0.4.3