Exploring the pareto front for OCC model and threshold selection

The idea:

In one-class classification we only have positive and unlabeled data for model selection.

It is reasonable to assume that for a given true positive rate (which we can estimate from the training data) a model is the better the fewer positive predictions it returns for the unlabeled data.

It may help in a OCC application to explore the pareto front over all models and thresholds for the two conflicting goals of a high true positive rate and a low probabilty of positive predictions.

Let's get some toy data and try...


In [31]:
require(oneClass)
require(rPref)
require(dplyr)
require(doParallel)
registerDoParallel(cores=4)

data(bananas)
trs <- bananas_trainset(nP=50, nU=200)
plot(trs[, -1], pch=c(1, 3)[as.numeric(trs$y)],
     col=c("darkred", "darkblue")[as.numeric(trs$y)])


Lets get some cross-validated decision values for a couple of biased SVM models.


In [32]:
set.seed(1)
index <- createFolds(trs$y, returnTrain=T)
sapply(index, function(i) table(trs$y[i]))


Fold01Fold02Fold03Fold04Fold05Fold06Fold07Fold08Fold09Fold10
un180180180180180180180180180180
pos45454545454545454545

In [33]:
fit <- trainOcc(x=trs[, -1], y=trs$y, index=index)


Aggregating results
Selecting tuning parameters
Fitting sigma = 1, cNeg = 32, cMultiplier = 16384 on full training set

The cross validated decision values (distances in the case of the BSVM) for all models are stored here:


In [34]:
head(fit$pred)


predobsunposrowIndexsigmacNegcMultiplierResample
1pospos0.030742360.0307423660.10.00097656254Fold01
2pospos0.017610330.01761033270.10.00097656254Fold01
3unpos-0.01319288-0.01319288310.10.00097656254Fold01
4unpos-0.01298458-0.01298458360.10.00097656254Fold01
5pospos0.037850770.03785077450.10.00097656254Fold01
6posun0.013003160.01300316580.10.00097656254Fold01

The results based on which we usually do model selection are stored here:


In [35]:
head(fit$results)


sigmacNegcMultipliertprpuPppppuAucpuFpuF1pntprSDpuPSDpppSDpuAucSDpuFSDpuF1SDpnSD
10.10.000976562540.820.45971140.3840.8671.9411540.57755310.1751190.1606260.10189320.11624210.84282730.14936950
20.10.00097656251610.210.90710.333333300000.08769265000
30.10.00097656256410.210.89210.333333300000.09647107000
40.10.000976562525610.210.91710.333333300000.125614000
50.10.0009765625102410.210.91710.333333300000.125614000
60.10.0009765625409610.210.91710.333333300000.125614000

But the performance metrics, e.g. puF, are only evaluated at the separating hyperplane, i.e. at 0. We might want to consider other threshold and calclate the true positive rate (tpr) and the probability of positive predictions (ppp) over all models and a lot of thresholds (by default 50 in the range of the returned decision values):


In [36]:
params <- fit$modelInfo$parameters$parameter
idx <- group_indices_(
  fit$pred, 
  .dots=as.character(params))
u_idx <- sort(unique(idx))

We want to keep the model position according to the fit and can do it like this:


In [37]:
res <- foreach(i = u_idx, .packages="oneClass") %dopar% {
    df <- fit$pred[idx==i, ]
    modrow = modelPosition(fit, modParam=df[1, as.character(params)])$row
    cbind(modrow=modrow, puSummaryThLoop(df, returnAll=TRUE))
}
res <- do.call(rbind, res)
head(res)


modrowthtprpuPppppuFpuF1pn
11-0.0354618410.23809520.841.1904760.38461541
21-0.0339628710.250.81.250.41
31-0.032463910.26315790.761.3157890.41666671
41-0.0309649310.27624310.7241.3812150.43290041
51-0.0294659710.28901730.6921.4450870.44843051
61-0.0279670.980.29696970.661.4551520.4558141

Now we do not only have the metrics at one but 50 thresholds per model. Which makes 12600 possible model / threshold candidates.


In [38]:
nrow(res)


12600

How should we select a solution knowing that we should use puF and similar metrics with caution? First we can see if we can reduce the number of solutions by using the non-dominated solutions, i.e. the solutions on the pareto-front. These turn out to be less:


In [39]:
get_nondominated_solutions <- function(x, rm_duplicated=TRUE, verbose=TRUE) {
    nondominated <- psel(x, high(tpr) * low(ppp)) 
    # There are lots of duplicates with equal tpr and ppp values which we remove
    n_all <- nrow(nondominated)
    nondominated <- nondominated[!duplicated(nondominated[, c("tpr", "ppp")]), ]
    n_no_duplicates <- nrow(nondominated)
    nondominated <- nondominated[order(-nondominated$tpr, nondominated$ppp), ]
    if (verbose)
        cat(sprintf("Solutions with / without duplicated: %d / %d", n_all, n_no_duplicates))
    return(nondominated) # now ordered by descending tpr   
}

In [40]:
nondominated <- get_nondominated_solutions(res)


Solutions with / without duplicated: 472 / 33

And we only have a low fraction of models in which the nondominated solutions live in.


In [41]:
nondominated_models <- nondominated$modrow
nrow(fit$results)
length(unique(nondominated$modrow))


252
20

The following plot shows all solutions (no color), the solutions at the pareto front (blue), the solutions at th=0 (red) and the selected model among th=0 solutions (green):


In [42]:
ggplot(res, aes(x = ppp, y = tpr)) + 
  geom_point(shape = 21, size=.75) + 
  geom_point(data=nondominated, size = 2, col="blue") +
  geom_point(data=fit$results, size = 1, col="red") + 
  geom_point(data=fit$results[modelPosition(fit)$row, , drop=F], size = 1, col="green")


And how is that useful?

1) Automatic model selection with whatever PU-based optimization might benefit by considering only the non-dominated solutions since this might reduce the risk.

2) We can investigate if a subset of models contains a similar pareto front.

3) If we manually analyze different models, e.g. by plotting the histograms, it reduces the models to be investivated to a reasonable size.

For example 2):

Let us compare the non-dominant solutions of all models and thresholds given a fixed cMultplier. The result is interesting! If we search our best solution among all the thresholds (and not only th=0 as is usually done) we can forget fitting models for different cMultiplier values and thus we need to fit a much smaller amount of models. In our case:


In [43]:
nValues <- sapply(c("sigma", "cNeg", "cMultiplier"), function(p) length(unique(fit$results[, p])))
cat(sprintf("# all models / # models given fixed cMultiplier: %d / %d", prod(nValues[1:2]), prod(nValues)))


# all models / # models given fixed cMultiplier: 36 / 252

In [49]:
cMultipliers <- sort(unique(fit$results$cMultiplier))
nondominated_cM <- lapply(cMultipliers, function(cM) 
    get_nondominated_solutions(res[res$modrow %in% which(fit$results$cMultiplier==cM), ], verbose=F))
plot(nondominated$ppp, nondominated$tpr, xlab="ppp", ylab="tpr", col="blue", lwd=2)
for (ele in nondominated_cM)
    lines(ele$ppp, ele$tpr, xlab="ppp", ylab="tpr", col=rgb(0, 0, 0, alpha=.3), lwd=3)


For example 3):

Is the selected th=0 solution one of the models with nondominated solutions?


In [45]:
hist(fit, predict(fit, bananas$x[]), th=0)



In [46]:
unique(nondominated$modrow)
modelPosition(fit)$row  %in% unique(nondominated$modrow)


  1. 99
  2. 57
  3. 73
  4. 58
  5. 135
  6. 148
  7. 155
  8. 141
  9. 127
  10. 51
  11. 78
  12. 66
  13. 80
  14. 79
  15. 30
  16. 37
  17. 65
  18. 38
  19. 176
  20. 23
FALSE

In [47]:
for (mr in unique(nondominated$modrow)) {
    fit <- update(fit, modRow = mr)
    hist(fit, predict(fit, bananas$x[]), th=nondominated$th[nondominated$modrow==mr])
}


What to choose?