The idea:
In one-class classification we only have positive and unlabeled data for model selection.
It is reasonable to assume that for a given true positive rate (which we can estimate from the training data) a model is the better the fewer positive predictions it returns for the unlabeled data.
It may help in a OCC application to explore the pareto front over all models and thresholds for the two conflicting goals of a high true positive rate and a low probabilty of positive predictions.
Let's get some toy data and try...
In [31]:
require(oneClass)
require(rPref)
require(dplyr)
require(doParallel)
registerDoParallel(cores=4)
data(bananas)
trs <- bananas_trainset(nP=50, nU=200)
plot(trs[, -1], pch=c(1, 3)[as.numeric(trs$y)],
col=c("darkred", "darkblue")[as.numeric(trs$y)])
Lets get some cross-validated decision values for a couple of biased SVM models.
In [32]:
set.seed(1)
index <- createFolds(trs$y, returnTrain=T)
sapply(index, function(i) table(trs$y[i]))
In [33]:
fit <- trainOcc(x=trs[, -1], y=trs$y, index=index)
The cross validated decision values (distances in the case of the BSVM) for all models are stored here:
In [34]:
head(fit$pred)
The results based on which we usually do model selection are stored here:
In [35]:
head(fit$results)
But the performance metrics, e.g. puF, are only evaluated at the separating hyperplane, i.e. at 0. We might want to consider other threshold and calclate the true positive rate (tpr) and the probability of positive predictions (ppp) over all models and a lot of thresholds (by default 50 in the range of the returned decision values):
In [36]:
params <- fit$modelInfo$parameters$parameter
idx <- group_indices_(
fit$pred,
.dots=as.character(params))
u_idx <- sort(unique(idx))
We want to keep the model position according to the fit
and can do it like this:
In [37]:
res <- foreach(i = u_idx, .packages="oneClass") %dopar% {
df <- fit$pred[idx==i, ]
modrow = modelPosition(fit, modParam=df[1, as.character(params)])$row
cbind(modrow=modrow, puSummaryThLoop(df, returnAll=TRUE))
}
res <- do.call(rbind, res)
head(res)
Now we do not only have the metrics at one but 50 thresholds per model. Which makes 12600 possible model / threshold candidates.
In [38]:
nrow(res)
How should we select a solution knowing that we should use puF and similar metrics with caution? First we can see if we can reduce the number of solutions by using the non-dominated solutions, i.e. the solutions on the pareto-front. These turn out to be less:
In [39]:
get_nondominated_solutions <- function(x, rm_duplicated=TRUE, verbose=TRUE) {
nondominated <- psel(x, high(tpr) * low(ppp))
# There are lots of duplicates with equal tpr and ppp values which we remove
n_all <- nrow(nondominated)
nondominated <- nondominated[!duplicated(nondominated[, c("tpr", "ppp")]), ]
n_no_duplicates <- nrow(nondominated)
nondominated <- nondominated[order(-nondominated$tpr, nondominated$ppp), ]
if (verbose)
cat(sprintf("Solutions with / without duplicated: %d / %d", n_all, n_no_duplicates))
return(nondominated) # now ordered by descending tpr
}
In [40]:
nondominated <- get_nondominated_solutions(res)
And we only have a low fraction of models in which the nondominated solutions live in.
In [41]:
nondominated_models <- nondominated$modrow
nrow(fit$results)
length(unique(nondominated$modrow))
The following plot shows all solutions (no color), the solutions at the pareto front (blue), the solutions at th=0 (red) and the selected model among th=0 solutions (green):
In [42]:
ggplot(res, aes(x = ppp, y = tpr)) +
geom_point(shape = 21, size=.75) +
geom_point(data=nondominated, size = 2, col="blue") +
geom_point(data=fit$results, size = 1, col="red") +
geom_point(data=fit$results[modelPosition(fit)$row, , drop=F], size = 1, col="green")
And how is that useful?
1) Automatic model selection with whatever PU-based optimization might benefit by considering only the non-dominated solutions since this might reduce the risk.
2) We can investigate if a subset of models contains a similar pareto front.
3) If we manually analyze different models, e.g. by plotting the histograms, it reduces the models to be investivated to a reasonable size.
For example 2):
Let us compare the non-dominant solutions of all models and thresholds given a fixed cMultplier
.
The result is interesting! If we search our best solution among all the thresholds (and not only th=0 as is usually done) we can forget fitting models for different cMultiplier
values and thus we need to fit a much smaller amount of models. In our case:
In [43]:
nValues <- sapply(c("sigma", "cNeg", "cMultiplier"), function(p) length(unique(fit$results[, p])))
cat(sprintf("# all models / # models given fixed cMultiplier: %d / %d", prod(nValues[1:2]), prod(nValues)))
In [49]:
cMultipliers <- sort(unique(fit$results$cMultiplier))
nondominated_cM <- lapply(cMultipliers, function(cM)
get_nondominated_solutions(res[res$modrow %in% which(fit$results$cMultiplier==cM), ], verbose=F))
plot(nondominated$ppp, nondominated$tpr, xlab="ppp", ylab="tpr", col="blue", lwd=2)
for (ele in nondominated_cM)
lines(ele$ppp, ele$tpr, xlab="ppp", ylab="tpr", col=rgb(0, 0, 0, alpha=.3), lwd=3)
For example 3):
Is the selected th=0 solution one of the models with nondominated solutions?
In [45]:
hist(fit, predict(fit, bananas$x[]), th=0)
In [46]:
unique(nondominated$modrow)
modelPosition(fit)$row %in% unique(nondominated$modrow)
In [47]:
for (mr in unique(nondominated$modrow)) {
fit <- update(fit, modRow = mr)
hist(fit, predict(fit, bananas$x[]), th=nondominated$th[nondominated$modrow==mr])
}
What to choose?