5. Active Learning

Make big data algorithms ask when they're unsure.
Avoid silly questions and ask questions which promise to yield the most information.
- Want to minimize number of requested labels (expensive).

Overview

Two main approaches:

Stream-based (online) active learning: need to decide on the spot whether to request a label for a data point
Pool-based active learning: sequentially request point labels from an unlabeled dataset

Simple example

Learning a 1D threshold function or a liner threshold function.

\begin{equation} H = \left\{ h : [0, 1] \mapsto \{+1, -1\}, h(x) = \operatorname{sign}(x - \tau), \tau \in [0, 1] \right\} \end{equation}

Pool-based active learning

request labels intelligently until we can infer all remaining labels
efficiency varies (can reduce nr. of requested labels exponentially, or have almost no impact)

Uncertainty sampling

Assign uncertainty score to all points and pick most uncertain example. Widely used heuristic.
SVM: pick point closest to separating hyperplane: $x^{*} = \operatorname{arg min}_{x_i \in \mathcal{U}} | w^T x_i |$
Classifier retrained after every new label (Can we avoid this?).
Complexity to pick $m$ labels: $O\left( n \cdot m \cdot d + \frac{m^2}{\epsilon} \right)$, with $n$ datapoints (need to compute uncertainty score for each), $m$ max requested labels, $d$ dimensions and $\epsilon$ our error threshold (TODO please confirm).

Note: AUROC = Area Under ROC (receiver operating characteristic) Curve; measures the performance of a binary classifier system as its discrimination threshold is varied (i.e. it's big when a classifier has a lot of true positives vs. false positives).

Sub-linear time active learning

Store mappings from hyperplanes (hyperplane query) to close points.
Hyperplane query = give me the closest point(s) to this hyperplane. TODO: why more?
This helps us avoid recomputing uncertainty scores every time we update our model (i.e. the separating hyperplane).
The closer a point's vector is to be perpendicular to the normal $w$, the more likely it is to be close to it. There are exceptions, of course, but most of the time this rule works very well.

So how can we hash these queries?

Use two random vectors, $u$ and $v$.

\begin{equation} h_{u, v}(a, b) = [h_u(a), h_v(b)] = [\operatorname{sign}(u^{T} a), \operatorname{sign}(v^{T} b)], \quad u, v \sim \mathcal{N}(0, I) \end{equation}

Define a hash family (why?):

\begin{equation} h_{\mathcal{H}}(z) = \left\{ \begin{array} h_{u, v}(z, z), \> & \text{if } z \text{ is a database point vector} \\ h_{u, v}(z, -z), \> & \text{if } z \text{ is a query hyperplane vector} \\ \end{array} \right. \end{equation}

With this we get an LSH collision probability:

\begin{equation} \begin{aligned} Pr[h_H(w) & = h_H(x)] = Pr[h_u(w) = h_u(x)] Pr[h_v(-w) = h_v(x)] \\ & = \frac{1}{4} - \frac{1}{\pi^2}\left( \theta_{x, w} - \frac{\pi}{2}\right)^2 \end{aligned} \end{equation}

(A collision being a query hyperplane vector and another point get put into the same bucket, representing the fact that they are close, and that its label should be requested)

For small enough $\theta_{x, w}$ this is OK but still not great. However, we can use multiple hash functions together in an OR-construction to increase the number of positives. (TODO: not in slides but logical; find reference)

Improving active learning

Key insight/problem: uncertain $\nRightarrow$ informative

Better to quantify how much information we get from each label, rather than just how uncertain it is.

Version space

A version space is a set of all classifiers consistent with the current data.

\begin{equation} \mathcal{V}(D) = \{ w : \forall \> (x, y) \in D, \operatorname{sign}(w^Tx) = y \} \end{equation}

Note: this does not account for noisy data.

In an SVM case, the original version space could be the whole $l_2$-norm ball of radius $\frac{1}{\sqrt{\lambda}}$.

We want to always request a label for the point which will make our version space shrink the most.

Relevant version space

A relevant version space, given a pool of unlabeled points, is the set of possible labeling consistent with the data.

\begin{equation} \hat{\mathcal{V}}(D; U) = \left\{ h: U \rightarrow \{+1, -1\} : \exists w \in \mathcal{V}(D) : \forall \> x \in U, \operatorname{sign}(w^Tx) = h(x) \right\} \end{equation}

We use this to conduct a generalized binary search to find the most informative point.

$v^{+}(x)$ - how many elemenents we would have in the relevant version space if x were labelled +. $v^{-}(x)$ - same, but if x were labelled -. Pick $x$ where the minimum possible change $\min{(v^{-}(x), v^{+}(x))}$ is the largest. This is the x which, regardless of what label it will get, will shrink the relevant version space the most ("balanced" split). (TODO: clarify)

In other words for each possible label (+ or -) of an uncertain point, calculate resulting SVMs with margings $m^{+}$ and $m^{-}$. Pick point which produces most "balanced" difference between margins:

Max-min margin: $x_t = \operatorname{arg max}_x \left\{ \min \{ m_x^-, m_x^+ \} \right\}$ (used in generalized binary search (GBS) example)
Ratio margin: $x_t = \operatorname{arg max}_x \left\{ \min \left\{ \frac{m_x^-}{m_x^+}, \frac{m_x^+}{m_x^-} \right\} \right\}$

Wait, "calculate resulting SVMs"? Isn't that expensive?

It is. Optimal max-min margins and ratio margins are expensive to compute. But there exist some practical tricks:

Only score and pick from small random subsample of data
Only use informativeness at first (e.g. first 10 examples), then switch to (faster) uncertainty sampling
Occasionally "cheat" and pick points uniformly at random

Dealing with noise

Thus far, we assumed there was no noise. How to deal with noise?

In practice, it doesn't matter (SVM + slack variables)
In theory, the analysis is much harder.

Open questions

Uncertainty sampling for SVMs: What about kernelized SVMs? Can you still tell what point(s) are closest to the decision boundary (easily)? (Note: just picking the smallest kernel function value between all available training points and the new one doesn't work.)



In [ ]: