Two main approaches:
Learning a 1D threshold function or a liner threshold function.
\begin{equation} H = \left\{ h : [0, 1] \mapsto \{+1, -1\}, h(x) = \operatorname{sign}(x - \tau), \tau \in [0, 1] \right\} \end{equation}Note: AUROC = Area Under ROC (receiver operating characteristic) Curve; measures the performance of a binary classifier system as its discrimination threshold is varied (i.e. it's big when a classifier has a lot of true positives vs. false positives).
So how can we hash these queries?
Use two random vectors, $u$ and $v$.
\begin{equation} h_{u, v}(a, b) = [h_u(a), h_v(b)] = [\operatorname{sign}(u^{T} a), \operatorname{sign}(v^{T} b)], \quad u, v \sim \mathcal{N}(0, I) \end{equation}Define a hash family (why?):
\begin{equation} h_{\mathcal{H}}(z) = \left\{ \begin{array} h_{u, v}(z, z), \> & \text{if } z \text{ is a database point vector} \\ h_{u, v}(z, -z), \> & \text{if } z \text{ is a query hyperplane vector} \\ \end{array} \right. \end{equation}With this we get an LSH collision probability:
\begin{equation} \begin{aligned} Pr[h_H(w) & = h_H(x)] = Pr[h_u(w) = h_u(x)] Pr[h_v(-w) = h_v(x)] \\ & = \frac{1}{4} - \frac{1}{\pi^2}\left( \theta_{x, w} - \frac{\pi}{2}\right)^2 \end{aligned} \end{equation}(A collision being a query hyperplane vector and another point get put into the same bucket, representing the fact that they are close, and that its label should be requested)
For small enough $\theta_{x, w}$ this is OK but still not great. However, we can use multiple hash functions together in an OR-construction to increase the number of positives. (TODO: not in slides but logical; find reference)
Key insight/problem: uncertain $\nRightarrow$ informative
Better to quantify how much information we get from each label, rather than just how uncertain it is.
A version space is a set of all classifiers consistent with the current data.
\begin{equation} \mathcal{V}(D) = \{ w : \forall \> (x, y) \in D, \operatorname{sign}(w^Tx) = y \} \end{equation}Note: this does not account for noisy data.
In an SVM case, the original version space could be the whole $l_2$-norm ball of radius $\frac{1}{\sqrt{\lambda}}$.
We want to always request a label for the point which will make our version space shrink the most.
A relevant version space, given a pool of unlabeled points, is the set of possible labeling consistent with the data.
\begin{equation} \hat{\mathcal{V}}(D; U) = \left\{ h: U \rightarrow \{+1, -1\} : \exists w \in \mathcal{V}(D) : \forall \> x \in U, \operatorname{sign}(w^Tx) = h(x) \right\} \end{equation}We use this to conduct a generalized binary search to find the most informative point.
$v^{+}(x)$ - how many elemenents we would have in the relevant version space if x were labelled +. $v^{-}(x)$ - same, but if x were labelled -. Pick $x$ where the minimum possible change $\min{(v^{-}(x), v^{+}(x))}$ is the largest. This is the x which, regardless of what label it will get, will shrink the relevant version space the most ("balanced" split). (TODO: clarify)
In other words for each possible label (+ or -) of an uncertain point, calculate resulting SVMs with margings $m^{+}$ and $m^{-}$. Pick point which produces most "balanced" difference between margins:
Wait, "calculate resulting SVMs"? Isn't that expensive?
It is. Optimal max-min margins and ratio margins are expensive to compute. But there exist some practical tricks:
In [ ]: