3. Classification

Assign labels to data points (documents, images, users, etc.)
Training examples generalized to a decision rule (or more)



In [2]:

    
import pandas as pd
import numpy as np

Linear classifiers

Model expressed as a normal $w$ to a hyperplane separating two classes (and, optionally, a bias (offset) term $b$).
Linear classifiers are easy to understand, quick to train, work very well (especially in high dimensions) and make extremely efficient predictions.
In all future examples, use homgeneous representation and only define a model as $w$, with no separate bias term, for simplicity's sake.
- Actually, the bias will simply be included in the model, as we replace $w = [w_1,\dots,w_d]$ with $\tilde w=[w_1,\dots,w_d,b]$.

Perceptrons [book]

Linear binary classifier
Model: weights $w$ and a threshold $\theta$
Predicts:
- +1 if $wx > \theta$
- -1 if $wx < \theta$
- "wrong" if $wx = \theta$ (!!!)
The weight vector symbolizes a hyperplane (s.t. $wx = \theta$)
Only works for data which is linearly separable
- vanilla perceptron will not converge otherwise
- perceptron will converge to some hyperplane
- SVMs (see next section) try to find an optimal hyperplane, and also work well with data which are not perfectly linearly separable
The Winnow Algorithm is an improvement of the basic Perceptron which is guaranteed to converge (all features are 0 or 1).
Can be parallelized in a decent manner using multiple iterations of MapReduce
- partition training data to mappers
- map: emit all points which disagree with current $w$
- reduce: perform all updates to $w$ in centralized way
- repeat until convergence

Multiclass Perceptrons

For $k$ classes, train $k$ Perceptrons, one for every class; (positive examples are examples of that class; negative examples are all others)
Classify by computing all $w_ix$ values, and picking the largest one, as long as it's larger than the threshold $\theta$. If it's not, then the new object does not belong to any known class.
In some cases (e.g. topic modelling), we can also have objects which may belong to more than one class (e.g. a news article can be about both sports and celebrities)

Nonlinear Perceptrons

Transform data into another coordinate system first (e.g. to polar coordinates) and run Perceptron in that space. But more on this, later, when we talk about non-linear SVMs.

Limitations

Perceptrons don't converge on non-linearily separable data
Perceptrons don't automatically choose a "best" separating hyperplane
Perceptrons can get (very) biased towards a class since they terminate as soon as they correctly classified all training data points (see textbook, Fig. 12.13)

Support Vector Machines

Max-margin classifiers
Find $w$ so that the margin to closest representatives of each class is maximized. Can be shown (but not here) that this is the right thing to do in order to improve our generalization.
Margin size is $\frac{1}{||w||_2} = \mathit{margin}$
Want to maximize margin size, so we want to minimize $w^Tw$, while also making sure that the margin is actually correct (duh!): $\forall \> i, w^Tx_i\ge1$
There's always noise $\implies$ model as slack variables.
- slack variables represent points we can't avoid mis-classifying
- one slack variable for every point
  - $0$ if that point is "good" and can be classified OK by the SVM
  - $>0$ if that point is not correctly classified by our computed $w$, but we sort of have to live with it
  - in practice, most (if not all, if we're lucky) points $i$ have $\xi_i = 0$

Convex optimization

This is the way we solve the SVM constrained minimization problem
Minimization because we want to "shrink" our model's magnitude (i.e. the margin size) and, as much as possible, the impact of the slack variables (data points which we cannot avoid misclassifying).
Constrained because we also want to ensure our model does the right thing and correctly classifies most points (for most points, we expect their $\xi_i$ to be 0, and $y_iw^Tx_i\ge1$, meaning that the point is snugly placed in the correct class; positive means the right side of the separation hyperplane; also greater than one means it's beyond the separation margin).
Problem: current approach has a run-time complexity of $\Omega(n^2)$, and simply doesn't work if the data don't fit in memory.

Convex optimization, reformulated

Because linear algebra is awesome, we can reformulate the (primal) SVM definition into something a little more useful.
$\xi_i = \operatorname{max}(0, 1 - y_iw^Tx_i)$, from the constraint part; it's called hinge loss ($l_{\operatorname{hinge}}$)
Shove this into the minimization formula.
We get: $\min_w{\left (w^Tw + C\sum\limits_{i}\max(0, 1 - y_iw^Tx_i) \right )}$

Convex optimization, reformulated, again

We can formulate a multicriteria optimization problem in two ways:
- Minimize everything at once
- Minimize one objective, while keeping the other under a certain bound
Our above (re)formulation belongs to the first category (one big-ass $\min$)
Let's write it in the second way.
Minimize the second part, as long as the first part ($w^Tw$) is below some threshold B.
We write that as:
- $\min_w\sum\limits_i\max(0, 1 - y_i w^T x_i)$ s. t. $||w||_2 \lt \frac{1}{\sqrt{\lambda}}$
Interestingly enough, all 3 of our formulations produce the same solutions. TODO: ask algebra guru for more details about why.
However, the complexity varies. A lot. The last version is the best in that regard.

Convex optimization, in general

Many supervized learning problems consists of two components: a loss function and a regularizer.
From a (handwavily explained) Bayesian standpoint, the former represents the evidence, while the latter, our prior knowledge (i.e. we know our model shouldn't be too complex) [citation needed]
As we saw before, there are two ways to state these problems mathematically.
- Using a single "large" minimization (l + r)
- Using a little minimization (l) plus a constraint (r)
We will focus on the second techniques, as it makes it possible to perform online learning.

SVM -> OCP -> online-to-batch conversion by averaging if we want to train a single SVM on fixed data set If we also pick data at random -> SGD.

TODO: strong convexity and its benefits
- Using geometric examples can really help!

Online Convex Programming (OCP)

It can be shown that going through the data in streaming fashion can lead to sensible solution, while allowing truly massive data sets to be processed.

Note that it is normal that some members of the training set are never used in a stochastic gradient descent algorithm.



In [3]:

    
def shuffle_in_unison(a, b):
    """Helper used by the OCP implementation. Shuffles both numpy arrays in the same way."""
    assert len(a) == len(b)
    shuffled_a = np.empty(a.shape, dtype=a.dtype)
    shuffled_b = np.empty(b.shape, dtype=b.dtype)
    permutation = np.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b



In [1]:

    
def svm_ocp(train_x, train_y):
    assert train_x.shape[0] == train_y.shape[0]
    assert train_y.shape[1] == 1

    # The regularization parameter.
    lbd = 0.005
    rows, features = train_x.shape

    # Initialize our model with zeros.
    model = np.zeros(features)
    model_sum = np.zeros(features)

    # If we don't shuffle our input, the learning gets *really* messed up.
    train_x_vals, train_y_vals = shuffle_in_unison(train_x.values,
                                                   train_y.values)

    for t, (x, y) in enumerate(zip(train_x_vals, train_y_vals)):
        eta = 1 / math.sqrt(t + 1)
        # eta = 1 / (lbd * (t + 2))
        prediction = np.dot(x, model)
        score = (y * prediction)[0]
        if score < 1:
            # We misclassified 'y'. Or weren't condifent enough.
            # Let's update the model.

            # Gradient descent step.
            model_prime = model + eta * y * x

            # Regularization step (stay inside the hypersphere with radius
            # lambda).
            model_prime_norm = np.linalg.norm(model_prime)
            regularizer = 1 / (math.sqrt(lbd) * model_prime_norm)
            model = model_prime * min(1, regularizer)

        model_sum += model

    # Online-to-batch conversion: we simply want to average over all iterations
    # of our model. This is a very simple and provably accurate method.
    # TODO(andrei) Consider skipping first 'k' elements.
    avg_model = model_sum / rows

    return avg_model

PEGASOS vs. standard OCP/SGD

	PEGASOS	OCP/SGD
Updating	Batch updates	Single-point updates
Loss function	Strongly convex	Convex
Run-time complexity	Depends on number of dimensions and desired accuracy	Depends on number of data points

Perceptrons vs. SVMs

(TODO)

	Perceptron	SVM
Scalability	Both can support millions of features
Supported data types	Numeric only	Numeric only, unless kernels are used

Adapting to geometry (AdaGrad)

In some cases, gradient descent can take a long time to converge because of the distorted curvature of the loss function
We can "stretch" the dimensions to make our gradient "aim" better towards the minimum and thus converge faster
Transformation matrix $A$: $z = Aw$, $w=A^{-1}z$
Gradient descent in z-space:
- $z_{t+1} = \operatorname{arg min}\limits_{z: A^{-1}z \in S} || z - \left(z_t - \eta\nabla_z g(z_t) \right) ||_2^2$
- explained:
  - we still want to minimize, hence the $\operatorname{arg min}$
  - we still want to constrain based on the original space, so we impose the constraint on $w = A^{-1}z \in S$
  - the updated model is computed via gradient descent (duh) in the $\nabla_zg(z_t)$ part; $\eta$ just tweaks the learning speed as much as we want
- We reach: $\operatorname{arg min}\limits_{w \in S} || w - \left( w_t - \eta A^{-1}\nabla_w f(w_t) \right) ||_A^2$ (which uses the Mahalanobis norm instead of the $\ell_2$ one)
All that's left is to pick an appropriate projection matrix $A$
- positive definite
- symmetric
- all eigenvalues $\ge 0$
We could compute a nice $G_T$ which minimizes the regret upper bound...
But we obviously can't compute $G_T$ at time $t < T$ :(
See 06:24 for more details.
There's an alternate way of computing an optimal $G_t$ at every step
$G_t = \left( \sum\limits_{\tau = 1}^{t}\nabla f_{\tau}(w_{\tau})\nabla f_{\tau}(w_{\tau})^T \right)^\frac{1}{2}$
However, G is expensive to compute, and it also needs to be inverted.
Solution: $G_t := diag(G_t)$; this is much easier to work with and trivial to invert
Bonus: can prove that AdaGrad provides advantages if data is sparse (better upper bound for $\mathbb{E}[L(\hat{w})] - L(w^*)$)
Generalization of OCP and friends: Proximal gradient methods.

Note: AdaGrad is not covered in the textbook (as of January 2016)

Mahalanobis distance

TODO: more info, since we need it to really grok ADAGRAD.

The Mahalanobis distance is essentially the distance between a point and the centroid of a cluster, normalized by the standard deviation of the cluster in each dimension.

-- Rajaraman, Anand, and Jeffrey D. Ullman. Mining of massive datasets. Vol. 77. Cambridge: Cambridge University Press, 2012.

\begin{equation} d_{\text{Mahalanobis}}(x, y) = \sqrt{\sum_{i=1}^{d}\left( \frac{p_i - c_i}{\sigma_i} \right)^{2}} \end{equation}

AdaGrad in exams

Not seen in any data mining exam from 2011 to 2014.

Exploiting parallelism

Parallelize gradient computation (cluster)
- needs to constantly sync value of $w$ after each iteration
Parallel stochastic gradient descent (cluster)
- partition data
- run SGD independently on each partition
- average produced models
- TODO: analysis of error rate
Hogwild! (multi-proc)
- on local, multi-processor machine
- no sychrnoization
- works well, as long as the data is sparse!
Parameter server
- keep track of $w$ on central machine, which receives $\Delta w$s every once in a while, after which it pushes the update to all the workers



In [ ]: