3. Classification

  • Assign labels to data points (documents, images, users, etc.)
  • Training examples generalized to a decision rule (or more)

In [2]:
import pandas as pd
import numpy as np

Linear classifiers

  • Model expressed as a normal $w$ to a hyperplane separating two classes (and, optionally, a bias (offset) term $b$).
  • Linear classifiers are easy to understand, quick to train, work very well (especially in high dimensions) and make extremely efficient predictions.
  • In all future examples, use homgeneous representation and only define a model as $w$, with no separate bias term, for simplicity's sake.
    • Actually, the bias will simply be included in the model, as we replace $w = [w_1,\dots,w_d]$ with $\tilde w=[w_1,\dots,w_d,b]$.

Perceptrons [book]

  • Linear binary classifier
  • Model: weights $w$ and a threshold $\theta$
  • Predicts:
    • +1 if $wx > \theta$
    • -1 if $wx < \theta$
    • "wrong" if $wx = \theta$ (!!!)
  • The weight vector symbolizes a hyperplane (s.t. $wx = \theta$)
  • Only works for data which is linearly separable
    • vanilla perceptron will not converge otherwise
    • perceptron will converge to some hyperplane
    • SVMs (see next section) try to find an optimal hyperplane, and also work well with data which are not perfectly linearly separable
  • The Winnow Algorithm is an improvement of the basic Perceptron which is guaranteed to converge (all features are 0 or 1).
  • Can be parallelized in a decent manner using multiple iterations of MapReduce
    • partition training data to mappers
    • map: emit all points which disagree with current $w$
    • reduce: perform all updates to $w$ in centralized way
    • repeat until convergence

Multiclass Perceptrons

  • For $k$ classes, train $k$ Perceptrons, one for every class; (positive examples are examples of that class; negative examples are all others)
  • Classify by computing all $w_ix$ values, and picking the largest one, as long as it's larger than the threshold $\theta$. If it's not, then the new object does not belong to any known class.
  • In some cases (e.g. topic modelling), we can also have objects which may belong to more than one class (e.g. a news article can be about both sports and celebrities)

Nonlinear Perceptrons

  • Transform data into another coordinate system first (e.g. to polar coordinates) and run Perceptron in that space. But more on this, later, when we talk about non-linear SVMs.


  • Perceptrons don't converge on non-linearily separable data
  • Perceptrons don't automatically choose a "best" separating hyperplane
  • Perceptrons can get (very) biased towards a class since they terminate as soon as they correctly classified all training data points (see textbook, Fig. 12.13)

Support Vector Machines

  • Max-margin classifiers
  • Find $w$ so that the margin to closest representatives of each class is maximized. Can be shown (but not here) that this is the right thing to do in order to improve our generalization.
  • Margin size is $\frac{1}{||w||_2} = \mathit{margin}$
  • Want to maximize margin size, so we want to minimize $w^Tw$, while also making sure that the margin is actually correct (duh!): $\forall \> i, w^Tx_i\ge1$
  • There's always noise $\implies$ model as slack variables.
    • slack variables represent points we can't avoid mis-classifying
    • one slack variable for every point
      • $0$ if that point is "good" and can be classified OK by the SVM
      • $>0$ if that point is not correctly classified by our computed $w$, but we sort of have to live with it
      • in practice, most (if not all, if we're lucky) points $i$ have $\xi_i = 0$

Convex optimization

  • This is the way we solve the SVM constrained minimization problem
  • Minimization because we want to "shrink" our model's magnitude (i.e. the margin size) and, as much as possible, the impact of the slack variables (data points which we cannot avoid misclassifying).
  • Constrained because we also want to ensure our model does the right thing and correctly classifies most points (for most points, we expect their $\xi_i$ to be 0, and $y_iw^Tx_i\ge1$, meaning that the point is snugly placed in the correct class; positive means the right side of the separation hyperplane; also greater than one means it's beyond the separation margin).
  • Problem: current approach has a run-time complexity of $\Omega(n^2)$, and simply doesn't work if the data don't fit in memory.

Convex optimization, reformulated

  • Because linear algebra is awesome, we can reformulate the (primal) SVM definition into something a little more useful.
  • $\xi_i = \operatorname{max}(0, 1 - y_iw^Tx_i)$, from the constraint part; it's called hinge loss ($l_{\operatorname{hinge}}$)
  • Shove this into the minimization formula.
  • We get: $\min_w{\left (w^Tw + C\sum\limits_{i}\max(0, 1 - y_iw^Tx_i) \right )}$

Convex optimization, reformulated, again

  • We can formulate a multicriteria optimization problem in two ways:
    • Minimize everything at once
    • Minimize one objective, while keeping the other under a certain bound
  • Our above (re)formulation belongs to the first category (one big-ass $\min$)
  • Let's write it in the second way.
  • Minimize the second part, as long as the first part ($w^Tw$) is below some threshold B.
  • We write that as:
    • $\min_w\sum\limits_i\max(0, 1 - y_i w^T x_i)$ s. t. $||w||_2 \lt \frac{1}{\sqrt{\lambda}}$
  • Interestingly enough, all 3 of our formulations produce the same solutions. TODO: ask algebra guru for more details about why.
  • However, the complexity varies. A lot. The last version is the best in that regard.

Convex optimization, in general

  • Many supervized learning problems consists of two components: a loss function and a regularizer.
  • From a (handwavily explained) Bayesian standpoint, the former represents the evidence, while the latter, our prior knowledge (i.e. we know our model shouldn't be too complex) [citation needed]
  • As we saw before, there are two ways to state these problems mathematically.
    • Using a single "large" minimization (l + r)
    • Using a little minimization (l) plus a constraint (r)
  • We will focus on the second techniques, as it makes it possible to perform online learning.

SVM -> OCP -> online-to-batch conversion by averaging if we want to train a single SVM on fixed data set If we also pick data at random -> SGD.

  • TODO: strong convexity and its benefits
    • Using geometric examples can really help!

Online Convex Programming (OCP)

It can be shown that going through the data in streaming fashion can lead to sensible solution, while allowing truly massive data sets to be processed.

Note that it is normal that some members of the training set are never used in a stochastic gradient descent algorithm.

In [3]:
def shuffle_in_unison(a, b):
    """Helper used by the OCP implementation. Shuffles both numpy arrays in the same way."""
    assert len(a) == len(b)
    shuffled_a = np.empty(a.shape, dtype=a.dtype)
    shuffled_b = np.empty(b.shape, dtype=b.dtype)
    permutation = np.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b

In [1]:
def svm_ocp(train_x, train_y):
    assert train_x.shape[0] == train_y.shape[0]
    assert train_y.shape[1] == 1

    # The regularization parameter.
    lbd = 0.005
    rows, features = train_x.shape

    # Initialize our model with zeros.
    model = np.zeros(features)
    model_sum = np.zeros(features)

    # If we don't shuffle our input, the learning gets *really* messed up.
    train_x_vals, train_y_vals = shuffle_in_unison(train_x.values,

    for t, (x, y) in enumerate(zip(train_x_vals, train_y_vals)):
        eta = 1 / math.sqrt(t + 1)
        # eta = 1 / (lbd * (t + 2))
        prediction = np.dot(x, model)
        score = (y * prediction)[0]
        if score < 1:
            # We misclassified 'y'. Or weren't condifent enough.
            # Let's update the model.

            # Gradient descent step.
            model_prime = model + eta * y * x

            # Regularization step (stay inside the hypersphere with radius
            # lambda).
            model_prime_norm = np.linalg.norm(model_prime)
            regularizer = 1 / (math.sqrt(lbd) * model_prime_norm)
            model = model_prime * min(1, regularizer)

        model_sum += model

    # Online-to-batch conversion: we simply want to average over all iterations
    # of our model. This is a very simple and provably accurate method.
    # TODO(andrei) Consider skipping first 'k' elements.
    avg_model = model_sum / rows

    return avg_model

PEGASOS vs. standard OCP/SGD

UpdatingBatch updatesSingle-point updates
Loss functionStrongly convexConvex
Run-time complexityDepends on number of dimensions and desired accuracyDepends on number of data points

Perceptrons vs. SVMs


ScalabilityBoth can support millions of features
Supported data typesNumeric onlyNumeric only, unless kernels are used

Adapting to geometry (AdaGrad)

  • In some cases, gradient descent can take a long time to converge because of the distorted curvature of the loss function
  • We can "stretch" the dimensions to make our gradient "aim" better towards the minimum and thus converge faster
  • Transformation matrix $A$: $z = Aw$, $w=A^{-1}z$
  • Gradient descent in z-space:
    • $z_{t+1} = \operatorname{arg min}\limits_{z: A^{-1}z \in S} || z - \left(z_t - \eta\nabla_z g(z_t) \right) ||_2^2$
    • explained:
      • we still want to minimize, hence the $\operatorname{arg min}$
      • we still want to constrain based on the original space, so we impose the constraint on $w = A^{-1}z \in S$
      • the updated model is computed via gradient descent (duh) in the $\nabla_zg(z_t)$ part; $\eta$ just tweaks the learning speed as much as we want
    • We reach: $\operatorname{arg min}\limits_{w \in S} || w - \left( w_t - \eta A^{-1}\nabla_w f(w_t) \right) ||_A^2$ (which uses the Mahalanobis norm instead of the $\ell_2$ one)
  • All that's left is to pick an appropriate projection matrix $A$
    • positive definite
    • symmetric
    • all eigenvalues $\ge 0$
  • We could compute a nice $G_T$ which minimizes the regret upper bound...
  • But we obviously can't compute $G_T$ at time $t < T$ :(
  • See 06:24 for more details.
  • There's an alternate way of computing an optimal $G_t$ at every step
  • $G_t = \left( \sum\limits_{\tau = 1}^{t}\nabla f_{\tau}(w_{\tau})\nabla f_{\tau}(w_{\tau})^T \right)^\frac{1}{2}$
  • However, G is expensive to compute, and it also needs to be inverted.
  • Solution: $G_t := diag(G_t)$; this is much easier to work with and trivial to invert
  • Bonus: can prove that AdaGrad provides advantages if data is sparse (better upper bound for $\mathbb{E}[L(\hat{w})] - L(w^*)$)
  • Generalization of OCP and friends: Proximal gradient methods.

Note: AdaGrad is not covered in the textbook (as of January 2016)

Mahalanobis distance

TODO: more info, since we need it to really grok ADAGRAD.

The Mahalanobis distance is essentially the distance between a point and the centroid of a cluster, normalized by the standard deviation of the cluster in each dimension.

-- Rajaraman, Anand, and Jeffrey D. Ullman. Mining of massive datasets. Vol. 77. Cambridge: Cambridge University Press, 2012.

\begin{equation} d_{\text{Mahalanobis}}(x, y) = \sqrt{\sum_{i=1}^{d}\left( \frac{p_i - c_i}{\sigma_i} \right)^{2}} \end{equation}

AdaGrad in exams

Not seen in any data mining exam from 2011 to 2014.

Exploiting parallelism

  • Parallelize gradient computation (cluster)
    • needs to constantly sync value of $w$ after each iteration
  • Parallel stochastic gradient descent (cluster)
    • partition data
    • run SGD independently on each partition
    • average produced models
    • TODO: analysis of error rate
  • Hogwild! (multi-proc)
    • on local, multi-processor machine
    • no sychrnoization
    • works well, as long as the data is sparse!
  • Parameter server
    • keep track of $w$ on central machine, which receives $\Delta w$s every once in a while, after which it pushes the update to all the workers

In [ ]: