- Assign
**labels**to**data points**(documents, images, users, etc.) - Training examples generalized to a
**decision rule**(or more)

```
In [2]:
```import pandas as pd
import numpy as np

- Model expressed as a normal $w$ to a hyperplane separating two classes (and, optionally, a bias (offset) term $b$).
- Linear classifiers are easy to understand, quick to train, work very well (especially in high dimensions) and make extremely efficient predictions.
- In all future examples, use
*homgeneous representation*and only define a model as $w$, with no separate bias term, for simplicity's sake.- Actually, the bias will simply be included in the model, as we replace $w = [w_1,\dots,w_d]$ with $\tilde w=[w_1,\dots,w_d,b]$.

- Linear binary classifier
- Model:
*weights*$w$ and a*threshold*$\theta$ - Predicts:
- +1 if $wx > \theta$
- -1 if $wx < \theta$
- "wrong" if $wx = \theta$ (!!!)

- The weight vector symbolizes a hyperplane (s.t. $wx = \theta$)
**Only works for data which is linearly separable**- vanilla perceptron will not converge otherwise
- perceptron will converge to
*some*hyperplane - SVMs (see next section) try to find an optimal hyperplane, and also work well with data which are not perfectly linearly separable

- The
**Winnow Algorithm**is an improvement of the basic Perceptron which is guaranteed to converge (all features are 0 or 1). - Can be parallelized in a decent manner using multiple iterations of MapReduce
- partition training data to mappers
- map: emit all points which disagree with current $w$
- reduce: perform all updates to $w$ in centralized way
- repeat until convergence

- For $k$ classes, train $k$ Perceptrons, one for every class; (positive examples are examples of that class; negative examples are all others)
- Classify by computing all $w_ix$ values, and picking the largest one, as long as it's larger than the threshold $\theta$. If it's not, then the new object does not belong to any known class.
- In some cases (e.g. topic modelling), we can also have objects which may belong to more than one class (e.g. a news article can be about both sports and celebrities)

- Transform data into another coordinate system first (e.g. to polar coordinates) and run Perceptron in
*that*space. But more on this, later, when we talk about non-linear SVMs.

- Perceptrons don't converge on non-linearily separable data
- Perceptrons don't automatically choose a "best" separating hyperplane
- Perceptrons can get (very) biased towards a class since they terminate as soon as they correctly classified all training data points (see textbook, Fig. 12.13)

- Max-margin classifiers
- Find $w$ so that the margin to closest representatives of each class is maximized. Can be shown (but not here) that this is the right thing to do in order to improve our generalization.
- Margin size is $\frac{1}{||w||_2} = \mathit{margin}$
- Want to maximize margin size, so we want to minimize $w^Tw$, while also making sure that the margin is actually correct (duh!): $\forall \> i, w^Tx_i\ge1$
- There's always noise $\implies$ model as slack variables.
- slack variables represent points we can't avoid mis-classifying
- one slack variable for every point
- $0$ if that point is "good" and can be classified OK by the SVM
- $>0$ if that point is not correctly classified by our computed $w$, but we sort of have to live with it
- in practice, most (if not all, if we're lucky) points $i$ have $\xi_i = 0$

- This is the way we solve the SVM constrained minimization problem
- Minimization because we want to "shrink" our model's magnitude (i.e. the margin size) and, as much as possible, the impact of the slack variables (data points which we cannot avoid misclassifying).
- Constrained because we also want to ensure our model does the right thing and correctly classifies most points (for most points, we expect their $\xi_i$ to be 0, and $y_iw^Tx_i\ge1$, meaning that the point is snugly placed in the correct class; positive means the right side of the separation hyperplane; also greater than one means it's beyond the separation margin).
**Problem:**current approach has a run-time complexity of $\Omega(n^2)$, and simply doesn't work if the data don't fit in memory.

- Because linear algebra is awesome, we can reformulate the (primal) SVM definition into something a little more useful.
- $\xi_i = \operatorname{max}(0, 1 - y_iw^Tx_i)$, from the constraint part; it's called hinge loss ($l_{\operatorname{hinge}}$)
- Shove this into the minimization formula.
- We get: $\min_w{\left (w^Tw + C\sum\limits_{i}\max(0, 1 - y_iw^Tx_i) \right )}$

- We can formulate a multicriteria optimization problem in two ways:
- Minimize everything at once
- Minimize one objective, while keeping the other under a certain bound

- Our above (re)formulation belongs to the first category (one big-ass $\min$)
- Let's write it in the second way.
- Minimize the second part, as long as the first part ($w^Tw$) is below some threshold B.
- We write that as:
- $\min_w\sum\limits_i\max(0, 1 - y_i w^T x_i)$ s. t. $||w||_2 \lt \frac{1}{\sqrt{\lambda}}$

- Interestingly enough, all 3 of our formulations produce the
**same solutions**. TODO: ask algebra guru for more details about why. - However, the complexity varies. A lot. The last version is the best in that regard.

- Many supervized learning problems consists of two components: a loss function and a regularizer.
- From a (handwavily explained) Bayesian standpoint, the former represents the evidence, while the latter, our prior knowledge (i.e. we
*know*our model shouldn't be*too*complex) [citation needed] - As we saw before, there are two ways to state these problems mathematically.
- Using a single "large" minimization (l + r)
- Using a little minimization (l) plus a constraint (r)

- We will focus on the second techniques, as it makes it possible to perform online learning.

SVM -> OCP -> online-to-batch conversion by averaging if we want to train a single SVM on fixed data set If we also pick data at random -> SGD.

- TODO: strong convexity and its benefits
- Using geometric examples can really help!

It can be shown that going through the data in streaming fashion can lead to sensible solution, while allowing truly massive data sets to be processed.

Note that it is normal that some members of the training set are never used in a stochastic gradient descent algorithm.

```
In [3]:
```def shuffle_in_unison(a, b):
"""Helper used by the OCP implementation. Shuffles both numpy arrays in the same way."""
assert len(a) == len(b)
shuffled_a = np.empty(a.shape, dtype=a.dtype)
shuffled_b = np.empty(b.shape, dtype=b.dtype)
permutation = np.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b

```
In [1]:
```def svm_ocp(train_x, train_y):
assert train_x.shape[0] == train_y.shape[0]
assert train_y.shape[1] == 1
# The regularization parameter.
lbd = 0.005
rows, features = train_x.shape
# Initialize our model with zeros.
model = np.zeros(features)
model_sum = np.zeros(features)
# If we don't shuffle our input, the learning gets *really* messed up.
train_x_vals, train_y_vals = shuffle_in_unison(train_x.values,
train_y.values)
for t, (x, y) in enumerate(zip(train_x_vals, train_y_vals)):
eta = 1 / math.sqrt(t + 1)
# eta = 1 / (lbd * (t + 2))
prediction = np.dot(x, model)
score = (y * prediction)[0]
if score < 1:
# We misclassified 'y'. Or weren't condifent enough.
# Let's update the model.
# Gradient descent step.
model_prime = model + eta * y * x
# Regularization step (stay inside the hypersphere with radius
# lambda).
model_prime_norm = np.linalg.norm(model_prime)
regularizer = 1 / (math.sqrt(lbd) * model_prime_norm)
model = model_prime * min(1, regularizer)
model_sum += model
# Online-to-batch conversion: we simply want to average over all iterations
# of our model. This is a very simple and provably accurate method.
# TODO(andrei) Consider skipping first 'k' elements.
avg_model = model_sum / rows
return avg_model

- In some cases, gradient descent can take a long time to converge because of the distorted curvature of the loss function
- We can "stretch" the dimensions to make our gradient "aim" better towards the minimum and thus converge faster
- Transformation matrix $A$: $z = Aw$, $w=A^{-1}z$
- Gradient descent in z-space:
- $z_{t+1} = \operatorname{arg min}\limits_{z: A^{-1}z \in S} || z - \left(z_t - \eta\nabla_z g(z_t) \right) ||_2^2$
- explained:
- we still want to minimize, hence the $\operatorname{arg min}$
- we still want to constrain based on the original space, so we impose the constraint on $w = A^{-1}z \in S$
- the updated model is computed via gradient descent (duh) in the $\nabla_zg(z_t)$ part; $\eta$ just tweaks the learning speed as much as we want

- We reach: $\operatorname{arg min}\limits_{w \in S} || w - \left( w_t - \eta A^{-1}\nabla_w f(w_t) \right) ||_A^2$ (which uses the Mahalanobis norm instead of the $\ell_2$ one)

- All that's left is to pick an appropriate projection matrix $A$
- positive definite
- symmetric
- all eigenvalues $\ge 0$

- We could compute a nice $G_T$ which minimizes the regret upper bound...
- But we obviously can't compute $G_T$ at time $t < T$ :(
- See 06:24 for more details.
- There's an alternate way of computing an optimal $G_t$ at every step
- $G_t = \left( \sum\limits_{\tau = 1}^{t}\nabla f_{\tau}(w_{\tau})\nabla f_{\tau}(w_{\tau})^T \right)^\frac{1}{2}$
- However, G is expensive to compute, and it also needs to be inverted.
- Solution: $G_t := diag(G_t)$; this is much easier to work with and trivial to invert
**Bonus:**can prove that AdaGrad provides advantages if data is sparse (better upper bound for $\mathbb{E}[L(\hat{w})] - L(w^*)$)- Generalization of OCP and friends: Proximal gradient methods.

Note: AdaGrad is not covered in the textbook (as of January 2016)

TODO: more info, since we need it to really grok ADAGRAD.

\begin{equation} d_{\text{Mahalanobis}}(x, y) = \sqrt{\sum_{i=1}^{d}\left( \frac{p_i - c_i}{\sigma_i} \right)^{2}} \end{equation}The Mahalanobis distance is essentially the distance between a point and the centroid of a cluster, normalized by the standard deviation of the cluster in each dimension.

-- Rajaraman, Anand, and Jeffrey D. Ullman. Mining of massive datasets. Vol. 77. Cambridge: Cambridge University Press, 2012.

- Parallelize gradient computation (cluster)
- needs to constantly sync value of $w$ after each iteration

- Parallel stochastic gradient descent (cluster)
- partition data
- run SGD independently on each partition
- average produced models
- TODO: analysis of error rate

- Hogwild! (multi-proc)
- on local, multi-processor machine
- no sychrnoization
- works well, as long as the data is sparse!

- Parameter server
- keep track of $w$ on central machine, which receives $\Delta w$s every once in a while, after which it pushes the update to all the workers

```
In [ ]:
```