In [2]:
import pandas as pd
import numpy as np
SVM -> OCP -> online-to-batch conversion by averaging if we want to train a single SVM on fixed data set If we also pick data at random -> SGD.
It can be shown that going through the data in streaming fashion can lead to sensible solution, while allowing truly massive data sets to be processed.
Note that it is normal that some members of the training set are never used in a stochastic gradient descent algorithm.
In [3]:
def shuffle_in_unison(a, b):
"""Helper used by the OCP implementation. Shuffles both numpy arrays in the same way."""
assert len(a) == len(b)
shuffled_a = np.empty(a.shape, dtype=a.dtype)
shuffled_b = np.empty(b.shape, dtype=b.dtype)
permutation = np.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
In [1]:
def svm_ocp(train_x, train_y):
assert train_x.shape[0] == train_y.shape[0]
assert train_y.shape[1] == 1
# The regularization parameter.
lbd = 0.005
rows, features = train_x.shape
# Initialize our model with zeros.
model = np.zeros(features)
model_sum = np.zeros(features)
# If we don't shuffle our input, the learning gets *really* messed up.
train_x_vals, train_y_vals = shuffle_in_unison(train_x.values,
train_y.values)
for t, (x, y) in enumerate(zip(train_x_vals, train_y_vals)):
eta = 1 / math.sqrt(t + 1)
# eta = 1 / (lbd * (t + 2))
prediction = np.dot(x, model)
score = (y * prediction)[0]
if score < 1:
# We misclassified 'y'. Or weren't condifent enough.
# Let's update the model.
# Gradient descent step.
model_prime = model + eta * y * x
# Regularization step (stay inside the hypersphere with radius
# lambda).
model_prime_norm = np.linalg.norm(model_prime)
regularizer = 1 / (math.sqrt(lbd) * model_prime_norm)
model = model_prime * min(1, regularizer)
model_sum += model
# Online-to-batch conversion: we simply want to average over all iterations
# of our model. This is a very simple and provably accurate method.
# TODO(andrei) Consider skipping first 'k' elements.
avg_model = model_sum / rows
return avg_model
Note: AdaGrad is not covered in the textbook (as of January 2016)
TODO: more info, since we need it to really grok ADAGRAD.
\begin{equation} d_{\text{Mahalanobis}}(x, y) = \sqrt{\sum_{i=1}^{d}\left( \frac{p_i - c_i}{\sigma_i} \right)^{2}} \end{equation}The Mahalanobis distance is essentially the distance between a point and the centroid of a cluster, normalized by the standard deviation of the cluster in each dimension.
-- Rajaraman, Anand, and Jeffrey D. Ullman. Mining of massive datasets. Vol. 77. Cambridge: Cambridge University Press, 2012.
In [ ]: