Homework 2 (Linear models, Optimization)

In this homework you will implement a simple linear classifier using numpy and your brain.

Two-dimensional classification



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import random
from IPython import display
from sklearn import datasets, preprocessing

(X, y) = datasets.make_circles(n_samples=1024, shuffle=True, noise=0.2, factor=0.4)
ind = np.logical_or(y==1, X[:,1] > X[:,0] - 0.5)
X = X[ind,:]
X = preprocessing.scale(X)
y = y[ind]
y = 2*y - 1
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.show()



In [2]:

    
h = 0.01
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
def visualize(X, y, w, loss, n_iter):
    plt.clf()
    Z = classify(np.c_[xx.ravel(), yy.ravel()], w)
    Z = Z.reshape(xx.shape)
    plt.subplot(1,2,1)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.subplot(1,2,2)
    plt.plot(loss)
    plt.grid()
    ymin, ymax = plt.ylim()
    plt.ylim(0, ymax)
    display.clear_output(wait=True)
    display.display(plt.gcf())

Your task starts here

First, let's write a function that predicts class for given X.

Since the problem above isn't linearly separable, we add quadratic features to the classifier. This transformation is implemented in the expand function.

Don't forget to expand X inside classify and other functions

Sample classification should not be much harder than computation of sign of dot product.



In [3]:

    
def expand(X):
    X_ = np.zeros((X.shape[0], 6))
    X_[:,0:2] = X
    X_[:,2:4] = X**2
    X_[:,4] = X[:,0] * X[:,1]
    X_[:,5] = 1
    return X_

def classify(X, w):
    """
    Given feature matrix X [n_samples,2] and weight vector w [6],
    return an array of +1 or -1 predictions
    """
    X = expand(X)
    y = X.dot(w)
    return np.sign(y)

The loss you should try to minimize is the Hinge Loss:

$$ L = {1 \over N} \sum_{i=1}^N max(0,1-y_i \cdot w^T x_i) $$



In [4]:

    
def compute_loss(X, y, w):
    """
    Given feature matrix X [n_samples,2], target vector [n_samples] of +1/-1,
    and weight vector w [6], compute scalar loss function using formula above.
    """
    X = expand(X)
    y_pred = X.dot(w)
    res = np.maximum(0, 1 - y_pred * y)
    return res.mean()

def compute_grad(X, y, w):
    """
    Given feature matrix X [n_samples,2], target vector [n_samples] of +1/-1,
    and weight vector w [6], compute vector [6] of derivatives of L over each weights.
    """
    X = expand(X)
    y_pred = X.dot(w) * y
    y_pred = np.int0(y_pred < 1)
    grad = -(X.T * y).T
    grad = (grad.T * y_pred).T
    return grad.sum(axis=0)

Training

Find an optimal learning rate for gradient descent for given batch size.

You can see the example of correct output below this cell before you run it.

Don't change the batch size!



In [5]:

    
w = np.array([1,0,0,0,0,0])

alpha = 0.1 # learning rate

n_iter = 50
batch_size = 4
loss = np.zeros(n_iter)
plt.figure(figsize=(12,5))
for i in range(n_iter):
    ind = random.sample(range(X.shape[0]), batch_size)
    loss[i] = compute_loss(X, y, w)
    visualize(X[ind,:], y[ind], w, loss, n_iter)
    
    w = w - alpha * compute_grad(X[ind,:], y[ind], w)
    
visualize(X, y, w, loss, n_iter)
plt.clf()









    












    





<matplotlib.figure.Figure at 0x7f1920b84ba8>

Implement gradient descent with momentum and test it's performance for different learning rate and momentum values.



In [6]:

    
w = np.array([1,0,0,0,0,0])
v = np.zeros(6)

alpha = 0.02 # learning rate
mu    = 0.8 # momentum

n_iter = 50
batch_size = 4
loss = np.zeros(n_iter)

plt.figure(figsize=(12,5))
for i in range(n_iter):
    ind = random.sample(range(X.shape[0]), batch_size)
    loss[i] = compute_loss(X, y, w)
    visualize(X[ind,:], y[ind], w, loss, n_iter)
    
    v = mu * v - alpha * compute_grad(X, y, w)
    w = w + v

visualize(X, y, w, loss, n_iter)
plt.clf()









    












    





<matplotlib.figure.Figure at 0x7f191af31a58>

Same task but for Nesterov's accelerated gradient:



In [8]:

    
w = np.array([1,0,0,0,0,0])
v = np.zeros(6)

alpha = 0.01 # learning rate
mu    = 0.8 # momentum

n_iter = 50
batch_size = 4
loss = np.zeros(n_iter)

plt.figure(figsize=(12,5))
for i in range(n_iter):
    ind = random.sample(range(X.shape[0]), batch_size)
    loss[i] = compute_loss(X, y, w)
    visualize(X[ind,:], y[ind], w, loss, n_iter)
    
    v = mu * v - alpha * compute_grad(X, y, w + mu * v)
    w = w + v

visualize(X, y, w, loss, n_iter)
plt.clf()









    












    





<matplotlib.figure.Figure at 0x7f192011f240>

Finally, try Adam algorithm. You can start with beta = 0.9 and mu = 0.999



In [25]:

    
w = np.array([1,0,0,0,0,0])
v = np.zeros(6)
g = np.zeros(6)

alpha = 2. # learning rate
beta = 0.9  # (beta1 coefficient in original paper) exponential decay rate for the 1st moment estimates
mu   = 0.5  # (beta2 coefficient in original paper) exponential decay rate for the 2nd moment estimates
eps = 1e-4  # A small constant for numerical stability

n_iter = 50
batch_size = 4
loss = np.zeros(n_iter)
plt.figure(figsize=(12,5))
for i in range(n_iter):
    ind = random.sample(range(X.shape[0]), batch_size)
    loss[i] = compute_loss(X, y, w)
    visualize(X[ind,:], y[ind], w, loss, n_iter)
    
    gr = compute_grad(X, y, w)
    v = beta * v + (1 - beta) * gr
    g = mu * g + gr * gr * (1 - mu)
    w = w - v * alpha * (1 - mu) / (g + eps) / (1 - beta)

visualize(X, y, w, loss, n_iter)
plt.clf()









    












    





<matplotlib.figure.Figure at 0x7f191ae2ec50>

Which optimization method do you consider the best? Type your answer in the cell below

Судя по графикам лучше всего градиент с моментом. Однако, при правильном подборе параметров ADAM может быть лучше. Если посмотреть на последний график, то можно заметить что лосс делает меньшие скачки и убывает более монотонно. Однако существенным недостатком данного метода является необходимость тщательного подбора параметров. Можно еще все это сравнить с моментом Нестерова, преимущество которого в том, что он очень быстро подбирается к минимуму (правда дальше начинает ходить где-то в его окрестности и может даже выбиваться на несколько итераций). Но все же для ответа на данный вопрос я бы выбрал второй метод