A quick recap of the Gradient Descent method: This is an iterative algorithm to minize a loss function $L(x)$, where we start with a guess of what the answer should be - and then take steps proportional to the gradient at the current point.
$x = x_0$ (initial guess)
Until Convergence is achieved:
$x_{i+1} = x_{i} - \eta\nabla_L(x_i)$
For example, Let's say $L(x) = x^2 - 2x + 1$ and we start at $x0 = 2$. Coding the Gradient Descent method in Python:
In [4]:
%matplotlib inline
In [3]:
import numpy as np
def L(x):
return x**2 - 2*x + 1
def L_prime(x):
return 2*x - 2
def converged(x_prev, x, epsilon):
"Return True if the abs value of all elements in x-x_prev are <= epsilon."
absdiff = np.abs(x-x_prev)
return np.all(absdiff <= epsilon)
def gradient_descent(f_prime, x_0, learning_rate=0.2, n_iters=100, epsilon=1E-8):
x = x_0
for _ in range(n_iters):
x_prev = x
x -= learning_rate*f_prime(x)
if converged(x_prev, x, epsilon):
break
return x
x_min = gradient_descent(L_prime, 2)
print('Minimum value of L(x) = x**2 - 2*x + 1.0 is [%.2f] at x = [%.2f]' % (L(x_min), x_min))
In most supervised ML applications, we will try to learn a pattern from a number of labeled examples. In Batch Gradient Descent, each iteration loops over entire set of examples.
So, let's build 1-layer network of Linear Perceptrons to classify Fisher's IRIS dataset (again!). Remember that a Linear Perceptron can only distinguish between two classes.
Since there are 3 classes, our mini-network will have 3 Perceptrons. We'll channel the output of each Perceptron $w_i^T + b$ into a softmax function to pick the final label. We'll train this network using Batch Gradient Descent.
In [2]:
import seaborn as sns
import pandas as pd
iris_df = sns.load_dataset('iris')
print('Columns: %s' % (iris_df.columns.values, ))
print('Labels: %s' % (pd.unique(iris_df['species']), ))
iris_df.head(5)
Out[2]:
The softmax function is a technique to apply a probabilistic classifier by making a probability distribution out of a set of values $(v_1, v_2, ..., v_n)$ which may or may not satisfy all the features of probability distribution:
The probability distribution is the Gibbs Distribution: $v'_i = \frac {\exp {v_i}} {\sum_{j=1}^n\exp {v_j})}$ for $i = 1, 2, ... n$.
In [5]:
def softmax(x):
# Uncomment to find out why we shouldn't do it this way...
# return np.exp(x) / np.sum(np.exp(x))
scaled_x = x - np.max(x)
result = np.exp(scaled_x) / np.sum(np.exp(scaled_x))
return result
a = np.array([-500.9, 2000, 7, 11, 12, -15, 100])
sm_a = softmax(a)
print('Softmax(%s) = %s' % (a, sm_a))
With softmax, we typically use the cross-entropy error as the function to minimize.
The Cross Entropy Error for a given input $X = (x_1, x_2, ..., x_n)$, where each $x_i$ is a vector, is given by:
$L(x) = - \frac {1}{n} \sum_{i=1}^n y_i^T log(\hat{y_i})$
Where
In [6]:
def encode_1_of_n(ordered_labels, y):
label2idx = dict((label, idx)
for idx, label in enumerate(ordered_labels))
def encode_one(y_i):
enc = np.zeros(len(ordered_labels))
enc[label2idx[y_i]] = 1.0
return enc
return np.array([x for x in map(encode_one, y)])
encode_1_of_n(['apple', 'banana', 'orange'],
['apple', 'banana', 'orange', 'apple', 'apple'])
Out[6]:
In [7]:
def cross_entropy_loss(Y, Y_hat):
entropy_sum = 0.0
log_Y_hat = np.log(Y_hat)
for y, y_hat in zip(Y, log_Y_hat):
entropy_sum += np.dot(y, y_hat)
return -entropy_sum/Y.shape[0]
Y_tst = np.array([[1, 0, 0],
[0, 1, 0]])
# log(Y_hat_tst1) is the same as Y_tst, so we expect the x-entropy error to be the min (-1) in this case.
print(Y_tst)
Y_hat_tst1 = np.array([[np.e, 1, 1,],
[1, np.e, 1]])
print(Y_hat_tst1)
print(cross_entropy_loss(Y_tst, Y_hat_tst1))
print()
# expect it to be > -1
Y_hat_tst2 = np.array([[1, 1, 1,],
[1, np.e, 1]])
print(Y_hat_tst2)
print(cross_entropy_loss(Y_tst, Y_hat_tst2))
print()
In [11]:
import pandas as pd
class OneLayerNetworkWithSoftMax:
def __init__(self):
self.w, self.bias = None, 0.0
self.optimiser = None
self.output = None
def init_weights(self, X, Y):
"""
Initialize a 2D weight matrix as a Dataframe with
dim(n_labels*n_features).
"""
self.labels = np.unique(Y)
w_init = np.random.randn(len(self.labels), X.shape[1])
self.w = pd.DataFrame(data=w_init)
self.w.index.name = 'node_id'
def predict(self, x):
"""
Return the predicted label of x using current weights.
"""
output = self.forward(x, update=False)
max_label_idx = np.argmax(output)
return self.labels[max_label_idx]
def forward(self, x, update=True):
"""
Calculate softmax(w^Tx+b) for x using current $w_i$ s.
"""
#output = self.w.apply(lambda row: np.dot(row, x), axis=1)
output = np.dot(self.w, x)
output += self.bias
output = softmax(output)
if update:
self.output = output
return output
def backward(self, x, y, learning_rate):
"""
Executes the weight update step
grad = (self.output - y)
for i in range(len(grad)):
dw[i] -= grad[i] * x
w -= learning_rate * dw
:param x: one sample vector.
:param y: One-hot encoded label for x.
"""
# [y_hat1 - y1, y_hat2-y2, ... ]
y_hat_min_y = self.output - y
# Transpose the above to a column vector
# and then multiply x with each element
# to produce a 2D array (n_labels*n_features), same as w
error_grad = np.apply_along_axis(lambda z: z*x ,
1, np.atleast_2d(y_hat_min_y).T)
dw = learning_rate * error_grad
return dw
def print_weight_diff(self, i, w_old, diff_only=True):
if not diff_only:
print('Before Iteration [%s]: weights are: \n%s' %
(i+1, w_old))
print('After Iteration [%s]: weights are: \n%s' %
(i+1, self.w))
w_diff = np.abs(w_old - self.w)
print('After Iteration [%s]: weights diff: \n%s' %
(i+1, w_diff))
def _gen_minibatch(self, X, Y, mb_size):
"""Generates `mb_size` sized chunks from X and Y."""
n_samples = X.shape[0]
indices = np.arange(n_samples)
np.random.shuffle(indices)
for start in range(0, n_samples, mb_size):
yield X[start:start+mb_size, :], Y[start:start+mb_size, :]
def _update_batch(self, i, X_batch, Y_batch, learning_rate, print_every=100):
w_old = self.w.copy()
dw = []
for x, y in zip(X_batch, Y_batch):
self.forward(x)
dw_item = self.backward(x, y, learning_rate)
dw.append(dw_item)
dw_batch = np.mean(dw, axis=0)
self.w -= dw_batch
if (i == 0) or ((i+1) % print_every == 0):
self.print_weight_diff(i, w_old)
def train(self, X, Y,
n_iters=1000,
learning_rate=0.2,
minibatch_size=30,
epsilon=1E-8):
"""
Entry point for the Minibatch SGD training method.
Calls forward+backward for each (x_i, y_i) pair and adjusts the
weight w accordingly.
"""
self.init_weights(X, Y)
Y = encode_1_of_n(self.labels, Y)
n_samples = X.shape[0]
# MiniBatch SGD
for i in range(n_iters):
for X_batch, Y_batch in self._gen_minibatch(X, Y, minibatch_size):
self._update_batch(i, X_batch, Y_batch, learning_rate)
# Set aside test data
label_grouper = iris_df.groupby('species')
test = label_grouper.head(10).set_index('species')
train = label_grouper.tail(100).set_index('species')
# Train the Network
X_train, Y_train = train.as_matrix(), train.index.values
nn = OneLayerNetworkWithSoftMax()
nn.train(X_train, Y_train)
# Test
results = test.apply(lambda row : nn.predict(row.as_matrix()), axis=1)
results.name = 'predicted_label'
results.index.name = 'expected_label'
results.reset_index()
Out[11]:
This is a complex derivation, and we need to approach it step-by step. First, let's work out what the $i$-th sample contributes to the gradient of L, i.e. the derivative of - $Y_i^Tln(\hat Y_i)$.
Let's draw the structure of the Network using networkx for a 2-class problem, so we have 2 input nodes.
In [ ]:
import networkx as nx
from matplotlib import pylab
G = nx.DiGraph()
G.add_edges_from(
[('i', 'n1'),
('i', 'n2'),
('n1', 's1'),
('n2', 's1'),
('n1', 's2'),
('n2', 's2'),
('s1', 'y1'),
('s2', 'y2'),
])
pos = {'i': (1, 1),
'n1': (2, 0), 'n2': (2, 2),
's1': (3, 0), 's2': (3, 2),
'y1': (4, 0), 'y2': (4, 2),
}
labels = {'i': r'$x_i$',
'n1': r'$w_1$', 'n2': r'$w_2$',
's1': r'$s_1$', # r'$\frac {\exp(z_{i1})} {S_i}$',
's2': r'$s_2$', # r'$\frac {\exp(z_{i2})} {S_i}$'
}
edge_labels = {('i', 'n1'): r'$x_i$',
('i', 'n2'): r'$x_i$',
('n1', 's1'): r'$w_1^Tx_i$',
('n1', 's2'): r'$w_1^Tx_i$',
('n2', 's1'): r'$w_2^Tx_i$',
('n2', 's2'): r'$w_2^Tx_i$',
('n2', 's1'): r'$w_2^Tx_i$',
('s1', 'y1'): r'$\frac {\exp(z_{i1})} {S_i}$',
('s2', 'y2'): r'$\frac {\exp(z_{i2})} {S_i}$',
}
nx.draw(G, pos=pos, node_size=1000)
nx.draw_networkx_labels(G,pos,labels, font_size=15, color='white')
nx.draw_networkx_edge_labels(G, pos=pos,
edge_labels=edge_labels, font_size=15)
To calculate the derivative $- Y_i^Tln(\hat {Y_i})$, we will break down the vector sum:
$L_i = - [y_1 ln (\hat {y_1}) + y_2 ln (\hat {y_2}) + ... ]$, where
We know that
$\begin{equation} y_1 ln (\hat {y_1}) = y_1 ln \frac {\exp(z_{i1})} {\exp(z_{i1}) + \exp(z_{i2}) + ...} \\ y_2 ln (\hat {y_2}) = y_2 ln \frac {\exp(z_{i2})} {\exp(z_{i1}) + \exp(z_{i2}) + ...} \\ \vdots \end{equation}$
Where $z_{i1} = w_1^Tx_i, z_{i2} = w_2^Tx_i$, and so on.
Our end goal is to calculate $(\frac {\partial L_i}{\partial w_1}, \frac {\partial L_i}{\partial w_2}, ...)$. We can use the Chain rule to produce:
$\begin{equation} \frac {\partial L_i}{\partial w_1} = \frac {\partial L_i} {\partial z_{i1}} \frac {\partial z_{i1}}{\partial w_1} \\ \frac {\partial L_i}{\partial w_2} = \frac {\partial L_i} {\partial z_{i2}} \frac {\partial z_{i2}}{\partial w_2} \\ \vdots \end{equation}$
The denominator is the same for all of $(\hat {y_1}, \hat {y_2}, ...)$, so let's call that $S_i$.
$S_i = \exp(z_{i1}) + \exp(z_{i1}) + ...$
So the equations above become:
$\begin{equation} y_1 ln(\hat {y_1}) = z_{i1} - ln(S_i) \\ y_2 ln(\hat {y_2}) = z_{i2} - ln(S_i) \\ \vdots \end{equation}$
Taking the partial derivative of all these equations w.r.t $z_{i1}$, we get:
$\begin{equation} \frac {y_1} {\hat {y_1}} \frac {\partial \hat {y_1}} {\partial z_{i1}} = y_1(1 - \frac {z_{i1}} {S_i}) = y_1(1 - \hat {y_1}) \\ \frac {y_2} {\hat {y_2}} \frac {\partial \hat {y_2}} {\partial z_{i1}} = - y_2 \frac {z_{i1}} {S_i} = - y_2 \hat {y_1} \\ \vdots \end{equation}$
Thus, we can express $\frac {\partial L_i} {\partial z_{j1}}$ as:
$\frac {\partial L_i} {\partial z_{j1}} = [y1(\hat {y_1} - 1) + y2 \hat {y_1} + y3 \hat{y_1}+ ...] = [\hat {y_1}(y_1 + y_2 + ...) - y_1] = (\hat {y_1} - y_1)$
Since exactly 1 of $(y_1 + y_2 + ...)$ is 1, and all the others are zero.
Similarly, we can prove:
$\begin{equation} \frac {\partial L_i} {\partial z_{j2}} = (\hat {y_2} - y_2) \\ \frac {\partial L_i} {\partial z_{j3}} = (\hat {y_3} - y_3) \\ \vdots \end{equation}$
Noting that
$\begin{equation} \frac {\partial z_{i1}} {\partial w_1} = x_i \\ \frac {\partial z_{i2}} {\partial w_2} = x_i \\ \vdots \end{equation}$
We finally arrive at the result:
$\begin{equation} \frac {\partial L_i} {\partial w_1} = - (\hat {y_1} - y_1)x_i \\ \frac {\partial L_i} {\partial w_2} = - (\hat {y_2} - y_2)x_i \\ \vdots \end{equation}$
In [ ]: