June 2016 - This notebook was created by [Oriol Pujol Vila](http://www.maia.ub.es/~oriol). Source and [license](./LICENSE.txt) info are in the folder.
In [ ]:
#pip install tqdm
Consider the problem up to this point. Let us recall the three basic components of the algorithm.
Let us consider the following graph representation of the problem:
The model function is $f(x,\omega)$, where $\omega$ are the parameters to optimise. For example, in the case of a linear model we may have $f(x,\omega) = \sum_i \omega_i x_i = \omega^T x$. The loss function is represented with $\mathcal{L}(y,t)$, where $y=f(x,\omega)$ and $t$ is the target value. Remember that the loss function models the dissimilarity between the output of the model and the true value to predict. The last element of the learning algorithm is the optimisation algorithm. In our case is an algorithm with the goal of minimizing the dissimilarity between target value and the model output, i.e.
$$\underset{\omega}{\text{minimize}} \quad \mathcal{L}(f(x,\omega),t)$$Let us refactor the code using OOP in order to take into account these three elements:
In [ ]:
class model:
def __init__(self):
pass
def forward(self,x):
#Takes a data point and evaluates f(x,w)
pass
class loss:
def __init__(self):
pass
def evaluate(self,x,t):
#Evaluates the loss function L(y,t)
pass
class optimize:
def __init__(self):
pass
def run(self, data, target, model, loss):
#Takes a loss function and a model and find the optimal parameters for an specific data set
pass
We know one way of optimising algorithms in front of large scale data sets, i.e. stochastic subgradient methods. In short this is an iterative algorithm that updates the unknown variables proportional to minus the gradient magnitude at each iteration.
Let us fill in the gaps for a classical linear regression where the model is $f(x,\omega) = \omega^Tx$,
In [ ]:
import numpy as np
class model:
def __init__(self,w):
self.w = w
def forward(self,x):
#Takes a data point and evaluates f(x,w)
return np.dot(self.w,x.T)
using a least squares loss function $\mathcal{L}(y,t) = (t-y)^2$,
In [ ]:
class loss:
def __init__(self):
pass
def evaluate(self,model,x,t):
#Evaluates the loss function L(y,t)
y = model.forward(x)
return (t-y)*(t-y)
using stochastic gradient descend,
In [ ]:
class optimize:
def __init__(self,t):
self.num_iter = t
def run(self, data, target, model, loss):
#Takes a loss function and a model and find the optimal parameters for an specific data set
N_samples = data.shape[0]
for t in xrange(self.num_iter):
#Step 1.-take a sample x at random from the training set
idx = np.random.randint(N_samples)
xi = data[idx,:]
yi = target[idx]
#Step 2.- update the parameters to optimise
model.w = model.w - eta * "gradient_L_omega(xi,yi)"
The gradient with respect to the parameter to optimise is $$\nabla_\omega\mathcal{L(f(x,\omega),t)} = (\frac{\partial \mathcal{L}}{\partial \omega_1},\frac{\partial \mathcal{L}}{\partial \omega_2}, \dots, \frac{\partial \mathcal{L}}{\partial \omega_d}) = \frac{\partial \mathcal{L}}{\partial \bar{\omega}}$$.
Recall the graph representation
Observe that the complete differentiation is easier this way.
Given that $\mathcal{L}(y,t) = (t-y)^2$, then $$\frac{\partial \mathcal{L}}{\partial y} = -2(t-y)$$
Given that $y = \omega^Tx$, then $$ \frac{\partial y}{\partial \bar{\omega}} = x$$
Let us try to understand visually what the chain rule is about. This will constitute the basis of the backpropagation algorithm, that will be next used to train deep learning techniques and the basis of automatic differentiation, a programatically optimum way for computing differentials.
In [ ]:
class optimize:
def __init__(self):
pass
def run(self, data, target, model, loss, num_iter, eta):
#Takes a loss function and a model and find the optimal parameters for an specific data set
N_samples = data.shape[0]
for t in xrange(num_iter):
#Step 1.-take a sample x at random from the training set
idx = np.random.randint(N_samples)
xi = data[idx,:]
yi = target[idx]
#Step 2.- update the parameters to optimise
model.w = model.w - eta * loss.gradient(model,xi,yi)*model.gradient(xi)
Thus we need to add the gradient terms to the model and the loss.
In [ ]:
import numpy as np
class model:
def __init__(self,d):
self.w = np.zeros((1,d+1))
def forward(self,x):
#Takes a data point and evaluates f(x,w)
return np.dot(self.w[0,:-1],x.T)+self.w[0,-1]
def gradient(self,x):
return np.concatenate((x,np.array([1])))
def gradient_x(self,x):
return self.w[0,:-1]
In [ ]:
class loss:
def __init__(self):
self.y = 0
def evaluate(self,model,x,t):
#Evaluates the loss function L(y,t)
self.y = model.forward(x)
return (t-self.y)*(t-self.y)
def gradient(self,model,x,t):
self.y = model.forward(x)
return -2.*(t-self.y)
Let us put all the pieces together and solve a toy problem
In [ ]:
import numpy as np
#Example dataset
N_samples_per_class = 1000
d_dimensions = 2
x = np.vstack((np.random.randn(N_samples_per_class, d_dimensions),np.random.randn(N_samples_per_class, d_dimensions)+np.array([3,3])))
y = np.vstack((-1.*np.ones((N_samples_per_class,1)),1.*np.ones((N_samples_per_class,1))))
%matplotlib inline
import matplotlib.pyplot as plt
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1,color='pink')
In [ ]:
delta = 0.05
xx = np.arange(-5.0, 8.0, delta)
yy = np.arange(-5.0, 8.0, delta)
XX, YY = np.meshgrid(xx, yy)
Xf = XX.flatten()
Yf = YY.flatten()
sz=XX.shape
test_data = np.concatenate([Xf[:,np.newaxis],Yf[:,np.newaxis]],axis=1);
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1,color='pink')
f=model(d_dimensions)
Z = f.forward(test_data)
Z.shape=sz
plt.imshow(Z, interpolation='bilinear', origin='lower', extent=(-6,8,-6,8),alpha=0.3, vmin=-15, vmax=15)
plt.contour(XX,YY,Z,[0])
fig = plt.gcf()
fig.set_size_inches(9,9)
Let us add some control to check how it works. We will compute the loss every 1000 iterations, just for checking purposes and add a plot method for showing convergence.
In [ ]:
import matplotlib.pyplot as plt
class optimize:
def __init__(self):
pass
def run(self, data, target, model, loss, num_iter, eta):
#Takes a loss function and a model and find the optimal parameters for an specific data set
self.l = np.zeros((np.ceil(num_iter/1000),1))
N_samples = data.shape[0]
print N_samples
i=0
for t in xrange(num_iter):
#Step 1.-take a sample x at random from the training set
idx = np.random.randint(N_samples)
xi = data[idx,:]
yi = target[idx]
# <-- Start new code
if t%1000==0:
self.l[i] = np.sum(loss.evaluate(model,data[:1000,:],target[:1000,0]))
i=i+1
## End new code -->
#Step 2.- update the parameters to optimise
model.w = model.w - eta * loss.gradient(model,xi,yi)*model.gradient(xi)
def plot(self):
plt.plot(self.l)
Now let us run the method.
In [ ]:
num_iter = 100000
eta = 0.001 #optimization step/learning rate
f = model(d_dimensions)
L = loss()
opt = optimize()
opt.run(x,y,f,L,num_iter,eta)
In [ ]:
opt.plot()
Let us visually check the results
In [ ]:
delta = 0.05
xx = np.arange(-5.0, 8.0, delta)
yy = np.arange(-5.0, 8.0, delta)
XX, YY = np.meshgrid(xx, yy)
Xf = XX.flatten()
Yf = YY.flatten()
sz=XX.shape
test_data = np.concatenate([Xf[:,np.newaxis],Yf[:,np.newaxis]],axis=1);
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1,color='pink')
Z = f.forward(test_data)
Z.shape=sz
plt.imshow(Z, interpolation='bilinear', origin='lower', extent=(-6,8,-6,8),alpha=0.3, vmin=-15, vmax=15)
plt.contour(XX,YY,Z,[0])
fig = plt.gcf()
fig.set_size_inches(9,9)
Let us create a new loss function more adequate for classification purposes. The hinge loss
$\mathcal{L}(y,t) = \max(0, 1-t\cdot y)$
In [ ]:
#display
class loss_hinge:
def __init__(self):
self.y = 0
def evaluate(self,model,x,t):
#Evaluates the loss function L(y,t)
self.y = model.forward(x)
return np.maximum(0,1-t*self.y)
def gradient(self,model,x,t):
self.y = model.forward(x)
if (1-t*self.y)>0:
return -t
else:
return 0
In [ ]:
num_iter = 10000
eta = 0.01 #optimization step/learning rate
f = model(d_dimensions)
L = loss_hinge()
opt = optimize()
opt.run(x,y,f,L,num_iter,eta)
In [ ]:
opt.plot()
In [ ]:
delta = 0.05
xx = np.arange(-5.0, 8.0, delta)
yy = np.arange(-5.0, 8.0, delta)
XX, YY = np.meshgrid(xx, yy)
Xf = XX.flatten()
Yf = YY.flatten()
sz=XX.shape
test_data = np.concatenate([Xf[:,np.newaxis],Yf[:,np.newaxis]],axis=1);
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1,color='pink')
Z = f.forward(test_data)
Z.shape=sz
plt.imshow(Z, interpolation='bilinear', origin='lower', extent=(-6,8,-6,8),alpha=0.3, vmin=-15, vmax=15)
plt.contour(XX,YY,Z,[0])
fig = plt.gcf()
fig.set_size_inches(9,9)
We still do not have a perfect result. But we could modify the data set so that our classifier solves the problem perfectly. Remember that the gradient can be seen as what change we must made to a certain variable in order to maximize some cost function. What if the variable is data? This would mean how should we change data for a perfect classification?.
Let us create a new optimizer in order to update data.
In [ ]:
# your code
from IPython.display import clear_output, Image, display
%matplotlib inline
class optimize_data:
def __init__(self):
pass
def run(self, data, target, model, loss, num_iter, eta):
#Takes a loss function and a model and find the optimal parameters for an specific data set
N_samples = data.shape[0]
for t in xrange(num_iter):
#Step 1.-take a sample x at random from the training set
idx = np.random.randint(N_samples)
xi = data[idx,:]
yi = target[idx,:]
#Step 2.- update the parameters to optimise
data[idx,:] = data[idx,:] - eta * ##### YOUR CODE HERE ##############
return data
Let us train a classifier first.
In [ ]:
num_iter = 100000
eta = 0.001 #optimization step/learning rate
f = model(d_dimensions)
L = loss_hinge()
opt = optimize()
opt.run(x,y,f,L,num_iter,eta)
Now we check the results and see that the result is good but does not classify all data perfectly.
In [ ]:
delta = 0.05
xx = np.arange(-5.0, 8.0, delta)
yy = np.arange(-5.0, 8.0, delta)
XX, YY = np.meshgrid(xx, yy)
Xf = XX.flatten()
Yf = YY.flatten()
sz=XX.shape
test_data = np.concatenate([Xf[:,np.newaxis],Yf[:,np.newaxis]],axis=1);
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1,color='pink')
Z = f.forward(test_data)
Z.shape=sz
plt.imshow(Z, interpolation='bilinear', origin='lower', extent=(-6,8,-6,8),alpha=0.3, vmin=-15, vmax=15)
plt.contour(XX,YY,Z,[0])
fig = plt.gcf()
fig.set_size_inches(9,9)
And now, modify the data so that the trained classifier classifies all data points.
In [ ]:
num_iter = 500000
eta = 0.05 #optimization step/learning rate
x_orig = x.copy()
Lh = loss_hinge()
opt = optimize_data()
x_mod = opt.run(x_orig,y,f,Lh,num_iter,eta)
In [ ]:
delta = 0.05
xx = np.arange(-5.0, 8.0, delta)
yy = np.arange(-5.0, 8.0, delta)
XX, YY = np.meshgrid(xx, yy)
Xf = XX.flatten()
Yf = YY.flatten()
sz=XX.shape
test_data = np.concatenate([Xf[:,np.newaxis],Yf[:,np.newaxis]],axis=1);
idx = y==1
plt.scatter(x_mod[idx.ravel(),0],x_mod[idx.ravel(),1],alpha=0.1)
idx = y==-1
plt.scatter(x_mod[idx.ravel(),0],x_mod[idx.ravel(),1],alpha=0.4,color='pink')
Z = f.forward(test_data)
Z.shape=sz
plt.imshow(Z, interpolation='bilinear', origin='lower', extent=(-6,8,-6,8),alpha=0.3, vmin=-15, vmax=15)
plt.contour(XX,YY,Z,[0])
fig = plt.gcf()
fig.set_size_inches(9,9)
where
$$y_2 = f_2(f_1(x)).$$Observe that we are just changing the model while the loss and the optimization function remains the same.
Different to the other case the parameters are arranged in layers. Thus the computation of the gradient with respect to the parameters is a little more involved. However, we will use standard chain rule. In this case we want to find
$$\frac{\partial \mathcal{L}}{\partial \bar{\omega_2}} = \frac{\partial \mathcal{L}}{\partial y_2}\frac{\partial y_2}{\partial \bar{\omega_2}}$$and $$\frac{\partial \mathcal{L}}{\partial \bar{\omega_1}} = \frac{\partial \mathcal{L}}{\partial y_2}\frac{\partial y_2}{\partial \bar{y_1}}\frac{\partial \bar{y_1}}{\partial \bar{\omega_1}}$$
Observe that this decomposition allows to decouple each layer in the following terms $$\frac{\partial \mathcal{L}}{\partial \bar{\omega_1}} = \frac{\partial \mathcal{L}}{\partial \bar{y_1}}\frac{\partial \bar{y_1}}{\partial \bar{\omega_1}}$$
In general for $N$ layers the update of the parameters of the $m$-th layer is written as $$\frac{\partial \mathcal{L}}{\partial \bar{\omega_m}} = \frac{\partial \mathcal{L}}{\partial \bar{y_m}}\frac{\partial \bar{y_m}}{\partial \bar{\omega_m}} = \frac{\partial \mathcal{L}}{\partial \bar{y_N}}\prod\limits_{i=m}^{N-1}\frac{\partial \bar{y_{i+1}}}{\partial \bar{y_i}} \frac{\partial \bar{y_m}}{\partial \bar{\omega_m}}$$
Thus we have to define for each layer two gradients:
This is what it is called backpropagation.
In [ ]:
import numpy as np
class layer:
def __init__(self):
pass
def forward(self): # evaluate the layer
pass
def backward(self): # gradient with respect to the inputs
pass
def gradient(self): # gradient with respect to the parameters
pass
In [ ]:
class reluLayer(layer): #this is a linear layer with relu activation
def __init__(self,input_dim,n_neurons):
self.x = np.zeros((1,input_dim))
self.z = np.zeros((1,n_neurons))
self.w = np.random.randn(input_dim,n_neurons)
self.y = np.zeros((1,n_neurons))
def forward(self,x): # evaluate the layer
self.z = np.dot(x, self.w)
self.y = np.maximum(0,self.z)
return self.y
def backward(self): # gradient with respect to the inputs
return np.select([self.z>0],[self.w],default=0).T
def gradient(self): # gradient with respect to the parameters
dydz = np.where(self.z>0,1.,0.)
return np.dot(dydz[:,np.newaxis],self.x).T
class model:
def __init__(self):
self.architecture = []
self.y_ = []
def addLayer(self,layer):
self.architecture.append(layer)
def forward(self,x):
#Takes a data point and evaluates f(x,w)
self.y_= x
for layer in self.architecture:
self.y_=layer.forward(self.y_)
return self.y_
In [ ]:
class optimize:
def __init__(self):
pass
def run(self, data, target, model, loss, num_iter, eta):
#Takes a loss function and a model and find the optimal parameters for an specific data set
N_samples = data.shape[0]
for t in xrange(num_iter):
#Step 1.-take a sample x at random from the training set
idx = np.random.randint(N_samples)
xi = data[idx,:]
yi = target[idx,:]
#Step 2.- update the parameters to optimise
dLdx= loss.gradient(model,xi,yi)[:,np.newaxis]
for layer in reversed(model.architecture):
tmp = layer.w - eta * np.dot(layer.gradient(),dLdx.T)
dLdx = np.dot(dLdx,layer.backward())
layer.w = tmp
In [ ]:
num_iter = 10000
eta = 0.01 #optimization step/learning rate
nn = model()
nn.addLayer(reluLayer(2,10))
nn.addLayer(reluLayer(10,100))
nn.addLayer(reluLayer(100,30))
nn.addLayer(reluLayer(30,10))
nn.addLayer(reluLayer(10,1))
L = loss()
opt = optimize()
x_mod = opt.run(x,y,nn,L,num_iter,eta)
In [ ]:
delta = 0.05
xx = np.arange(-5.0, 8.0, delta)
yy = np.arange(-5.0, 8.0, delta)
XX, YY = np.meshgrid(xx, yy)
Xf = XX.flatten()
Yf = YY.flatten()
sz=XX.shape
test_data = np.concatenate([Xf[:,np.newaxis],Yf[:,np.newaxis]],axis=1);
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.3)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.3,color='green')
Z = nn.forward(test_data)
Z.shape=sz
plt.imshow(Z, interpolation='bilinear', origin='lower', extent=(-6,8,-6,8),alpha=0.3, vmin=-15, vmax=15)
plt.contour(XX,YY,Z,[0])
fig = plt.gcf()
fig.set_size_inches(9,9)
In [ ]:
class loss:
def __init__(self):
self.y = 0
def evaluate(self,model,x,t):
#Evaluates the loss function L(y,t)
self.y = model.forward(x)
return (t-self.y)*(t-self.y)
def gradient(self,model,x,t):
self.y = model.forward(x)
return -2.*(t-self.y)
class reluLayer(layer): #this is a linear layer with relu activation
def __init__(self,input_dim,n_neurons):
self.x = np.zeros((1,input_dim))
self.z = np.zeros((1,n_neurons))
self.w = np.random.randn(input_dim,n_neurons) #Normal Random Initialization
self.y = np.zeros((1,n_neurons))
def forward(self,x): # evaluate the layer
self.z = np.dot(x, self.w)
self.y = np.maximum(0,self.z)
return self.y
def backward(self): # gradient with respect to the inputs
return np.select([self.z>0],[self.w],default=0).T
def gradient(self): # gradient with respect to the parameters
dydz = np.where(self.z>0,1.,0.)
return np.dot(dydz[:,np.newaxis],self.x).T
class model:
def __init__(self):
self.architecture = []
self.y_ = []
def addLayer(self,layer):
self.architecture.append(layer)
def forward(self,x):
#Takes a data point and evaluates f(x,w)
self.y_= x
for layer in self.architecture:
self.y_=layer.forward(self.y_)
return self.y_
class optimize:
def __init__(self):
pass
def run(self, data, target, model, loss, num_iter, eta):
#Takes a loss function and a model and find the optimal parameters for an specific data set
N_samples = data.shape[0]
for t in xrange(num_iter):
#Step 1.-take a sample x at random from the training set
idx = np.random.randint(N_samples)
xi = data[idx,:]
yi = target[idx,:]
#Step 2.- update the parameters to optimise
dLdx= loss.gradient(model,xi,yi)[:,np.newaxis]
for layer in reversed(model.architecture):
layer.w = layer.w - eta * np.dot(layer.gradient(),dLdx.T)
dLdx = np.dot(dLdx,layer.backward())
In [ ]:
import numpy as np
#Example dataset
N_samples_per_class = 1000
d_dimensions = 2
x = np.vstack((np.random.randn(N_samples_per_class, d_dimensions),np.random.randn(N_samples_per_class, d_dimensions)+np.array([3,3])))
y = np.vstack((-1.*np.ones((N_samples_per_class,1)),1.*np.ones((N_samples_per_class,1))))
%matplotlib inline
import matplotlib.pyplot as plt
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.1,color='pink')
In [ ]:
import numpy as np
import tqdm
class loss:
def __init__(self):
self.y = 0
def evaluate(self,model,x,t):
#Evaluates the loss function L(y,t)
self.y = model.forward(x)
return (t-self.y)*(t-self.y)
def gradient(self,model,x,t):
self.y = model.forward(x)
return -2.*(t-self.y)
class reluLayer(layer): #this is a linear layer with relu activation
def __init__(self,input_dim,n_neurons):
self.z = np.zeros((1,n_neurons))
self.w = np.random.randn(input_dim+1,n_neurons)
#self.w = self.w/np.dot(np.ones((input_dim+1,1)),np.sum(self.w*self.w,axis=0)[np.newaxis,:])
self.y = np.zeros((1,n_neurons))
def forward(self,x): # evaluate the layer
self.xext = np.concatenate((x,np.ones((x.shape[0],1))),axis=1)
self.z = np.dot(self.xext, self.w)
self.y = np.maximum(0,self.z)
return self.y
def backward(self): # gradient with respect to the inputs
dydz = np.tile(np.where(self.z>0.,1.,0.),(self.w.shape[0],1))
tw = dydz*self.w
return tw[:-1,:]
def gradient(self): # gradient with respect to the parameters
dydz = np.where(self.z>0.,1.,0.)
return np.dot(dydz.T,self.xext)
class linearLayer(layer): #this is a linear layer with relu activation
def __init__(self,input_dim,n_neurons):
self.w = np.random.randn(input_dim+1,n_neurons)
self.y = np.zeros((1,n_neurons))
def forward(self,x): # evaluate the layer
self.xext = np.concatenate((x,np.ones((x.shape[0],1))),axis=1)
self.y = np.dot(self.xext, self.w)
return self.y
def backward(self): # gradient with respect to the inputs
return self.w[:-1,:]
def gradient(self): # gradient with respect to the parameters
return self.xext
class model:
def __init__(self):
self.architecture = []
self.y_ = []
def addLayer(self,layer):
self.architecture.append(layer)
def forward(self,x):
#Takes a data point and evaluates f(x,w)
self.y_= x
for layer in self.architecture:
self.y_=layer.forward(self.y_)
return self.y_
class optimize:
def __init__(self, debug = False, plot_convergence = 10.):
self.debug_ = debug
self.pc = plot_convergence
def run(self, data, target, model, loss, num_iter, eta):
#Takes a loss function and a model and find the optimal parameters for an specific data set
self.l = np.zeros((np.ceil((num_iter*1.0)/self.pc),1))
print self.l.shape
i=0
N_samples = data.shape[0]
for t in tqdm.tqdm(xrange(num_iter)):
#Step 1.-take a sample x at random from the training set
idx = np.random.randint(N_samples)
xi = data[idx,:]
xi = xi[np.newaxis,:]
yi = target[idx,:]
if (t*1.)%self.pc==0:
self.l[i] = np.sum(loss.evaluate(model,data,target))
i=i+1
#Step 2.- update the parameters to optimise
dLdx= loss.gradient(model,xi,yi)
for layer in reversed(model.architecture):
if self.debug_:
print "xi"+str(xi.shape)
print "g"+str(layer.gradient().shape)
print "dLdx"+str(dLdx.shape)
print "w"+str(layer.w.shape)
print "b"+str(layer.backward().shape)
tmp = layer.w - eta * np.dot(dLdx,layer.gradient()).T
dLdx = np.dot(dLdx,layer.backward().T)
layer.w = tmp
def plot(self):
plt.plot(self.l)
In [ ]:
num_iter = 100000
eta = 0.001 #optimization step/learning rate
nn = model()
nn.addLayer(reluLayer(2,5))
nn.addLayer(reluLayer(5,5))
nn.addLayer(linearLayer(5,1))
L = loss()
opt = optimize(plot_convergence=100)
x_mod = opt.run(x,y,nn,L,num_iter,eta)
opt.plot()
In [ ]:
np.sum(L.evaluate(nn,x,y))
In [ ]:
plt.hist(nn.forward(x))
In [ ]:
delta = 0.05
xx = np.arange(-5.0, 8.0, delta)
yy = np.arange(-5.0, 8.0, delta)
XX, YY = np.meshgrid(xx, yy)
Xf = XX.flatten()
Yf = YY.flatten()
sz=XX.shape
test_data = np.concatenate([Xf[:,np.newaxis],Yf[:,np.newaxis]],axis=1);
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.3)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.3,color='green')
Z = nn.forward(test_data)
Z.shape=sz
plt.imshow(Z, interpolation='bilinear', origin='lower', extent=(-6,8,-6,8),alpha=0.3, vmin=-15, vmax=15)
plt.contour(XX,YY,Z,[0])
fig = plt.gcf()
fig.set_size_inches(9,9)
In [ ]:
def test_forward():
nn = model()
nn.addLayer(reluLayer(2,1))
w = np.ndarray((3,1), buffer=np.array([1.,2.,0.]))
nn.architecture[0].w = w
x = np.ndarray((1,2), buffer=np.array([-0.1,0.1]))
assert nn.forward(x)==0.1, nn.forward(x)
nn = model()
nn.addLayer(reluLayer(2,1))
w = np.ndarray((3,1), buffer=np.array([1.,2.,0.]))
nn.architecture[0].w = w
x = np.ndarray((1,2), buffer=np.array([0.1,-0.1]))
assert nn.forward(x)==0., nn.forward(x)
def test_gradient():
nn = model()
nn.addLayer(reluLayer(2,1))
w = np.ndarray((3,1), buffer=np.array([1.,2.,0.]))
nn.architecture[0].w = w
x = np.ndarray((1,2), buffer=np.array([-0.1,0.1]))
l = loss()
eg = np.ndarray((1,3), buffer=np.array([-0.1,0.1,1.]))
eb = np.ndarray((2,1), buffer=np.array([1.,2.]))
#Step 2.- update the parameters to optimise
dLdx= l.gradient(nn,x,y)
for layer in reversed(nn.architecture):
assert np.all(np.equal(layer.gradient(),eg)), layer.gradient()
assert np.all(np.equal(layer.backward(),eb)), layer.backward()
def test_loss():
nn = model()
nn.addLayer(reluLayer(2,1))
w = np.ndarray((3,1), buffer=np.array([1.,2.,0.]))
nn.architecture[0].w = w
l = loss()
x = np.ndarray((1,2), buffer=np.array([-0.1,0.1]))
t=1.
assert l.evaluate(nn,x,t)==0.81, "Expected 0.81, returned: "+str(l.evaluate(nn,x,t))
assert l.gradient(nn,x,t)==-1.8, "Expected -1.8, returned: "+str(l.gradient(nn,x,t))
test_forward()
test_loss()
test_gradient()
In general, we may have a workflow that involves an acyclic directed graph. If that is the case we find two new elements where gradient is involved, splitting nodes and joint nodes.
Consider an splitting node. How will backpropagation work in that case?
Looking at the picture we realise that $$(y_1,y_2)=f(x,\omega) = (f_1(x,\omega),f_2(x,\omega)).$$
Remember that in backpropagation we need to define two derivatives, the update derivative $\frac{\partial \mathcal{L}}{\partial \omega}$ and the backpropagation derivative $\frac{\partial \mathcal{L}}{\partial x}$.
Remember that at any layer we are able to define these two quantities as soon as we have $\frac{\partial \mathcal{L}}{\partial \bar{y}} = (\frac{\partial \mathcal{L}}{\partial y_1}, \frac{\partial \mathcal{L}}{\partial y_2})$.
Again we use the differentiation chain rule,
$$\frac{\partial \mathcal{L}}{\partial \omega} = \frac{\partial \mathcal{L}}{\partial \bar{y}}^T\frac{\partial \bar{y}}{\partial \omega}$$As stated in the former lines $$\frac{\partial \mathcal{L}}{\partial \bar{y}} = (\frac{\partial \mathcal{L}}{\partial y_1},\frac{\partial \mathcal{L}}{\partial y_2})^T$$ and
$$\frac{\partial \bar{y}}{\partial \omega} = (\frac{\partial y_1}{\partial \omega},\frac{\partial y_2}{\partial \omega})^T$$Thus, $$\frac{\partial \mathcal{L}}{\partial \omega} = (\frac{\partial \mathcal{L}}{\partial y_1},\frac{\partial \mathcal{L}}{\partial y_2})^T \left(\begin{matrix}\frac{\partial y_1}{\partial \omega}\\ \frac{\partial y_2}{\partial \omega}\end{matrix}\right) = $$
$$= \frac{\partial \mathcal{L}}{\partial y_1} \frac{\partial y_1}{\partial \omega} + \frac{\partial \mathcal{L}}{\partial y_2}\frac{\partial y_2}{\partial \omega}$$Observe that the influence of both gradients backpropagating through the graph is added.
In [ ]: