Seminar 3: Basic Artificial Neural Networks

Neural Networks (NN) became popular due to many facts. One of them is extensibility. NN is composed of modules (blocks), where each module implements some functionality. By combining these modules one can build state-of-the-art NNs with existing NN packages. Recent NN wonderful ideas often require just defining a new module or slightly changing an existing one. This notebook should help you to understand what the modules are and what other abstractions are used in NNs.

At first, let's think of NN as of black box model (we don't care or know how it works inside, but when we ask it to do something it politely does). What functionality then should the black box implement to be practical? Well, the same as other discriminative models!

it should be able to give a predictions (let's call it output) if provided with input data
it should be learnable (there should be a mean to adapt model to the given data)

The first point implies the black box should implement a function (we call it forward).

$$\text{output = NN.forward(input)}$$

The second point means the model should be able to compute a gradient with respect to (w.r.t.) its parameters and return them to us. We would use this gradient to perform parameters update. The computation of the gradient is done during backward call.

$$\text{NN.backward(input, criterion (output, target))}$$

and gradients retrieved with, lets say:

$$\text{gradParameters = NN.getGradParameters()}$$

the criterion should tell quantively how wrong your model is if predicting output when target expected.

After the Seminar 2 it should be clear, how we use the gradient: we use one of the optimizers (sgd, adaGrad, Adam, nag) to perform parametrs update.

Summary

At this point we have seen three important abstractions:

black box
criterion
optimizer

Workflow

The workflow is split into 3 steps (yeah, kind of abstractions):

forward pass
backward pass
parameters update

Let's detail furthur the workflow.

Forward pass:

$$ \text{output = NN.forward(input)} \\ \text{loss = criterion.forward(output, target)} $$

Backward pass:

$$ \text{NNGrad = criterion.backward(output, target)} \\ \text{NN.forward(input, NNGrad)} \\ $$

Parameters update:

$$ \text{gradParameters = NN.getGradParameters()} \\ \text{optimizer.update(currentParams, gradParameters)} \\ $$

There can be slight technical variations, but the high level idea is always the same. It should be clear about forward pass and parameters update, the most struggling is to understand backprop.

White box

Last thing before discussing backprop is to whiten our black box, we are old enough to know the truth.

As said in introduction NN is composed of modules and surprisingly these modules are NNs too by definition! Remember, left or right child in binary tree is also a tree, and the leaves are trees themselfs. Kind of the same logic it is here too, but is about directed acyclic graphs (you can think of a chain for the first time). You can find "starter" and "final" nodes in these graphs (start and end of a chain), the data goes through the graph according to the directions, each node applies its forward function till the last node is reached. On backward pass the graph is traversed form "final" nodes to "starter" and each node applies backward function to whatever previous node passed.

Here is one of the real-world NNs, the data goes from left to right.

So the cool thing is: each node is a NN, every connected subgraph is NN. We defined everything we need already, you just need a set of "simple" NNs which are used as building blocks for comlex models! That is exactly what the NN packges implements for you and what you are to do in homework.

Backprop

Be careful! In this section the variable $x$ designates the parameters in NN and not the input data. Think that we fixed the data now, and loss is a function of parametrs, we try to find the best parameters to lower the loss.

Let's define as $ f(x) $ the function NN applies to input data and $ g(o) $ is a criterion. Then $$ L(x) = g(f(x); target) $$

We aim to find $\nabla_x L$. Obvious, if $f,g: \mathbb{R} \rightarrow \mathbb{R}$ using chain rule:

$$ \frac{dL}{dx} = \frac{dg}{df}\frac{df}{dx}$$

and practical formula:

$$ \left.\frac{dL}{dx}\right|_{x=x_0} = \left.\frac{dg}{df}\right|_{u = f(x_0)} \cdot \left.\frac{df}{dx}\right|_{x=x_0} $$

What's up with multidimensional case ? Barely the same. It is the sum of 1-dimentional chains. $$ \frac{\partial L}{\partial x_i} = \sum_{j = 1}^m \frac{\partial L}{\partial f_j} \frac{\partial f_j}{\partial x_i}. $$

Actually that is all you need to write backprop functions! Go to differenciation notebook to for some practice before homework.