Welcome to the Tutorial!

First I'll introduce the theory behind neural nets. then we will implement one from scratch in numpy, (which is installed on the uni computers) - just type this code into your text editor of choice. I'll also show you how to define a neural net in googles DL library Tensorflow(which is not installed on the uni computers) and train it to clasify handwritten digits.

You will understand things better if you're familiar with calculus and linear algebra, but the only thing you really need to know is basic programming. Don't worry if you don't understand the equations.

Numpy/linear algebra crash course

(You should be able to run this all in python 2.7.8 on the uni computers.)

Vectors and matrices are the language of neural networks. For our purposes, a vector is a list of numbers and a matrix is a 2d grid of numbers. Both can be defined as instances of numpy's ndarray class:


In [9]:
import numpy as np
my_vector = np.asarray([1,2,3])
my_matrix = np.asarray([[1,2,3],[10,10,10]])
print(my_matrix*my_vector)


[[ 1  4  9]
 [10 20 30]]

Putting an ndarray through a function will apply it elementwise:


In [10]:
print((my_matrix**2))
print((my_matrix))


[[  1   4   9]
 [100 100 100]]

What is a neural network?

For our data-sciencey purposes, it's best to think of a neural network as a function approximator or a statistical model. Surprisingly enough they are made up of a network of neurons. What is a neuron?

WARNING: huge oversimplification that will make neuroscientists cringe.

This is what a neuron in your brain looks like. On the right are the axons, on the left are the dendrites, which recieve signals from the axons of other neurons. The dendrites are connected to the axons with synapses. If the neuron has enough voltage across, it will "spike" and send a signal through its axon to neighbouring neurons. Some synapses are excitory in that if a signal goes through them it will increase the voltage across the next neuron, making it more likely to spike. Others are inhibitory and do the opposite. We learn by changing the strengths of synapses(well, kinda), and that is also usually how artificial neural networks learn.

This is what a the simplest possible artificial neuron looks like. This neuron is connected to two other input neurons named $x_1 $ and $ x_2$ with "synapses" $w_1$ and $w_2$. All of these symbols are just numbers(real/float). To get the neurons output signal $h$, just sum the input neurons up, weighted by their "synapses" then put them through a nonlinear function $ f$: $$ h = f(x_1 w_1 + x_2 w_2)$$

$f$ can be anything that maps a real number to a real number, but for ML you want something nonlinear and smooth. For this neuron, $f$ is the sigmoid function: $$\sigma(x) = \frac{1}{1+e^{-x}} $$ Sigmoid squashes its output into [0,1], so it's closer to "fully firing" the more positive it's input, and closer to "not firing" the more negative it's input.

If you like to think in terms of graph theory, neurons are nodes and If you have a stats background you might have noticed that this looks similar a logistic regression on two variables. That's because it is!

As you can see, these artificial neurons are only loosely inspired by biological neurons. That's ok, our goal is to have a good model, not simulate a brain.

There are many exciting ways to arange these neurons into a network, but we will focus on one of the easier, more useful topologies called a "two layer perceptron", which looks like this:

Neurons are arranged in layers, with the first hidden layer of neurons connected to a vector(think list of numbers) of input data, $x$, sometimes referred to as an "input layer". Every neuron in a given layer is connected to every neuron in the previous layer.

$$net = \sum_{i=0}^{N}x_i w_i = \vec{x} \cdot \vec{w}$$

Where $\vec{x}$ is a vector of previous layer's neuron activations and $\vec{w} $ is a vector of the weights(synapses) for every $x \in \vec{x} $.

Look back at the diagram again. Each of these 4 hidden units will have a vector of 3 weights for each of the inputs. We can arrange them as a 3x4 matrix of row vectors, which we call $W_1$. Then we can multiply this matrix with $\vec{x}$ and apply our nonlinearity $f$ to get a vector of neuron activations: $$\vec{h} = f( \vec{x} \cdot W_1 )$$ ..actually, in practice we add a unique learnable "bias" $b$ to every neurons weighted sum, which has the effect of shifting the nonlinearity left or right: $$\vec{h} = f( \vec{x} \cdot W_1 + \vec{b}_1 )$$ We pretty much do the same thing to get the output for the second hidden layer, but with a different weight matrix $W_2$: $$\vec{h_2} = f( \vec{h_1} \cdot W_2 + \vec{b}_2 )$$

So if we want to get an output for a given data vector x, we can just plug it into these equations. Here it is in numpy:


In [3]:
def sigmoid(x):
    return 1.0/(1.0+np.exp(-x))

hidden_1 = sigmoid(x.dot(W1) + b_1)
output = hidden1.dot(W2) + b_2


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-2ae6c0c740eb> in <module>()
      2     return 1.0/(1.0+np.exp(-x))
      3 
----> 4 hidden_1 = sigmoid(x.dot(W1) + b_1)
      5 output = hidden1.dot(W2) + b_2

NameError: name 'x' is not defined

Learning

Well that's all very nice, but we need it to be able to learn


In [ ]:
N,D = 300,2 #  number of examples, dimension of examples 
X = np.random.uniform(size=(N,D),low=0,high=20)
y = [X[i,0] * X[i,1] for i in range(N)]

class TwoLayerPerceptron:
    """Simple implementation of the most basic neural net"""
    def __init__(self,X,H,Y):
        N,D = X.shape
        N,O = y.shape
        # initialize the weights, or "connections between neurons" to random values.
        self.W1 = np.random.normal(size=(D,H))
        self.b1 = np.zeros(size=(H,))
        self.W2 = np.random.normal(size=(H,O))
        self.b2 = np.random.normal(size=(O,))
        
    def forward_pass(X):
        """Get the outputs for batch X, and a cache of hidden states for backprop"""
        hidden_inputs = X.dot(W1) + b #matrix multiply
        hidden_activations = relu(hidden_inputs)
        output = hidden_activations.dot(W2) + b
        cache = [X, hidden_inputs, ]
        return cache
    
    def backwards_pass(self,cache):
        """ """
        [X,hidden_inputs, hidden_activations, output] = cache
        #//TODO: backwards pass
        return d_W1, d_W2, d_b1, d_b2
    
    def subtract_gradients(self,gradients,lr=0.001):
        [d_W1, d_W2, d_b1, d_b2] = gradients
        self.W1 -= lr * d_W1
        self.W2 -= lr * d_W2
        self.b1 -= lr * d_b1
        self.b2 -= lr * d_b2

In [ ]:
hidden_activations = relu(np.dot(X,W1) + b1)
output = np.dot(hidden_activations,W2)+b2
errors = 0.5 * (output - y)**2
d_h1 = np.dot((output - y),W2.T)
d_b1  = np.sum(d_h1,axis=1)
d_a1 = sigmoid()
d_W2 = np.dot(hidden_Activations, errors)
d_W1 = np.dot(d_h1, W1.T)
W_2 += d_W2
b1 += db1
W_1 += d_W1

In [ ]:
display(Math(r'h_1 = \sigma(X \cdot W_1 + b)'))

In [ ]: