The perceptron

Perceptrons, invented in the 1957 by Frank Rosenblatt, are the simpliest form of feedforward networks. They are linear classifiers because they find a linear function to predict if a piece of data belongs to a determined class or not. A perceptron is basically formed by a layer of input units and a layer of output units. In the simpliest case the output layer is formed by just one unit:

Table of contents

Definitions

The first layer of units is $\mathbf{x} = (x_0,\dots,x_{n+1})$. At each timestep all units of this layer are filled up with values from an input vector $\mathbf{p}_k = (1, p_0,\dots,p_n)$. The first element of the input is permanently set to 1. We call the first unit receiving the steady signal the bias unit. Each connection from a unit $x_i$ of the first layer to the output unit $y$ has a weight $w_i$ that is initially set to 0.

Spreading of activations

The activation of the output unit is given by:

$$ \begin{eqnarray} &net = \sum_{i} w_i x_i\\ &y = \begin{cases} 1 & \quad \text{if } net > 0\\ 0 & \quad \text{otherwise}\\ \end{cases}\\ \end{eqnarray} $$

where $net$ is the weighted sum of the inputs.

Using linear algebra we can rewrite $net$ in a shorter form as: $$ net = \mathbf{w}^T\mathbf{x} $$

where $\mathbf{w}^T$ is the vector $\mathbf{w} = (w_0, \dots, w_{n+1})$ translated into a row vector and $\mathbf{w}^T\mathbf{x}$ is the dot product, a linear algebra operator that allows to calculate the weighted sum at once.

Given two vectors of the same size $\mathbf{a} \in \mathbb{R}^n$ and $\mathbf{b} \in \mathbb{R}^n$, the dot product $\mathbf{a}^T\mathbf{b}$ produces the scalar value $a_0 b_0 + \dots +a_n b_n$.

In python you write:

#The first unit is a bias unit
x[0] = 1

# Fill the rest of the x layer 
# with p values
for i in xrange(1,n) :
    x[i] = p[i]

# Clear y
net = 0

# Fill up with the weighted 
# sum of inputs 
for i in xrange(n) :
    net += w[i]*x[i]

# Evaluate the output
y = 1*(net > 0)

Using pylab, which has linear algebra functionalities, it becomes:

# Fill the x layer with p values
# with the exception of the first unit
x = hstack([1, p])

# Evalutate the weighted sum - dot product
net = dot(w,x)

# Evaluate the output
y = step(net)

Where step(x) is the step function, defined as:

def step(x) :
    return 1*(x>0)

Using linear algebra in a neural network implementation is far simpler than writing loops, it is less error prone and also produces a much efficient code in terms of speed!!

If the network is composed of more than one output units, the weighted sums can be expressed as: $$ net_j = \sum_{i}\sum_{j} w_{ij} x_i $$ and they can be written in linear algebra notation as: $$ \mathbf{net} = \mathbf{W}\mathbf{x} $$ where $ \mathbf{net} = (y_0, \dots, y_m)$ is the vector of weighted sums, $\mathbf{W} \in \mathbb{R}^{m\times n}$ is the matrix of weights from $ \mathbf{x}$ to $ \mathbf{y}$, and $\mathbf{W}\mathbf{x}$ is a multiplication of a matrix times a vector, that is the matrix product of which the dot product is a special case.

Given two matrices $\mathbf{A} \in \mathbb{R}^{m\times k}$ and $\mathbf{B} \in \mathbb{R}^{k\times n}$, the matrix product $\mathbf{A}\mathbf{B}$ produces a new matrix $\mathbf{P} \in \mathbb{R}^{m\times n}$: $$ \begin{bmatrix} \mathbf{a}^T_{row_0}\mathbf{b}_{col_0} &\dots& \mathbf{a}^T_{row_0}\mathbf{b}_{col_n} \\ \vdots &\ddots &\vdots \\ \mathbf{a}^T_{row_m}\mathbf{b}_{col_0} &\dots& \mathbf{a}^T_{row_m}\mathbf{b}_{col_n} \end{bmatrix}\ $$

Magically, in pylab there is almost no change!:

# Evalutate the weighted sum - dot product
net = dot(w,x)

# Evaluate the output
y = step(net)

where W is now a matrix, and y is a vector.

Learning

Learning consists in updating the weights so that the weighted sum $y$ is more and more similar to a desired output $o_k$ when we give the input $p_k$ to the network. In perceptrons learning can be done online, meaning that we can update the weights each time a single input pattern is presented. Learning is given at each timestep by: $$ \Delta w_i = \eta (o_k - y)x_i \\ $$ or, in linear algebra notation: $$ \Delta \mathbf{w} = \eta(o_k - y)\mathbf{x} $$ where $\eta$ is a value determining the rate of weight change per timestep (it is typically very little) and $o_k - y$ is the error in reproducing the desired value. In python we write:

w += eta*(o - y)*tx

In the case of many output units the learning rule becomes: $$ \Delta w_{ij} = \eta (o_{kj} - y_j)x_i $$ which in linear algebra notation is:

$$ \Delta \mathbf{W} = \eta (\mathbf{o}_k - \mathbf{y})\mathbf{x}^T $$

where $\mathbf{o}_k - \mathbf{y}$ is now a vector, $\mathbf{x}^T$ is the vector $\mathbf{x}$ transposed into a row vector, and $(\mathbf{o}_k - \mathbf{y})\mathbf{x}^T$ is a new form of vector product: the outer product.

Given two vectors $\mathbf{a} \in \mathbb{R}^m$ and $\mathbf{b} \in \mathbb{R}^n$, the outer product $\mathbf{a}\mathbf{b}^T$ produces a matrix $\mathbf{M} \in \mathbb{R}^{m\times n}$: $$ \begin{bmatrix} a_{0} b_{0} & \dots & a_{0} b_{n} \\ \vdots & \ddots & \vdots \\ a_{m} b_{0} & \dots & a_{m} b_{n} \end{bmatrix} $$

In python we can rewrite it as:

W += eta*outer((o - y), tx)

A general rule: the dot product is for spreading the activity throughout a network, while the outer product is for weight update

The decision boundary

The weights of the network can be viewed as the parameters of a linear equation. This equation defines a boundary in the space of all possible inputs. All points laying above the boundary belong to the class, all the others don't belong to it. If the network has two inputs plus the bias the input space is a plane and the boundary is a row. In this case it is:

$$x_2 = -\frac{x_1 w_1 + w_0}{w_2}$$

The error

We can analyze if how much the network has learned by measuring the error. We compute the error at each timestep as the sum of the squared errors (SSE) for each output unit: $$ E_t = \sum_i \frac{1}{2}(o_{it} - y_{it})^2 $$ Furthermore we sum all the SSE collected during each k-th presentation of all the series of the input patterns $p$: $$ E_k = \sum_{p}\sum_{i} \frac{1}{2}(o_{pi} - y_{pi})^2 $$ If the network is learning the error diminuishes and it converges to a minimum. After a while the error cannot diminuish anymore and the network reaches a steady state.

The next cell is just for styling



In [1]:

    
from IPython.core.display import HTML
def css_styling():
    styles = open("../style/ipybn.css", "r").read()
    return HTML(styles)
css_styling()









    Out[1]: