Write your Own Perceptron

In our examples, we have seen different algorithms and we could use scikit learn functions to get the paramters. However, do you know how is it implemented? To understand it, we could start from perceptron, which could also be an important concept for future studies in deep learning.

The perceptron can be used for supervised learning. It can solve binary linear classification problems. A comprehensive description of the functionality of a perceptron is out of scope here. To follow this tutorial you already should know what a perceptron is and understand the basics of its functionality. Additionally a fundamental understanding of stochastic gradient descent is needed. To get in touch with the theoretical background, I advise the Wikipedia article:

Wikipedia - Perceptron

Furthermore I highly advise you the book of Schölkopf & Smola. Do not let the math scrare you, as they explain the basics of machine learning in a really comprehensive way:

Schölkopf & Smola (2002). Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond.

To better understand the internal processes of a perceptron in practice, we will step by step develop a perceptron from scratch now.


In [1]:
# import our packages
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

Stochastic Gradient Descent

We will implement the perceptron algorithm in python 3 and numpy. The perceptron will learn using the stochastic gradient descent algorithm (SGD). Gradient Descent minimizes a function by following the gradients of the cost function. For further details see:

Wikipedia - stochastic gradient descent

Calculating the Error

To calculate the error of a prediction we first need to define the objective function of the perceptron.

Hinge Loss Function

To do this, we need to define the loss function, to calculate the prediction error. We will use hinge loss for our perceptron:

$$c(x, y, f(x)) = (1 - y * f(x))_+$$

$c$ is the loss function, $x$ the sample, $y$ is the true label, $f(x)$ the predicted label. This means the following: $$ c(x, y, f(x))= \begin{cases} 0,& \text{if } y * f(x)\geq 1\\ 1-y*f(x), & \text{else} \end{cases} $$ So consider, if y and f(x) are signed values $(+1,-1)$:

  • the loss is 0, if $y*f(x)$ are positive, respective both values have the same sign.
  • loss is $1-y*f(x)$ if $y*f(x)$ is negative

Objective Function

As we defined the loss function, we can now define the objective function for the perceptron:

$$l_i(w) = \big(-y_i \langle x_i,w \rangle\big)_+$$

We can write this without the dot product with a sum sign: $$l_i(w) = (-y_i \sum_{i=1}^n x_iw)_+$$ So the sample $x_i$ is misclassified, if $y_i \langle x_i,w \rangle \leq 0$. The general goal is, to find the global minima of this function, respectively find a parameter $w$, where the error is zero.

Derive the Objective Function

To do this we need the gradients of the objective function. The gradient of a function $f$ is the vector of its partial derivatives. The gradient can be calculated by the partially derivative of the objective function.

$$ \nabla l_i(w) = -y_i x_i $$

This means, if we have a misclassified sample $x_i$, respectively $ y_i \langle x_i,w \rangle \leq 0 $, update the weight vector $w$ by moving it in the direction of the misclassified sample.

$$w = w + y_i x_i$$

With this update rule in mind, we can start writing our perceptron algorithm in python.

Our Data Set

First we need to define a labeled data set.


In [2]:
X = np.array([
    [-2, 4],
    [4, 1],
    [1, 6],
    [2, 4],
    [6, 2]
])

Next we fold a bias term -1 into the data set. This is needed for the SGD to work. Details see The Perceptron algorithm


In [3]:
X = np.array([
    [-2,4,-1],
    [4,1,-1],
    [1, 6, -1],
    [2, 4, -1],
    [6, 2, -1],

])

In [4]:
y = np.array([-1,-1,1,1,1])

This small toy data set contains two samples labeled with $-1$ and three samples labeled with $+1$. This means we have a binary classification problem, as the data set contains two sample classes. Lets plot the dataset to see, that is is linearly seperable:


In [5]:
for d, sample in enumerate(X):
    # Plot the negative samples
    if d < 2:
        plt.scatter(sample[0], sample[1], s=120, marker='_', linewidths=2)
    # Plot the positive samples
    else:
        plt.scatter(sample[0], sample[1], s=120, marker='+', linewidths=2)

# Print a possible hyperplane, that is seperating the two classes.
plt.plot([-2,6],[6,0.5])


Out[5]:
[<matplotlib.lines.Line2D at 0x20bd32c1a58>]

Your task: Lets Start implementing Stochastic Gradient Descent

Finally we can code our SGD algorithm using our update rule. To keep it simple, we will linearly loop over the sample set. For larger data sets it makes sence, to randomly pick a sample during each iteration in the for-loop.

Code Description Line by Line

line 1: Initialize the weight vector for the perceptron with zeros
line 2: Set the learning rate to 1
line 3: Set the number of epochs
line 4: Iterate n times over the whole data set. line 5: Iterate over each sample in the data set
line 6: Misclassification condition $y_i \langle x_i,w \rangle \leq 0$ line 7: Update rule for the weights $w = w + y_i * x_i$ including the learning rate


In [6]:
def perceptron_sgd(X, Y):
    #line 1:
    w = np.zeros(len(X[0]))
    
    #line 2:
    eta = 1
    
    #line 3
    epochs = 10
    
    ####Please finish the algithm here as description
    #line 4
    #line 5   
    #line 6        
    #line 7           
    #############
    return w

In [ ]: