In our examples, we have seen different algorithms and we could use scikit learn functions to get the paramters. However, do you know how is it implemented? To understand it, we could start from perceptron, which could also be an important concept for future studies in deep learning.
The perceptron can be used for supervised learning. It can solve binary linear classification problems. A comprehensive description of the functionality of a perceptron is out of scope here. To follow this tutorial you already should know what a perceptron is and understand the basics of its functionality. Additionally a fundamental understanding of stochastic gradient descent is needed. To get in touch with the theoretical background, I advise the Wikipedia article:
Furthermore I highly advise you the book of Schölkopf & Smola. Do not let the math scrare you, as they explain the basics of machine learning in a really comprehensive way:
Schölkopf & Smola (2002). Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond.
To better understand the internal processes of a perceptron in practice, we will step by step develop a perceptron from scratch now.
In [1]:
# import our packages
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
We will implement the perceptron algorithm in python 3 and numpy. The perceptron will learn using the stochastic gradient descent algorithm (SGD). Gradient Descent minimizes a function by following the gradients of the cost function. For further details see:
Wikipedia - stochastic gradient descent
To calculate the error of a prediction we first need to define the objective function of the perceptron.
To do this, we need to define the loss function, to calculate the prediction error. We will use hinge loss for our perceptron:
$$c(x, y, f(x)) = (1 - y * f(x))_+$$$c$ is the loss function, $x$ the sample, $y$ is the true label, $f(x)$ the predicted label. This means the following: $$ c(x, y, f(x))= \begin{cases} 0,& \text{if } y * f(x)\geq 1\\ 1-y*f(x), & \text{else} \end{cases} $$ So consider, if y and f(x) are signed values $(+1,-1)$:
As we defined the loss function, we can now define the objective function for the perceptron:
We can write this without the dot product with a sum sign: $$l_i(w) = (-y_i \sum_{i=1}^n x_iw)_+$$ So the sample $x_i$ is misclassified, if $y_i \langle x_i,w \rangle \leq 0$. The general goal is, to find the global minima of this function, respectively find a parameter $w$, where the error is zero.
To do this we need the gradients of the objective function. The gradient of a function $f$ is the vector of its partial derivatives. The gradient can be calculated by the partially derivative of the objective function.
$$ \nabla l_i(w) = -y_i x_i $$This means, if we have a misclassified sample $x_i$, respectively $ y_i \langle x_i,w \rangle \leq 0 $, update the weight vector $w$ by moving it in the direction of the misclassified sample.
$$w = w + y_i x_i$$With this update rule in mind, we can start writing our perceptron algorithm in python.
In [2]:
X = np.array([
[-2, 4],
[4, 1],
[1, 6],
[2, 4],
[6, 2]
])
Next we fold a bias term -1 into the data set. This is needed for the SGD to work. Details see The Perceptron algorithm
In [3]:
X = np.array([
[-2,4,-1],
[4,1,-1],
[1, 6, -1],
[2, 4, -1],
[6, 2, -1],
])
In [4]:
y = np.array([-1,-1,1,1,1])
This small toy data set contains two samples labeled with $-1$ and three samples labeled with $+1$. This means we have a binary classification problem, as the data set contains two sample classes. Lets plot the dataset to see, that is is linearly seperable:
In [5]:
for d, sample in enumerate(X):
# Plot the negative samples
if d < 2:
plt.scatter(sample[0], sample[1], s=120, marker='_', linewidths=2)
# Plot the positive samples
else:
plt.scatter(sample[0], sample[1], s=120, marker='+', linewidths=2)
# Print a possible hyperplane, that is seperating the two classes.
plt.plot([-2,6],[6,0.5])
Out[5]:
line 1: Initialize the weight vector for the perceptron with zeros
line 2: Set the learning rate to 1
line 3: Set the number of epochs
line 4: Iterate n times over the whole data set.
line 5: Iterate over each sample in the data set
line 6: Misclassification condition $y_i \langle x_i,w \rangle \leq 0$
line 7: Update rule for the weights $w = w + y_i * x_i$ including the learning rate
In [6]:
def perceptron_sgd(X, Y):
#line 1:
w = np.zeros(len(X[0]))
#line 2:
eta = 1
#line 3
epochs = 10
####Please finish the algithm here as description
#line 4
#line 5
#line 6
#line 7
#############
return w
In [ ]: