The Perceptron is a lightweight algorithm, which can classify data quiet fast. But it only works in the limited case of a linearly separable, binary dataset. If you have a dataset consisting of only two classes, the Perceptron classifier can be trained to find a linear hyperplane which seperates the two. If the dataset is not linearly separable, the perceptron will fail to find a separating hyperplane.
If the dataset consists of more than two classes we can use the standard approaches in multiclass classification (one-vs-all and one-vs-one) to transform the multiclass dataset to a binary dataset. For example, if we have a dataset, which consists of three different classes:
Although the one-vs-one can be a bit slower (there is one more iteration layer), it is not difficult to imagine it will be more advantageous in situations where a (linear) separating hyperplane does not exist between one class and the rest of the data, while it does exists between one class and other classes when they are considered individually. In the image below there is no separating line between the pear-class and the other two classes.
Another difference is; If the dataset is not linearly seperable [2] the perceptron will fail to find a separating hyperplane. The algorithm simply does not converge during its iteration cycle. The SVM on the other hand, can still find a maximum margin minimum cost decision boundary (a separating hyperplane which does not separate 100% of the data, but does it with some small error).
The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product $ < w, x >$. To produce the behaviour of 'firing' a signal (+1) we can use the signum function $ sgn() $; it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative.
Thus, this Perceptron can mathematically be modeled by the function $ y = sgn(b+ < w, x >) $. Here $ b $ is the bias, i.e. the default value when all feature values are zero.
The perceptron algorithm looks as follows:
In [1]:
class Perceptron():
"""
Class for performing Perceptron.
X is the input array with n rows (no_examples) and m columns (no_features)
Y is a vector containing elements which indicate the class
(1 for positive class, -1 for negative class)
w is the weight-vector (m number of elements)
b is the bias-value
"""
def __init__(self, b = 0, max_iter = 1000):
self.max_iter = max_iter
self.w = []
self.b = 0
self.no_examples = 0
self.no_features = 0
def train(self, X, Y):
self.no_examples, self.no_features = np.shape(X)
self.w = np.zeros(self.no_features)
for ii in range(0, self.max_iter):
w_updated = False
for jj in range(0, self.no_examples):
a = self.b + np.dot(self.w, X[jj])
if np.sign(Y[jj]*a) != 1:
w_updated = True
self.w += Y[jj] * X[jj]
self.b += Y[jj]
if not w_updated:
print("Convergence reached in %i iterations." % ii)
break
if w_updated:
print(
"""
WARNING: convergence not reached in %i iterations.
Either dataset is not linearly separable,
or max_iter should be increased
""" % self.max_iter
)
def classify_element(self, x_elem):
return int(np.sign(self.b + np.dot(self.w, x_elem)))
def classify(self, X):
predicted_Y = []
for ii in range(np.shape(X)[0]):
y_elem = self.classify_element(X[ii])
predicted_Y.append(y_elem)
return predicted_Y
As you can see, we set the bias-value and all the elements in the weight-vector to zero. Then we iterate 'max_iter' number of times over all the examples in the training set.
Here, $Y[ii]$ is the actual output value of each training example. This is either +1 (if it belongs to the positive class) or -1 (if it does not belong to the positive class)., The activation function value $ a = b + < w, x >$ is the predicted output value. It will be $> 0 $ if the prediction is correct and $ < 0 $ if the prediction is incorrect. Therefore, if the prediction made (with the weight vector from the previous training example) is incorrect, $ sgn(y\cdot a) $ will be -1, and the weight vector $ w $ is updated.
If the weight vector is not updated after some iteration, it means we have reached convergence and we can break out of the loop.
If the weight vector was updated in the last iteration, it means we still didnt reach convergence and either the dataset is not linearly separable, or we need to increase 'max_iter'.
We can see that the Perceptron is an online algorithm; it iterates through the examples in the training set, and for each example in the training set it calculates the value of the activation function $ a $ and updates the values of the weight-vector.
In [2]:
import numpy as np
import random
def generate_data(no_points):
X = np.zeros(shape=(no_points, 2))
Y = np.zeros(shape=no_points)
for ii in range(no_points):
X[ii][0] = random.randint(1,9)+0.5
X[ii][1] = random.randint(1,9)+0.5
Y[ii] = 1 if X[ii][0]+X[ii][1] >= 13 else -1
return X, Y
As we can see in the image above, the dataset contains two linearly separable classes, which are indicated with red and green dots.
Using the Perceptron algorithm on this dataset, will result in.
In the 2D case, the perceptron algorithm looks like:
In [4]:
X, Y = generate_data(100)
p = Perceptron()
p.train(X, Y)
X_test, Y_test = generate_data(50)
predicted_Y_test = p.classify(X_test)
print(predicted_Y_test)
print(predicted_Y_test - Y_test)
As we can see, the weight vector and the bias ( which together determine the separating hyperplane ) are updated when $ y_i \cdot a $ is not positive.
The result is nicely illustrated in this gif (reload if nothing if happening):
We can extend this to a dataset in any number of dimensions, and as long as it is linearly separable, the Perceptron algorithm will converge.
One of the benefits of this Perceptron is that it is a very 'lightweight' algorithm; it is computationally very fast and easy to implement for datasets which are linearly separable. But if the dataset is not linearly separable, it will not converge.
For such datasets, the Perceptron can still be used if the correct kernel is applied. In practice this is never done, and Support Vector Machines are used whenever a Kernel needs to be applied. Some of these Kernels are:
Linear: | $ < w, x > $ |
Polynomial: | $ ( |
Laplacian RBF: | $ exp(-\lambda \ |\ w - x\ | ) $ |
Gaussian RBF: | $ exp(-\lambda \ |\ w - x\ |^2 ) $ |
For more information about Kernel functions, a comprehensive list of kernels, and their source code, please click here.
In [ ]:
In [ ]: