There are three popular ways to train a model: batch learning, mini-batch learning, and stochastic learning.
And we have two main divisions:
In addition to not having to load all the data in memory to train the model, in some cases, using algorithms that train on one example at a time may be faster than the format that uses all the data at the same time.
There are multiple ways to classify data with machine learning. You could run a logistic regression, use decision trees, or build a neural network to accomplish the task. In 1963, Vladimir Vapnik and Alexey Chervonenkis developed another classification tool, the support vector machine. Vapnik refined this classification method in the 1990’s and extended uses for SVMs. Support vector machines have become a great tool for the data scientist and is my choice in this exercise.
SVMs try to find a hyperplane, which maximizes the margin between two classes. The margin is the distance between the hyperplane and a few close points. These close points are the support vectors because they control the hyperplane.
Furthermore there is a short introduction of SVM in the following article:
Understanding Support Vector Machine algorithm from examples
To start explaning from scratch a for dummy concept helps.
Lets say you are about to start a business that sells t-shirts, but you are unsure what are the best measures for a medium sized one for males. Luckily you have gathered a group of men that have all stated they tend to buy medium sized t-shirts. Now you figure you're going to use a gradient descent type method t get the size just right.
A gradient is the ratio by which a variable quantity increases or decreases. The adjusments according to the feedback are this variation. The feedback can be de error function. The gradient of a scalar function is the co-vector with the direction according to which the highest rate of change of function occurs. Its modulus is equal to the rate of function variation along that direction.
In both batch gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize the error function.
While in GD, you have to run through all the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use only one training sample from your training set to do the update for a parameter in a particular iteration.
Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.
That difference explains the stochastic process, instead of describing the system through deterministic equations, that given an initial condition, we know all the evolution of the system, we will use stochastic processes, for which, given an initial condition, we still have several possible trajectories evolution of the system.
SGD minimizes a function by following the gradients of the cost function.
$c$ is the loss function, $x$ the sample, $y$ is the true label, $f(x)$ the predicted label.
This means the following: $$ c(x, y, f(x))= \begin{cases} 0,& \text{if } y*f(x)\geq 1\\ 1-y*f(x), & \text{else} \end{cases} $$
So consider, if y and f(x) are signed values $(+1,-1)$:
As you can see, our objective of a svm consists of two terms. The first term is a regularizer, the second term the loss. The regularizer parameter $\lambda$ determines the tradeoff between increasing the margin-size and ensuring that the $x_i$ lie on the correct side of the margin balancing margin maximization and loss.
To minimize this function, we need the gradients of this function.
As we have two terms, we will derive them seperately using the sum rule in differentiation.
$$ \frac{\delta}{\delta w_k} \lambda\parallel w\parallel^2 \ = 2 \lambda w_k $$$$ \frac{\delta}{\delta w_k} \big(1-y_i \langle x_i,w \rangle\big)_+ \ = \begin{cases} 0,& \text{if } y_i \langle x_i,w \rangle\geq 1\\ -y_ix_{ik}, & \text{else} \end{cases} $$This means, if we have a misclassified sample $x_i$, respectively $y_i \langle x_i,w \rangle \ < \ 1$, we update the weight vector w using the gradients of both terms, if $y_i \langle x_i,w \rangle \geq 1$ we just update w by the gradient of the regularizer. To sum it up, our stochastic gradient descent for the svm looks like this:
if $y_i⟨x_i,w⟩ < 1$: $$ w = w + \alpha (y_ix_i - 2\lambda w) $$ else: $$ w = w + \alpha (-2\lambda w) $$
In Python language we can write like this:
In [37]:
def __init__(self,lmbd,D):
self.lmbd = lmbd
self.D = D + 1
self.w = [0.] * self.D
def sign(self, x):
return -1. if x <= 0 else 1.
def hinge_loss(self,target,y):
return max(0, 1 - target*y)
def train(self,x,y,alpha):
if y*self.predict(x) < 1:
for i in xrange(len(x)):
self.w[i] = self.w[i] + alpha *( (y*x[i]) + (-2 * (self.lmbd)*self.w[i]) )
else:
for i in xrange(len(x)):
self.w[i] = self.w[i] + alpha * (-2 * (self.lmbd)*self.w[i])
This article will use the Skin Segmentation Data Set that can be downloaded from the UCI database. This data refers to a classification task in which RGB values of random points are extracted from the face picture of a person, and the task is to determine, according to these values, whether that point corresponds to the skin or another part of the image.
According to the researchers, images of people from various ages, races and genres were collected. A practical application of this task would be to identify images with inappropriate content for minors on the internet.
This dataset is of the dimension 245057 * 4 where first three columns are B,G,R (x1,x2, and x3 features) values and fourth column is a binary classification.
In a real online environment, one way to evaluate the performance of the model would be to calculate the loss before predicting a new example and averaging in a time window. This method is called progressive validation.
In our case, we are simulating an online environment, and to make the model more effective, I will use a separate test set with 20% of the data.
The files train.csv and test.csv each contains approximately 20% positive examples, so in addition to loss, I want to evaluate the number of correct answers and errors in each class (real positives and true negatives).
The training set has 196,129 examples, and the test set, 48,928. We will use each example of the training set only once, and then evaluate the performance of the model in the test set.
Since we are not going to use this test set to adjust hyperparameters, it is a legitimate estimate of the model’s performance in new examples. If you are going to use it to validate parameters, you will need another test set.
In [67]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
from random import shuffle
In [3]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00229/Skin_NonSkin.txt"
data = pd.read_csv(url, header = None, sep='\t')
np.random.seed(1)
msk = np.random.rand(len(data)) < 0.8
data[msk].to_csv('svm-gradient-descent/train.csv', header=False, index=False)
data[~msk].to_csv('svm-gradient-descent/test.csv', header=False, index=False)
With data splited we should start building our model!
In [163]:
class SVM():
"""Implementation of SVM with SGD"""
def __init__(self,lmbd,D):
self.lmbd = lmbd
self.D = D + 1
self.w = [0.] * self.D
def sign(self, x):
return -1. if x <= 0 else 1.
def hinge_loss(self,target,y):
return max(0, 1 - target*y)
def data(self,test=False):
if test:
with open('svm-gradient-descent/test.csv','r') as f:
samples = f.readlines()
for t,row in enumerate(samples):
row = row.replace('\n','')
row = row.split(',')
target = -1.
if row[3] == '1':
target = 1.
del row[3]
x = [float(c) for c in row] + [1.] #inputs + bias
yield t, x,target
else:
with open('svm-gradient-descent/train.csv','r') as f:
samples = f.readlines()
shuffle(samples)
for t,row in enumerate(samples):
row = row.replace('\n','')
row = row.split(',')
target = -1.
if row[3] == '1':
target = 1.
del row[3]
x = [float(c) for c in row] + [1.] #inputs + bias
yield t, x,target
def train(self,x,y,alpha):
if y*self.predict(x) < 1:
for i in xrange(len(x)):
self.w[i] = self.w[i] + alpha *( (y*x[i]) + (-2 * (self.lmbd)*self.w[i]) )
else:
for i in xrange(len(x)):
self.w[i] = self.w[i] + alpha * (-2 * (self.lmbd)*self.w[i])
return self.w
def predict(self,x):
wTx = 0.
for i in xrange(len(x)):
wTx += self.w[i]*x[i]
return wTx
def fit(self):
test_count = 0.
tn = 0.
tp = 0.
total_positive = 0.
total_negative = 0.
accuracy = 0.
loss = 0.
last = 0
for t, x,target in self.data(test=False):
if target == last:
continue
alpha = 1./(self.lmbd*(t+1.))
w = self.train(x,target,alpha)
last = target
for t,x,target in self.data(test=True):
pred = self.predict(x)
loss += self.hinge_loss(target,pred)
pred = self.sign(pred)
if target == 1:
total_positive += 1.
else:
total_negative += 1.
if pred == target:
accuracy += 1.
if pred == 1:
tp += 1.
else:
tn += 1.
loss = loss / (total_positive+total_negative)
acc = accuracy/(total_positive+total_negative)
# print 'Loss', loss, '\nTrue Negatives', tn/total_negative * 100, '%', '\nTrue Positives', tp/total_positive * 100, '%','\nPrecision', accuracy/(total_positive+total_negative) * 100, '%', '\n'
return loss, acc, tp/total_positive,tn/total_negative, w
In [167]:
loss_list = []
acc_list = []
tp_list = []
tn_list = []
w_list = []
for i in range(100):
#print '\nSeed',i
np.random.seed(i)
svm = SVM(1,3)
l,acc,tp,tn,w = svm.fit()
loss_list.append(l)
acc_list.append(acc)
tp_list.append(tp)
tn_list.append(tn)
w_list.append(w)
print 'Loss', sum(loss_list)/len(loss_list)
print 'Accuracy', sum(acc_list)/len(acc_list) * 100, '%'
print 'True Positives', sum(tp_list)/len(tp_list) * 100, '%'
print 'True Negatives',sum(tn_list)/len(tn_list) * 100, '%'
In [168]:
print 'Loss', loss_list[-1]
print 'Accuracy', acc_list[-1] * 100, '%'
print 'True Positives', tp_list[-1] * 100, '%'
print 'True Negatives', tn_list[-1] * 100, '%'
Weight vector
In [169]:
w_list[-1]
Out[169]:
In [170]:
w_list[-1]
Out[170]:
The weight vector of the SVM including the bias term after 100 epochs is $(-0.012, -0.0144, 0.024, -0.029)$. We can extract the following prediction function now:
$$ f(x) = \langle x,(-0.012,-0.0144,0.024)\rangle - 0.029 $$
In [172]:
w_list[-1]
Out[172]:
In [176]:
with open('svm-gradient-descent/train.csv','r') as f:
samples = f.readlines()
shuffle(samples)
samples[0:10]
Out[176]:
First sample $(44, 45, 11)$, supposed to be -1:
$$(-0.012*44)+(-0.0144*45)+(0.0242*11) - 0.029 = sign(-0.9388) = -1$$Second sample $(157, 159, 129)$, supposed to be -1:
$$(-0.012*157)+(-0.0144*159)+(0.0242*129) - 0.029 = sign(-1.0808) = -1$$Third sample $(47, 89, 142)$, supposed to be 1:
$$(-0.012*47)+(-0.0144*89)+(0.0242*142) - 0.029 = sign(1.5618) = 1$$All example samples are classified right.