Stochastic Gradient Descent l That is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss function such as(linear) Support Vector Machine and Logistic Regression.

1 Introdution

The advantages of Stochastic Gradient Desent are:

Efficiency
Ease of implementation

The disadvantages of SGD

SGD requiress a number of hyperparamters such as the regularization and the number of iterations
SGD is senstive to feature scaling

2 Classification



In [10]:

    
from sklearn.linear_model import SGDClassifier
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = SGDClassifier(loss='hinge', penalty='l2')
clf.fit(X, y)









    Out[10]:





SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)



In [2]:

    
clf.predict([[2., 2.]])









    Out[2]:





array([1])



In [3]:

    
clf.coef_









    Out[3]:





array([[ 9.91080278,  9.91080278]])



In [4]:

    
#To get the signed distance to the hyperplane
clf.decision_function([[2., 2.]])









    Out[4]:





array([ 29.67316119])

The concrete loss function can be set via the loss paramters,

loss = 'hinge': soft-margin linear Support Vector Machine
loss = 'modified_huber': smooth hinge loss
loss = 'log' : logistic regression

Using loss = 'log' or loss='modified_huber' enables the predict_proba method, which gives a vector of probabilty estimates $P(y \rvert x) $per sample x:



In [8]:

    
clf = SGDClassifier(loss='log').fit(X, y)
clf.predict_proba([[1., 1.]])









    Out[8]:





array([[  4.97248476e-07,   9.99999503e-01]])

3 Tips

Scale each attribute ont the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. and the same scaling must be applied to the test vector to obtain meaningful results.



In [13]:

    
from sklearn.preprocessing import StandardScaler
sclarer = StandardScaler()
sclarer.fit(X_train)
sclarer.fit(X_test)

4 Mathematical formulations

Given a set of training examples$(x_1, y_1), (x_2, y_2),\ldots, (x_n,y_n)$ where $x_i \in R^n$ and $y_i \in \{-1, 1\}$, our goal is to learn a linear scoring function $f(x)=\omega ^ T x+ b$ and the regularized training error given by $$E(w,b)=\frac{1}{n}\sum_{i=1}^{n}L(y_i-f(x_i))+\alpha R(\omega)$$