Stochastic Gradient Descent l That is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss function such as(linear) Support Vector Machine and Logistic Regression.

1 Introdution

The advantages of Stochastic Gradient Desent are:

  • Efficiency
  • Ease of implementation

The disadvantages of SGD

  • SGD requiress a number of hyperparamters such as the regularization and the number of iterations
  • SGD is senstive to feature scaling

2 Classification


In [10]:
from sklearn.linear_model import SGDClassifier
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = SGDClassifier(loss='hinge', penalty='l2')
clf.fit(X, y)


Out[10]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

In [2]:
clf.predict([[2., 2.]])


Out[2]:
array([1])

In [3]:
clf.coef_


Out[3]:
array([[ 9.91080278,  9.91080278]])

In [4]:
#To get the signed distance to the hyperplane
clf.decision_function([[2., 2.]])


Out[4]:
array([ 29.67316119])

The concrete loss function can be set via the loss paramters,

  • loss = 'hinge': soft-margin linear Support Vector Machine
  • loss = 'modified_huber': smooth hinge loss
  • loss = 'log' : logistic regression

Using loss = 'log' or loss='modified_huber' enables the predict_proba method, which gives a vector of probabilty estimates $P(y \rvert x) $per sample x:


In [8]:
clf = SGDClassifier(loss='log').fit(X, y)
clf.predict_proba([[1., 1.]])


Out[8]:
array([[  4.97248476e-07,   9.99999503e-01]])

3 Tips

Scale each attribute ont the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. and the same scaling must be applied to the test vector to obtain meaningful results.


In [13]:
from sklearn.preprocessing import StandardScaler
sclarer = StandardScaler()
sclarer.fit(X_train)
sclarer.fit(X_test)

4 Mathematical formulations

Given a set of training examples$(x_1, y_1), (x_2, y_2),\ldots, (x_n,y_n)$ where $x_i \in R^n$ and $y_i \in \{-1, 1\}$, our goal is to learn a linear scoring function $f(x)=\omega ^ T x+ b$ and the regularized training error given by $$E(w,b)=\frac{1}{n}\sum_{i=1}^{n}L(y_i-f(x_i))+\alpha R(\omega)$$