Logistic Regression is the basic learning algorithm for classification problems. It is simple but elegant in its form. This notebook will cover the algorithms of Logistic Regression and build it in Python from scratch.

We start from the binary classifications

1) let's first define the notations:

$m$ - Number of training examples

$n$ - Number of features

$X$ - Features

$y$ - Target (discrete)

$(X^{(i)}$,$y^{(i)})$ - $i^{th}$ taining example

$\theta: (\theta_1,\theta_2,...,\theta_n)^T$ - model parameters

2) The hypothesis of the Logistic Regression is:

$$h_\theta(X) = g(\theta^TX)$$
$$Where\;\; g(z) = \frac{1}{1+e^{-z}}$$

Here $g(z)$ is called the Sigmoid Function. We will use it for the activation function of neurons later in the Neural Network as well.

3) The log-loss is then used to define the cost function of the Logistic Regression (L2 Regularization Term):

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}\Big(y^{(i)}log\big(h_\theta(X^{(i)})\big)+(1-y^{(i)})log\big(1-h_\theta(X^{(i)})\big)\Big)+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2$$

4) To minimize cost function with respect to the parameters, we can calculate the gradient (for $j>=1$):

$$\frac{\partial}{\partial\theta_j}J(\theta)=\frac{1}{m}\sum_{i=1}^{m}\Big(h\big(\theta(X_{(i)})\big)-y^{(i)}\Big)X_j^{(i)}+\frac{\lambda}{m}\theta_j$$

The following class BinaryLogisticRegression() builds the binary classification model:


In [1]:
# Binary Logistic Regression

import numpy as np
import scipy

class BinaryLogisticRegression():

    def __init__(self, lamda):

        self._lamda = lamda

    # Define class variables
    _mu = None
    _sigma = None
    _coef = None

    def _feature_norm(self, X):

        # Normalize all features to expedite the gradient descent process
        mu = np.mean(X, axis=0)
        sigma = np.std(X, axis=0)
        X_norm = (X - mu) / sigma

        return X_norm, mu, sigma

    def _sigmoid(self, z):

        # Formulate sigmoid function
        return 1.0 / (1.0 + np.exp(-1.0 * z))

    def _cost_calc(self, theta, X, y):

        # Formulate cost function
        m, n = X.shape
        y = y.reshape((m, 1))
        theta = theta.reshape((n, 1))
        z = X.dot(theta)
        h_x = self._sigmoid(z)
        J = -1.0 / m * (y.T.dot(np.log(h_x)) + (1 - y).T.dot(np.log(1 - h_x))) \
            + self._lamda / (2.0 * m) * sum(theta[1:]**2)

        return J.ravel()

    def _gradient_calc(self, theta, X, y):

        # Formulate the gradient of the cost function
        m, n = X.shape
        y = y.reshape((m, 1))
        theta = theta.reshape((n, 1))
        z = X.dot(theta)
        h_x = self._sigmoid(z)
        grad = np.zeros((n, 1))
        grad[0] = 1.0 / m * sum(h_x - y)
        grad[1:] = 1.0 / m * X[:, 1:].T.dot(h_x - y) + float(self._lamda) / m * theta[1:]

        return grad.ravel()

    def fit(self, X, y):

        # Fit the model
        m, n = X.shape
        X, self._mu, self._sigma = self._feature_norm(X)
        X = np.c_[np.ones((m, 1)), X]
        theta = np.zeros(X.shape[1])
        result = scipy.optimize.minimize(fun=self._cost_calc, x0=theta, args=(X, y),
                                         method='BFGS', jac=self._gradient_calc,
                                         options={"maxiter": 100, "disp": False})

        self._coef = result.x

        return self

    def predict_prob(self, X):

        # predict probabilities with the fitted model
        m, n = X.shape
        X = np.c_[np.ones((m, 1)), (X - self._mu) / self._sigma]
        y_prob = self._sigmoid(X.dot(self._coef.reshape((n+1, 1))))

        return y_prob.ravel()

    def predict(self, X):

        # predict with the fitted model
        p = self.predict_prob(X)
        y_predict = np.copy(p)
        y_predict[p > 0.5] = 1
        y_predict[p <= 0.5] = 0

        return y_predict.ravel()

Now let's focus on multi-class classification

We can use the "one-versus-all" scheme to achive the goal. To be specific, for each class, we code it as the positive class (1), while the rest of classes are coded as the negative class (0). We then apply Binary Logistic Regression and predict the probabilities of being positive for each class against all others. The class with the highest probability of being positive will be chosen as the final prediction of the classification task.

The following class MultiClassLogisticRegression() implements this idea.


In [2]:
class MultiClassLogisticRegression():

    def __init__(self, lamda):

        self._lamda = lamda

    # Define class variables
    _values = None
    _n_samples = None
    _n_classes = None
    _lr_list = []

    def fit(self, X, y):

        # fit the model
        self._values = np.unique(y)
        self._n_samples = X.shape[0]
        self._n_classes = len(self._values)
        y_copy = np.array(y)

        for value in self._values:
            idx = (y == value)
            y_copy[idx] = 1
            y_copy[~idx] = 0
            lr = BinaryLogisticRegression(lamda=self._lamda)
            self._lr_list.append(lr.fit(X, y_copy))

        return self

    def predict_prob(self, X):

        # predict the probabilities for each class
        y_prob_ini = np.empty((X.shape[0], self._n_classes))
        for k, lr in enumerate(self._lr_list):
            y_prob_ini[:, k] = lr.predict_prob(X)

        row_sums = y_prob_ini.sum(axis=1)
        y_prob = y_prob_ini / row_sums[:, np.newaxis]
        return y_prob

    def predict(self, X):

        # predict with one_versus_all scheme
        p = self.predict_prob(X)
        y_predict_idx = np.argmax(p, axis=1)
        y_predict = np.array([self._values[i] for i in y_predict_idx])
        return y_predict

It's time to test a data set with the learning tool we just built.

We've loaded the Iris data set from the Scikit-Learn Module. It has 4 features and 3 classes.


In [3]:
from sklearn.datasets import load_iris
iris = load_iris()

X = iris['data']
y = iris['target']

print X.shape
print y.shape
print "Number of Classes: {}".format(len(np.unique(y)))


(150L, 4L)
(150L,)
Number of Classes: 3

We can split the original data for test purposes


In [4]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=26)

This is a multi-class problem. We then use the defined class MultiClassLogisticRegression() to train the model and predict the results for test examples.


In [5]:
lr = MultiClassLogisticRegression(lamda=1)
lr = lr.fit(X_train, y_train)
y_predict = lr.predict(X_test)

print "True Values:      {}".format(y_test)
print "Predicted Values: {}".format(y_predict)
print "Prediction Accuracy: {:.2%}".format(np.mean((y_predict == y_test).astype(float)))


True Values:      [1 1 0 0 2 1 0 2 2 2 0 1 1 0 1 0 2 1 2 0 0 2 2 2 2 2 1 0 0 2]
Predicted Values: [1 1 0 0 2 1 0 2 2 2 0 1 1 0 1 0 2 1 2 0 0 2 2 2 2 1 1 0 0 2]
Prediction Accuracy: 96.67%

Other things to consider:

  • Elastic Net (L1 & L2 Regularization)
  • If class distribution is skewed, re-sampling or cost-sentitive learning should be applied
  • Evaluation metrics are not just the prediction accuracy. We should also look at precision, recall, f-score, ROC curve to have a comprehensive understanding of how the classifier performs
  • Logitic Regression is considered as linear model. It can be pretty useful in the second level with model stacking