Logistic Regression

Logistic regression is a tool that we can use to solve classification problems with arbitrary number of classes. To simplify equations, we focus on binary regression problems.

Solving classification problems with linear regression models have some issues:

  • for binary classification, the resulting estimates may lie outside the 0 to 1 interval because the model's outcome is continuously distributed. However, this does not make sense because we want to be able to find the probability that a given data instance belongs to a particular category.
  • not useful if the number of classes exceeds 2. If we were to simply code the response as $1,2,\cdots,k$ for a number of classes, then the ordering here would be arbitrary. We could use a binary coding, and performing $k$ separate regressions, then using the strongest prediction at the end, given an input X. This is flawed too, however, as we would likely encounter a problem called masking. With this problem, there is some class $j$ that never ends up being predicted in favor of the others, regardless of the input.

Figure 4.2 [#] illustrates the difference between the linear function (left) and the logistic function (right). As our data points can only be one of two values on the $y$-axis, a linear regression model is not a good fit since it dips below zero. The logistic function is a better fit because it never estimates outside the $[0, 1]$ interval

Logistic regression models the probability that $y$ is class 1 given $x$ instead of modelling $y$ as a function $x$ directly like in linear regression. In other words, just like a Linear Regression model, a Logistic Regression first computes a weighted sum of the input features (plus the intercept) but instead of outputting the result directly, it outputs the logistic of this result. The logistic model is:

\begin{equation} p(x) = P( y = 1 \mid x) = \frac{\exp(\beta_0 + \beta_1 x)}{1 + \exp(\beta_0 + \beta_1 x)} = \frac{1}{1 + \exp(-\beta_0 - \beta_1 x)} \end{equation}

We can rewrite the equation above to:

\begin{equation} \frac{P( y = 1 \mid x)}{1 - P( y = 1 \mid x)} = exp(\beta_0 + \beta_1 x) \end{equation}

The left-hand side of the equality is called the odds. We get the log-odds or logit if we take the logarithm of both sides of the equality sign:

\begin{equation} \log \left( \frac{P( y = 1 \mid x)}{1 - P( y = 1 \mid x)} \right) = \beta_0 + \beta_1 x \end{equation}

In general, $logit(a) = \log(a/(1 − a))$

If we have $p$ predictors $x_1, x_2, \cdots, x_p$, we can collect them in a vector $X=(1, x_1, x_2, \cdots, x_p)$. Similarly, the corresponding coefficients can put in a vector $\beta = (\beta_0, \beta_1, \beta_2, \cdots, \beta_p)$. Now, we can construct a linear transformation:

\begin{equation} X^T \beta = 1 \cdot \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p \end{equation}

We can write the logistic function as:

\begin{equation} P( y = 1 \mid X) = \frac{\exp(\beta^T X)}{1 + \exp(\beta^T X)} \end{equation}

The coefficients $\beta_0$ and $\beta_1$ can be learned using the maximum likelihood method.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The logistic function is characterised by the logistic (sigmoid or S) curve:


In [2]:
X = np.linspace(-10, 10, 100)
y = 1/ (1 + np.exp(-1*X))
plt.plot(X, y)


Out[2]:
[<matplotlib.lines.Line2D at 0x17a41f6deb8>]

In [ ]: