In the previous chapter we discussed linear regression and its application. It is a very valuable tool because
But it has its drawbacks too. For example: it makes quite strong assumptions about the functional form of $f(X)$ and the error terms. Another disadvantage is that linear regression is not suited for classification problems where we deal with qualitative responses.
Below figure, where we regress the probability of default onto credit card balance, visualizes the issue.
Logistic regression is able to overcome these obstacles. Rather than modeling the default class $y$ (e.g. 0 for non-default, 1 for default) it models the (conditional) probability that $y$ belongs to a certain category, $\Pr(Y=y|X)$.
In above figure we display our attempt to estimated the probability of default given one feature, i.e. a client's credit card balance. In the same way we could have used a multiple linear regression (with multiple features as input) to estimate the probability.
$$\begin{equation} \Pr(Y=y|X) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \end{equation}$$But as discussed above for the simple case, this too would yield unbounded probabilities with $y > 1$ and $y<0$ for some $X$ (for convenience let us assume that we are using the generic 0/1 coding for the dependent variable). The question thus is: how could we improve on that model such that we have bounded results with probabilities of default? A well known function which satisfies the inequality $0 \leq f(x) \leq 1$ (and is continuous and hence differentiable) is the so called "Sigmoid" or "Logistic" function which is defined as
$$\begin{equation} S(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x} \end{equation}$$Above figure displays the function's shape. Its range is bounded by $[0, 1]$ and its notable "S" shape depends on the input parameter $x$. For logistic regression $x$ takes on the known functional form of a linear regression:
$$\begin{equation} \Pr(Y=y|X) = \frac{e^{\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p}} \end{equation}$$This function is related to the linear model in that by rearranging the equation we arrive at
$$\begin{equation*} \frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p} \end{equation*}$$Notice that for ease of use we define $p(X)$ as equivalent of $\Pr(Y=y|X)$. Taking the natural logarithm of above expression yields
$$\begin{equation} \underbrace{\log\left(\frac{p(X)}{1 - p(X)}\right)}_{\textit{Logit}} = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p, \end{equation}$$a function where the output on the left-hand side, the so called Logit or log-odds, is linear in $X$.
To run a logistic regression we need to estimate the $\beta$ coefficients in the following equation:
$$\begin{equation} \Pr(Y=y|X) = \frac{e^{\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p}} \end{equation}$$In the linear regression setting we used a least squares approach. Here, however, we switch to a method called maximum likelihood. This approach is used in many areas of statistics/machine learning to fit models and as such often studied in graduate courses. Here we will restrict ourselves to a very superficial discussion of the intuition. For the interested reader, Elkan (2014) provides an excellent, concise introduction, whereas Bishop (2006) has a more advanced and thorough discussion of the topic.
Back to the intuition: maximum likelihood optimizes the above equation to get estimates of $\beta$ "such that the predicted probability $\hat{p}(x_i)$ of default for each individual [$\ldots$] corresponds as closely as possible to the individual's observed default status. In other words, we try to find $\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p$ such that plugging these estimates into the model for $p(X)$ [$\ldots$] yields a number close to one for all individuals who defaulted, and a number close to zero for all individuals who did not." (James et al. (2013, p. 133)).
The maximum likelihood function that is used to estimate the $\beta$ coefficients in the logistic regression model is derived from the probability distribution of the dependent variable $y$. Since $y$ takes on one of two values (in our example $y \in \{0, 1\}$) and assuming the response values are independent of each other, the probability mass function of $Y$ follows a Bernoulli distribution, $Y \sim Bern(p)$, which is a special case of the Binomial distribution $Bin(n, p)$ with $n=1$. Its probability mass function is
\begin{equation*} f(y;p) = p(Y=y) = p^y (1-p)^{1-y} = \begin{cases} p & \text{ if } y=1 \\ (1-p) & \text{ if } y=0 \end{cases} \end{equation*}The joint probability mass function of $Y$ is
\begin{equation*} f(y | \beta) = \prod_{i=1}^N p(x_i)^{y_i} \, (1-p(x_i))^{1-y_i} \end{equation*}and describes the values of $y$ as a function of known, fixed values for $\beta$, where $\beta$ is related to $y$ through the logistic function. Since we don't know the coefficients $\beta$ but have measured outcomes for $y$ the likelihood function reverses above joint probability mass function such that it expresses the values of $\beta$ in terms of known, fixed values for $y$. This is the likelihood function (Czepiel (2002)).
\begin{equation} L(\beta | y) = \prod_{i=1}^N p(x_i)^{y_i} \, (1-p(x_i))^{1-y_i} \end{equation}The maximum likelihood estimates are now those values for $\beta$ which maximize the likelihood function $L(\beta | y)$. Typically, to find the maximum likelihood estimates we would differentiate the (log-) likelihood with respect to the coefficients, set the derivatives equal to zero, and solve. However, there is no closed-form solution for this and thus numerical methods (such as Newton-Raphson, Newton-conjugent gradient etc.) are required to derive a maxima (Shalizi (2017)).
Below clip I found to be an excellent introduction into the details of how the log-likelihood is derived with maximum likelihood.
In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo('TM1lijyQnaI')
Out[1]:
Thankfully, statistical packages such as Python's statsmodels
have the necessary tools and functions integrated into the relevant functions such that we do not need to concern ourselves with the details.
To show how logistic regression is run in Python we will again rely on functions from the statsmodels
package. The Scikit-learn package sklearn
contains similar functions. However, when it comes to calling results, the statsmodels
functions for logistic regression follow those for linear regression which we got to know in the previous chapter. Therefore we will work with this package.
We will use the generic 'Default' data set from James et al. (2013), which we discussed in this chapter's introduction. It is taken from the book's corresponding R
package and made available in this course's data folder on GitHub. We start our journey as usual with the initial load of the necessary packages and setting a few options.
In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
plt.style.use('seaborn-whitegrid')
In [3]:
# Default data set is not available online. Data was extracted from R package "ISLR"
df = pd.read_csv('Data/Default.csv', sep=',')
# Factorize 'No' and 'Yes' in columns 'default' and 'student'
df['defaultFac'] = df.default.factorize()[0]
df['studentFac'] = df.student.factorize()[0]
df.head(5)
Out[3]:
Similar to linear regression with statsmodels
there exist different ways to compute a logistic regression. For reference, we will show three approaches here:
statsmodels.formula.api
library has a glm
function mimicking R
's way of running glm regressionsGLM
function from the statsmodels.api
library statsmodels.api
's Logit
function follows the approach we used for OLS case You might ask what GLM/glm means. It stands for Generlized Linear Models and as such is a generalization of linear regression approaches (linear, logistic, Poisson regression). There are many good resources available on the web for the interested. A good textbook introducing GLM is Alan Agresti's Foundations of Linear and Generalized Linear Models (Agresti (2015)).
y ~ X1 + X2 + ... + Xn
- in words we regress $y$ on $X_1, X_2, \ldots, X_n$ where $X_i$ is a vector of a particular feature. family
is a necessary argument. Remember that above, where we discussed the maximum likelihood approach of estimating the coefficients, we mentioned that the binary response $y$ follows a Bernoulli distribution, which is a special case of a Binomial distribution. Thus we chose family=sm.families.Binomial()
. .fit()
method runs the maximum likelihood estimation of the coefficients. Default method is 'Newton-Raphson' (method=newton
) and this fits our needs. Nonetheless, others are available. Use the help function sm.Logit.fit?
to see what options you have (it's the same for all three approaches).More details can be found here
In [4]:
# 1. R-Style
# R-Style formula
formula = 'defaultFac ~ balance'
# Regress model
logReg = smf.glm(formula=formula, data=df, family=sm.families.Binomial()).fit()
print(logReg.summary())
endog
= endogenous variable = response = $y$exog
= exogenous variable(s) = features = $X$sm.add_constant(X)
Function details are described here.
In [16]:
# 2. GLM
# Regress model
logReg = sm.GLM(endog=df.defaultFac, exog=sm.add_constant(df.balance),
family=sm.families.Binomial()).fit()
print(logReg.summary())
endog
= endogenous variable = response = $y$exog
= exogenous variable(s) = features = $X$sm.add_constant(X)
family
(Logit is per default binomial)The full function details are to be found here
In [6]:
# 3. Logit
# Regress model
logReg = sm.Logit(endog=df.defaultFac, exog=sm.add_constant(df.balance)).fit()
print(logReg.summary())
Though the scope of information in the summaries differ slightly between the different approaches the function yield the same results. As in the case of linear regression, the model's output can be accessed through its different methods and attributes. For example the models parameters and p-values are easily accessed as shown below.
In [7]:
logReg.params
Out[7]:
In [8]:
logReg.pvalues
Out[8]:
The $z$-statistic in above summary and the corresponding p-values play the same role as the $t$-statistic in the linear regression output. For example, the $z$-statistic associated with $\hat{\beta}_1$ is equal to $\hat{\beta}_1 / SE(\hat{\beta}_1)$. You can check this by running logReg.params[1] / logReg.bse[1]
. A large (absolute) $z$-value, and correspondingly a small p-value, provides evidence against the null hypothesis $H_0 : \beta_1 = 0$. Notice that this null hypothesis implies that $p(X) = \dfrac{e^{\beta_0}}{(1 + e^{\beta_0})}$.
The main purpose of the intercept is to adjust the average fitted probabilities to the proportion of ones in the data. Beyond that it is not of interest.
If we are interested in the coefficient's 99% CI (instead of 95% like in the summary) we can use the same code as in the linear regression case.
In [9]:
logReg.conf_int(alpha=0.01)
Out[9]:
If we wish to make predictions of the default probability we can use the .predict()
method. Given we leave the brackets empty, this method will calculate $p(X)$ on the basis of the previously used feature training sample. If we wish to get the probability of default for a balance of, let us say, USD 2'000 we can run the following line of code:
In [10]:
# X must include 1 in first column for intercept
logReg.predict([1, 2000])
Out[10]:
The vector $[1, 2000]$ in the above .predict()
function corresponds to a row vector containing the $x$-values for which we wish to predict the probability of default. If we had $p$ features, this vector would obviously have lenght $p+1$ ($p$ features + 1 for intercept).
For future reference, below code generates a confusion matrix. We will discuss confusion matrices in more detail in the chapter on $k$-Nearest Neighbor. The threshold
parameter defines above what probability a new observation is labeled/predicted as 1 (default); if no value is given, the function reverts to its default settings which is threshold=0.5
.
In [11]:
# Confusion Matrix for LogRegression
logReg.pred_table(threshold=0.5)
Out[11]:
In [12]:
# Create sorted resutls-df
res = pd.DataFrame()
res['balance'] = df.balance
res['prob'] = logReg.predict()
res = res.sort_values('balance')
# Plot scatter & logReg
plt.figure(figsize=(12, 8))
plt.scatter(df.balance, df.defaultFac, marker='.')
plt.plot(res.balance, res.prob, c='k')
plt.axhline(y=0, color='gray', linestyle='dashed')
plt.axhline(y=1, color='gray', linestyle='dashed')
plt.ylabel('Probability of default', fontsize=12)
plt.xlabel('Balance', fontsize=12);
Here is an alternative way of plotting it. Notice that for the logistic function I do not use the data I trained it with but 'any possible' outcome.
In [33]:
# Create sorted resutls-df
res = pd.DataFrame()
res['balance'] = np.linspace(start=0, stop=5000, num=10000)
res['prob'] = logReg.predict(sm.add_constant(res['balance']))
# Plot scatter & logReg
plt.figure(figsize=(12, 8))
plt.scatter(df.balance, df.defaultFac, marker='.')
plt.plot(res.balance, res.prob, c='k')
plt.axhline(y=0, color='gray', linestyle='dashed')
plt.axhline(y=1, color='gray', linestyle='dashed')
plt.ylabel('Probability of default', fontsize=12)
plt.xlabel('Balance', fontsize=12);
In [13]:
# Assign features to matrix X and response to y
X = sm.add_constant(df[['balance', 'income', 'studentFac']])
X.income = X.income / 1000 # Income in 1'000 dollars
y = df.defaultFac
# Run Logistic Regression
logReg = sm.Logit(endog=y, exog=X).fit()
print(logReg.summary().tables[1])
The p-value for income
is surprisingly large, indicatign that there is no clear evidence of a real association between the probability of default and income. In a case study this certainly would need to be further examined.
For reference we show below how logistic regression is run with Scikit-learn. Unfortunately, there is no summary output available in Scikit-learn like there is within the Statsmodels package. The main reason is that sklearn is not used for statistical inference but for predictive modelling/ML and the evaluation criteria are based on performance on previously unseen data.
Notice that in order to have similar coefficients/results, the hyperparameter C
has to be set to C=1e9
. This parameter is a regularization parameter that refers to the L1/L2 penalty scheme implemented. For details refer to Scikit-learn's user guide, this discussion on Stackoverflow or chapter 6.2 in James et al. (2013).
In [14]:
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression(C=1e9, fit_intercept=False).fit(X, y)
In [15]:
logReg.coef_
Out[15]:
We've seen above how one can plot a logistic regression in the simple case with just one feature. If you regress a response on two features, it is still possible to create very meaningful plots. A good example is this answer on stackoverflow where the decision boundary is drawn and the color scheme follows the probability of the response.
In closing this chapter it should be mentioned that the preceding code sections are by no means a thorough analysis but a superficial digging into the topics. Yet the tools and measures necessary to understand the output are yet to be introduced. For this we refer to the chapter on linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) (and subsequent chapters) where many tools are introduced that are of significant help in analyzing an algorithm's output.
In writing this notebook, many ressources were consulted. For internet ressources links are provided within the textflow above. Beyond these links, the following ressources were consulted and are recommended as further reading on the discussed topics: