Classification I: Logistic Regression

Predicting student admission

Suppose we have a csv file containing historical data set for 100 student admissions at a given university.

Let's load this data into a Pandas DataFrame, using the read_csv global function that returns a DataFrame from a csv file.

So we first import Pandas:


In [1]:
import pandas as pd

Then we load the data, and show the first entries using the head method:


In [2]:
student_data = pd.read_csv('data/log_reg1.csv', names=['exam1', 'exam2', 'admitted'])
student_data.head()


Out[2]:
exam1 exam2 admitted
0 34.623660 78.024693 0
1 30.286711 43.894998 0
2 35.847409 72.902198 0
3 60.182599 86.308552 1
4 79.032736 75.344376 1

The first two columns hold the scores for each students at two exams, while the last column indicates whether that student was admitted at the university (admitted=1) or not (admitted=0).

Problem: As an university administrator, you may be interested in predicting the probability of future prospective student to be admitted at that university, given their scores on those two exams.

To see the kind the kind of relationship between the scores and university admission, we can draw a scatter plot of student scores, using a color scheme to indicate whether that a student, represented by a the coordinates of her two scores, was admitted (green) or not (red).

To do that we use the Python package matplotlib. To allow matplotlib plots to be embedded into the notebook, we should first invoke the following magic command:


In [3]:
%matplotlib inline

In [4]:
import matplotlib.pyplot as plt

Then, using the function subplots module, one creates a figure object representing a canva on which to draw, along with a AxesSubplot object, that represents a set of xy-axes. The AxesSubplot object has a method scatter for scatter plots, along with various other methods to add stuff to the drawing:


In [5]:
admitted = student_data[student_data.admitted == 1][[0,1]]
rejected = student_data[student_data.admitted == 0][[0,1]]

In [6]:
figure1, figure1_axes = plt.subplots()
figure1_axes.plot(admitted[[0]], admitted[[1]] , 'go', 
                  rejected[[0]], rejected[[1]], 'ro')
figure1_axes.set_xlim([20,110])
figure1_axes.set_ylim([20,110])
figure1_axes.set_xlabel('Score at Exam 1')
figure1_axes.set_ylabel('Score at Exam 2')
figure1_axes.legend(['admitted', 'rejected'])
figure1_axes.set_title('Admission Scatter Plot');


Conditional probability and sigmoid function

From that scatter plot above, we see that the admitted student seems to be separated from the rejected students by a line, the separation boundary, whose equation should thus be of the form:

$$\theta_0 + \theta_1x_1 + \theta_2 x_2= 0$$

where $x_1$ is the score at the first exam, $x_2$ is the score at the second exam, and the $\theta$'s are unknown parameters.

Logistic regression provides a method (which we will outline later on) to estimate these parameters. That is, one of the output of logistic resgression is a vector of estimated parameters:

$$\hat\theta = (\hat\theta_0, \hat\theta_1, \hat\theta_2)$$

A second output of logistic regression is a prediction of the admission probability for a prospective student (i.e. not necessarily in our initial data set) given the student scores $x_1$ and $x_2$ at the two exams. If we denote by $Y$ the binary variable that takes $0$ if a student is rejected and $1$ in case of admission, the logistic regression predicts the following conditional probability of admission, given the two scores:

$$P(Y=1 \,|\, X_1=x_1,\, X_2=x_2\,;\;\hat\theta) = \frac{1}{1+\exp(-(\hat\theta_0 + \hat\theta_1x_1 + \hat\theta_2 x_2))}$$

where $X_1$ and $X_2$ are the two random variables modelling the student scores at each of the two exams.

The conditional probability above is modelled using the sigmoid function (also called logistic function):

$$s(z) = \frac1{1 + e^{-z}}$$

This function always gives back as output a number between $0$ and $1$ for any real input. Hence, this output can be interpreted as a probability.

To code the sigmoid function, we want to take advantage from numpy vectorized arrays, which will allow us to apply numerical functions not only to scalar but to the entries of any multidimentional arrays with the exact same code:


In [7]:
import numpy as np

Of course, for the vectorization to work, one needs to use exclusively the math function defined in the numpy package: np.cos, np.sin, np.exp, etc.


In [8]:
sigmoid = lambda z: 1/(1 + np.exp(-z))

We see that the sigmoid coded above can handle any multidimensional numpy arrays (including the object of the matrix class), by applying the function above to each entries of the array:


In [9]:
domain = np.linspace(-10, 10, 100)
image  = sigmoid(domain)
image[47:53]


Out[9]:
array([ 0.37635452,  0.42481687,  0.47476892,  0.52523108,  0.57518313,
        0.62364548])

Let's plot now the sigmoid function:


In [10]:
figure2, figure2_axes = plt.subplots()
figure2_axes.plot(domain, image);


By eye-balling the admission scatter plot above, we can guess that the speration line between positive ($Y=1$) and negative examples ($Y=0$) approximately has roughly a slope of $\alpha = -1$ and an intercept of roughly $\beta = 125$, giving us the line equation:

$$x_2 = -x_1 + 125$$

We show this guessed separating boundary on the scatter plot below:


In [11]:
figure3, figure3_axes = plt.subplots()
domain = np.linspace(20, 110, 100)
figure3_axes.plot(admitted[[0]], admitted[[1]] , 'go', 
                  rejected[[0]], rejected[[1]], 'ro',
                  domain, -domain + 125, '-')
figure3_axes.set_xlabel('Score at Exam 1')
figure3_axes.set_ylabel('Score at Exam 2')
figure3_axes.set_title('Admission Scatter Plot')
figure3_axes.set_xlim([20,110])
figure3_axes.set_ylim([20,110])
figure3_axes.legend(['admited students','rejected students','separation line'])


Out[11]:
<matplotlib.legend.Legend at 0x10d118c10>

Observe that the line itself, which is fully determined by its slope $\alpha$ and its intercept $\beta$ is not enough information to determin the model parameters uniquely. Namely, rescalling the $\theta$'s by any number $\lambda$, will give us the equation

$$\lambda(\theta_0 + \theta_1x_1 + \theta_2x_2) = 0$$

which describes the same line in the plane. In terms of the line slope and intercept $\alpha$ and $\beta$, we have that

$$\theta_1 = -\alpha \theta_2\quad \textrm{and} \quad \theta_0 = -\beta \theta_2$$

leaving us free to choose the value for $\theta_2$. Any choice of $\theta_2$ will different parameter values, all of which describing the same geometric line.

The bottom line is thus that finding the geometric line separating the best the positive examples (admitted students) and negative examples (rejected students) is not enough information to evaluate the admission probability for a new prospective students that has passed the two exams. Namely, the conditional probability above needs a fixed and unique value for each of the $\theta$ parameters.

The Maximum Likelihood method described in the next section will give us a reasonable way to uniquely determine these parameters.

Once these parameter uniquely determined, Logistic regression will predict an admission probability

  • equal to $1/2$ for students lying on the separating line: $z = 0$
  • below $1/2$ for students below the separating line: $z < 0$
  • above $1/2$ for students above the separating line: $z > 0$

where $z = \hat\theta_0 + \hat\theta_1 x_1 + \hat\theta_2 x_2$. The more positive $z$ is the highest the admission probability will be, and the more negative $z$ is, the lowest the admission probability will be.

The decision of classifying an student as positive (admitted) or negative (rejected) will then be made on the basis of the admission probability being above (or equal) to $1/2$ or below that threshold.

Hence, logistic regression (as SVM, or the perceptron) can be cast into the form of a linear classifier, i.e. the learned decision function

$$\delta: \mathbb R^2 \rightarrow \{0,1\}$$

that classifies a student as admitted or rejected on the basis of the two exam scores, is of the form

$$\delta(x_1, x_2) = g(\hat\theta_0 + \hat\theta_1x_1 + \hat\theta_2x_2)$$

where $$ \begin{eqnarray} g(z) & = & \bigg\{ \begin{array}{lr} 0 & \textrm{if }\quad z < 0 \\ 1 & \textrm{if }\quad z > 0 \end{array} \end{eqnarray} $$

Many classifiers are of this form. Logistic regression has the advantage to also output a classification probability giving a kind of confidence, along with a a classification decision.

Probabilistic Model

We now establish now the probability model underlying logistic regression. This model will give us a way to estimate uniquely the parameters $\theta$.

Suppose we have a data set reccording $m$ observations $\omega_1,\dots,\omega_m$ for $n$ features variables $X_1, \dots, X_n$ along with the value of a binary classification target variable $Y$ for each observation. Our data set can be described as a list

$$\mathcal D = \{(\omega_i, y_i):\: i=1,\dots, m\}$$

where the $i^{th}$ observation $\omega_i = (x_{i1}, \dots, x_{in})$ is a $n$-dimensional row vector, $x_ij$ is the value of feature $j$ for observation $i$ and $y_i$ is the class or label of that example ($y_i$ is either $0$ or $1$).

The first assumption of the model is that each of values $x_{1j},\dots, x_{mj}$ are $m$ Independent and Identically Distributed (IID) realizations of a random variable $X_j$ modelling the $j^{th}$ feature. Similarly, $y_1, \dots, y_m$ are regarded as IID realization of a binary random variable Y.

Hence each row $(\omega_i, y_i)$ of our data table is governed by a joint probability distribution

$$P(X=\omega_i,\, Y=y_i) = P(X_1 = x_{i1},\,\dots,\,X_n=x_{in},\, Y=y_i)$$

The logistic regression model does not make any assumption on the form of this joint probability. The model only cares about modelling the conditional probability of $Y$ given $X$, which is a much less restrictive assumption, since many different joint distribution can yield the very same conditional probability. Since $Y$ is a binary discrete random variable, we have essentially one choice for a probilitic model: The Bernouili model!

$$P(Y=y\,| X=x) = (p(x))^{y}(1-p(x))^{(1-y)}$$

where $p(x)$ is the probability of class $Y=0$ given $X=x$ and $1 - p(x)$ is the probability of class $Y=1$ given $X=x$.

Now the overall conditional probability of the seeing in our data set $\mathcal D $ the values $\mathcal D_Y$ we see for $Y$ given the values $\mathcal D_X$ we see for $X = (X_1, \dots, X_m)$ can expressed as a product of the conditional probability for each of the examples (i.e. each row of our data table), because of our IID assumption:

$$ \begin{eqnarray} P(\mathcal D_Y | \mathcal D_X) & = & P(Y=y_1 | X=\omega_1)\dots P(Y=y_m | X=\omega_m)\\ & = & \prod_{i=1}^m (p(\omega_i))^{y_i}(1-p(\omega_i))^{(1-y_i)} \end{eqnarray} $$

The idea is now to look for the "best" approximation of the target function $p:\mathbb R^n\rightarrow [0,1]$ on a space $\mathcal H$ of hypothesis, which is a special subspace of all the functions from the feature space $\mathbb R^n$ to $[0,1]$. The algorithm that will search this space for the "best" approximation to $p$ according to a certain criterium is the learning algorithm.

The hypothesis space for logistic regression is the space of linearly parametrized sigmoid functions:

$$\mathcal H = \left\{h_\theta(x)=\frac1{1 + \exp(\theta^T x)}:\: \theta=(\theta_0,\dots,\theta_n)\in\mathbb R^{n+1}\right\}$$

In the notation above, we added a component $x_0$ to our feature vectors. We will always set up this first component to $1$, $x = (1, x_1, \dots, x_n)$ in order to use the compact vector notation:

$$\theta^T x = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n$$

From now on, we will add $1$ to the feature vectors of each of our examples and denote the resulting vector by $$x^{(i)} = (1, x_{i1},\dots, x_{in})$$

The question is now how to find $\theta$ in our hypotesis space $\mathcal H$ so that $h_\theta$ is the "best" approximation we can get to the target function $p$ that we want to learn from our data set.

Maximal Likelihood

If we knew the true target function $p$, we could compute the conditional probabliliy $P(\mathcal D_Y | \mathcal D_X)$ as seen above. Now if we chose a hypotesis $h_\theta$ in the logistic regresssion hypothesis space $\mathcal H$, we obtain the conditional probability of seing the values $\mathcal D_Y$ we see in our data set for $Y$ given the values $\mathcal D_X$ we see for $X$ according to that hypothesis:

$$ \begin{eqnarray} P(\mathcal D_Y | \mathcal D_X;\; \theta) & = & \prod_{i=1}^m (h_\theta(\omega_i))^{y_i}(1-h_\theta(\omega_i))^{(1-y_i)} \end{eqnarray} $$

The $\theta$ appearing in the notation emphasis that that conditional probability is "according to the hypothesis" $h_\theta$ that we are making on the form of the true target function $p$.

The Method of Maximum Likelihood prescribes to search our hypothesis space $\mathcal H$ for that hypothesis $h_{\hat\theta}$ that maximizes the probability of seeing the values we are seing for $Y$ given the one we are seeing for $X$. In other words, the maximum likelihood estimate for $\theta$ is given by

$$\hat\theta_{MLE} = \operatorname{argmax}_{\theta \in \mathbb R^{n+1}} P(\mathcal D_Y | \mathcal D_X;\; \theta)$$

Given the data $\mathcal D$, the conditinal probability above can be seen as a function of the parameter vector $\theta$ only. It is called the likelihood function and denoted by

$$L_{\mathcal D}(\theta):= P(\mathcal D_Y | \mathcal D_X;\; \theta)$$

It is usually hard to find a maximum for the likelihood function directly. Taking advantage of the independence assumption that turns the likelihood function into a product of similar terms, we may apply the logarithm function, which turn products (harder to diferentiate) to sums (easier to differenciate) to obtain the log-likelihood function:

$$ \begin{eqnarray} l_{\mathcal D}(\theta) & = & \log L_{\mathcal D}(\theta)\\ & = & \sum_{i=1}^m y^{(i)}\log h_{\theta}(x^{(i)}) + (1-y^{(i)})\log(1-h_{\theta}(x^{(i)})) \end{eqnarray} $$

Since the logarithm is an increasing monotone function, the log-likelihood achieves its maximum at the exact same value $\hat\theta$ as the likelihood function does. Hence, maximizing $l_{\mathcal D}(\theta)$ will deliver the MLE estimate for the parameter vector. The advantage is that $l_{\mathcal D}(\theta)$ is more amenable to traditional optimization methods such as gradient ascent or the Newton method.

Log-likelihood Gradient and Hessian

We are now going to code the log-likelihood for logistic regression, in a vectorized way, using numpy matrices.

First of all, let's gather the features observations into a single matrix called the design matrix:

$$ X := \left( \begin{array}{ccccc} 1 & x_{11} & x_{12} & \cdots & x_{1n} \\ 1 & x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{m1} & x_{m2} & \cdots & x_{mn} \\ \end{array} \right) $$

In our student admission example, we obtain:


In [34]:
X = np.matrix(student_data[['exam1', 'exam2']])
ones = np.matrix(np.ones(X.shape[0])).T
X = np.hstack([ones, X])
X[:4]


Out[34]:
matrix([[  1.        ,  34.62365962,  78.02469282],
        [  1.        ,  30.28671077,  43.89499752],
        [  1.        ,  35.84740877,  72.90219803],
        [  1.        ,  60.18259939,  86.3085521 ]])

While we will colect the admission status into the column vector $y$:


In [35]:
y = np.matrix(student_data.admitted).T
y[:4]


Out[35]:
matrix([[0],
        [0],
        [0],
        [1]])

In vectorized computation, the sigmoid function sigmoid, we defined earlier can take any numpy array and apply the sigmoid transformation to each of the array elements, returning hence an array of the same dimension. The numpy logarithm np.log has the same property. Thus, if $\theta$ is a column vector, the numpy code corresponding to

$$ \log (\operatorname{sigmoid}(X\theta)) = \left(\log \frac1{1 + \exp(x^{(1)}\theta)}, \dots, \log \frac1{1+\exp(x^{(m)}\theta)}\right)^T$$

whose inner product with the vector $y = (y_1, \dots, y_m)^T$ yields the first term of $l_{\mathcal D}(\theta)$.

The second term is obtained in a similar way by vectorized operations, yielding:

$$ l_{\mathcal D}(\theta) = y^T\log(\operatorname{sigmoid}(X\theta)) + (1-y)^T\log(1 - \operatorname{sigmoid}(X\theta)) $$

Since log-likelihood depends on the data, it is better to implement this function as a callable class:

Using the fact that the sigmoid function $s$ has the property that $$s'(z) = s(z)(1-s(z)$$ one computes the gradient of the log-likelihood, whose components are called the score functions:

$$\frac{\partial l_{\mathcal D}(\theta)}{\partial \theta_k} = \langle y - s(X\theta),\, \vec x_k\rangle$$

In vectorized notation, a straightforward computation yields:

$$\nabla l_{\mathcal D}(\theta) = X^T(\vec y - \operatorname{sigmoid}(X\theta))$$

where $\vec x_k$ is the sample vector of all observations for feature $X_k$ and the deference

$$\hat \epsilon = \vec y - s(X\theta)$$

is the vector of residual between the predictions $s(X\theta)$ and the response vector $\vec y$.

With that gradient, one can find the maximum of the log-likelihood by following a curves of steepest ascent in the space of parameters. The resulting algorithm is called gradient ascent:

$$\theta^{(n+1)} := \theta^{(n)} + \alpha X^T(y - \operatorname{sigmoid}(X\theta^{(n)}))$$

where $\alpha$ is a the step size, called the learning rate.

Although the log-likehood function is concave, and, hence, has a single global maximum, which guarantees gradient ascent to converge, the convergence rate is slow and requires many iterations. A better method in this case is Newton alogorithm to find the zero of a function, that we apply to the gradient of the log-likelihood. This requires us to compute the derivative of the scores, that is, the matrix of second derivatives, called Hessian of the log-likelihood. Let us compute the Hessian of the log-likelihood:

Using again that $s'(z) = s(z)(1-s(z))$, we obtain that

$$ \begin{eqnarray} \frac{\partial^2 l_{\mathcal D}(\theta)}{\partial \theta_k \partial \theta_l} & = & - \sum_{i=1}^m s(\theta^T x^{(i)})(1 - s(\theta^T x^{(i)})) x_k^{(i)} x_l^{(i)}\\ & = & - \sum_{i=1}^m w_i x_k^{(i)} x_l^{(i)}\\ & = & - \vec x_k^T \Sigma_\theta \vec x_l \end{eqnarray} $$

where $\Sigma_\theta = \operatorname{Diag}\big(\sigma_1^2(\theta), \dots, \sigma_m^2(\theta)\big)$ is a diagonal matrix of weights:

$$ \begin{eqnarray} \sigma_i^2(\theta)& = & s(\theta^T x^{(i)})(1 - s(\theta^T x^{(i)})) \\ & = & \operatorname{var}_\theta (Y\,|\, X=x^{(i)}) \end{eqnarray} $$

which is the predicted variance of $Y | X=x^{(i)}; \theta$ given the hipothesis $\theta$ on the conditional probability

$$ P(Y=y\,|\,X=x^{(i)}; \;\theta) = (\operatorname{sigmoid}(\theta^T x^{(i)}))^y (1- \operatorname{sigmoid}(\theta^T x^{(i)}))^y$$

Therefore, in vector notation, we obtain the grandient and Hessian matrix of the log-likelihood for logistic regression:

$$ \nabla l_{\mathcal D}(\theta) = X^T(\vec y - \operatorname{sigmoid}(X\theta))$$

Optimization packages will often require us to give, along with the function to optimize, both the gradient and the Hessian of the objective; doing so decresase the burden of the optimization algorithm that have thus not to approximate both derivatives of the objective. This reduces the number of function calls to the objective.


In [16]:
class LogisticRegresssion(object):
    
    def __init__(self, X, y):
        self.y = y
        self.X = X
        self.theta = np.matrix(np.zeros(X.shape[1])).T
        
    def sigmoid(self, z):
        return 1.0/(1 + np.exp(-z)) 
               
    def log_likelihood(self, theta):
        prob = self.sigmoid(self.X * theta)
        ll = self.y.T * np.log(prob) + (1 - self.y).T * np.log(1 - prob)     
        return float(ll)
    
    def gradient(self, theta):
        return self.X.T * (self.y - self.sigmoid(self.X * theta))
    
    def hessian(self, theta):
        pass
    
    def error(self, theta):
        pass
    
    def train(self, alpha=0.001, n=1000):
        for i in range(n):
            self.theta = self.theta + alpha * self.gradient(self.theta)
        return self.theta
    
    def summary(self):
        pass
    
    def classify(self, x):
        pass
    
    def positive_prob(self, x):
        pass

In [32]:
lreg = LogisticRegresssion(X, y)
theta_learned = lreg.train(alpha=0.01,n=500000)
print 'Learned parameters:'
print theta_learned


Learned parameters:
[[-5373.24906527]
 [   45.92968145]
 [   45.66893581]]

In [38]:
t0, t1, t2 = float(theta_learned[0]), float(theta_learned[1]), float(theta_learned[2])
alpha = -t1/t2; beta  = -t0/t2
print 'Learned slope: ', alpha
print 'Learned intercept: ', beta


Learned slope:  -1.00570947465
Learned intercept:  117.656542011

In [62]:
fig, axes = plt.subplots()
domain = np.linspace(15, 105, 100)

axes.plot(admitted[[0]], admitted[[1]] , 'go', 
          rejected[[0]], rejected[[1]], 'ro',
          domain, alpha * domain + beta, '-')


Out[62]:
[<matplotlib.lines.Line2D at 0x1109a28d0>,
 <matplotlib.lines.Line2D at 0x1109a2ad0>,
 <matplotlib.lines.Line2D at 0x1109a7190>]

In [ ]: