CSE 6040, Fall 2015 [26]: Logistic regression, Part 2

Maximum likelihood estimation and numerical optimization 101

In Lab 25, we looked at geometric solutions to the binary classification problem. In particular, the data is a set of labeled points in some vector space; our goal is to model how points are labeled by dividing (or partitioning) the space.

Recall that you did a quick experiment in which you partitioned the space for a set of synthetic data points manually; in this lab, you will see a technique to determine this partitioning automatically.

Some of today's code snippets use Plotly, so you may need to refer back to how to make plots in Plotly. The most important one is how to log-in to the service, for which you'll need to look up your Plotly API key. Anyway, you may need to refer to the references below.

Also, this lab builds on the iterative numerical optimization idea from Lab 24, which is known as gradient ascent or gradient descent (also, steepest ascent/descent), depending on whether one is maximizing or minimizing some quantity.

Lastly, several of the routines we used last time to manipulate points and visualize them have been moved into the cse6040utils module; so please update that: cse6040utils.py

Preliminaries


In [ ]:
import pandas as pd
import seaborn as sns
import numpy as np
from IPython.display import display

%matplotlib inline

In [ ]:
import plotly.plotly as py
from plotly.graph_objs import *

# @YOUSE: Fill in your credentials (user ID, API key) for Plotly here
py.sign_in ('USERNAME', 'APIKEY')

In [ ]:
%reload_ext autoreload
%autoreload 2
import cse6040utils

Notation

Recall the basic notation used in Lab 25.

Definition 1. There are $m$ labeled data points in a $d$-dimensional vector space. The $i$-th observation is represented by its label, $l_i$, and its augmented coordinates ("point"), $x_i \equiv (1.0, x_{i,1}, x_{i,2}, \ldots, x_{i,d})^T$; that is, the point $x_i$ is a column vector of length $d+1$ with a dummy value of 1.0 in the very first (0-th) entry. The matrix of data coordinates, $X$, stacks these points rowwise into a $m \times (d+1)$ matrix, i.e.,

$$ \begin{array}{rcl} X \equiv \left(\begin{array}{c} x_0^T \\ x_1^T \\ \vdots \\ x_{m-1}^T \end{array}\right) & = & \left(\begin{array}{ccccc} 1 & x_{0,1} & x_{0,2} & \cdots & x_{0,d} \\ 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,d} \\ & & & \vdots & \\ 1 & x_{m-1,1} & x_{m-1,2} & \cdots & x_{m-1,d} \\ \end{array}\right). \end{array} $$

We will take the labels to be a binary column vector, $l \equiv \left(l_0, l_1, \ldots, l_{m-1}\right)^T$.

Definition 2. A linear discriminant is a function that assigns a "score" to a given point $x$ by linearly transforming its coordinates. It may be specified by a $d+1$-dimensional column vector of coefficients, $\theta \equiv (\theta_0, \theta_1, \ldots, \theta_d)^T$; the score is then taken to be $\theta^T x$.

In the cse6040utils module, use lin_discr (X, theta) to evaluate a linear discriminant given by theta on an augmented coordinate matrix X.


In [ ]:
from cse6040utils import lin_discr

Definition 3. The heaviside function maps strictly positive values to the value 1 and non-positive values to 0:

$$ \begin{array}{rcl} H(y) & \equiv & \left\{\begin{array}{ll} 1 & \mathrm{if}\ y > 0 \\ 0 & \mathrm{if}\ y \leq 0 \end{array}\right.. \end{array} $$

In the cse6040utils module, use heaviside (Y) to apply $H(y)$ elementwise to any Numpy multidimensional array object.


In [ ]:
from cse6040utils import heaviside

Definition 4. The logistic function maps a real number continuously into the interval (0, 1):

$$ \begin{array}{rcl} G(y) & \equiv & \frac{1}{1 + e^{-y}}. \end{array} $$

Recall that $G(y) \rightarrow 0$ as $y \rightarrow -\infty$, and $G(y) \rightarrow 1$ as $y \rightarrow +\infty$.

In the cse6040utils module, use logistic (Y) to apply $G(y)$ elementwise to any Numpy multidimensional array object.


In [ ]:
from cse6040utils import logistic

Properties of the logistic function. The following is a handy list of properties of $G(y)$.

$$ \begin{array}{rcll} G(y) & = & \frac{e^y}{e^y + 1} & \mathrm{(P1)} \\ G(-y) & = & 1 - G(y) & \mathrm{(P2)} \\ \dfrac{G(y)}{G(-y)} = \dfrac{G(y)}{1 - G(y)} & = & e^y & \mathrm{(P3)} \\ \dfrac{dG}{dy} & = & G(y) G(-y) & \mathrm{(P4)} \\ {\dfrac{d}{dy}} {\left[ \ln G(y) \right]} & = & G(-y) & \mathrm{(P5)} \\ {\dfrac{d}{dy}} {\ln \left[ 1 - G(y) \right]} & = & -G(y) & \mathrm{(P6)} \\ \end{array} $$

The sample data

We'll use the same synthetic data set of labeled points as in Lab 25. The following code cells loads them in, creating a matrix points corresponding to $X$ and a column vector labels corresponding to $l$.


In [ ]:
df = pd.read_csv ('http://vuduc.org/cse6040/logreg_points_train.csv')

In [ ]:
points = np.insert (df.as_matrix (['x_1', 'x_2']), 0, 1.0, axis=1)
labels = df.as_matrix (['label'])

In [ ]:
from cse6040utils import make_2d_scatter_traces

In [ ]:
print "Number of points:", len (points)

traces = make_2d_scatter_traces (points, labels)
py.iplot (traces)

Demo: Manual classification

Recall that in the last class you generated label predictions using $H(\theta^T x)$, that is, by using the heaviside function with a linear discriminant. You determined the discriminant coefficients, $\theta$, manually.


In [ ]:
from cse6040utils import check_labels
from cse6040utils import np_col_vec
from cse6040utils import gen_lin_discr_trace

In [ ]:
#theta = np_col_vec ([0., -1., 3.])
#theta = np_col_vec ([-0.55, -2., -0.5])
theta = np_col_vec ([-1.35, -6.5, -1.])

# Generate 0/1 labels for your discriminant:
is_correct = check_labels (points, labels,
                           fun=lambda X: heaviside (lin_discr (X, theta)))

print "Number of misclassified points:", (len (points) - sum (is_correct))[0]
print "\n(Run the code cell below to visualize the results.)"

In [ ]:
# Visually inspect the above results
traces = make_2d_scatter_traces (points, is_correct)
traces.append (gen_lin_discr_trace (points, theta))

# Plot it!
layout = Layout (xaxis=dict (range=[-1.25, 2.25]),
                 yaxis=dict (range=[-3.25, 2.25]))
fig = Figure (data=traces, layout=layout)
py.iplot (fig)

Heaviside vs. logistic functions

Where we ended in the last class was comparing the use of the heaviside function against use of the logistic function, both assuming a linear discriminant scoring function.

Here is that comparison visually.


In [ ]:
# Use Numpy's handy meshgrid() to create a regularly-spaced grid of values.
# http://docs.scipy.org/doc/numpy/reference/generated/numpy.meshgrid.html

x1 = np.linspace (-2., +2., 100)
x2 = np.linspace (-2., +2., 100)
x1_grid, x2_grid = np.meshgrid (x1, x2)
h_grid = heaviside (theta[0] + theta[1]*x1_grid + theta[2]*x2_grid)

trace_grid = Contour (x=x1, y=x2, z=h_grid)
py.iplot ([trace_grid])

In [ ]:
x_logit_1d = np.linspace (-6.0, +6.0, 101)
y_logit_1d = logistic (x_logit_1d)
trace_logit_1d = Scatter (x=x_logit_1d, y=y_logit_1d)
py.iplot ([trace_logit_1d])

In [ ]:
g_grid = logistic (theta[0] + theta[1]*x1_grid + theta[2]*x2_grid)

trace_logit_grid = Contour (x=x1, y=x2, z=g_grid)
py.iplot ([trace_logit_grid])

Determining $\theta$ via Maximum Likelihood Estimation

Previously, you determined $\theta$ for our synthetic dataset experimentally. Can you compute a good $\theta$ automatically? One of the standard techniques in statistics is to perform a maximum likelihood estimation (MLE) of a model's parameters, $\theta$.

Indeed, MLE is basis for the "statistical" way to derive the normal equations in the case of linear regression, though that is of course not how we encountered it in this class.

"Likelihood" as an objective function

MLE derives from the following idea. Consider the joint probability of observing all of the labels, given the points and the parameters, $\theta$:

$$ \mathrm{Pr}[l\,|\,X, \theta]. $$

Suppose these observations are independent and identically distributed (i.i.d.). Then the joint probability can be factored as the product of individual probabilities,

$$ \begin{array}{rcl} \mathrm{Pr}[l\,|\,X,\theta] = \mathrm{Pr}[l_0, \ldots, l_{m-1}\,|\,x_0, \ldots, x_{m-1}, \theta] & = & \mathrm{Pr}[l_0\,|\,x_0, \theta] \cdots \mathrm{Pr}[l_{m-1}\,|\,x_{m-1}, \theta] \\ & = & \displaystyle \prod_{i=0}^{m-1} \mathrm{Pr}[l_i\,|\,x_i,\theta]. \end{array} $$

The maximum likelihood principle says that you should try to choose a parameter $\theta$ that maximizes the chances ("likelihood") of seeing these particular observations. Thus, we can simply reinterpret the preceding probability as an objective function to optimize. Mathematically, it is equivalent and convenient to consider the logarithm of the likelihood, or log-likelihood, as the objective function, defining it by,

$$ \begin{array}{rcl} \mathcal{L}(\theta; l, X) & \equiv & \log \left\{ \displaystyle \prod_{i=0}^{m-1} \mathrm{Pr}[l_i\,|\,x_i,\theta] \right\} \\ & = & \displaystyle \sum_{i=0}^{m-1} \log \mathrm{Pr}[l_i\,|\,x_i,\theta]. \end{array} $$

We are using the symbol $\log$, which could be taken in any convenient base, such as the natural logarithm ($\ln y$) or the information theoretic base-two logarithm ($\log_2 y$).

The MLE procedure then consists of two steps:

  • For the problem at hand, determine a suitable choice for $\mathrm{Pr}[l_i\,|\,x_i,\theta]$.
  • Run any optimization procedure to find the $\theta$ that maximizes $\mathcal{L}(\theta; l, X)$.

Example: Logistic regression

Let's say you have decided that the logistic function, $G(\theta^T x_i) = G(x_i^T \theta)$, is a good model of the probability of producing a label $l_i$ given the point $x_i$. Under the i.i.d. assumption, we can interpret the label $l_i$ as being the result of a Bernoulli trial (e.g., a biased coin flip), where the probability of success ($l_i=1$) is defined as $g_i = g_i(\theta) \equiv G(x_i^T \theta)$. Thus,

$$ \begin{array}{rcl} \mathrm{Pr}[l_i \, | \, x_i, \theta] & \equiv & g_i^{l_i} \cdot \left(1 - g_i\right)^{1 - l_i}. \end{array} $$

The log-likelihood in turn becomes,

$$ \begin{array}{rcl} \mathcal{L}(\theta; l, X) & = & \displaystyle \sum_{i=0}^{m-1} l_i \ln g_i + (1-l_i) \ln (1-g_i) \\ & = & \displaystyle \sum_{i=0}^{m-1} l_i \ln \dfrac{g_i}{1-g_i} + \ln (1-g_i) \\ & = & \sum_{i=0}^{m-1} l_i \theta^T x_i + \ln (1-g_i). \end{array} $$

You can write the log-likelihood more compactly in a matrix-vector notation, with the following notational conventions.

Convention 1. Let $u \equiv (1, \ldots, 1)^T$ be a column vector of all ones, with its length inferred from context. Then, the sum of the coordinate vectors ($\{x_i\}$) could be written as

$$\sum_{i=0}^{m-1} x_i = \left(x_0\ x_1\ \cdots\ x_{m-1}\right) \cdot \left(\begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array}\right) = X^T u. $$

Convention 2. Let $A = \left(a_{ij}\right)$ be any matrix and let $f(y)$ be any function that we have defined by default to accept a scalar argument $y$ and produce a scalar result. For instance, $f(y) = \ln y$ or $f(y) = G(y)$. Then, assume that $B = f(A)$ applies $f(\cdot)$ elementwise to $A$, returning a matrix $B$ whose elements $b_{ij} = f(a_{ij})$.

With these notational conventions, here are two different ways to write the log-likelihood.

Exercise. Show the following.

$$ \begin{array}{rrcl} (\mathrm{V1}) & \mathcal{L}(\theta; l, X) & = & l^T \ln G(X \theta) + (1-l)^T \ln [1 - G(X \theta)] \\ (\mathrm{V2}) & \mathcal{L}(\theta; l, X) & = & l^T X \theta + u^T \ln G(-X \theta) \end{array} $$

Exercise. Implement the log-likelihood function in Python.


In [ ]:
def log_likelihood (theta, l, X):
    # @YOUSE: Complete this function to evaluate the log-likelihood
    pass

To optimize the log-likelihood with respect to the parameters, $\theta$, you'd like to do the moral equivalent of taking its derivative, setting it to zero, and then solving for $\theta$.

For example, recall that in the case of linear regression via least squares minimization, carrying out this process produced an analytic solution for the parameters, which was to solve the normal equations.

Unfortunately, for logistic regression---or for most log-likelihoods you are likely to ever write down---you cannot usually derive an analytic solution. Therefore, you will need to resort to numerical optimization procedures.

Numerical optimization via gradient (or steepest) ascent/descent

The simplest procedure to maximize a function is gradient ascent (or steepest ascent). If instead you are minimizing the function, then the equivalent procedure is gradient (or steepest) descent.

The basic idea, in 1-D

Suppose we wish to find the maximum of a scalar function $f(x)$ in one dimension, i.e., $x$ is also a scalar. At the maximum, $\dfrac{df(x)}{dx} = 0$.

Suppose instead that $\dfrac{df}{dx} \neq 0$ and consider the value of $f$ at a nearby point, $x + p$, as given approximately by a truncated Taylor series:

$$ \begin{array}{rcl} f(x + p) & \approx & f(x) + p \dfrac{df(x)}{dx} + \frac{1}{2} p^2 \dfrac{d^2 f(x)}{dx^2}. \end{array} $$

To make progress toward maximizing $f(x)$, you'd like to choose $p$ so that $f(x+p) > f(x)$. One way is to choose $p=\alpha \cdot \mathrm{sign} \left(\dfrac{df}{dx}\right)$, where $0 < \alpha \ll 1$ is "small:"

$$ \begin{array}{rcl} f \left(x + \alpha \cdot \mathrm{sign} \dfrac{df}{dx} \right) & \approx & f(x) - \alpha \left|\dfrac{df}{dx}\right| + \frac{1}{2} \alpha^2 \cdot \mathrm{sign} \left(\dfrac{df}{dx}\right) \dfrac{d^2 f}{dx^2}. \end{array} $$

If $\alpha$ is small enough, then you can neglect the $\mathcal{O}(\alpha^2)$ term and $f(x + p)$ will be larger than $f(x)$, thus making progress toward a maximum.

This scheme is the basic idea: starting from some initial guess $x$, refine the guess by taking a small step $p$ in the direction of the derivative, i.e., $\mathrm{sign} \dfrac{df}{dx}$.

The basic idea in higher dimensions

Now consider $f(x)$ where $x$ is instead a vector, rather than a scalar. Then the value of $f$ at a nearby point $f(x + p)$, where $p$ is a vector, becomes

$$ \begin{array}{rcl} f(x + p) \approx f(x) + p^T \nabla_x f(x), \end{array} $$

where $\nabla_x f(x)$ is the vector derivative, or gradient, of $f$. Just as in the 1-D case, you want a step $p$ such that $f(x + p) > f(x)$. To make as much progress as possible, it would seem reasonable to choose $p$ to be parallel to $\nabla_x\,f(x)$, that is, proportional to the gradient. This intuition motivates the following choice of $p$:

$$ \begin{array}{rcl} p \equiv \alpha \frac{\nabla_x\,f(x)}{\|\nabla_x\,f(x)\|}. \end{array} $$

Again, $\alpha$ is a kind of fudge factor. You need to choose it to be small enough that the high-order terms of the Taylor approximation become negligible, yet large enough that you can make reasonable progress.

Applying the gradient ascent algorithm to MLE

The procedure applied to maximizing the log-likelihood is as follows.

  • Start with some initial guess, $\theta(0)$.
  • At each iteration $t \geq 0$ of the procedure, let $\theta(t)$ be the current guess.
  • Compute the direction of steepest ascent by evaluating the gradient, $\Delta_t \equiv \nabla_{\theta(t)} \left\{\mathcal{L}(\theta(t); l, X)\right\}$.
  • Take a step in the direction of the gradient, $\theta(t+1) \leftarrow \theta(t) + \alpha \dfrac{\Delta_t}{\|\Delta_t\|}$, where $\alpha$ is a suitably chosen fudge factor.

This procedure should smell eerily like the one in Lab 24! And just as in Lab 24, the tricky bit is how to choose $\alpha$.

One additional and slight distinction between this procedure and the Lab 24 procedure is that here we are optimizing using the full dataset, rather than processing data points one at a time. (That is, the step iteration variable $t$ used above is not used in exactly the same way as the step iteration $k$ was used in Lab 24.)

Another question is, how do we know this procedure will converge to the global maximum, rather than, say, a local maximum? For that you need a deeper analysis of a specific $\mathcal{L}(\theta; l, X)$, to show, for instance, that it is convex in $\theta$.

Example: A gradient ascent algorithm for logistic regression

Let's apply the gradient ascent procedure to the logistic regression problem, in order to determine a good $\theta$.

Exercise. Show the following:

$$ \begin{array}{rcl} \nabla_\theta \left\{\mathcal{L}(\theta; l, X)\right\} & = & X^T \left[ l - G(X \cdot \theta)\right]. \end{array} $$

Exercise. Implement the gradient ascent procedure to determine $\theta$, and try it out on the sample data.

In your solution, we'd like you to store all guesses in the matrix thetas, so that you can later see how the $\theta(t)$ values evolve. To extract a particular column t, use the notation, theta[:, t:t+1]. This notation is necessary to preserve the "shape" of the column as a column vector.


In [ ]:
def gradient_log_likelihood (theta, l, X):
    """Returns the gradient of the log-likelihood."""
    # @YOUSE: Implement the gradient for the logistic regression
    #         model's log-likelihood
    pass

In [ ]:
MAX_STEP = 500
ALPHA = 0.5

# Get the data coordinate matrix, X, and labels vector, l
X = points
l = labels.astype (dtype=float)

# Store *all* guesses, for subsequent analysis
thetas = np.zeros ((3, MAX_STEP+1))

for t in range (MAX_STEP):
    theta_t = thetas[:, t:t+1]
    # @YOUSE: Fill in the code to compute thetas[:, t+1:t+2]
    pass
    thetas[:, t+1:t+2] = theta_t + ALPHA*delta_t
    
theta_gd = thetas[:, MAX_STEP:]
print "Your (hand) solution:", theta.T.flatten ()
print "Computed solution:", theta_gd

print "\n=== Comparisons ==="
print "\n\\theta_0/\\theta_2:", \
      "manual =", theta[0]/theta[2], \
      ", vs. MLE (via gradient ascent) =", theta_gd[0]/theta_gd[2]
print "\n\\theta_1/\\theta_2:", \
      "manual =", theta[1]/theta[2], \
      ", vs. MLE (via gradient ascent) =", theta_gd[1]/theta_gd[2]

In [ ]:
# Generate 0/1 labels for computed discriminant using the logistic function
def gen_label_logreg (X, theta):
    return heaviside (logistic (lin_discr (X, theta)) - 0.5)

def check_correct_logreg (l, X, theta):
    return check_labels (X, l, fun=lambda X: gen_label_logreg (X, theta))

def count_correct_logreg (l, X, thetas):
    num_steps = thetas.shape[1]
    num_correct = np.zeros (num_steps, dtype=int)
    for t in range (num_steps):
        theta_t = thetas[:, t:t+1]
        is_correct = check_correct_logreg (l, X, theta_t)
        num_correct[t] = sum (is_correct)[0]
    return num_correct

num_correct_gd = count_correct_logreg (l, X, thetas)
is_correct_gd = check_correct_logreg (l, X, thetas[:, -1:])

print "Number of misclassified points using MLE via gradient ascent:", \
      (len (points) - num_correct_gd[-1])
print "\n(Run the code cell below to visualize the results.)"

In [ ]:
# Visually inspect the above results
traces_gd = make_2d_scatter_traces (points, is_correct_gd)
traces_gd.append (gen_lin_discr_trace (points, theta_gd))

# Plot it!
layout_gd = Layout (xaxis=dict (range=[-1.25, 2.25]),
                    yaxis=dict (range=[-3.25, 2.25]))
fig_gd = Figure (data=traces_gd, layout=layout_gd)
py.iplot (fig_gd)

Exercise. Make a contour plot of the log-likelihood and draw the trajectory taken by the $\theta(t)$ values laid on top of it.

In particular, your function, log_likelihood (theta, l, X), should return the value of the log-likelihood given a column vector of discriminant coefficients, theta; a column vector of labels, l; and a coordinate matrix, X.

The gradient ascent trajectory

Let's take a look at how the gradient ascent algorithm's progress looks. Try changing the $\phi$ parameter and see how it affects the results.


In [ ]:
n1_ll = 100
x1_ll = np.linspace (-20., 0., n1_ll)
n2_ll = 100
x2_ll = np.linspace (-20., 0., n2_ll)
x1_ll_grid, x2_ll_grid = np.meshgrid (x1_ll, x2_ll)

ll_grid = np.zeros ((n1_ll, n2_ll))
for i1 in range (n1_ll):
    for i2 in range (n2_ll):
        theta_i1_i2 = np.array ([[thetas[0, MAX_STEP]],
                                 [x1_ll_grid[i1][i2]],
                                 [x2_ll_grid[i1][i2]]
                                ])
        ll_grid[i1][i2] = log_likelihood (theta_i1_i2, l, X)
        
trace_ll_grid = Contour (x=x1_ll, y=x2_ll, z=ll_grid)
trace_thetas = Scatter (x=thetas[1, :], y=thetas[2, :], mode='markers+lines')
py.iplot ([trace_ll_grid, trace_thetas])

Numerical optimization via Newton's method

The fudge factor in gradient ascent might be troubling. Can you choose the step size or direction in a better or more principled way?

One idea is Newton's method, summarized below.

The basic idea, in 1-D

Suppose you start at a point $x$ and, assuming you are not yet at the optimum, you wish to take a step $x + p$ so that $f(x + p)$ is the maximum.

However, instead of trying to maximize $f(x + p)$ directly, let's replace $f(x + p)$ with some approximation $q(p)$, and then choose a $p$ to maximize $q(p)$. A simple choice for $q(p)$ is a quadratic function in $p$. This choice is motivated by two factors: (a) since it's quadratic, it should have some sort of extreme point (and hopefully an actual maximum), and (b) it is a higher-order approximation than a linear one, and so hopefully more accurate than a linear one as well.

$$ \begin{array}{rcl} f(x + p) & \approx & f(x) + p \dfrac{df}{dx} + \frac{1}{2} p^2 \dfrac{d^2 f}{dx^2} & \equiv & q(p). \end{array} $$

To maximize $q(p)$, take its derivative and then solve for the $p_*$ such that $q(p_*) = 0$:

$$ \begin{array}{rcl} \left.\dfrac{dq}{dp}\right|_{p=p_*} & = & \dfrac{df}{dx} + p_* \dfrac{d^2 f}{dx^2} = 0 \\ \implies p_* & = & -\left(\dfrac{d^2 f}{dx^2}\right)^{-1} \dfrac{df}{dx}. \end{array} $$

Generalizing to higher dimensions

To see how this procedure works in higher dimensions, you will need not only the gradient of $f(x)$, but also its Hessian, which is the moral equivalent of a second derivative.

Definition: the Hessian. Let $f(v)$ be a function that takes a vector $v$ of length $n$ as input and returns a scalar. The Hessian of $f(v)$ is an $n \times n$ matrix, $H_v(f)$, whose entries are all $n^2$ possible second derivatives with respect to the components of $v$. That is, the $(i, j)$ element of $H_v(f)$ is given by $h_{ij}$ such that

$$ \begin{array}{rcl} h_{ij} & \equiv & \dfrac{\partial^2}{\partial v_i \partial v_j} f(v). \end{array} $$

Armed with a Hessian, the Newton step is defined as follows, by direct analogy to the 1-D case. First, the Taylor series approximation of $f(x + p)$ for multidimensional variables becomes,

$$ \begin{array}{rcl} f(x + p) & \approx & f(x) + {p^T \, \nabla_x \, f} + {\frac{1}{2}\,p^T H_x(f) \, p} & \equiv & q(p). \end{array} $$

As in the 1-D case, we want to find an extreme point of $q(p)$. Taking its "derivative" (gradient), $\nabla_p q$, and setting it to 0 yields,

$$ \begin{array}{rcl} \nabla_p \, q(p) & = & \nabla_x \, f(x) + H_x(f) \, p = 0 \\ \implies H_x(f) \cdot p & = & -\, \nabla_x \, f(x). \end{array} $$

In other words, to choose the next step $p$, Newton's method suggests you must solve a system of linear equations, where the matrix is the Hessian of $f$ and the right-hand side is the negative gradient of $f$.

Summary: Newton's method

Summarizing the main ideas from above, Newton's method to maximize the scalar objective function $f(x)$ where $x$ is a vector, consists of the following steps:

  • Start with some initial guess $x(0)$.
  • At step $t$, compute the search direction $p(t)$ by solving $H_{x(t)}(f) \cdot p(t) = -\, \nabla_x \, f(x(t))$.
  • Compute a new (and hopefully improved) guess by the update, $x(t+1) \leftarrow x(t) + p(t)$.

Example: Newton's method for logistic regression

To perform MLE for the logistic regression model using Newton's method, you need both the gradient of the log-likelihood as well as the Hessian. You already know how to compute the gradient from the preceding exercises; so what about the Hessian?

Notationally, that calculation will be a little bit easier to write down and program with the following definition.

Definition: Elementwise product. Let $A \equiv (a_{ij})$ and $B \equiv (b_{ij})$ be $m \times n$ matrices. Denote the elementwise product of $A$ and $B$ by $A \odot B$. That is, if $C = A \odot B$, then element $c_{ij} = a_{ij} \cdot b_{ij}$.

If $A$ is $m \times n$ but $B$ is instead just $m \times 1$, then we will "auto-extend" $B$. Put differently, if $B$ has the same number of rows as $A$ but only 1 column, then we will take $C = A \odot B$ to have elements $c_{ij} = a_{ij} \cdot b_{i}$.

In Python, you can use np.multiply() for elementwise multiplication of Numpy arrays.


In [ ]:
A = np.array ([[1, 2, 3],
               [4, 5, 6]])
B = np.array ([[-1, 2, -3],
               [4, -5, 6]])

print np.multiply (A, B) # elementwise product
print np.multiply (A, B[:, 0:1]) # "auto-extend" version

Exercise. Show that the Hessian of the log-likelihood for logistic regression is,

$$ \begin{array}{rcl} H_{\theta} \left( \mathcal{L}(\theta; l, X) \right) & = & \left( X \odot G(X \theta) \right)^T \left( X \odot G(-X \theta) \right). \end{array} $$

Exercise. Implement a function to compute the Hessian of the log-likelihood.


In [ ]:
def hessian_log_likelihood (theta, l, X):
    """Returns the Hessian of the log-likelihood."""
    # @YOUSE: Implement the Hessian
    pass

Exercise. Complete the code below, which implements Newton's method.


In [ ]:
MAX_STEP = 10

# Get the data coordinate matrix, X, and labels vector, l
X = points
l = labels.astype (dtype=float)

# Store *all* guesses, for subsequent analysis
thetas_newt = np.zeros ((3, MAX_STEP+1))

for t in range (MAX_STEP):
    theta_t = thetas_newt[:, t:t+1]
    # @YOUSE: Fill in this code
    pass
    thetas_newt[:, t+1:t+2] = theta_t + delta_t

theta_newt = thetas_newt[:, MAX_STEP:]
print "Your (hand) solution:", theta.T.flatten ()
print "Computed solution:", theta_newt

num_correct_newt = count_correct_logreg (l, X, thetas_newt)
is_correct_newt = check_correct_logreg (l, X, thetas_newt[:, -1:])

print "\nNumber of misclassified points using MLE:", (len (points) - num_correct_newt[-1])
print "\n(Run the code cell below to visualize the results.)"

print "\n=== Comparisons ==="
print "\n\\theta_0/\\theta_2:", \
      "manual =", theta[0]/theta[2], \
      ", vs. Newton =", theta_newt[0]/theta_newt[2]
print "\n\\theta_1/\\theta_2:", \
      "manual =", theta[1]/theta[2], \
      ", vs. Newton =", theta_newt[1]/theta_newt[2]

In [ ]:
# Visually inspect the above results
traces_newt = make_2d_scatter_traces (points, is_correct_newt)
traces_newt.append (gen_lin_discr_trace (points, theta_newt))

# Plot it!
layout_newt = Layout (xaxis=dict (range=[-1.25, 2.25]),
                      yaxis=dict (range=[-3.25, 2.25]))
fig_newt = Figure (data=traces_newt, layout=layout_newt)
py.iplot (fig_newt)

In [ ]:
trace_thetas_newt = Scatter (x=thetas_newt[1, :], y=thetas_newt[2, :], mode='markers+lines')
py.iplot ([trace_ll_grid, trace_thetas_newt])

In [ ]:
I_gd = range (len (num_correct_gd))
trace_gd_mistakes = Scatter (x=I_gd, y=len (points) - num_correct_gd,
                              mode='markers+lines', name='Gradient descent'
                             )

I_newt = range (len (num_correct_newt))
trace_newt_mistakes = Scatter (x=I_newt, y=len (points) - num_correct_newt,
                               mode='markers+lines', name='Newton'
                              )

layout_mistakes = Layout (xaxis=dict (type='log'), yaxis=dict (type='log'))
fig_mistakes = Figure (data=[trace_gd_mistakes, trace_newt_mistakes],
                       layout=layout_mistakes)
py.iplot (fig_mistakes)

In [ ]: