In Lab 25, we looked at geometric solutions to the binary classification problem. In particular, the data is a set of labeled points in some vector space; our goal is to model how points are labeled by dividing (or partitioning) the space.
Recall that you did a quick experiment in which you partitioned the space for a set of synthetic data points manually; in this lab, you will see a technique to determine this partitioning automatically.
Some of today's code snippets use Plotly, so you may need to refer back to how to make plots in Plotly. The most important one is how to log-in to the service, for which you'll need to look up your Plotly API key. Anyway, you may need to refer to the references below.
Also, this lab builds on the iterative numerical optimization idea from Lab 24, which is known as gradient ascent or gradient descent (also, steepest ascent/descent), depending on whether one is maximizing or minimizing some quantity.
Lastly, several of the routines we used last time to manipulate points and visualize them have been moved into the cse6040utils
module; so please update that: cse6040utils.py
In [ ]:
import pandas as pd
import seaborn as sns
import numpy as np
from IPython.display import display
%matplotlib inline
In [ ]:
import plotly.plotly as py
from plotly.graph_objs import *
# @YOUSE: Fill in your credentials (user ID, API key) for Plotly here
py.sign_in ('USERNAME', 'APIKEY')
In [ ]:
%reload_ext autoreload
%autoreload 2
import cse6040utils
Recall the basic notation used in Lab 25.
Definition 1. There are $m$ labeled data points in a $d$-dimensional vector space. The $i$-th observation is represented by its label, $l_i$, and its augmented coordinates ("point"), $x_i \equiv (1.0, x_{i,1}, x_{i,2}, \ldots, x_{i,d})^T$; that is, the point $x_i$ is a column vector of length $d+1$ with a dummy value of 1.0 in the very first (0-th) entry. The matrix of data coordinates, $X$, stacks these points rowwise into a $m \times (d+1)$ matrix, i.e.,
$$ \begin{array}{rcl} X \equiv \left(\begin{array}{c} x_0^T \\ x_1^T \\ \vdots \\ x_{m-1}^T \end{array}\right) & = & \left(\begin{array}{ccccc} 1 & x_{0,1} & x_{0,2} & \cdots & x_{0,d} \\ 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,d} \\ & & & \vdots & \\ 1 & x_{m-1,1} & x_{m-1,2} & \cdots & x_{m-1,d} \\ \end{array}\right). \end{array} $$We will take the labels to be a binary column vector, $l \equiv \left(l_0, l_1, \ldots, l_{m-1}\right)^T$.
Definition 2. A linear discriminant is a function that assigns a "score" to a given point $x$ by linearly transforming its coordinates. It may be specified by a $d+1$-dimensional column vector of coefficients, $\theta \equiv (\theta_0, \theta_1, \ldots, \theta_d)^T$; the score is then taken to be $\theta^T x$.
In the cse6040utils
module, use lin_discr (X, theta)
to evaluate a linear discriminant given by theta
on an augmented coordinate matrix X
.
In [ ]:
from cse6040utils import lin_discr
Definition 3. The heaviside function maps strictly positive values to the value 1 and non-positive values to 0:
$$ \begin{array}{rcl} H(y) & \equiv & \left\{\begin{array}{ll} 1 & \mathrm{if}\ y > 0 \\ 0 & \mathrm{if}\ y \leq 0 \end{array}\right.. \end{array} $$In the cse6040utils
module, use heaviside (Y)
to apply $H(y)$ elementwise to any Numpy multidimensional array object.
In [ ]:
from cse6040utils import heaviside
Definition 4. The logistic function maps a real number continuously into the interval (0, 1):
$$ \begin{array}{rcl} G(y) & \equiv & \frac{1}{1 + e^{-y}}. \end{array} $$Recall that $G(y) \rightarrow 0$ as $y \rightarrow -\infty$, and $G(y) \rightarrow 1$ as $y \rightarrow +\infty$.
In the cse6040utils
module, use logistic (Y)
to apply $G(y)$ elementwise to any Numpy multidimensional array object.
In [ ]:
from cse6040utils import logistic
Properties of the logistic function. The following is a handy list of properties of $G(y)$.
$$ \begin{array}{rcll} G(y) & = & \frac{e^y}{e^y + 1} & \mathrm{(P1)} \\ G(-y) & = & 1 - G(y) & \mathrm{(P2)} \\ \dfrac{G(y)}{G(-y)} = \dfrac{G(y)}{1 - G(y)} & = & e^y & \mathrm{(P3)} \\ \dfrac{dG}{dy} & = & G(y) G(-y) & \mathrm{(P4)} \\ {\dfrac{d}{dy}} {\left[ \ln G(y) \right]} & = & G(-y) & \mathrm{(P5)} \\ {\dfrac{d}{dy}} {\ln \left[ 1 - G(y) \right]} & = & -G(y) & \mathrm{(P6)} \\ \end{array} $$
In [ ]:
df = pd.read_csv ('http://vuduc.org/cse6040/logreg_points_train.csv')
In [ ]:
points = np.insert (df.as_matrix (['x_1', 'x_2']), 0, 1.0, axis=1)
labels = df.as_matrix (['label'])
In [ ]:
from cse6040utils import make_2d_scatter_traces
In [ ]:
print "Number of points:", len (points)
traces = make_2d_scatter_traces (points, labels)
py.iplot (traces)
In [ ]:
from cse6040utils import check_labels
from cse6040utils import np_col_vec
from cse6040utils import gen_lin_discr_trace
In [ ]:
#theta = np_col_vec ([0., -1., 3.])
#theta = np_col_vec ([-0.55, -2., -0.5])
theta = np_col_vec ([-1.35, -6.5, -1.])
# Generate 0/1 labels for your discriminant:
is_correct = check_labels (points, labels,
fun=lambda X: heaviside (lin_discr (X, theta)))
print "Number of misclassified points:", (len (points) - sum (is_correct))[0]
print "\n(Run the code cell below to visualize the results.)"
In [ ]:
# Visually inspect the above results
traces = make_2d_scatter_traces (points, is_correct)
traces.append (gen_lin_discr_trace (points, theta))
# Plot it!
layout = Layout (xaxis=dict (range=[-1.25, 2.25]),
yaxis=dict (range=[-3.25, 2.25]))
fig = Figure (data=traces, layout=layout)
py.iplot (fig)
In [ ]:
# Use Numpy's handy meshgrid() to create a regularly-spaced grid of values.
# http://docs.scipy.org/doc/numpy/reference/generated/numpy.meshgrid.html
x1 = np.linspace (-2., +2., 100)
x2 = np.linspace (-2., +2., 100)
x1_grid, x2_grid = np.meshgrid (x1, x2)
h_grid = heaviside (theta[0] + theta[1]*x1_grid + theta[2]*x2_grid)
trace_grid = Contour (x=x1, y=x2, z=h_grid)
py.iplot ([trace_grid])
In [ ]:
x_logit_1d = np.linspace (-6.0, +6.0, 101)
y_logit_1d = logistic (x_logit_1d)
trace_logit_1d = Scatter (x=x_logit_1d, y=y_logit_1d)
py.iplot ([trace_logit_1d])
In [ ]:
g_grid = logistic (theta[0] + theta[1]*x1_grid + theta[2]*x2_grid)
trace_logit_grid = Contour (x=x1, y=x2, z=g_grid)
py.iplot ([trace_logit_grid])
Previously, you determined $\theta$ for our synthetic dataset experimentally. Can you compute a good $\theta$ automatically? One of the standard techniques in statistics is to perform a maximum likelihood estimation (MLE) of a model's parameters, $\theta$.
Indeed, MLE is basis for the "statistical" way to derive the normal equations in the case of linear regression, though that is of course not how we encountered it in this class.
MLE derives from the following idea. Consider the joint probability of observing all of the labels, given the points and the parameters, $\theta$:
$$ \mathrm{Pr}[l\,|\,X, \theta]. $$Suppose these observations are independent and identically distributed (i.i.d.). Then the joint probability can be factored as the product of individual probabilities,
$$ \begin{array}{rcl} \mathrm{Pr}[l\,|\,X,\theta] = \mathrm{Pr}[l_0, \ldots, l_{m-1}\,|\,x_0, \ldots, x_{m-1}, \theta] & = & \mathrm{Pr}[l_0\,|\,x_0, \theta] \cdots \mathrm{Pr}[l_{m-1}\,|\,x_{m-1}, \theta] \\ & = & \displaystyle \prod_{i=0}^{m-1} \mathrm{Pr}[l_i\,|\,x_i,\theta]. \end{array} $$The maximum likelihood principle says that you should try to choose a parameter $\theta$ that maximizes the chances ("likelihood") of seeing these particular observations. Thus, we can simply reinterpret the preceding probability as an objective function to optimize. Mathematically, it is equivalent and convenient to consider the logarithm of the likelihood, or log-likelihood, as the objective function, defining it by,
$$ \begin{array}{rcl} \mathcal{L}(\theta; l, X) & \equiv & \log \left\{ \displaystyle \prod_{i=0}^{m-1} \mathrm{Pr}[l_i\,|\,x_i,\theta] \right\} \\ & = & \displaystyle \sum_{i=0}^{m-1} \log \mathrm{Pr}[l_i\,|\,x_i,\theta]. \end{array} $$We are using the symbol $\log$, which could be taken in any convenient base, such as the natural logarithm ($\ln y$) or the information theoretic base-two logarithm ($\log_2 y$).
The MLE procedure then consists of two steps:
Let's say you have decided that the logistic function, $G(\theta^T x_i) = G(x_i^T \theta)$, is a good model of the probability of producing a label $l_i$ given the point $x_i$. Under the i.i.d. assumption, we can interpret the label $l_i$ as being the result of a Bernoulli trial (e.g., a biased coin flip), where the probability of success ($l_i=1$) is defined as $g_i = g_i(\theta) \equiv G(x_i^T \theta)$. Thus,
$$ \begin{array}{rcl} \mathrm{Pr}[l_i \, | \, x_i, \theta] & \equiv & g_i^{l_i} \cdot \left(1 - g_i\right)^{1 - l_i}. \end{array} $$The log-likelihood in turn becomes,
$$ \begin{array}{rcl} \mathcal{L}(\theta; l, X) & = & \displaystyle \sum_{i=0}^{m-1} l_i \ln g_i + (1-l_i) \ln (1-g_i) \\ & = & \displaystyle \sum_{i=0}^{m-1} l_i \ln \dfrac{g_i}{1-g_i} + \ln (1-g_i) \\ & = & \sum_{i=0}^{m-1} l_i \theta^T x_i + \ln (1-g_i). \end{array} $$You can write the log-likelihood more compactly in a matrix-vector notation, with the following notational conventions.
Convention 1. Let $u \equiv (1, \ldots, 1)^T$ be a column vector of all ones, with its length inferred from context. Then, the sum of the coordinate vectors ($\{x_i\}$) could be written as
$$\sum_{i=0}^{m-1} x_i = \left(x_0\ x_1\ \cdots\ x_{m-1}\right) \cdot \left(\begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array}\right) = X^T u. $$Convention 2. Let $A = \left(a_{ij}\right)$ be any matrix and let $f(y)$ be any function that we have defined by default to accept a scalar argument $y$ and produce a scalar result. For instance, $f(y) = \ln y$ or $f(y) = G(y)$. Then, assume that $B = f(A)$ applies $f(\cdot)$ elementwise to $A$, returning a matrix $B$ whose elements $b_{ij} = f(a_{ij})$.
With these notational conventions, here are two different ways to write the log-likelihood.
Exercise. Show the following.
$$ \begin{array}{rrcl} (\mathrm{V1}) & \mathcal{L}(\theta; l, X) & = & l^T \ln G(X \theta) + (1-l)^T \ln [1 - G(X \theta)] \\ (\mathrm{V2}) & \mathcal{L}(\theta; l, X) & = & l^T X \theta + u^T \ln G(-X \theta) \end{array} $$Exercise. Implement the log-likelihood function in Python.
In [ ]:
def log_likelihood (theta, l, X):
# @YOUSE: Complete this function to evaluate the log-likelihood
pass
To optimize the log-likelihood with respect to the parameters, $\theta$, you'd like to do the moral equivalent of taking its derivative, setting it to zero, and then solving for $\theta$.
For example, recall that in the case of linear regression via least squares minimization, carrying out this process produced an analytic solution for the parameters, which was to solve the normal equations.
Unfortunately, for logistic regression---or for most log-likelihoods you are likely to ever write down---you cannot usually derive an analytic solution. Therefore, you will need to resort to numerical optimization procedures.
Suppose we wish to find the maximum of a scalar function $f(x)$ in one dimension, i.e., $x$ is also a scalar. At the maximum, $\dfrac{df(x)}{dx} = 0$.
Suppose instead that $\dfrac{df}{dx} \neq 0$ and consider the value of $f$ at a nearby point, $x + p$, as given approximately by a truncated Taylor series:
$$ \begin{array}{rcl} f(x + p) & \approx & f(x) + p \dfrac{df(x)}{dx} + \frac{1}{2} p^2 \dfrac{d^2 f(x)}{dx^2}. \end{array} $$To make progress toward maximizing $f(x)$, you'd like to choose $p$ so that $f(x+p) > f(x)$. One way is to choose $p=\alpha \cdot \mathrm{sign} \left(\dfrac{df}{dx}\right)$, where $0 < \alpha \ll 1$ is "small:"
$$ \begin{array}{rcl} f \left(x + \alpha \cdot \mathrm{sign} \dfrac{df}{dx} \right) & \approx & f(x) - \alpha \left|\dfrac{df}{dx}\right| + \frac{1}{2} \alpha^2 \cdot \mathrm{sign} \left(\dfrac{df}{dx}\right) \dfrac{d^2 f}{dx^2}. \end{array} $$If $\alpha$ is small enough, then you can neglect the $\mathcal{O}(\alpha^2)$ term and $f(x + p)$ will be larger than $f(x)$, thus making progress toward a maximum.
This scheme is the basic idea: starting from some initial guess $x$, refine the guess by taking a small step $p$ in the direction of the derivative, i.e., $\mathrm{sign} \dfrac{df}{dx}$.
Now consider $f(x)$ where $x$ is instead a vector, rather than a scalar. Then the value of $f$ at a nearby point $f(x + p)$, where $p$ is a vector, becomes
$$ \begin{array}{rcl} f(x + p) \approx f(x) + p^T \nabla_x f(x), \end{array} $$where $\nabla_x f(x)$ is the vector derivative, or gradient, of $f$. Just as in the 1-D case, you want a step $p$ such that $f(x + p) > f(x)$. To make as much progress as possible, it would seem reasonable to choose $p$ to be parallel to $\nabla_x\,f(x)$, that is, proportional to the gradient. This intuition motivates the following choice of $p$:
$$ \begin{array}{rcl} p \equiv \alpha \frac{\nabla_x\,f(x)}{\|\nabla_x\,f(x)\|}. \end{array} $$Again, $\alpha$ is a kind of fudge factor. You need to choose it to be small enough that the high-order terms of the Taylor approximation become negligible, yet large enough that you can make reasonable progress.
The procedure applied to maximizing the log-likelihood is as follows.
This procedure should smell eerily like the one in Lab 24! And just as in Lab 24, the tricky bit is how to choose $\alpha$.
One additional and slight distinction between this procedure and the Lab 24 procedure is that here we are optimizing using the full dataset, rather than processing data points one at a time. (That is, the step iteration variable $t$ used above is not used in exactly the same way as the step iteration $k$ was used in Lab 24.)
Another question is, how do we know this procedure will converge to the global maximum, rather than, say, a local maximum? For that you need a deeper analysis of a specific $\mathcal{L}(\theta; l, X)$, to show, for instance, that it is convex in $\theta$.
Exercise. Show the following:
$$ \begin{array}{rcl} \nabla_\theta \left\{\mathcal{L}(\theta; l, X)\right\} & = & X^T \left[ l - G(X \cdot \theta)\right]. \end{array} $$Exercise. Implement the gradient ascent procedure to determine $\theta$, and try it out on the sample data.
In your solution, we'd like you to store all guesses in the matrix
thetas
, so that you can later see how the $\theta(t)$ values evolve. To extract a particular columnt
, use the notation,theta[:, t:t+1]
. This notation is necessary to preserve the "shape" of the column as a column vector.
In [ ]:
def gradient_log_likelihood (theta, l, X):
"""Returns the gradient of the log-likelihood."""
# @YOUSE: Implement the gradient for the logistic regression
# model's log-likelihood
pass
In [ ]:
MAX_STEP = 500
ALPHA = 0.5
# Get the data coordinate matrix, X, and labels vector, l
X = points
l = labels.astype (dtype=float)
# Store *all* guesses, for subsequent analysis
thetas = np.zeros ((3, MAX_STEP+1))
for t in range (MAX_STEP):
theta_t = thetas[:, t:t+1]
# @YOUSE: Fill in the code to compute thetas[:, t+1:t+2]
pass
thetas[:, t+1:t+2] = theta_t + ALPHA*delta_t
theta_gd = thetas[:, MAX_STEP:]
print "Your (hand) solution:", theta.T.flatten ()
print "Computed solution:", theta_gd
print "\n=== Comparisons ==="
print "\n\\theta_0/\\theta_2:", \
"manual =", theta[0]/theta[2], \
", vs. MLE (via gradient ascent) =", theta_gd[0]/theta_gd[2]
print "\n\\theta_1/\\theta_2:", \
"manual =", theta[1]/theta[2], \
", vs. MLE (via gradient ascent) =", theta_gd[1]/theta_gd[2]
In [ ]:
# Generate 0/1 labels for computed discriminant using the logistic function
def gen_label_logreg (X, theta):
return heaviside (logistic (lin_discr (X, theta)) - 0.5)
def check_correct_logreg (l, X, theta):
return check_labels (X, l, fun=lambda X: gen_label_logreg (X, theta))
def count_correct_logreg (l, X, thetas):
num_steps = thetas.shape[1]
num_correct = np.zeros (num_steps, dtype=int)
for t in range (num_steps):
theta_t = thetas[:, t:t+1]
is_correct = check_correct_logreg (l, X, theta_t)
num_correct[t] = sum (is_correct)[0]
return num_correct
num_correct_gd = count_correct_logreg (l, X, thetas)
is_correct_gd = check_correct_logreg (l, X, thetas[:, -1:])
print "Number of misclassified points using MLE via gradient ascent:", \
(len (points) - num_correct_gd[-1])
print "\n(Run the code cell below to visualize the results.)"
In [ ]:
# Visually inspect the above results
traces_gd = make_2d_scatter_traces (points, is_correct_gd)
traces_gd.append (gen_lin_discr_trace (points, theta_gd))
# Plot it!
layout_gd = Layout (xaxis=dict (range=[-1.25, 2.25]),
yaxis=dict (range=[-3.25, 2.25]))
fig_gd = Figure (data=traces_gd, layout=layout_gd)
py.iplot (fig_gd)
Exercise. Make a contour plot of the log-likelihood and draw the trajectory taken by the $\theta(t)$ values laid on top of it.
In particular, your function, log_likelihood (theta, l, X)
, should return the value of the log-likelihood given a column vector of discriminant coefficients, theta
; a column vector of labels, l
; and a coordinate matrix, X
.
In [ ]:
n1_ll = 100
x1_ll = np.linspace (-20., 0., n1_ll)
n2_ll = 100
x2_ll = np.linspace (-20., 0., n2_ll)
x1_ll_grid, x2_ll_grid = np.meshgrid (x1_ll, x2_ll)
ll_grid = np.zeros ((n1_ll, n2_ll))
for i1 in range (n1_ll):
for i2 in range (n2_ll):
theta_i1_i2 = np.array ([[thetas[0, MAX_STEP]],
[x1_ll_grid[i1][i2]],
[x2_ll_grid[i1][i2]]
])
ll_grid[i1][i2] = log_likelihood (theta_i1_i2, l, X)
trace_ll_grid = Contour (x=x1_ll, y=x2_ll, z=ll_grid)
trace_thetas = Scatter (x=thetas[1, :], y=thetas[2, :], mode='markers+lines')
py.iplot ([trace_ll_grid, trace_thetas])
The fudge factor in gradient ascent might be troubling. Can you choose the step size or direction in a better or more principled way?
One idea is Newton's method, summarized below.
Suppose you start at a point $x$ and, assuming you are not yet at the optimum, you wish to take a step $x + p$ so that $f(x + p)$ is the maximum.
However, instead of trying to maximize $f(x + p)$ directly, let's replace $f(x + p)$ with some approximation $q(p)$, and then choose a $p$ to maximize $q(p)$. A simple choice for $q(p)$ is a quadratic function in $p$. This choice is motivated by two factors: (a) since it's quadratic, it should have some sort of extreme point (and hopefully an actual maximum), and (b) it is a higher-order approximation than a linear one, and so hopefully more accurate than a linear one as well.
$$ \begin{array}{rcl} f(x + p) & \approx & f(x) + p \dfrac{df}{dx} + \frac{1}{2} p^2 \dfrac{d^2 f}{dx^2} & \equiv & q(p). \end{array} $$To maximize $q(p)$, take its derivative and then solve for the $p_*$ such that $q(p_*) = 0$:
$$ \begin{array}{rcl} \left.\dfrac{dq}{dp}\right|_{p=p_*} & = & \dfrac{df}{dx} + p_* \dfrac{d^2 f}{dx^2} = 0 \\ \implies p_* & = & -\left(\dfrac{d^2 f}{dx^2}\right)^{-1} \dfrac{df}{dx}. \end{array} $$Definition: the Hessian. Let $f(v)$ be a function that takes a vector $v$ of length $n$ as input and returns a scalar. The Hessian of $f(v)$ is an $n \times n$ matrix, $H_v(f)$, whose entries are all $n^2$ possible second derivatives with respect to the components of $v$. That is, the $(i, j)$ element of $H_v(f)$ is given by $h_{ij}$ such that
$$ \begin{array}{rcl} h_{ij} & \equiv & \dfrac{\partial^2}{\partial v_i \partial v_j} f(v). \end{array} $$Armed with a Hessian, the Newton step is defined as follows, by direct analogy to the 1-D case. First, the Taylor series approximation of $f(x + p)$ for multidimensional variables becomes,
$$ \begin{array}{rcl} f(x + p) & \approx & f(x) + {p^T \, \nabla_x \, f} + {\frac{1}{2}\,p^T H_x(f) \, p} & \equiv & q(p). \end{array} $$As in the 1-D case, we want to find an extreme point of $q(p)$. Taking its "derivative" (gradient), $\nabla_p q$, and setting it to 0 yields,
$$ \begin{array}{rcl} \nabla_p \, q(p) & = & \nabla_x \, f(x) + H_x(f) \, p = 0 \\ \implies H_x(f) \cdot p & = & -\, \nabla_x \, f(x). \end{array} $$In other words, to choose the next step $p$, Newton's method suggests you must solve a system of linear equations, where the matrix is the Hessian of $f$ and the right-hand side is the negative gradient of $f$.
Summarizing the main ideas from above, Newton's method to maximize the scalar objective function $f(x)$ where $x$ is a vector, consists of the following steps:
To perform MLE for the logistic regression model using Newton's method, you need both the gradient of the log-likelihood as well as the Hessian. You already know how to compute the gradient from the preceding exercises; so what about the Hessian?
Notationally, that calculation will be a little bit easier to write down and program with the following definition.
Definition: Elementwise product. Let $A \equiv (a_{ij})$ and $B \equiv (b_{ij})$ be $m \times n$ matrices. Denote the elementwise product of $A$ and $B$ by $A \odot B$. That is, if $C = A \odot B$, then element $c_{ij} = a_{ij} \cdot b_{ij}$.
If $A$ is $m \times n$ but $B$ is instead just $m \times 1$, then we will "auto-extend" $B$. Put differently, if $B$ has the same number of rows as $A$ but only 1 column, then we will take $C = A \odot B$ to have elements $c_{ij} = a_{ij} \cdot b_{i}$.
In Python, you can use np.multiply()
for elementwise multiplication of Numpy arrays.
In [ ]:
A = np.array ([[1, 2, 3],
[4, 5, 6]])
B = np.array ([[-1, 2, -3],
[4, -5, 6]])
print np.multiply (A, B) # elementwise product
print np.multiply (A, B[:, 0:1]) # "auto-extend" version
Exercise. Show that the Hessian of the log-likelihood for logistic regression is,
$$ \begin{array}{rcl} H_{\theta} \left( \mathcal{L}(\theta; l, X) \right) & = & \left( X \odot G(X \theta) \right)^T \left( X \odot G(-X \theta) \right). \end{array} $$Exercise. Implement a function to compute the Hessian of the log-likelihood.
In [ ]:
def hessian_log_likelihood (theta, l, X):
"""Returns the Hessian of the log-likelihood."""
# @YOUSE: Implement the Hessian
pass
Exercise. Complete the code below, which implements Newton's method.
In [ ]:
MAX_STEP = 10
# Get the data coordinate matrix, X, and labels vector, l
X = points
l = labels.astype (dtype=float)
# Store *all* guesses, for subsequent analysis
thetas_newt = np.zeros ((3, MAX_STEP+1))
for t in range (MAX_STEP):
theta_t = thetas_newt[:, t:t+1]
# @YOUSE: Fill in this code
pass
thetas_newt[:, t+1:t+2] = theta_t + delta_t
theta_newt = thetas_newt[:, MAX_STEP:]
print "Your (hand) solution:", theta.T.flatten ()
print "Computed solution:", theta_newt
num_correct_newt = count_correct_logreg (l, X, thetas_newt)
is_correct_newt = check_correct_logreg (l, X, thetas_newt[:, -1:])
print "\nNumber of misclassified points using MLE:", (len (points) - num_correct_newt[-1])
print "\n(Run the code cell below to visualize the results.)"
print "\n=== Comparisons ==="
print "\n\\theta_0/\\theta_2:", \
"manual =", theta[0]/theta[2], \
", vs. Newton =", theta_newt[0]/theta_newt[2]
print "\n\\theta_1/\\theta_2:", \
"manual =", theta[1]/theta[2], \
", vs. Newton =", theta_newt[1]/theta_newt[2]
In [ ]:
# Visually inspect the above results
traces_newt = make_2d_scatter_traces (points, is_correct_newt)
traces_newt.append (gen_lin_discr_trace (points, theta_newt))
# Plot it!
layout_newt = Layout (xaxis=dict (range=[-1.25, 2.25]),
yaxis=dict (range=[-3.25, 2.25]))
fig_newt = Figure (data=traces_newt, layout=layout_newt)
py.iplot (fig_newt)
In [ ]:
trace_thetas_newt = Scatter (x=thetas_newt[1, :], y=thetas_newt[2, :], mode='markers+lines')
py.iplot ([trace_ll_grid, trace_thetas_newt])
In [ ]:
I_gd = range (len (num_correct_gd))
trace_gd_mistakes = Scatter (x=I_gd, y=len (points) - num_correct_gd,
mode='markers+lines', name='Gradient descent'
)
I_newt = range (len (num_correct_newt))
trace_newt_mistakes = Scatter (x=I_newt, y=len (points) - num_correct_newt,
mode='markers+lines', name='Newton'
)
layout_mistakes = Layout (xaxis=dict (type='log'), yaxis=dict (type='log'))
fig_mistakes = Figure (data=[trace_gd_mistakes, trace_newt_mistakes],
layout=layout_mistakes)
py.iplot (fig_mistakes)
In [ ]: