Logistic Regression

two-class/binary-class classification problems: $y$ is a discrete value (e.g., spam or not spam, fraud or not fraud, etc.)

$y \in {0,1}$, 0: "Negative Class" (e.g., benign), 1: "Positive Class" (e.g., malignant)

could use linear regression ($h_{\theta}(x) = \theta^{T}x$) with a particular threshold value; however, for example, extreme $x$ values can skew linear regression, suggesting a different threshold value that would go against the training samples; also, linear regression hypotheses can output values less than 0 and greater than 1 (even when all training samples have $y$ values of 0 or 1

logistic regression (classification algorithm) model:
want $0 \leq h_{\theta}(x) \leq 1$

Sigmoid/Logistic function: $h_{\theta}(x) = g(\theta^{T}x)$, where $g(z) = \frac{1}{1+e^{-z}}$, where $z \in \mathbb{R}$

thus, $h_{\theta}(x) = \frac{1}{1+e^{-\theta^{T}x}}$

interpretation of hypothesis output:
$h_{\theta}(x) =$ estimated probability that $y = 1$ on input $x$

for example:
for the following features x, $x = \begin{bmatrix} x{0} \ x{1}

\end{bmatrix}

\begin{bmatrix} 1 \\ v \end{bmatrix}

$

$h_{\theta}(x) = 0.7$ (means for those features, the estimated probability that $y = 1$ is 0.7; i.e., there is a 70% chance that y = 1$

$h_{\theta}(x) = P(y=1 \mid x;\theta)$ (p of y = 1 [probability that y = 1] given x parameterized by theta)

$P(y=0 \mid x;\theta) + P(y=1 \mid x;\theta) = 1$
$P(y=0 \mid x;\theta) = 1 - P(y=1 \mid x;\theta)$

thus, $h_{\theta}(x) = g(\theta^{T}x) = P(y=1 \mid x;\theta)$

the sigmoid


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-2, 2)
plt.plot(x,(1/(1+np.e**-x)))
plt.title("Sigmoid Function $g(z)$")
plt.xlabel('z')
plt.show()


suppose:
predict $y = 1$ if $h_{\theta}(x) \geq 0.5$
predict $y = 0$ if $h_{\theta}(x) \lt 0.5$

$g(z) \geq 0.5$ when $z \geq 0$
thus: $h_{\theta}(x) = g(\theta^{T}x) \geq 0.5$,
whenever $\theta^{T}x \geq 0$

$g(z) \lt 0.5$ when $z \lt 0$
and: $h_{\theta}(x) = g(\theta^{T}x) \lt 0.5$,
whenever $\theta^{T}x \lt 0$

decision boundary is a property of the hypothesis (not of training set); the training set is used to fit the parameters of $\theta$

example:

$h_{\theta}(x) = g(\theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2})$

$\theta = \begin{bmatrix} -3 \\ 1 \\ 1 \\ \end{bmatrix}$

predict $y = 1$ if $\theta^{T}x = -3 + x_{1} + x_{2} \geq 0$

decision boundary: $h_{\theta}(x) = 0.5$ when $x_{1} + x_{2} = 3$

can also write: $x_{1} + x_{2} \geq 3$

which means $x_{1} + x_{2} \lt 3 \implies y = 0$

y = 0 (half plane, region) y = 1 (half plane, region)


In [3]:
x = np.linspace(0, 3)

plt.scatter([1,0.5,0], [1,2,0.5], marker='o')
plt.scatter([3,2,2.5], [1,2.5,2], color='r', marker='x')

plt.plot(x,(-x + 3))
plt.title("Sigmoid Function $g(z)$")
plt.xlabel('z')
plt.show()


non-linear decision boundaries

$h_{\theta}(x) = g(\theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{1}^{2} + \theta_{4}x_{2}^{2})$

$\theta = \begin{bmatrix} -1 \\ 0 \\ 0 \\ 1 \\ 1 \end{bmatrix}$

predict $y = 1$ if $-1 + x_{1}^{2} + x_{2}^{2} \geq 0$
$\implies x_{1}^{2} + x_{2}^{2} \geq 1$

decision boundary: $x_{1}^{2} + x_{2}^{2} = 1$ (a circle)


In [4]:
x1 = np.linspace(-1, 1)

plt.scatter([0.2,0.5,0,0.3,-0.8], [0,-0.5,0.4,-0.3,0], marker='o')
plt.scatter([1.2,0.8,0,0,-0.8], [0,0.8,1.1,-1.3,0.8], color='r', marker='x')

plt.plot(x1,(np.sqrt(-x1**2 + 1)))
plt.plot(x1,-(np.sqrt(-x1**2 + 1)), color='b')
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()


it's also possible for more complex shapes, more complex decision boundaries

from an elipsis to an irregular shape

cost function

supervised learning problem of fitting a logistic regression model

training set: $\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\}$

$m$ training examples: $x \in \begin{bmatrix}x_{0}\\x_{1}\\...\\x_{n}\end{bmatrix}$, where $x_{0} = 1$ and $y \in \{0,1\}$

$h_{\theta}(x) = \frac{1}{1+e^{-\theta^{T}x}}$

given this training set, how to choose/fit parameters $\theta$?

for linear regression:
$J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}Cost(h_{\theta}(x^{(i)}),y^{(i)})$, where $Cost(h_{\theta}(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_{\theta}(x^{(i)}) - y^{(i)})^{2}$, where $(h_{\theta}(x^{(i)}) - y^{(i)})^{2}$ is the squared-error term

however, this cost function doesn't work well for logistic regression because it is a non-convex function of the parameters theta (i.e., with many local optima) because of the sigmoid component (in the hypothesis) being a relatively complicated non-linearity

gradient descent, on that sort of function, would not be guaranteed to converge on the global minimum

thus, a logistic regression cost function (for a single training example):
$Cost(h_{\theta}(x),y) = \begin{cases} -\log{h_{\theta}(x)} & \text{if } y = 1 \\ -\log{1 - h_{\theta}(x)} & \text{if } y = 0 \end{cases}$

note: $y = 0$ or $1$ always


In [5]:
x = np.linspace(0, 1)

plt.plot(x,-(np.log(x)))
#plt.plot(x,-(np.log(1/(1+np.e**-x))))

plt.title("if $y = 1$")
plt.xlabel(r"$h_{\theta}(x)$")
plt.show()


this cost function has desirable properties

$Cost = 0$ if $y = 1, h_{\theta}(x) = 1$ (i.e., if hypothesis predict h = 1 and y is exactly equal to the prediction)

but, as $h_{\theta}(x) \rightarrow 0$, $Cost \rightarrow \infty$, which captures intuition that if $h_{\theta}(x) = 0$ (predict $P(y=1 \mid x;\theta) = 0$, but $y = 1$ (i.e., the chance that y = 1 is 0; e.g., the probability that the email is spam is 0 [i.e., impossible]; if the email is spam, then the learning algorithm is penalized with a very large cost)


In [6]:
x = np.linspace(0, 1)

plt.plot(x,-(np.log(1 - x)))
#plt.plot(x,-(np.log(1/(1+np.e**-x))))

plt.title("if $y = 0$")
plt.xlabel(r"$h_{\theta}(x)$")
plt.show()


for the case of y = 0, it's saying that if y ends up equaling 0 when it was predicted that it equals 1, a very large cost is paid (while the cost is 0 when the hypothesis accurately predicts that y = 0)

with this choice of cost function (which is convex and local-optima free), we have a convex optimization problem

$Cost(h_{\theta}(x),y) = \begin{cases} -\log{h_{\theta}(x)} & \text{if } y = 1 \\ -\log{1 - h_{\theta}(x)} & \text{if } y = 0 \end{cases}$

simplified cost function (compressed into one equation since y is always 0 or 1):

$Cost(h_{\theta}(x),y) = -y\log{(h_{\theta}(x))} - (1 - y)\log{(1 - h_{\theta}(x))}$

equivalent since:
if $y = 1: -(1)\log{(h_{\theta}(x))} - (1 - (1))\log{(1 - h_{\theta}(x))} \\ = -\log{(h_{\theta}(x))} - (0)\log{(1 - h_{\theta}(x))} = -\log{(h_{\theta}(x))}$

if $y = 0: -(0)\log{(h_{\theta}(x))} - (1 - (0))\log{(1 - h_{\theta}(x))} \\ = -(1)\log{(1 - h_{\theta}(x))} = -\log{(1 - h_{\theta}(x))}$

so, final logistic regression cost function:

$J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}Cost(h_{\theta}(x^{(i)}),y^{(i)}) \\ = -\frac{1}{m}\big[\sum\limits_{i=0}^{m-1}y^{(i)}\log{h_{\theta}(x^{(i)})} + (1 - y^{(i)})\log{(1 - h_{\theta}(x^{(i)}))}\big]$

brought out minus sign; also, this particular function was chosen because it can be derived from statistics using the principle of maximum likelihood estimation (how to find paramaters theta for particular models)

to fit parameters $\theta$:
$\underset{\theta}{\text{min}}$ $J(\theta)$, which will give some set of parameters $\theta$

to make a prediction given a new set of features $x$:
output a prediction $h_{\theta(x)} = \frac{1}{1+e^{-\theta^{T}x}}$, interpreted as $P(y=1 \mid x;\theta)$

so, how to minimize $J(\theta)$ as a function of $\theta$ (i.e., fit the parameters to the training set)

gradient descent for logistic regression

$J(\theta) = -\frac{1}{m}\big[\sum\limits_{i=0}^{m-1}y^{(i)}\log{h_{\theta}(x^{(i)})} + (1 - y^{(i)})\log{(1 - h_{\theta}(x^{(i)}))}\big]$

$\frac{\partial}{\partial\theta_{j}} J(\theta) = \frac{\partial}{\partial\theta_{j}} \big(-\frac{1}{m}\big[\sum\limits_{i=0}^{m-1}y^{(i)}\log{h_{\theta}(x^{(i)})} + (1 - y^{(i)})\log{(1 - h_{\theta}(x^{(i)}))}\big]\big)$

for example:
$\frac{\partial}{\partial\theta_{0}} J(\theta_{0},\theta_{1}) = \frac{\partial}{\partial\theta_{0}} \big(-\frac{1}{m}\big[\sum\limits_{i=0}^{m-1}y^{(i)}\log{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(i)} + \theta_{1} \cdot x_{1}^{(i)})}}\big)} + (1 - y^{(i)})\log{(1 - \big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(i)} + \theta_{1} \cdot x_{1}^{(i)})}}\big))}\big]\big)$

STILL NEED TO CALCULATE DERIVATIVE

for example, where $i = 0$:
$\frac{\partial}{\partial\theta_{0}} \big(-\frac{1}{m}\big[y^{(0)}\log{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)} + (1 - y^{(0)})\log{(1 - \big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big))}\big]\big)$

$= -\frac{1}{m}y^{(0)} \frac{\partial}{\partial\theta_{0}} \big(\log{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)} + (1 - y^{(0)})\log{(1 - \big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big))}\big)$

$= -\frac{1}{m}y^{(0)} \frac{\partial}{\partial\theta_{0}} \big(\log{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)}\big) + \frac{\partial}{\partial\theta_{0}} \big((1 - y^{(0)})\log{(1 - \big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big))}\big)$

$= -\frac{1}{m}y^{(0)} \frac{\partial}{\partial\theta_{0}} \big(\log{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)}\big) + (1 - y^{(0)}) \frac{\partial}{\partial\theta_{0}} \big(\log{(1 - \big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big))}\big)$

$= -\frac{1}{m}y^{(0)} \Bigg(\frac{1}{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)}\Bigg) \cdot \frac{\partial}{\partial\theta_{0}} \Bigg(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1}{\big(1 - \frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)}\Bigg) \cdot \frac{\partial}{\partial\theta_{0}} \Bigg(1 - \frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$

$= -\frac{1}{m}y^{(0)} \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) \cdot \frac{\partial}{\partial\theta_{0}} \Bigg(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot \frac{\partial}{\partial\theta_{0}} \Bigg(1 - \frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$

$= -\frac{1}{m}y^{(0)} \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) \cdot \Bigg(\frac{0 \cdot (1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1 \cdot (0 + e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot \Bigg(0 - \frac{0 \cdot (1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1 \cdot (0 + e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg)$

$= -\frac{1}{m}y^{(0)} \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) \cdot \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot \Bigg(- \frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg)$

$= -\frac{1}{m}y^{(0)} \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot \Bigg(- \frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$

$= -\frac{1}{m}y^{(0)} \Bigg(\frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1}{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \cdot \Bigg(- \frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$

$= -\frac{1}{m}y^{(0)} \Bigg(\frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(- \frac{x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$

$= -\frac{1}{m}y^{(0)} \Bigg(\frac{x_{0}^{(0)}}{e^{(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot (1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})})}\Bigg) + (1 - y^{(0)}) \Bigg(- \frac{x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$

$= -\frac{1}{m}y^{(0)} \Bigg(\frac{x_{0}^{(0)}}{1+e^{(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(- \frac{x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$

$= -\frac{1}{m}y^{(0)} \Bigg(\frac{x_{0}^{(0)}}{1+e^{(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) - \Bigg(\frac{(1 - y^{(0)})x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$

$= -\frac{1}{m} \Bigg(\frac{x_{0}^{(0)}y^{(0)}}{1+e^{(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) - \Bigg(\frac{x_{0}^{(0)} + x_{0}^{(0)}y^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$

$= -\frac{1}{m} \frac{x_{0}^{(0)}y^{(0)}}{1+e^{(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}} + \frac{-x_{0}^{(0)} - x_{0}^{(0)}y^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}$


$= -\frac{1}{m}y^{(0)} \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) \cdot \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot \Bigg(- \frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg)$

$= \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot (-1)\Bigg]$

$= \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg) \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} + (1 - y^{(0)}) \Bigg(\frac{1}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot (-1)\Bigg]$

$= \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} + (1 - y^{(0)}) \Bigg(\frac{1}{(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \cdot (-1)\Bigg]$

$= \Bigg(\frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} + (1 - y^{(0)}) \Bigg(-\frac{1}{(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)\Bigg]$

$= \Bigg(\frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} + \Bigg(-\frac{1 - y^{(0)}}{(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)\Bigg]$

$= \Bigg(\frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} - \frac{1 - y^{(0)}}{(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg]$

want $\underset{\theta}{\text{min}}$ $J(\theta)$

repeat until convergence {
$\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial\theta_{j}} J(\theta) = \theta_{j} - \alpha \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$
}

the algorithm looks identical to the one used for linear regression; the hypothesis is the only difference

a vectorized implemention:

of the form: $\theta := \theta - \alpha \delta$, for some vector $d \in \mathbb{R}^{n+1}$

$\theta := \theta - \alpha \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x^{(i)} = \theta - \alpha\delta$

so, for optimization, given $\theta$, need to compute:

  • $J(\theta)$
  • $\frac{\partial}{\partial\theta_{j}} J(\theta)$, for ($j = 0, 1, ..., n$)

and then give those to an optimization alogorithm

more advanced algorithms:

  • conjugate gradient
  • BFGS
  • L-BFGS

advantages:

  • usually do not need to pick manually the learning rate alpha (they have a clever inner-loop called a line-search algorithm that automatically tries out different values for the learning rate alpha and automatically picks a good one, even a different one for each iteration)
  • they do other things too, often making them faster than gradient descent

disadvantage:

  • more complex than gradient descent

don't try to implement yourself (just like you don't implement your own square-root function)

example:

given a $\theta$, where $\theta = \begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix}$

given a cost function: $J(\theta) = (\theta_{0} - 5)^{2} + (\theta_{1} - 5)^{2}$

then:
$\frac{\partial}{\partial \theta_{0}} J(\theta) = 2(\theta_{0} - 5)$
$\frac{\partial}{\partial \theta_{1}} J(\theta) = 2(\theta_{1} - 5)$

(for logistic regression, use optimization solvers that find the minimum of unconstrained functions since logistic regression does not have any constraints [as $\theta$ can take any real value])

using SciPy's optimization package:

Nelder-Mead Simplex algorithm


In [7]:
# Nelder-Mead Simplex algorithm
from scipy.optimize import fmin

theta = np.array([[0],
                  [0]])

def J(theta):
    return (theta[0] - 5)**2.0 + (theta[1] - 5)**2.0

theta_optimized = fmin(J, theta, xtol=1e-8)
print theta_optimized

print "\n"

theta_optimized = fmin(J, theta)
print theta_optimized


Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 96
         Function evaluations: 170
[ 5.  5.]


Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 56
         Function evaluations: 95
[ 4.99975895  5.00013968]

Broyden-Fletcher-Goldfarb-Shanno algorithm


In [8]:
# Broyden-Fletcher-Goldfarb-Shanno algorithm
from scipy.optimize import fmin_bfgs

theta = np.array([[0],
                  [0]])

def J(theta):
    return (theta[0] - 5)**2.0 + (theta[1] - 5)**2.0

delta = np.array([[0], 
                  [1]])

def deltas(theta):
    delta[0] = 2.0*(theta[0] - 5)
    delta[1] = 2.0*(theta[1] - 5)
    return delta.flatten()

# gradient estimated
theta_optimized = fmin_bfgs(J, theta)
print theta_optimized

print "\n"

# gradient provided
theta_optimized = fmin_bfgs(J, theta, fprime=deltas)
print theta_optimized


Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 2
         Function evaluations: 16
         Gradient evaluations: 4
[ 4.99999999  4.99999999]


Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 1
         Function evaluations: 3
         Gradient evaluations: 3
[ 4.9995  4.9995]

multiclass classification

for example:
email foldering/tagging: work ($y=1$), friends ($y=2$), family ($y=3$), hobby ($y=4$)

medical diagrams: not ill ($y=0$), cold ($y=1$), flu ($y=2$)

weather: sunny, cloudy, rain, snow

one-vs-all (one-vs-rest)

train a logistic regression classifier $h_{\theta}^{(i)}(x)$ for each class $i$ to predict the probability that $y=1$

on a new input $x$, to male a prediction, pick the class $i$ that maximizes (i.e., gives highest probability):
$\underset{i}{\text{max}}$ $h_{\theta}^{(i)}(x))$


In [10]:
x = np.linspace(0, 3)

plt.scatter([1,0.5,0,0.5], [1,2,0.5,1.0], marker='o')
plt.scatter([3,2,2.5,2.3], [1,2.5,2,1.5], color='r', marker='x')

plt.plot(x,(-x + 3))
plt.title("Binary Classification")
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()



In [11]:
plt.scatter([1,0.7,0,0.5], [2,2.5,2,2.3], color='g', marker='^')
plt.scatter([1,0.5,0,0.5], [1,0.5,0.5,1.0], color='b', marker='s')
plt.scatter([3,2.6,2.5,2.3], [1,1.6,1.3,1.5], color='r', marker='x')

plt.legend(["Class 1","Class 2","Class 3"])
plt.title("Multi-class Classification")
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()


turn into three binary class classification problems

create a "fake" training set where classes 2 and 3 get assigned to the negative class (value 0), and class 1 gets assigned to the positive class (value 1)

fit a classifier $h_{\theta}^{(i)}(x)$

train a regular logistic regression classifier to compute a decision boundary


In [12]:
plt.scatter([1,0.7,0,0.5], [2,2.5,2,2.3], color='g', marker='^')
plt.scatter([1,0.5,0,0.5,3,2.6,2.5,2.3], [1,0.5,0.5,1.0,1,1.6,1.3,1.5], color='k', marker='o')

x = np.linspace(0,3)

plt.plot(x,((x/2) + 1))

plt.title(r"$h_{\theta}^{(1)}(x)$")
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()



In [13]:
plt.scatter([1,0.5,0,0.5], [1,0.5,0.5,1.0], color='b', marker='s')
plt.scatter([3,2.6,2.5,2.3,1,0.7,0,0.5], [1,1.6,1.3,1.5,2,2.5,2,2.3], color='k', marker='o')

x = np.linspace(0,3)

plt.plot(x,((-x/2) + 1.7))

plt.title(r"$h_{\theta}^{(2)}(x)$")
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()



In [14]:
plt.scatter([3,2.6,2.5,2.3], [1,1.6,1.3,1.5], color='r', marker='x')
plt.scatter([1,0.7,0,0.5,1,0.5,0,0.5], [2,2.5,2,2.3,1,0.5,0.5,1.0], color='k', marker='o')

x.shape = (50, )
x *= 0 
x += 1.5
y = np.linspace(0,3)

plt.plot(x,y)

plt.title(r"$h_{\theta}^{(3)}(x)$")
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()


Regularization

underfitting / "high bias"
(e.g., a linear model that does not fit a training set well; i.e., there is a strong bias/preconception even in the face of contrary data that the model should be linear)
(e.g., for logistic regression, a simple sigmoid function model)

overfitting / "high variance"
(e.g., a 4th-order polynomial model, which might fit a curve that pases through all training samples; i.e., there is too much variance, almost as if it could fit any data (it tries too hard); thus, it would fail to generalize to new examples)
(e.g., for logistic regression, a complicated high-order polynomial that fits all training samples)

deciding if overfitting: if a lot of features, it's very difficult to plot/visualize to see if overfitting

addressing overfitting:

  1. reduce number of features, especially if not much training data (disadvantage: throwing away information about a problem)
    • manually select which features to keep
    • model selection algorithm (later in course)
  2. regularization
    • keep all the features, but reduce the magnitude/values of parameters $\theta_{j}$
    • works well when we have a lot of features, each of which contributes a bit to predicting $y$

regularization: cost function

penalize and make $\theta$ parameter values really small

  • this will correspond to a "simpler" hypothesis
  • less prone to overfitting

modify cost function to shrink all parameters; add a "regularization term"

new optimization objective $\underset{\theta}{\text{min}}$ $J(\theta)$ (minimize this regularized cost function):
$J(\theta) = \frac{1}{2m}\Big[(\sum\limits_{i=0}^{m-1}h_{\theta}(x^{(i)}) - y^{(i)})^{2}) + \lambda \sum\limits_{j=1}^{n} \theta_{j}^{2}\Big]$

by convention, don't penalize/regularize $\theta_{0}$ (in practice, the result is not very different)

$\lambda$ is the regularization parameter; it controls a trade-off between two different goals (a) fit training set well, b)keep parameters small)

if $\lambda$ is set extremely large (e.g., $\lambda = 10^{10}$), then all of the parameters would be close to zero (essentially getting rid of all terms except for $\theta_{0}$, which would result in underfitting)

gradient descent for linear regression

$\frac{\partial}{\partial\theta_{0}} J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)}$

$\frac{\partial}{\partial\theta_{j}} J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$, where $\theta$ is regularized

repeat {
$\theta_{0} := \theta_{0} - \alpha (\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)})$
$\theta_{j} := \theta_{j} - \alpha \Big[(\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}) + \frac{\lambda}{m}\theta_{j}\Big] \\ = \theta_{j}(1 - \alpha\frac{\lambda}{m}) - \alpha (\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)})$
$(j=1,2,3,..,n)$
}

usually, $(1 - \alpha \frac{\lambda}{m}) < 1$, so $\theta_{j}$ will be made smaller

Normal Equation

design matrix ($m \times (n + 1)$):
$X = \begin{bmatrix}(x^{(1)})^{T} \\ ... \\(x^{(m)})^{T} \end{bmatrix}$

each row corresponds to a different training example

m-dimensional vector ($\mathbb{R}^{m}$):
$y = \begin{bmatrix}y^{(1)} \\ ... \\y^{(m)}\end{bmatrix}$

contains labels from training set

regularized:
$\theta = (X^{T}X + \lambda \begin{bmatrix} 0 && 0 && 0 && ... \\ 0 && 1 && 0 && ... \\ 0 && 0 && 1 && ... \\ 0 && 0 && 0 && ... \end{bmatrix})^{-1} X^{T}y$

the matrix is a $(n+1)\times(n+1)$

if $n = 2$, the matrix will be a $3 \times 3$

\begin{bmatrix} 0 && 0 && 0 \\ 0 && 1 && 0 \\ 0 && 0 && 1 \\ \end{bmatrix}

even if $m \leq n$ (which would normally make $X^{T}X$ non-invertible/singular/degenerate), if $\lambda > 0$, then $X^{T}X + \lambda \begin{bmatrix} 0 && 0 && 0 && ... \\ 0 && 1 && 0 && ... \\ 0 && 0 && 1 && ... \\ 0 && 0 && 0 && ... \end{bmatrix}$ will be invertible

gradient descent for logistic regression

$\frac{\partial}{\partial\theta_{0}} J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)}$

$\frac{\partial}{\partial\theta_{j}} J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$, where $\theta$ is regularized

repeat {
$\theta_{0} := \theta_{0} - \alpha [\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)}]$
$\theta_{j} := \theta_{j} - \alpha [\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)} + \frac{\lambda}{m}\theta_{j}] \\ = \theta_{j}(1 - \alpha\frac{\lambda}{m}) - \alpha [\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}]$
$(j=1,2,3,..,n)$
}

[the summation does not include the regularization term]

for advanced optimization, add regularization if it doesn't exist

Neural Networks

(an old algorithm; computationally more expensive)

non-linear hypotheses

classification rather than regression
(e.g., predict whether or not a house will be sold in the next 6 months)

also, if n is large, it can be computationally too expensive

computer vision

neurons and the brain

model representation

neuron (cells in brain)

  • computation (in nucleus)
  • input wires (dendrites; accepts messages)
  • output wire (axon; send messages [spikes/electric pulses] to other neurons)

artificial neural network

artificial neuron model: logistic unit

a sigmoid (logistic) activation function


In [14]: