two-class/binary-class classification problems: $y$ is a discrete value (e.g., spam or not spam, fraud or not fraud, etc.)
$y \in {0,1}$, 0: "Negative Class" (e.g., benign), 1: "Positive Class" (e.g., malignant)
could use linear regression ($h_{\theta}(x) = \theta^{T}x$) with a particular threshold value; however, for example, extreme $x$ values can skew linear regression, suggesting a different threshold value that would go against the training samples; also, linear regression hypotheses can output values less than 0 and greater than 1 (even when all training samples have $y$ values of 0 or 1
logistic regression (classification algorithm) model:
want $0 \leq h_{\theta}(x) \leq 1$
Sigmoid/Logistic function: $h_{\theta}(x) = g(\theta^{T}x)$, where $g(z) = \frac{1}{1+e^{-z}}$, where $z \in \mathbb{R}$
thus, $h_{\theta}(x) = \frac{1}{1+e^{-\theta^{T}x}}$
interpretation of hypothesis output:
$h_{\theta}(x) =$ estimated probability that $y = 1$ on input $x$
for example:
for the following features x,
$x =
\begin{bmatrix}
x{0} \
x{1}
$
$h_{\theta}(x) = 0.7$ (means for those features, the estimated probability that $y = 1$ is 0.7; i.e., there is a 70% chance that y = 1$
$h_{\theta}(x) = P(y=1 \mid x;\theta)$ (p of y = 1 [probability that y = 1] given x parameterized by theta)
$P(y=0 \mid x;\theta) + P(y=1 \mid x;\theta) = 1$
$P(y=0 \mid x;\theta) = 1 - P(y=1 \mid x;\theta)$
thus, $h_{\theta}(x) = g(\theta^{T}x) = P(y=1 \mid x;\theta)$
the sigmoid
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(-2, 2)
plt.plot(x,(1/(1+np.e**-x)))
plt.title("Sigmoid Function $g(z)$")
plt.xlabel('z')
plt.show()
suppose:
predict $y = 1$ if $h_{\theta}(x) \geq 0.5$
predict $y = 0$ if $h_{\theta}(x) \lt 0.5$
$g(z) \geq 0.5$ when $z \geq 0$
thus: $h_{\theta}(x) = g(\theta^{T}x) \geq 0.5$,
whenever $\theta^{T}x \geq 0$
$g(z) \lt 0.5$ when $z \lt 0$
and: $h_{\theta}(x) = g(\theta^{T}x) \lt 0.5$,
whenever $\theta^{T}x \lt 0$
decision boundary is a property of the hypothesis (not of training set); the training set is used to fit the parameters of $\theta$
example:
$h_{\theta}(x) = g(\theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2})$
$\theta = \begin{bmatrix} -3 \\ 1 \\ 1 \\ \end{bmatrix}$
predict $y = 1$ if $\theta^{T}x = -3 + x_{1} + x_{2} \geq 0$
decision boundary: $h_{\theta}(x) = 0.5$ when $x_{1} + x_{2} = 3$
can also write: $x_{1} + x_{2} \geq 3$
which means $x_{1} + x_{2} \lt 3 \implies y = 0$
y = 0 (half plane, region) y = 1 (half plane, region)
In [3]:
x = np.linspace(0, 3)
plt.scatter([1,0.5,0], [1,2,0.5], marker='o')
plt.scatter([3,2,2.5], [1,2.5,2], color='r', marker='x')
plt.plot(x,(-x + 3))
plt.title("Sigmoid Function $g(z)$")
plt.xlabel('z')
plt.show()
non-linear decision boundaries
$h_{\theta}(x) = g(\theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{1}^{2} + \theta_{4}x_{2}^{2})$
$\theta = \begin{bmatrix} -1 \\ 0 \\ 0 \\ 1 \\ 1 \end{bmatrix}$
predict $y = 1$ if $-1 + x_{1}^{2} + x_{2}^{2} \geq 0$
$\implies x_{1}^{2} + x_{2}^{2} \geq 1$
decision boundary: $x_{1}^{2} + x_{2}^{2} = 1$ (a circle)
In [4]:
x1 = np.linspace(-1, 1)
plt.scatter([0.2,0.5,0,0.3,-0.8], [0,-0.5,0.4,-0.3,0], marker='o')
plt.scatter([1.2,0.8,0,0,-0.8], [0,0.8,1.1,-1.3,0.8], color='r', marker='x')
plt.plot(x1,(np.sqrt(-x1**2 + 1)))
plt.plot(x1,-(np.sqrt(-x1**2 + 1)), color='b')
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()
it's also possible for more complex shapes, more complex decision boundaries
from an elipsis to an irregular shape
supervised learning problem of fitting a logistic regression model
training set: $\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\}$
$m$ training examples: $x \in \begin{bmatrix}x_{0}\\x_{1}\\...\\x_{n}\end{bmatrix}$, where $x_{0} = 1$ and $y \in \{0,1\}$
$h_{\theta}(x) = \frac{1}{1+e^{-\theta^{T}x}}$
given this training set, how to choose/fit parameters $\theta$?
for linear regression:
$J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}Cost(h_{\theta}(x^{(i)}),y^{(i)})$, where $Cost(h_{\theta}(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_{\theta}(x^{(i)}) - y^{(i)})^{2}$, where $(h_{\theta}(x^{(i)}) - y^{(i)})^{2}$ is the squared-error term
however, this cost function doesn't work well for logistic regression because it is a non-convex function of the parameters theta (i.e., with many local optima) because of the sigmoid component (in the hypothesis) being a relatively complicated non-linearity
gradient descent, on that sort of function, would not be guaranteed to converge on the global minimum
thus, a logistic regression cost function (for a single training example):
$Cost(h_{\theta}(x),y) =
\begin{cases}
-\log{h_{\theta}(x)} & \text{if } y = 1 \\
-\log{1 - h_{\theta}(x)} & \text{if } y = 0
\end{cases}$
note: $y = 0$ or $1$ always
In [5]:
x = np.linspace(0, 1)
plt.plot(x,-(np.log(x)))
#plt.plot(x,-(np.log(1/(1+np.e**-x))))
plt.title("if $y = 1$")
plt.xlabel(r"$h_{\theta}(x)$")
plt.show()
this cost function has desirable properties
$Cost = 0$ if $y = 1, h_{\theta}(x) = 1$ (i.e., if hypothesis predict h = 1 and y is exactly equal to the prediction)
but, as $h_{\theta}(x) \rightarrow 0$, $Cost \rightarrow \infty$, which captures intuition that if $h_{\theta}(x) = 0$ (predict $P(y=1 \mid x;\theta) = 0$, but $y = 1$ (i.e., the chance that y = 1 is 0; e.g., the probability that the email is spam is 0 [i.e., impossible]; if the email is spam, then the learning algorithm is penalized with a very large cost)
In [6]:
x = np.linspace(0, 1)
plt.plot(x,-(np.log(1 - x)))
#plt.plot(x,-(np.log(1/(1+np.e**-x))))
plt.title("if $y = 0$")
plt.xlabel(r"$h_{\theta}(x)$")
plt.show()
for the case of y = 0, it's saying that if y ends up equaling 0 when it was predicted that it equals 1, a very large cost is paid (while the cost is 0 when the hypothesis accurately predicts that y = 0)
with this choice of cost function (which is convex and local-optima free), we have a convex optimization problem
$Cost(h_{\theta}(x),y) = \begin{cases} -\log{h_{\theta}(x)} & \text{if } y = 1 \\ -\log{1 - h_{\theta}(x)} & \text{if } y = 0 \end{cases}$
simplified cost function (compressed into one equation since y is always 0 or 1):
$Cost(h_{\theta}(x),y) = -y\log{(h_{\theta}(x))} - (1 - y)\log{(1 - h_{\theta}(x))}$
equivalent since:
if $y = 1: -(1)\log{(h_{\theta}(x))} - (1 - (1))\log{(1 - h_{\theta}(x))} \\ = -\log{(h_{\theta}(x))} - (0)\log{(1 - h_{\theta}(x))} = -\log{(h_{\theta}(x))}$
if $y = 0: -(0)\log{(h_{\theta}(x))} - (1 - (0))\log{(1 - h_{\theta}(x))} \\ = -(1)\log{(1 - h_{\theta}(x))} = -\log{(1 - h_{\theta}(x))}$
so, final logistic regression cost function:
$J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}Cost(h_{\theta}(x^{(i)}),y^{(i)}) \\ = -\frac{1}{m}\big[\sum\limits_{i=0}^{m-1}y^{(i)}\log{h_{\theta}(x^{(i)})} + (1 - y^{(i)})\log{(1 - h_{\theta}(x^{(i)}))}\big]$
brought out minus sign; also, this particular function was chosen because it can be derived from statistics using the principle of maximum likelihood estimation (how to find paramaters theta for particular models)
to fit parameters $\theta$:
$\underset{\theta}{\text{min}}$ $J(\theta)$, which will give some set of parameters $\theta$
to make a prediction given a new set of features $x$:
output a prediction $h_{\theta(x)} = \frac{1}{1+e^{-\theta^{T}x}}$, interpreted as $P(y=1 \mid x;\theta)$
so, how to minimize $J(\theta)$ as a function of $\theta$ (i.e., fit the parameters to the training set)
$J(\theta) = -\frac{1}{m}\big[\sum\limits_{i=0}^{m-1}y^{(i)}\log{h_{\theta}(x^{(i)})} + (1 - y^{(i)})\log{(1 - h_{\theta}(x^{(i)}))}\big]$
$\frac{\partial}{\partial\theta_{j}} J(\theta) = \frac{\partial}{\partial\theta_{j}} \big(-\frac{1}{m}\big[\sum\limits_{i=0}^{m-1}y^{(i)}\log{h_{\theta}(x^{(i)})} + (1 - y^{(i)})\log{(1 - h_{\theta}(x^{(i)}))}\big]\big)$
for example:
$\frac{\partial}{\partial\theta_{0}} J(\theta_{0},\theta_{1}) = \frac{\partial}{\partial\theta_{0}} \big(-\frac{1}{m}\big[\sum\limits_{i=0}^{m-1}y^{(i)}\log{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(i)} + \theta_{1} \cdot x_{1}^{(i)})}}\big)} + (1 - y^{(i)})\log{(1 - \big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(i)} + \theta_{1} \cdot x_{1}^{(i)})}}\big))}\big]\big)$
STILL NEED TO CALCULATE DERIVATIVE
for example, where $i = 0$:
$\frac{\partial}{\partial\theta_{0}} \big(-\frac{1}{m}\big[y^{(0)}\log{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)} + (1 - y^{(0)})\log{(1 - \big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big))}\big]\big)$
$= -\frac{1}{m}y^{(0)} \frac{\partial}{\partial\theta_{0}} \big(\log{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)} + (1 - y^{(0)})\log{(1 - \big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big))}\big)$
$= -\frac{1}{m}y^{(0)} \frac{\partial}{\partial\theta_{0}} \big(\log{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)}\big) + \frac{\partial}{\partial\theta_{0}} \big((1 - y^{(0)})\log{(1 - \big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big))}\big)$
$= -\frac{1}{m}y^{(0)} \frac{\partial}{\partial\theta_{0}} \big(\log{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)}\big) + (1 - y^{(0)}) \frac{\partial}{\partial\theta_{0}} \big(\log{(1 - \big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big))}\big)$
$= -\frac{1}{m}y^{(0)} \Bigg(\frac{1}{\big(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)}\Bigg) \cdot \frac{\partial}{\partial\theta_{0}} \Bigg(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1}{\big(1 - \frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\big)}\Bigg) \cdot \frac{\partial}{\partial\theta_{0}} \Bigg(1 - \frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$
$= -\frac{1}{m}y^{(0)} \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) \cdot \frac{\partial}{\partial\theta_{0}} \Bigg(\frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot \frac{\partial}{\partial\theta_{0}} \Bigg(1 - \frac{1}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$
$= -\frac{1}{m}y^{(0)} \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) \cdot \Bigg(\frac{0 \cdot (1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1 \cdot (0 + e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot \Bigg(0 - \frac{0 \cdot (1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1 \cdot (0 + e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg)$
$= -\frac{1}{m}y^{(0)} \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) \cdot \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot \Bigg(- \frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg)$
$= -\frac{1}{m}y^{(0)} \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot \Bigg(- \frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$
$= -\frac{1}{m}y^{(0)} \Bigg(\frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1}{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \cdot \Bigg(- \frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$
$= -\frac{1}{m}y^{(0)} \Bigg(\frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(- \frac{x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$
$= -\frac{1}{m}y^{(0)} \Bigg(\frac{x_{0}^{(0)}}{e^{(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot (1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})})}\Bigg) + (1 - y^{(0)}) \Bigg(- \frac{x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$
$= -\frac{1}{m}y^{(0)} \Bigg(\frac{x_{0}^{(0)}}{1+e^{(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) + (1 - y^{(0)}) \Bigg(- \frac{x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$
$= -\frac{1}{m}y^{(0)} \Bigg(\frac{x_{0}^{(0)}}{1+e^{(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) - \Bigg(\frac{(1 - y^{(0)})x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$
$= -\frac{1}{m} \Bigg(\frac{x_{0}^{(0)}y^{(0)}}{1+e^{(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) - \Bigg(\frac{x_{0}^{(0)} + x_{0}^{(0)}y^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)$
$= -\frac{1}{m} \frac{x_{0}^{(0)}y^{(0)}}{1+e^{(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}} + \frac{-x_{0}^{(0)} - x_{0}^{(0)}y^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}$
$= -\frac{1}{m}y^{(0)} \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) \cdot \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot \Bigg(- \frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg)$
$= \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) + (1 - y^{(0)}) \Bigg(\frac{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot (-1)\Bigg]$
$= \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{[1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}]^{2}}\Bigg) \Bigg(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} + (1 - y^{(0)}) \Bigg(\frac{1}{(1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}) - 1}\Bigg) \cdot (-1)\Bigg]$
$= \Bigg(\frac{-(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot -[x_{0}^{(0)}])}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} + (1 - y^{(0)}) \Bigg(\frac{1}{(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \cdot (-1)\Bigg]$
$= \Bigg(\frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} + (1 - y^{(0)}) \Bigg(-\frac{1}{(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)\Bigg]$
$= \Bigg(\frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} + \Bigg(-\frac{1 - y^{(0)}}{(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg)\Bigg]$
$= \Bigg(\frac{e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})} \cdot x_{0}^{(0)}}{1+e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg) \Bigg[-\frac{1}{m}y^{(0)} - \frac{1 - y^{(0)}}{(e^{-(\theta_{0} \cdot x_{0}^{(0)} + \theta_{1} \cdot x_{1}^{(0)})}}\Bigg]$
want $\underset{\theta}{\text{min}}$ $J(\theta)$
repeat until convergence {
$\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial\theta_{j}} J(\theta) = \theta_{j} - \alpha \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$
}
the algorithm looks identical to the one used for linear regression; the hypothesis is the only difference
a vectorized implemention:
of the form: $\theta := \theta - \alpha \delta$, for some vector $d \in \mathbb{R}^{n+1}$
$\theta := \theta - \alpha \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x^{(i)} = \theta - \alpha\delta$
so, for optimization, given $\theta$, need to compute:
and then give those to an optimization alogorithm
more advanced algorithms:
advantages:
disadvantage:
don't try to implement yourself (just like you don't implement your own square-root function)
example:
given a $\theta$, where $\theta = \begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix}$
given a cost function: $J(\theta) = (\theta_{0} - 5)^{2} + (\theta_{1} - 5)^{2}$
then:
$\frac{\partial}{\partial \theta_{0}} J(\theta) = 2(\theta_{0} - 5)$
$\frac{\partial}{\partial \theta_{1}} J(\theta) = 2(\theta_{1} - 5)$
(for logistic regression, use optimization solvers that find the minimum of unconstrained functions since logistic regression does not have any constraints [as $\theta$ can take any real value])
using SciPy's optimization package:
In [7]:
# Nelder-Mead Simplex algorithm
from scipy.optimize import fmin
theta = np.array([[0],
[0]])
def J(theta):
return (theta[0] - 5)**2.0 + (theta[1] - 5)**2.0
theta_optimized = fmin(J, theta, xtol=1e-8)
print theta_optimized
print "\n"
theta_optimized = fmin(J, theta)
print theta_optimized
In [8]:
# Broyden-Fletcher-Goldfarb-Shanno algorithm
from scipy.optimize import fmin_bfgs
theta = np.array([[0],
[0]])
def J(theta):
return (theta[0] - 5)**2.0 + (theta[1] - 5)**2.0
delta = np.array([[0],
[1]])
def deltas(theta):
delta[0] = 2.0*(theta[0] - 5)
delta[1] = 2.0*(theta[1] - 5)
return delta.flatten()
# gradient estimated
theta_optimized = fmin_bfgs(J, theta)
print theta_optimized
print "\n"
# gradient provided
theta_optimized = fmin_bfgs(J, theta, fprime=deltas)
print theta_optimized
for example:
email foldering/tagging: work ($y=1$), friends ($y=2$), family ($y=3$), hobby ($y=4$)
medical diagrams: not ill ($y=0$), cold ($y=1$), flu ($y=2$)
weather: sunny, cloudy, rain, snow
train a logistic regression classifier $h_{\theta}^{(i)}(x)$ for each class $i$ to predict the probability that $y=1$
on a new input $x$, to male a prediction, pick the class $i$ that maximizes (i.e., gives highest probability):
$\underset{i}{\text{max}}$ $h_{\theta}^{(i)}(x))$
In [10]:
x = np.linspace(0, 3)
plt.scatter([1,0.5,0,0.5], [1,2,0.5,1.0], marker='o')
plt.scatter([3,2,2.5,2.3], [1,2.5,2,1.5], color='r', marker='x')
plt.plot(x,(-x + 3))
plt.title("Binary Classification")
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()
In [11]:
plt.scatter([1,0.7,0,0.5], [2,2.5,2,2.3], color='g', marker='^')
plt.scatter([1,0.5,0,0.5], [1,0.5,0.5,1.0], color='b', marker='s')
plt.scatter([3,2.6,2.5,2.3], [1,1.6,1.3,1.5], color='r', marker='x')
plt.legend(["Class 1","Class 2","Class 3"])
plt.title("Multi-class Classification")
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()
turn into three binary class classification problems
create a "fake" training set where classes 2 and 3 get assigned to the negative class (value 0), and class 1 gets assigned to the positive class (value 1)
fit a classifier $h_{\theta}^{(i)}(x)$
train a regular logistic regression classifier to compute a decision boundary
In [12]:
plt.scatter([1,0.7,0,0.5], [2,2.5,2,2.3], color='g', marker='^')
plt.scatter([1,0.5,0,0.5,3,2.6,2.5,2.3], [1,0.5,0.5,1.0,1,1.6,1.3,1.5], color='k', marker='o')
x = np.linspace(0,3)
plt.plot(x,((x/2) + 1))
plt.title(r"$h_{\theta}^{(1)}(x)$")
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()
In [13]:
plt.scatter([1,0.5,0,0.5], [1,0.5,0.5,1.0], color='b', marker='s')
plt.scatter([3,2.6,2.5,2.3,1,0.7,0,0.5], [1,1.6,1.3,1.5,2,2.5,2,2.3], color='k', marker='o')
x = np.linspace(0,3)
plt.plot(x,((-x/2) + 1.7))
plt.title(r"$h_{\theta}^{(2)}(x)$")
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()
In [14]:
plt.scatter([3,2.6,2.5,2.3], [1,1.6,1.3,1.5], color='r', marker='x')
plt.scatter([1,0.7,0,0.5,1,0.5,0,0.5], [2,2.5,2,2.3,1,0.5,0.5,1.0], color='k', marker='o')
x.shape = (50, )
x *= 0
x += 1.5
y = np.linspace(0,3)
plt.plot(x,y)
plt.title(r"$h_{\theta}^{(3)}(x)$")
plt.xlabel("$x_{1}$")
plt.ylabel("$x_{2}$")
plt.show()
underfitting / "high bias"
(e.g., a linear model that does not fit a training set well; i.e., there is a strong bias/preconception even in the face of contrary data that the model should be linear)
(e.g., for logistic regression, a simple sigmoid function model)
overfitting / "high variance"
(e.g., a 4th-order polynomial model, which might fit a curve that pases through all training samples; i.e., there is too much variance, almost as if it could fit any data (it tries too hard); thus, it would fail to generalize to new examples)
(e.g., for logistic regression, a complicated high-order polynomial that fits all training samples)
deciding if overfitting: if a lot of features, it's very difficult to plot/visualize to see if overfitting
addressing overfitting:
penalize and make $\theta$ parameter values really small
modify cost function to shrink all parameters; add a "regularization term"
new optimization objective $\underset{\theta}{\text{min}}$ $J(\theta)$ (minimize this regularized cost function):
$J(\theta) = \frac{1}{2m}\Big[(\sum\limits_{i=0}^{m-1}h_{\theta}(x^{(i)}) - y^{(i)})^{2}) + \lambda \sum\limits_{j=1}^{n} \theta_{j}^{2}\Big]$
by convention, don't penalize/regularize $\theta_{0}$ (in practice, the result is not very different)
$\lambda$ is the regularization parameter; it controls a trade-off between two different goals (a) fit training set well, b)keep parameters small)
if $\lambda$ is set extremely large (e.g., $\lambda = 10^{10}$), then all of the parameters would be close to zero (essentially getting rid of all terms except for $\theta_{0}$, which would result in underfitting)
$\frac{\partial}{\partial\theta_{0}} J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)}$
$\frac{\partial}{\partial\theta_{j}} J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$, where $\theta$ is regularized
repeat {
$\theta_{0} := \theta_{0} - \alpha (\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)})$
$\theta_{j} := \theta_{j} - \alpha \Big[(\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}) + \frac{\lambda}{m}\theta_{j}\Big] \\ = \theta_{j}(1 - \alpha\frac{\lambda}{m}) - \alpha (\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)})$
$(j=1,2,3,..,n)$
}
usually, $(1 - \alpha \frac{\lambda}{m}) < 1$, so $\theta_{j}$ will be made smaller
design matrix ($m \times (n + 1)$):
$X = \begin{bmatrix}(x^{(1)})^{T} \\ ... \\(x^{(m)})^{T} \end{bmatrix}$
each row corresponds to a different training example
m-dimensional vector ($\mathbb{R}^{m}$):
$y = \begin{bmatrix}y^{(1)} \\ ... \\y^{(m)}\end{bmatrix}$
contains labels from training set
regularized:
$\theta = (X^{T}X + \lambda
\begin{bmatrix}
0 && 0 && 0 && ... \\
0 && 1 && 0 && ... \\
0 && 0 && 1 && ... \\
0 && 0 && 0 && ...
\end{bmatrix})^{-1} X^{T}y$
the matrix is a $(n+1)\times(n+1)$
if $n = 2$, the matrix will be a $3 \times 3$
\begin{bmatrix} 0 && 0 && 0 \\ 0 && 1 && 0 \\ 0 && 0 && 1 \\ \end{bmatrix}even if $m \leq n$ (which would normally make $X^{T}X$ non-invertible/singular/degenerate), if $\lambda > 0$, then $X^{T}X + \lambda \begin{bmatrix} 0 && 0 && 0 && ... \\ 0 && 1 && 0 && ... \\ 0 && 0 && 1 && ... \\ 0 && 0 && 0 && ... \end{bmatrix}$ will be invertible
$\frac{\partial}{\partial\theta_{0}} J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)}$
$\frac{\partial}{\partial\theta_{j}} J(\theta) = \frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$, where $\theta$ is regularized
repeat {
$\theta_{0} := \theta_{0} - \alpha [\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)}]$
$\theta_{j} := \theta_{j} - \alpha [\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)} + \frac{\lambda}{m}\theta_{j}] \\ = \theta_{j}(1 - \alpha\frac{\lambda}{m}) - \alpha [\frac{1}{m}\sum\limits_{i=0}^{m-1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}]$
$(j=1,2,3,..,n)$
}
[the summation does not include the regularization term]
for advanced optimization, add regularization if it doesn't exist
(an old algorithm; computationally more expensive)
classification rather than regression
(e.g., predict whether or not a house will be sold in the next 6 months)
also, if n is large, it can be computationally too expensive
neuron (cells in brain)
artificial neural network
artificial neuron model: logistic unit
a sigmoid (logistic) activation function
In [14]: