first: Training set -> Learning Algorithm -> h (hypothesis: maps x's to y's)
then: x -> h -> y (an estimated value for y)
how to represent h: $h_{\theta}(x) = \theta_{0} + \theta_{1}x$
shorthand: $h(x)$
h is simply a linear function
univariate linear regression / linear regression with one variable (x)
In [14]:
from IPython.display import Math
import matplotlib.pyplot as plt
import numpy as np
#from pylab import * #MATLAB-like API
%matplotlib inline
## MATPLOT-like API
#x = linspace(0, 5, 10)
#y = x ** 2
#figure()
#plot(x, y, 'r')
#xlabel('x')
#ylabel('y')
#title('title')
#show()
## OO API
fig = plt.figure()
axes = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # left, bottom, width, height (range 0 to 1)
x = np.array([1.0, 2.0, 3.5, 4.0, 5.2])
y = np.array([1.5, 3.0, 3.5, 4.3, 5.9])
axes.plot(x, x*1.2, label=r"$h(x) = \theta_{0} + \theta_{1}x$")
axes.scatter(x, y, marker='x', color='r')
axes.legend(loc=2)
axes.set_xlabel('x')
axes.set_ylabel('y')
axes.set_title('title')
plt.show()
Squared-Error Cost Function: helps to figure out how to fit the best possible straight line to data
how to choose $\theta_i\text{'s}$:
choose parameters $\theta_{0}, \theta_{1}$ so that $h_{\theta}(x)$ is close to $y$ for the training examples $(x,y)$
formally, the (optimization) goal/objective is to minimize the cost function: $\overset{\text{minimize}}{\theta_{0},\theta_{1}} J(\theta_{0},\theta_{1})$:
use the average of the sum of the squared difference to find the smallest values for $\theta_{0},\theta_{1}$
$J(\theta_{0},\theta_{1}) = \frac{1}{2m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2$, where $h_{\theta}(x^{(i)}) = \theta_{0} + \theta_{1}x^{(i)}$
simplified version:
$h_{\theta}(x) = \theta_{1}x$
(as if $\theta_{0} = 0$)
$h_{\theta}(x^{(i)}) = \theta_{1}x^{(i)}$
$\overset{\text{minimize}}{\theta_{1}} J(\theta_{1})$
hypothesis function: $h_{\theta}(x)$ (for a fixed value of $\theta_{1}$, this is a function of $x$)
cost function: $J(\theta_{1})$ (is a function of the parameter $\theta_{1}$)
each value of $\theta_{1}$ corresponds to a different hypothesis
$J(\theta_{1}) = \frac{1}{2m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2$
$= \frac{1}{2m}\sum\limits_{i=1}^{m}(\theta_{1}x^{(i)} - y^{(i)})^2$
$= \frac{1}{2m}(0^2 + 0^2 + 0^2) = 0^2$
In [27]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(0, 4)
theta_1 = 1
h_theta = x * theta_1
x_i = np.array([1, 2, 3])
y_i = np.array([1, 2, 3])
plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.plot(x, h_theta)
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
plt.yticks(np.arange(min(x), max(x)+1, 1.0))
plt.xlabel('x')
plt.ylabel('y')
plt.title(r"$h_{\theta}(x)$")
plt.show()
so, when $\theta_{1} = 1$, $h_{\theta}(x^{(i)}) = y^{(i)}$, which explains the $0$'s
and for $\theta_{1} = 1$, $J(\theta_{1}) = J(1) = 0$
In [25]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x_i = np.array([1])
y_i = np.array([0])
plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.xticks(np.arange(min(x)-0.5, max(x)+1, 0.5))
plt.yticks(np.arange(min(x), max(x)+1, 1))
plt.xlabel(r"$h_{\theta}(x)$")
plt.ylabel(r"$J(\theta_{1})$")
plt.title(r"$J(\theta_{1})$")
plt.show()
In [26]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(0, 4)
theta_1 = 0.5
h_theta = x * theta_1
x_i = np.array([1, 2, 3])
y_i = np.array([1, 2, 3])
plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.plot(x, h_theta)
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
plt.yticks(np.arange(min(x), max(x)+1, 1.0))
plt.xlabel('x')
plt.ylabel('y')
plt.title(r"$h_{\theta}(x)$")
plt.show()
In [24]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x_i = np.array([1, 0.5])
y_i = np.array([0, 0.58])
plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.xticks(np.arange(min(x)-0.5, max(x)+1, 0.5))
plt.yticks(np.arange(min(x), max(x)+1, 1))
plt.xlabel(r"$h_{\theta}(x)$")
plt.ylabel(r"$J(\theta_{1})$")
plt.title(r"$J(\theta_{1})$")
plt.show()
In [29]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(0, 4)
theta_1 = 0
h_theta = x * theta_1
x_i = np.array([1, 2, 3])
y_i = np.array([1, 2, 3])
plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.plot(x, h_theta, linewidth=2.0)
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
plt.yticks(np.arange(min(x), max(x)+1, 1.0))
plt.xlabel('x')
plt.ylabel('y')
plt.title(r"$h_{\theta}(x)$")
plt.show()
In [30]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x_i = np.array([1, 0.5, 0])
y_i = np.array([0, 0.58, 2.3])
plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.xticks(np.arange(min(x)-0.5, max(x)+1, 0.5))
plt.yticks(np.arange(min(x), max(x)+1, 1))
plt.xlabel(r"$h_{\theta}(x)$")
plt.ylabel(r"$J(\theta_{1})$")
plt.title(r"$J(\theta_{1})$")
plt.show()
clearly, the optimization objective is met when $\theta_{1} = 1$
have some function $J(\theta_{0},\theta_{1},...,\theta_{n})$
want $\underset{\theta_{0},\theta_{1},...,\theta_{n}}{\text{min}}$ $J(\theta_{0},\theta_{1},...,\theta_{n})$
start with some $\theta_{0},\theta_{1},...,\theta_{n}$ (e.g., set all to $0$)
keep changing $\theta_{0},\theta_{1},...,\theta_{n}$ to reduce $J(\theta_{0},\theta_{1},...,\theta_{n})$ until, hopefully, converge on/reach minimum
property of gradient descent: can arrive at different local optimums
algorithm:
repeat until convergence {
$\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial\theta_{j}} J(\theta_{0},\theta_{1}) \text{, (for } j = 0 \text{ and } j = 1 \text{)}$
}
in other words, update $\theta_{j}$ by subtracting $\alpha \frac{\partial}{\partial\theta_{j}} J(\theta_{0},\theta_{1})$
$\alpha$ = learning rate (controls the size of steps for updating$\theta_{j}$)
$\frac{\partial}{\partial\theta_{j}} J(\theta_{0},\theta_{1})$ = derivative
simultaneous update:
$\text{temp0} := \theta_{0} - \alpha \frac{\partial}{\partial\theta_{0}} J(\theta_{0},\theta_{1})$
$\text{temp1} := \theta_{1} - \alpha \frac{\partial}{\partial\theta_{1}} J(\theta_{0},\theta_{1})$
$\theta_{0} := \text{temp0}$
$\theta_{1} := \text{temp1}$
$\underset{\theta_{1}}{\text{min}}$ $J(\theta_{1})$, $\theta_{1} \in \mathbb{R}$
$\theta_{1} := \theta_{1} - \alpha \frac{d}{d\theta_{1}} J(\theta_{1})$
example 1.1:
$f(x) = x^{2}$
$f'(x) = 2x$
starting point: $(8,64)$
slope at point: $m = f'(8) = 2(8) = 16$
tangent line (via point-slope): $y - y_{1} = m(x - x_{1}) \rightarrow y - 64 = 16(x - 8) \rightarrow y = 16x - 70$
slope is a positive number (so $\theta_{1}$ will be decreased)
$\theta_{1} := \theta_{1} - \alpha \frac{d}{d\theta_{1}} J(\theta_{1}) = \theta_{1} - \alpha (\text{positive_number})$
In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(-10,10,100)
y = x**2
plt.scatter(8,64, color='r')
plt.plot(x, y)
plt.legend([r"$J(\theta_{1})$"])
# theta_1 start at x = 8
tangent_line_x = np.linspace(5,10,5)
tangent_line_y = (16*tangent_line_x - 64)
plt.plot(tangent_line_x, tangent_line_y, color='r')
plt.xticks([8], [r"$\theta_{1}$"])
plt.annotate('min', xy=(0,0), xytext=(0,20), arrowprops=dict(facecolor='black', width=1, headwidth=5))
plt.annotate('', xy=(2,-10), xytext=(8,-10), arrowprops=dict(facecolor='black', width=1, headwidth=5))
plt.show()
example 1.2:
$f(x) = x^{2}$
$f'(x) = 2x$
starting point: $(-8,64)$
slope at point: $m = f'(-8) = 2(-8) = -16$
tangent line (via point-slope): $y - y_{1} = m(x - x_{1}) \rightarrow y - 64 = -16(x - (-8)) \rightarrow y = -16x - 128 + 64 = -16x - 64$
In [110]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(-10,10,100)
y = x**2
plt.scatter(-8,64, color='r')
plt.plot(x, y)
plt.legend([r"$J(\theta_{1})$"])
# theta_1 start at x = -8
tangent_line_x = np.linspace(-5,-10,5)
tangent_line_y = (-16*tangent_line_x - 64)
plt.plot(tangent_line_x, tangent_line_y, color='r')
plt.xticks([-8], [r"$\theta_{1}$"])
plt.annotate('min', xy=(0,0), xytext=(0,20), arrowprops=dict(facecolor='black', width=1, headwidth=5))
plt.annotate('', xy=(-2,-10), xytext=(-8,-10), arrowprops=dict(facecolor='black', width=1, headwidth=5))
plt.show()
slope is a negative number (so $\theta_{1}$ will be increased)
$\theta_{1} := \theta_{1} - \alpha \frac{d}{d\theta_{1}} J(\theta_{1}) = \theta_{1} - \alpha (\text{negative_number})$
if $\alpha$ is too small, gradient descent can be slow
if $\alpha$ is too large, gradient descent can overshoot the minumum, fail to converge (i.e., of a series, "approximate in the sum of its terms toward a definite limit"), or even diverge (i.e., of a series, "increase indefinitely as more terms are added")
since gradient descent is based on the derivative of $J$, it will automatically take smaller steps (so there is no need to decrease $\alpha$ over time)
$\frac{\partial}{\partial\theta_{j}} J(\theta_{0},\theta_{1}) = \frac{\partial}{\partial\theta_{j}} \left[\frac{1}{2m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2\right] = \frac{\partial}{\partial\theta_{j}} \left[\frac{1}{2m}\sum\limits_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^{2}\right]$
partial derivatives for $\theta_{0}$ case and $\theta_{1}$ case:
$j = 0: \frac{\partial}{\partial\theta_{0}} J(\theta_{0},\theta_{1}) = \frac{\partial}{\partial\theta_{0}} \left[\frac{1}{2m}\sum\limits_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^{2}\right]=$
$= \left[0 \cdot \sum\limits_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^{2}\right] + \left[\frac{1}{2m} \cdot \sum\limits_{i=1}^{m}\left(2[\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}] \cdot [1 + 0 + 0]\right)\right] = \frac{1}{m} \sum\limits_{i=1}^{m}[\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}]$
$ = \frac{1}{m}\sum\limits_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}]$
$j = 1: \frac{\partial}{\partial\theta_{1}} J(\theta_{0},\theta_{1}) = \frac{\partial}{\partial\theta_{1}} \left[\frac{1}{2m}\sum\limits_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^{2}\right]=$
$= \left[0 \cdot \sum\limits_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^{2}\right] + \left[\frac{1}{2m} \cdot \sum\limits_{i=1}^{m}\left(2[\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}] \cdot [0 + x^{(i)} + 0]\right)\right] = \frac{1}{m} \sum\limits_{i=1}^{m}[\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}] \cdot x^{(i)}$
$= \frac{1}{m}\sum\limits_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}] \cdot x^{(i)}$
(sometimes called "batch" gradient descent: in every step of gradient descent, all training examples are used; other non-"batch" versions of gradient descent exist)
repeat until convergence (update simultaneously) {
$\theta_{0} := \theta_{0} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}]$
$\theta_{1} := \theta_{1} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}] \cdot x^{(i)}$
}
the cost function for linear regression is always "bowl-shaped" (i.e., a convex function); thus, it has only one global/local optimum
In [61]:
# THIS IS JUST A VISUAL EXAMPLE OF A CONVEX FUNCTION
# (i.e., it is not based on the regression/cost functions detailed here)
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
#fig = plt.figure()
ax = plt.axes(projection='3d')
x = np.linspace(-10,10)
y = np.linspace(-10,10)
x,y = np.meshgrid(x,y)
z = (x**2) + (y**2)
#ax.set_zlim(0,100)
ax.plot_surface(x,y,z, rstride=2, cstride=2, cmap=cm.jet)
plt.show()
solving the cost function with one computation (i.e., without an iterative algorithm like gradient descent] and, thus, without the need for $\alpha$):
with more features ($x_{1}, x_{2}, ..., x_{n}$), it becomes difficult to plot the data and notationally problematic
linear algebra notation is better suited
for example, with 4 features:
$X = \begin{bmatrix} 2104 & 5 & 1 & 45 \\ 1416 & 3 & 2 & 40 \\ 1534 & 3 & 2 & 30 \\ 852 & 2 & 1 & 36 \end{bmatrix}$
$y = \begin{bmatrix} 460 \\ 232 \\ 315 \\ 172 \end{bmatrix}$
matrix: rectangular array of numbers
$\begin{bmatrix} 1402 & 191 \\ 949 & 1437 \\ 1371 & 821 \end{bmatrix}$
dimension of matrix: number of rows $\times$ number of columns (e.g., a $3 \times 2$ matrix, or this matrix is an element of the set $\mathbb{R}^{3 \times 2}$)
matrix elements
$A = \begin{bmatrix} 1402 & 191 \\ 949 & 1437 \\ 1371 & 821 \end{bmatrix}$
$A_{ij} =$ "$i,j$ entry" in the $i$-th row, $j$-th column
[commas must be used for indices when the number of a matrix rows/columns is greater than 9]
$A_{11} = 1402$
$A_{12} = 191$
$A_{32} = 821$
$A_{43} = \text{undefined}$
vector: an $n \times 1$ matrix
$y = \begin{bmatrix} 1402 \\ 949 \\ 1371 \\ 178 \end{bmatrix}$
$n = 4$
$n \times 1$ matrix
4-dimensional vector
a vector in the set $\mathbb{R}^{4}$
$y_{i} = i$-th element
$y_{1} = 1402$
$y_{2} = 949$
$y_{3} = 1371$
1-indexed vs. 0-indexed
$y = \begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \\ y_{4} \end{bmatrix}$
$y = \begin{bmatrix} y_{0} \\ y_{1} \\ y_{2} \\ y_{3} \end{bmatrix}$
matrix addition (can only add matrices of the same dimension; result is a matrix of the same dimension):
$\begin{bmatrix}1&&0\\2&&5\\3&&1\end{bmatrix} + \begin{bmatrix}4 && 0.5 \\2 && 5 \\0 && 1\end{bmatrix} = \begin{bmatrix}5 && 0.5 \\4 && 10 \\3 && 2\end{bmatrix}$
scalar (real number) multiplication (result is a matrix of the same dimension):
$3\times\begin{bmatrix}1&&0\\2&&5\\3&&1\end{bmatrix}=\begin{bmatrix}3&&0\\6&&15\\9&&3\end{bmatrix}$
scalar divison:
$\begin{bmatrix}4&&0\\6&&3\end{bmatrix}/4=\begin{bmatrix}1&&0\\\frac{3}{2}&&\frac{3}{4}\end{bmatrix} = \frac{1}{4}\begin{bmatrix}4&&0\\6&&3\end{bmatrix}$
commutative property applies for multiplication but not for division
combination of operations, follow normal precedence rules
matrix-vector multiplication
$\begin{bmatrix}1&&3\\4&&0\\2&&1\end{bmatrix} \begin{bmatrix}1\\5\end{bmatrix} = \begin{bmatrix}1 \times 1 + 3 \times 5 = 16 \\ 4 \times 1 + 0 \times 5 = 4 \\ 2 \times 1 + 1 \times 5 = 7 \end{bmatrix}$
a $3 \times 2$ matrix multiplied by a $2 \times 1$ vector/matrix results in a $3 \times 1$ vector/matrix
$A \times x = y$
$m \times n$ $\cdot$ $n \times 1$ $=$ $m \times 1$ (m-dimensional vector)
($n$ has to match $n$; the number of columns in the first matrix has to match the number of rows in the second matrix)
to get $y_{i}$, multiply each element (i.e., elementwise) of $A$'s $i$-th row with each element of vector $x$ (1:1, $A_{in} \cdot x_{n}$), then sum the products
[can simply pretend to stand the first matrix's rows vertically, multiplying the (then) horizontally adjacent elements]
suppose the hypothesis: $h_{\theta}(x) = -40 + 0.25x$
and (for $x$) the house sizes: 2104, 1416, 1534, 852
$\begin{bmatrix} 1 && 2104 \\ 1 && 1416 \\ 1 && 1534 \\ 1 && 852 \end{bmatrix} \times \begin{bmatrix} -40 \\ 0.25 \end{bmatrix}= \begin{bmatrix} -40 \times 1 + 0.25 \times 2104 = h_{\theta}(2104) \\ -40 \times 1 + 0.25 \times 1416 = h_{\theta}(1416) \\ -40 \times 1 + 0.25 \times 1534 = h_{\theta}(1534) \\ -40 \times 1 + 0.25 \times 852 = h_{\theta}(852) \end{bmatrix}$
a $4 \times 2$ matrix times a $2 \times 1$ vector results in a $4 \times 1$ vector
prediction = DataMatrix * parameters
$\begin{bmatrix} 1 && 3 && 2 \\ 4 && 0 && 1 \end{bmatrix} \times \begin{bmatrix} 1 && 3 \\ 0 && 1 \\ 5 && 2 \end{bmatrix}=$
$\begin{bmatrix} 1 && 3 && 2 \\ 4 && 0 && 1 \end{bmatrix} \times \begin{bmatrix} 1 \\ 0 \\ 5 \end{bmatrix}= \begin{bmatrix} 1 \times 1 + 3 \times 0 + 2 \times 5 = 11 \\ 4 \times 1 + 0 \times 0 + 1 \times 5 = 9 \end{bmatrix}$
$\begin{bmatrix} 1 && 3 && 2 \\ 4 && 0 && 1 \end{bmatrix} \times \begin{bmatrix} 3 \\ 1 \\ 2 \end{bmatrix}= \begin{bmatrix} 1 \times 3 + 3 \times 1 + 2 \times 2 = 10 \\ 4 \times 3 + 0 \times 1 + 1 \times 2 = 14 \end{bmatrix}$
$\begin{bmatrix} 1 && 3 && 2 \\ 4 && 0 && 1 \end{bmatrix} \times \begin{bmatrix} 1 && 3 \\ 0 && 1 \\ 5 && 2 \end{bmatrix}= \begin{bmatrix} 11 && 10 \\ 9 && 14 \end{bmatrix}$
a $2 \times 3$ matrix times a $3 \times 2$ matrix results in a $2 \times 2$ matrix
a $m \times n$ matrix times a $n \times o$ matrix results in a $m \times o$ matrix
($n$'s must match [i.e., number of first matrix columns must match number of second matrix rows])
$A \times B = C$
multiply and sum the elements of the $i$-th row of $A$ with the elements of the $i$-th column of $B$ to obtain the elements of the $i$-th column of $C$
suppose three competing hypotheses:
$h_{\theta}(x) = -40 + 0.25x$
$h_{\theta}(x) = 200 + 0.1x$
$h_{\theta}(x) = -150 + 0.4x$
and (for $x$) the house sizes: 2104, 1416, 1534, 852
$\begin{bmatrix} 1 && 2104 \\ 1 && 1416 \\ 1 && 1534 \\ 1 && 852 \end{bmatrix} \times \begin{bmatrix} -40 && 200 && -150 \\ 0.25 && 0.1 && 0.4 \end{bmatrix}= \begin{bmatrix} 486 && 410 && 692 \\ 314 && 342 && 416 \\ 344 && 353 && 464 \\ 173 && 285 && 191 \end{bmatrix}$
in general, not commutative (i.e., $A \times B \neq B \times A$)
dimensions might not even match ($m \times n$ and $n \times m$ results in $m \times m$, whereas $n \times m$ and $m \times n$ results in $n \times n)$
associative property does hold: $A \times (B \times C) = (A \times B) \times C$
identity matrix:
denoted $I$ or $I_{n \times n}$
examples ($1 \times 1$, $2 \times 2$, $3 \times 3$, $4 \times 4$):
$\begin{bmatrix} 1 \end{bmatrix}$
$\begin{bmatrix} 1 && 0 \\ 0 && 1 \end{bmatrix}$
$\begin{bmatrix} 1 && 0 && 0 \\ 0 && 1 && 0 \\ 0 && 0 && 1 \end{bmatrix}$
$\begin{bmatrix} 1 && 0 && 0 && 0 \\ 0 && 1 && 0 && 0 \\ 0 && 0 && 1 && 0 \\ 0 && 0 && 0 && 1 \end{bmatrix}$
for any matrix $A$:
$A \cdot I = I \cdot A = A$ (commutative property does hold)
($m \times n \cdot n \times n = m \times m \cdot m \times n = m \times n)$
(dimensions of identity matrix are implicit/dependent on context)
in the space of real numbers, the identity number is $1$
a number multiplied by its inverse results in $1$: $3(3^{-1}) = 1$
in the space of real numbers, not all numbers have an inverse (e.g., $0^{-1}$ is undefined); those that don't are called "singular" or "degenerate"
matrix inverse:
if $A$ is a $m \times m$ (i.e., a square) matrix, and if it has an inverse $A^{-1} (e.g., the zero matrix does not)$:
$A(A^{-1}) = A^{-1}(A) = I$
only square matrices (i.e., # of rows = # of columns) have inverses
$A = \begin{bmatrix} 3 && 4 \\ 2 && 16 \end{bmatrix}$
$A^{-1} = \begin{bmatrix} 0.4 && -0.1 \\ -0.05 && 0.075 \end{bmatrix}$
$A \times A^{-1} = \begin{bmatrix} 1 && 0 \\ 0 && 1 \end{bmatrix}= I_{2 \times 2}$
(inverses of matrices usually aren't computed by hand)
matrix transpose (operation)
(first row becomes first column; second row becomes second column...)
(also, imagine flipping/mirroring matrix along 45-degree diagonal axis)
$A = \begin{bmatrix} 1 && 2 && 0 \\ 3 && 5 && 9 \end{bmatrix}$
$A^{T} = \begin{bmatrix} 1 && 3 \\ 2 && 5 \\ 0 && 9 \end{bmatrix}$
let $A$ be a $m \times n$ matrix
let $B = A^{T}$
then $B$ is a $n \times m$ matrix
and $B_{ij} = A_{ji}$
now, features/variables: $x_{1}, x_{2}, x_{3}, ..., x_{n}$ prediction is still: $y$
notation:
$m =$ number of training examples
$n =$ number of features
$x^{(i)} =$ input (features) of the $i$-th training example (subscript is an index into training set)
$x_{j}^{(i)} =$ value of feature $j$ in the $i$-th training example
for example:
$x_{1}$: size (feet^2)
$x_{2}$: # of bedrooms
$x_{3}$ # of floors
$x_{4}$ age of home (years)
$y$: price ($1000)
$x^{(2)} = \begin{bmatrix} x_{1} = 1416 \\ x_{2} = 3 \\ x_{3} = 2 \\ x_{4} = 40 \end{bmatrix}$
thus, $x_{3}^{(2)} = 2$
$h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$
for example: $h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{3} + \theta_{4}x_{4}$
thus, as an example: $h_{\theta}(x) = 80 + 0.1x_{1} + 0.01x_{2} + 3x_{3} - 2x_{4}$
for convienience of notation, define $x_{0} = 1$, so that $x_{0}^{(i)} = 1$ (defining a zeroth feature); now a zero-indexed feature vector:
$x = \begin{bmatrix} x_{0} \\ x_{1} \\ x_{2} \\ ... \\ x_{n} \end{bmatrix} \in \mathbb{R}^{n+1}$
$\theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \theta_{2} \\ ... \\ \theta_{n} \end{bmatrix} \in \mathbb{R}^{n+1}$
thus: $h_{\theta}(x) = \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$
$\theta^{T} = \begin{bmatrix}\theta_{0}&&\theta_{1}&&\theta_{2}&&...&&\theta_{n}\end{bmatrix}$
(a $(n + 1) \times 1$ row vector)
use the inner/dot/scalar product of vectors
thus: $h_{\theta}(x) = \theta^{T}x = \begin{bmatrix}\theta_{0}&&\theta_{1}&&\theta_{2}&&...&&\theta_{n}\end{bmatrix} \times \begin{bmatrix} x_{0} \\ x_{1} \\ x_{2} \\ ... \\ x_{n} \end{bmatrix}$
hypothesis: $h_{\theta}(x) = \theta^{T}x$
parameters: $\theta$ (a ($n+1$)-dimensional vector)
cost function: $J(\theta_{0},\theta_{1},...,\theta_{n}) = J(\theta) = \frac{1}{2m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2$
new algorithm ($n \geq 1$):
gradient descent (simultaneously update $\theta_{j}$ for every $j = 0, ..., n$):
repeat {
$\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial\theta_{j}} J(\theta) = \theta_{j} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}] \cdot x_{j}^{(i)}$
}
vectorized implementation:
$\theta := \theta - \alpha \delta$
$\theta$: $\mathbb{R}^{n+1}$
$-$: matrix subtraction
$\alpha$: $\mathbb{R}$
$\delta$: $\mathbb{R}^{n+1}$
$h_{\theta}(x^{(i)}) - y^{(i)}$: $\mathbb{R}$
the final multiplicand $x^{(i)}$: $\mathbb{R}^{n+1}$
$\delta = \frac{1}{m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x^{(i)}$
$\delta = \begin{bmatrix} \delta_{0} \\ \delta_{1} \\ \delta_{2} \\ ... \\ \delta_{n} \end{bmatrix}$
$\delta_{0} = \frac{1}{m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)}$
$h_{\theta}(x^{(i)}) - y^{(i)} \cdot x^{(i)} = [h_{\theta}(x^{(0)}) - y^{(0)} \cdot x^{(0)}] + [h_{\theta}(x^{(1)}) - y^{(1)} \cdot x^{(1)}] + [h_{\theta}(x^{(2)}) - y^{(2)} \cdot x^{(2)}] + ...$
$x^{(i)} = \begin{bmatrix} x_{0}^{(i)} \\ x_{1}^{(i)} \\ x_{2}^{(i)} \\ ... \\ x_{n}^{(i)} \end{bmatrix}$
if features are on a similar scale, gradient descent can converge more quickly (contours of the cost function $J$ will become much less skewed):
for example:
$x_{1}$: size (0-2000 feet^2)
$x_{2}$: # of bedrooms (1-5)
get every feature into approximately a ($-1 \leq x_{i} \leq 1$) range
$x_{1} = \frac{\text{size (feet^{2})}}{2000}$
$x_{2} = \frac{\text{# of bedrooms}}{5}$
another alternative:
replace $x_{i}$ with $x_{i} - \mu_{i}$to make features have approximately zero mean
for example:
$x_{1} = \frac{\text{size}-1000}{2000}$
$x_{2} = \frac{\text{#bedrooms}-2}{5}$
thus, creating new ranges: $-0.5 \leq x_{i} \leq 0.5$
could also use $x_{i} = \frac{x_{i} - \mu_{i}}{s_{1}}$, where $s_{1}$ is the range (max$-$min) or the standard deviation
Plot $J(\theta)$ as a function of the number of iterations
$J(\theta)$ should decrease after every iteration, flattening out eventually when gradient descent converges
the number of iterations necessary for convergence varies immensely
example of an automatic convergence test: declare convergence if $J(\theta)$ decreases by less than some small value $\epsilon$ (e.g., $10^{-3}$) in one iteration
if $J(\theta)$ increases, or increases and decreases alternately, try using a small $\alpha$
for sufficiently small $\alpha$, $J(\theta)$ should decreases on every step
but if $\alpha$ is too small,gradient decent can be slow to converge
to choose $\alpha$, try a range of values (factor of 10 differences, with 3-fold increases in between): $0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1$
In [4]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# THIS IS JUST A VISUAL EXAMPLE
x = np.linspace(10,400)
y = (1/x)
plt.plot(x,y)
plt.legend([r"$J(\theta)$"])
plt.show()
sometimes, it's best to combine (defining new) features based on insight into a particular problem
for example:
$h_{\theta}(x) = \theta_{0} + \theta_{1} \times frontage + \theta_{2} + depth$
$x_{1}=$ frontage
$x_{2}=$ depth
instead, area $x = $ $frontage \times depth$
$h_{\theta}(x) = \theta_{0} + \theta_{1}x$
for example, for the housing data, a linear model might be too dramatic and not fit data very well, while a quadratic model ($\theta_{0} + \theta_{1}x + \theta_{2}x^{2}$) doesn't make sense since it eventually descends
thus, perhaps a cubic function ($\theta_{0} + \theta_{1} x + \theta_{2}x^{2} + \theta_{3}x^{3}$) would be more appropriate:
$h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{3} = \theta_{0} + \theta_{1}(size) + \theta_{2}(size)^{2} + \theta_{3}(size)^{3}$
$x_{1} = (size)$
$x_{2} = (size)^{2}$
$x_{3} = (size)^{3}$
if choosing features like this, feature scaling becomes increasingly important (e.g., size = 1-1000, size^2 = 1-1,000,000, size^3 = 10^9)
another reasonable choice would be a square-root function:
$h_{\theta}(x) = \theta_{0} + \theta_{1}(size) + \theta_{2}\sqrt{(size)}$
In [36]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# THIS IS JUST A VISUAL EXAMPLE
x_i = np.linspace(1,1000)
x_i = np.random.choice(x_i, 10)
y_i = np.log(x_i**3)
plt.scatter(x_i, y_i, marker='x', color='r')
x_i = np.linspace(1,1000)
x_i = np.random.choice(x_i, 10)
y_i = np.log(x_i**2.8)
plt.scatter(x_i, y_i, marker='x', color='r')
x_i = np.linspace(1,1000)
x_i = np.random.choice(x_i, 10)
y_i = np.log(x_i**2.9)
plt.scatter(x_i, y_i, marker='x', color='r')
plt.tick_params(\
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom='off', # ticks along the bottom edge are off
top='off', # ticks along the top edge are off
labelbottom='off')
plt.xlabel('size (x)')
plt.ylabel('price (y)')
plt.show()
works for many linear regression problems to solve for optimum value
solves for $\theta$ analytically (as opposed to iteratively, as with gradient descent)
if using the normal equation method, feature scaling is not necessary
intuition for a 1-dimensional real valued number/scalar ($\theta \in \mathbb{R}$):
$J(\theta) = a\theta^{2} + b\theta + c$
to minimize this cost function, take the derivative and then set the derivative to zero:
$\frac{d}{d\theta}J(\theta)=0$, then solve for $\theta$
when $\theta$ is a ($n+1$)-dimensional parameter vector ($\theta \in \mathbb{R}^{n+1}$):
and the cost function $J$ is a function of the vector $\theta$:
$J(\theta_{0},\theta_{1},...,\theta_{m}) = \frac{1}{2m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2$
to minimize this cost function, take the partial derivative with respect to every parameter $\theta_{j}$, and then set each partial derivative to zero:
$\frac{\partial}{\partial\theta_{j}}J(\theta)=0$, for every $j$, then solve for $\theta_{0},\theta_{1},...,\theta_{n}$
to implement the normal equation method:
for example, place all features in a $m\times(n+1)$ matrix $X$ (n + 1 because of extra feature $x_{0}$):
$X =
\begin{bmatrix}
1 && 2104 && 5 && 1 && 45 \\
1 && 1416 && 3 && 2 && 40 \\
1 && 1534 && 3 && 2 && 30 \\
1 && 852 && 2 && 1 && 36
\end{bmatrix}$
and all $y$ values into a $m$-dimensional vector $y$:
$y =
\begin{bmatrix}
460 \\
232 \\
315 \\
178
\end{bmatrix}$
now, to get the value of $\theta$ that minimizes the cost function:
$\theta = (X^{T}X)^{-1}X^{T}y$
more generally, $m$ examples $(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)})$, $n$ features:
$x^{(i)} =
\begin{bmatrix}
x_{0}^{(i)} \\
x_{1}^{(i)} \\
x_{2}^{(i)} \\
... \\
x_{n}^{(i)}
\end{bmatrix} \in \mathbb{R}^{n+1}$
to construct the ($m\times(n+1)$) design matrix $X$:
$X =
\begin{bmatrix}
---(x^{(1)})^{T}--- \\
---(x^{(2)})^{T}--- \\
---(x^{(3)})^{T}--- \\
... \\
---(x^{(m)})^{T}---
\end{bmatrix}$
for example, if just one feature:
$x^{(i)} =
\begin{bmatrix}
1 \\
x_{1}^{(i)}
\end{bmatrix} \rightarrow
X =
\begin{bmatrix}
1 && x_{1}^{(1)} \\
1 && x_{2}^{(1)} \\
... \\
1 && x_{m}^{(1)}
\end{bmatrix}$ (a $m\times2$-dimensional matrix)
$\vec{y} = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ ... \\ y^{(m)} \end{bmatrix}$ (a $m$-dimensional vector)
now, $\theta = (X^{T}X)^{-1}X^{T}y$
$(X^{T}X)^{-1}$: inverse of $X^{T}X$
what if $X^{T}X$ is non-invertible (singular/degenerate)?
it's rare, but most programming environment implementations will perform the computation (in octave pinv() vs. inv());
pinv() is short for pseudo-inverse
if $X^{T}X$ is non-invertible, that's usually because of:
1) redundant features (linearly dependent)
for example, $x_{1} = $ size in feet^2, $x_{2} = $ size in m^2:
since 1 meter $\approx$ 3.28 feet
then, always, $x_{1} = (3.28)^{2}x_{2}$
if two features are related via a linear equation, then $X^{T}X$ will be non-invertible (can be proven in linear algebra)
2) too many features (e.g., $m \leq n$)
for example, $m=10$ and $n=100$:
trying to fit parameter vector $\theta \in \mathbb{R}^{101}$ (i.e., 101 parameters); simply, it's too little data
either delete some features or use regularization
In [87]:
import numpy as np
# number of training examples
m = 4
# number of features (not including x_0)
n = 4
#x_0 = np.array([[1],
# [1],
# [1],
# [1]])
#x_1 = np.array([[2104],
# [1416],
# [1534],
# [852]])
#x_2 = np.array([[5],
# [3],
# [3],
# [2]])
#x_3 = np.array([[1],
# [2],
# [2],
# [1]])
#x_4 = np.array([[45],
# [40],
# [30],
# [36]])
# training examples:
xindex1 = np.array([[1],
[2104],
[5],
[1],
[45]])
xindex2 = np.array([[1],
[1416],
[3],
[2],
[40]])
xindex3 = np.array([[1],
[1534],
[3],
[2],
[30]])
xindex4 = np.array([[1],
[852],
[2],
[1],
[36]])
# y values
y = np.array([[460],
[232],
[315],
[178]])
# design matrix
X = np.zeros(shape=(m,(n+1)), dtype=np.int16)
X[0,:] = np.transpose(xindex1)
X[1,:] = np.transpose(xindex2)
X[2,:] = np.transpose(xindex3)
X[3,:] = np.transpose(xindex4)
#print X
#print X.shape
# X.T = X transpose
#print X.T
#print X.T.shape
#print np.dot(X.T,X)
#print np.linalg.pinv(np.dot(X.T,X))
#print np.dot(X.T,y)
#print np.dot(np.linalg.pinv(np.dot(X.T,X)), np.dot(X.T,y))
theta = np.dot(np.linalg.pinv(np.dot(X.T,X)), np.dot(X.T,y))
print theta
In [140]:
# gradient descent attempt 1
import numpy as np
# number of training examples
m = 4
# number of features (not including x_0)
n = 4
# training examples:
xindex1 = np.array([[1],
[2104],
[5],
[1],
[45]])
xindex2 = np.array([[1],
[1416],
[3],
[2],
[40]])
xindex3 = np.array([[1],
[1534],
[3],
[2],
[30]])
xindex4 = np.array([[1],
[852],
[2],
[1],
[36]])
# y values
y = np.array([[460],
[232],
[315],
[178]])
#print y
#print y.shape
theta = np.zeros(((n+1),1),dtype=np.int16)
#print theta.shape
#theta_transpose = np.transpose(theta)
#print theta_transpose
#hypothesis = theta_transpose.dot(x)
# design matrix
#X = np.concatenate((xindex1.T,xindex2.T,xindex3.T,xindex4.T))
X = np.concatenate((xindex1,xindex2,xindex3,xindex4), axis=1)
#print X
#print X.shape
#Xindex0 = X[:,0]
#print Xindex0.T
#Xindex0.shape = (5,1)
#print Xindex0.shape
#print Xindex0
#print theta.T.dot(X[:,0])
#Xtranspose = np.transpose(X)
alpha = 0.1
temp = np.zeros(((n+1),1),dtype=np.int16)
#for j in xrange(0,n):
#Xsuperscripti
#temp[j] = theta[j] - alpha * (1/m) * theta.T *
simultaneously update $\theta_{j}$ for $j=0,1,...,n$
$\theta_{j} := \theta_{j} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$
$\theta_{j} := \theta_{j} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}(\theta^{T}x^{(i)} - y^{(i)}) \cdot x_{j}^{(i)}$
where $j = 0$:
$\theta_{0} := \theta_{0} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}(\theta^{T}x^{(i)} - y^{(i)}) \cdot x_{0}^{(i)}$
where $m = 4$:
$\theta_{0} := \theta_{0} - \alpha \frac{1}{4}\sum\limits_{i=1}^{4}(\theta^{T}x^{(i)} - y^{(i)}) \cdot x_{0}^{(i)}$
expand summation:
$\theta_{0} := \theta_{0} - \alpha \frac{1}{4} \left[(\theta^{T}x^{(1)} - y^{(1)}) \cdot x_{0}^{(1)}\right] + \left[(\theta^{T}x^{(2)} - y^{(2)}) \cdot x_{0}^{(2)}\right] + \left[(\theta^{T}x^{(3)} - y^{(3)}) \cdot x_{0}^{(3)}\right] + \left[(\theta^{T}x^{(4)} - y^{(4)}) \cdot x_{0}^{(4)}\right]$
where $\theta$, $x$, and $y$ are vectors:
$\theta_{0} := \theta_{0} - \alpha \frac{1}{4} [(\begin{bmatrix}0&&0&&0&&0\end{bmatrix} \times \begin{bmatrix}1\\2104\\5\\1\\45\end{bmatrix} - 460) \cdot 1] + [(\begin{bmatrix}0&&0&&0&&0\end{bmatrix} \times \begin{bmatrix}1\\1416\\3\\2\\40\end{bmatrix} - 232) \cdot 1] + [(\begin{bmatrix}0&&0&&0&&0\end{bmatrix} \times \begin{bmatrix}1\\1534\\3\\2\\30\end{bmatrix} - 315) \cdot 1] + [(\begin{bmatrix}0&&0&&0&&0\end{bmatrix} \times \begin{bmatrix}1\\852\\2\\1\\36\end{bmatrix} - 178) \cdot 1]$
each of the $\theta^{T}x$ operations result in a single value, which is then averaged and multiplied by the learning rate, finally being assigned to the relevant $\theta$ parameter
In [251]:
# gradient descent attempt 2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# number of training examples
m = 4
# number of features (not including x_0)
n = 4
# learning rate
alpha = 0.1
# training samples matrix
X = np.array([[1,2104,5,1,45],
[1,1416,3,2,40],
[1,1534,3,2,30],
[1,852, 2,1,36]])
# y value vector
y = np.array([[460],
[232],
[315],
[178]])
# theta parameters
theta = np.zeros(((n+1),1))
#print theta
#print theta.T
#x = (X[0])
#x.shape = (((n+1),1))
#print x
#print x.shape
#print theta.T.dot(x)
iterations = 10
J = np.zeros((iterations,1))
def gradient_descent():
# temp theta parameters
temp = np.zeros(((n+1),1))
# deltas
delta = np.zeros(((n+1),1))
# update temp storage for theta
for j in xrange(0,n):
# training-sample numbered summation and average to create delta
temp_delta = 0
temp_J = 0
for i in xrange(0,m):
x_i = X[i]
#print x
x_i.shape = (((n+1),1))
#print x.shape
# get delta
temp_delta += ((theta.T.dot(x_i)) - y[i]) * x_i[j]
# get cost
temp_J += ((theta.T.dot(x_i)) - y[i])**2
delta[j] = (1.0/m) * temp_delta
J[j] = (1.0/(2*m)) * temp_J
temp[j] = theta[j] - (alpha * delta[j])
# update theta_0 through theta_n
for j in xrange(0,n):
theta[j] = temp[j]
for x in xrange(iterations):
gradient_descent()
print theta
x = np.arange(1,(iterations+1))
x.shape = (((iterations),1))
plt.plot(x,J)
plt.legend([r"$J(\theta)$"])
plt.show()
$m$ training example,s $n$ features
gradient descent disadvantages:
gradient descent advantages:
normal equation advantages:
normal equation disadvantages:
In [ ]: