Notation:

m = number of training examples

x's = "input" variable/features

y's = "output" variable/"target" variable

(x, y) = a single training example

(x^(*i*), y^(*i*)) = i-th training example

$\theta_i\text{ = parameter}$

Basics of how it works:

first: Training set -> Learning Algorithm -> h (hypothesis: maps x's to y's)

then: x -> h -> y (an estimated value for y)

how to represent h: $h_{\theta}(x) = \theta_{0} + \theta_{1}x$

shorthand: $h(x)$

h is simply a linear function

univariate linear regression / linear regression with one variable (x)



In [14]:

    
from IPython.display import Math
import matplotlib.pyplot as plt
import numpy as np
#from pylab import * #MATLAB-like API
%matplotlib inline

## MATPLOT-like API
#x = linspace(0, 5, 10)
#y = x ** 2
#figure()
#plot(x, y, 'r')
#xlabel('x')
#ylabel('y')
#title('title')
#show()

## OO API
fig = plt.figure()
axes = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # left, bottom, width, height (range 0 to 1)

x = np.array([1.0, 2.0, 3.5, 4.0, 5.2])
y = np.array([1.5, 3.0, 3.5, 4.3, 5.9])

axes.plot(x, x*1.2, label=r"$h(x) = \theta_{0} + \theta_{1}x$")

axes.scatter(x, y, marker='x', color='r')
axes.legend(loc=2)
axes.set_xlabel('x')
axes.set_ylabel('y')
axes.set_title('title')

plt.show()

Linear Regression Model

Squared-Error Cost Function: helps to figure out how to fit the best possible straight line to data

how to choose $\theta_i\text{'s}$:

choose parameters $\theta_{0}, \theta_{1}$ so that $h_{\theta}(x)$ is close to $y$ for the training examples $(x,y)$

formally, the (optimization) goal/objective is to minimize the cost function: $\overset{\text{minimize}}{\theta_{0},\theta_{1}} J(\theta_{0},\theta_{1})$:

use the average of the sum of the squared difference to find the smallest values for $\theta_{0},\theta_{1}$

$J(\theta_{0},\theta_{1}) = \frac{1}{2m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2$, where $h_{\theta}(x^{(i)}) = \theta_{0} + \theta_{1}x^{(i)}$

simplified version:

$h_{\theta}(x) = \theta_{1}x$

(as if $\theta_{0} = 0$)

$h_{\theta}(x^{(i)}) = \theta_{1}x^{(i)}$

$\overset{\text{minimize}}{\theta_{1}} J(\theta_{1})$

simplified version (1 parameter) example:

hypothesis function: $h_{\theta}(x)$ (for a fixed value of $\theta_{1}$, this is a function of $x$)

cost function: $J(\theta_{1})$ (is a function of the parameter $\theta_{1}$)

each value of $\theta_{1}$ corresponds to a different hypothesis

when $\theta_{1} = 1$

$J(\theta_{1}) = \frac{1}{2m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2$

$= \frac{1}{2m}\sum\limits_{i=1}^{m}(\theta_{1}x^{(i)} - y^{(i)})^2$

$= \frac{1}{2m}(0^2 + 0^2 + 0^2) = 0^2$



In [27]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(0, 4)
theta_1 = 1
h_theta = x * theta_1

x_i = np.array([1, 2, 3])
y_i = np.array([1, 2, 3])

plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.plot(x, h_theta)
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
plt.yticks(np.arange(min(x), max(x)+1, 1.0))
plt.xlabel('x')
plt.ylabel('y')
plt.title(r"$h_{\theta}(x)$")
plt.show()

so, when $\theta_{1} = 1$, $h_{\theta}(x^{(i)}) = y^{(i)}$, which explains the $0$'s

and for $\theta_{1} = 1$, $J(\theta_{1}) = J(1) = 0$



In [25]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x_i = np.array([1])
y_i = np.array([0])

plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.xticks(np.arange(min(x)-0.5, max(x)+1, 0.5))
plt.yticks(np.arange(min(x), max(x)+1, 1))
plt.xlabel(r"$h_{\theta}(x)$")
plt.ylabel(r"$J(\theta_{1})$")
plt.title(r"$J(\theta_{1})$")
plt.show()

when $\theta_{1} = 0.5$

$J(0.5) = \frac{1}{2m}[(0.5 - 1)^2 + (1 - 2)^2 + (1.5 - 3)^2]$

$= \frac{1}{2(3)}(3 \times 5) = \frac{3 \times 5}{6} \approx 0.58$



In [26]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(0, 4)
theta_1 = 0.5
h_theta = x * theta_1

x_i = np.array([1, 2, 3])
y_i = np.array([1, 2, 3])

plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.plot(x, h_theta)
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
plt.yticks(np.arange(min(x), max(x)+1, 1.0))
plt.xlabel('x')
plt.ylabel('y')
plt.title(r"$h_{\theta}(x)$")
plt.show()



In [24]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x_i = np.array([1, 0.5])
y_i = np.array([0, 0.58])

plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.xticks(np.arange(min(x)-0.5, max(x)+1, 0.5))
plt.yticks(np.arange(min(x), max(x)+1, 1))
plt.xlabel(r"$h_{\theta}(x)$")
plt.ylabel(r"$J(\theta_{1})$")
plt.title(r"$J(\theta_{1})$")
plt.show()

when $\theta_{1} = 0$

$J(0) = \frac{1}{2m}(1^2 + 2^2 + 3^2)$

$= \frac{1}{6}(14) \approx 2.3$



In [29]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(0, 4)
theta_1 = 0
h_theta = x * theta_1

x_i = np.array([1, 2, 3])
y_i = np.array([1, 2, 3])

plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.plot(x, h_theta, linewidth=2.0)
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
plt.yticks(np.arange(min(x), max(x)+1, 1.0))
plt.xlabel('x')
plt.ylabel('y')
plt.title(r"$h_{\theta}(x)$")
plt.show()



In [30]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x_i = np.array([1, 0.5, 0])
y_i = np.array([0, 0.58, 2.3])

plt.scatter(x_i, y_i, marker='x', color='r')
plt.xlim([0, 4])
plt.ylim([0, 4])
plt.xticks(np.arange(min(x)-0.5, max(x)+1, 0.5))
plt.yticks(np.arange(min(x), max(x)+1, 1))
plt.xlabel(r"$h_{\theta}(x)$")
plt.ylabel(r"$J(\theta_{1})$")
plt.title(r"$J(\theta_{1})$")
plt.show()

clearly, the optimization objective is met when $\theta_{1} = 1$

two parameter example:

hypothesis function: $h_{\theta}(x)$ (for fixed values $\theta_{0},\theta_{1}$, this is a function of $x$)

cost function: $J(\theta_{0},\theta_{1})$ (a function of the parameters $\theta_{0},\theta_{1}$)

Gradient Descent

have some function $J(\theta_{0},\theta_{1},...,\theta_{n})$

want $\underset{\theta_{0},\theta_{1},...,\theta_{n}}{\text{min}}$ $J(\theta_{0},\theta_{1},...,\theta_{n})$

start with some $\theta_{0},\theta_{1},...,\theta_{n}$ (e.g., set all to $0$)

keep changing $\theta_{0},\theta_{1},...,\theta_{n}$ to reduce $J(\theta_{0},\theta_{1},...,\theta_{n})$ until, hopefully, converge on/reach minimum

property of gradient descent: can arrive at different local optimums

algorithm:
repeat until convergence {
$\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial\theta_{j}} J(\theta_{0},\theta_{1}) \text{, (for } j = 0 \text{ and } j = 1 \text{)}$
}

in other words, update $\theta_{j}$ by subtracting $\alpha \frac{\partial}{\partial\theta_{j}} J(\theta_{0},\theta_{1})$

$\alpha$ = learning rate (controls the size of steps for updating$\theta_{j}$)

$\frac{\partial}{\partial\theta_{j}} J(\theta_{0},\theta_{1})$ = derivative

simultaneous update:
$\text{temp0} := \theta_{0} - \alpha \frac{\partial}{\partial\theta_{0}} J(\theta_{0},\theta_{1})$
$\text{temp1} := \theta_{1} - \alpha \frac{\partial}{\partial\theta_{1}} J(\theta_{0},\theta_{1})$
$\theta_{0} := \text{temp0}$
$\theta_{1} := \text{temp1}$

one parameter example

$\underset{\theta_{1}}{\text{min}}$ $J(\theta_{1})$, $\theta_{1} \in \mathbb{R}$

$\theta_{1} := \theta_{1} - \alpha \frac{d}{d\theta_{1}} J(\theta_{1})$

example 1.1:
$f(x) = x^{2}$
$f'(x) = 2x$
starting point: $(8,64)$
slope at point: $m = f'(8) = 2(8) = 16$
tangent line (via point-slope): $y - y_{1} = m(x - x_{1}) \rightarrow y - 64 = 16(x - 8) \rightarrow y = 16x - 70$

slope is a positive number (so $\theta_{1}$ will be decreased)

$\theta_{1} := \theta_{1} - \alpha \frac{d}{d\theta_{1}} J(\theta_{1}) = \theta_{1} - \alpha (\text{positive_number})$



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-10,10,100)
y = x**2
plt.scatter(8,64, color='r')
plt.plot(x, y)
plt.legend([r"$J(\theta_{1})$"])

# theta_1 start at x = 8
tangent_line_x = np.linspace(5,10,5)
tangent_line_y = (16*tangent_line_x - 64)
plt.plot(tangent_line_x, tangent_line_y, color='r')

plt.xticks([8], [r"$\theta_{1}$"])

plt.annotate('min', xy=(0,0), xytext=(0,20), arrowprops=dict(facecolor='black', width=1, headwidth=5))

plt.annotate('', xy=(2,-10), xytext=(8,-10), arrowprops=dict(facecolor='black', width=1, headwidth=5))

plt.show()

example 1.2:
$f(x) = x^{2}$
$f'(x) = 2x$
starting point: $(-8,64)$
slope at point: $m = f'(-8) = 2(-8) = -16$
tangent line (via point-slope): $y - y_{1} = m(x - x_{1}) \rightarrow y - 64 = -16(x - (-8)) \rightarrow y = -16x - 128 + 64 = -16x - 64$



In [110]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-10,10,100)
y = x**2
plt.scatter(-8,64, color='r')
plt.plot(x, y)
plt.legend([r"$J(\theta_{1})$"])

# theta_1 start at x = -8
tangent_line_x = np.linspace(-5,-10,5)
tangent_line_y = (-16*tangent_line_x - 64)
plt.plot(tangent_line_x, tangent_line_y, color='r')

plt.xticks([-8], [r"$\theta_{1}$"])

plt.annotate('min', xy=(0,0), xytext=(0,20), arrowprops=dict(facecolor='black', width=1, headwidth=5))

plt.annotate('', xy=(-2,-10), xytext=(-8,-10), arrowprops=dict(facecolor='black', width=1, headwidth=5))

plt.show()

slope is a negative number (so $\theta_{1}$ will be increased)

$\theta_{1} := \theta_{1} - \alpha \frac{d}{d\theta_{1}} J(\theta_{1}) = \theta_{1} - \alpha (\text{negative_number})$

learning rate

if $\alpha$ is too small, gradient descent can be slow
if $\alpha$ is too large, gradient descent can overshoot the minumum, fail to converge (i.e., of a series, "approximate in the sum of its terms toward a definite limit"), or even diverge (i.e., of a series, "increase indefinitely as more terms are added")

since gradient descent is based on the derivative of $J$, it will automatically take smaller steps (so there is no need to decrease $\alpha$ over time)

Gradient descent for linear regression

$\frac{\partial}{\partial\theta_{j}} J(\theta_{0},\theta_{1}) = \frac{\partial}{\partial\theta_{j}} \left[\frac{1}{2m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2\right] = \frac{\partial}{\partial\theta_{j}} \left[\frac{1}{2m}\sum\limits_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^{2}\right]$

partial derivatives for $\theta_{0}$ case and $\theta_{1}$ case:
$j = 0: \frac{\partial}{\partial\theta_{0}} J(\theta_{0},\theta_{1}) = \frac{\partial}{\partial\theta_{0}} \left[\frac{1}{2m}\sum\limits_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^{2}\right]=$
$= \left[0 \cdot \sum\limits_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^{2}\right] + \left[\frac{1}{2m} \cdot \sum\limits_{i=1}^{m}\left(2[\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}] \cdot [1 + 0 + 0]\right)\right] = \frac{1}{m} \sum\limits_{i=1}^{m}[\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}]$
$ = \frac{1}{m}\sum\limits_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}]$

$j = 1: \frac{\partial}{\partial\theta_{1}} J(\theta_{0},\theta_{1}) = \frac{\partial}{\partial\theta_{1}} \left[\frac{1}{2m}\sum\limits_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^{2}\right]=$
$= \left[0 \cdot \sum\limits_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^{2}\right] + \left[\frac{1}{2m} \cdot \sum\limits_{i=1}^{m}\left(2[\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}] \cdot [0 + x^{(i)} + 0]\right)\right] = \frac{1}{m} \sum\limits_{i=1}^{m}[\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}] \cdot x^{(i)}$
$= \frac{1}{m}\sum\limits_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}] \cdot x^{(i)}$

Gradient Descent Algorithm

(sometimes called "batch" gradient descent: in every step of gradient descent, all training examples are used; other non-"batch" versions of gradient descent exist)

repeat until convergence (update simultaneously) {

$\theta_{0} := \theta_{0} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}]$

$\theta_{1} := \theta_{1} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}] \cdot x^{(i)}$
}

the cost function for linear regression is always "bowl-shaped" (i.e., a convex function); thus, it has only one global/local optimum



In [61]:

    
# THIS IS JUST A VISUAL EXAMPLE OF A CONVEX FUNCTION 
# (i.e., it is not based on the regression/cost functions detailed here)
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

#fig = plt.figure()
ax = plt.axes(projection='3d')
x = np.linspace(-10,10)
y = np.linspace(-10,10)
x,y = np.meshgrid(x,y)
z = (x**2) + (y**2)
#ax.set_zlim(0,100)
ax.plot_surface(x,y,z, rstride=2, cstride=2, cmap=cm.jet)

plt.show()

solving the cost function with one computation (i.e., without an iterative algorithm like gradient descent] and, thus, without the need for $\alpha$):

with more features ($x_{1}, x_{2}, ..., x_{n}$), it becomes difficult to plot the data and notationally problematic

linear algebra notation is better suited

for example, with 4 features:

$X = \begin{bmatrix} 2104 & 5 & 1 & 45 \\ 1416 & 3 & 2 & 40 \\ 1534 & 3 & 2 & 30 \\ 852 & 2 & 1 & 36 \end{bmatrix}$

$y = \begin{bmatrix} 460 \\ 232 \\ 315 \\ 172 \end{bmatrix}$

Linear Algebra

matrix: rectangular array of numbers

$\begin{bmatrix} 1402 & 191 \\ 949 & 1437 \\ 1371 & 821 \end{bmatrix}$

dimension of matrix: number of rows $\times$ number of columns (e.g., a $3 \times 2$ matrix, or this matrix is an element of the set $\mathbb{R}^{3 \times 2}$)

matrix elements

$A = \begin{bmatrix} 1402 & 191 \\ 949 & 1437 \\ 1371 & 821 \end{bmatrix}$

$A_{ij} =$ "$i,j$ entry" in the $i$-th row, $j$-th column

[commas must be used for indices when the number of a matrix rows/columns is greater than 9]

$A_{11} = 1402$

$A_{12} = 191$

$A_{32} = 821$

$A_{43} = \text{undefined}$

vector: an $n \times 1$ matrix

$y = \begin{bmatrix} 1402 \\ 949 \\ 1371 \\ 178 \end{bmatrix}$

$n = 4$

$n \times 1$ matrix

4-dimensional vector

a vector in the set $\mathbb{R}^{4}$

$y_{i} = i$-th element

$y_{1} = 1402$

$y_{2} = 949$

$y_{3} = 1371$

1-indexed vs. 0-indexed

$y = \begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \\ y_{4} \end{bmatrix}$

$y = \begin{bmatrix} y_{0} \\ y_{1} \\ y_{2} \\ y_{3} \end{bmatrix}$

matrix addition (can only add matrices of the same dimension; result is a matrix of the same dimension):

$\begin{bmatrix}1&&0\\2&&5\\3&&1\end{bmatrix} + \begin{bmatrix}4 && 0.5 \\2 && 5 \\0 && 1\end{bmatrix} = \begin{bmatrix}5 && 0.5 \\4 && 10 \\3 && 2\end{bmatrix}$

scalar (real number) multiplication (result is a matrix of the same dimension):

$3\times\begin{bmatrix}1&&0\\2&&5\\3&&1\end{bmatrix}=\begin{bmatrix}3&&0\\6&&15\\9&&3\end{bmatrix}$

scalar divison:

$\begin{bmatrix}4&&0\\6&&3\end{bmatrix}/4=\begin{bmatrix}1&&0\\\frac{3}{2}&&\frac{3}{4}\end{bmatrix} = \frac{1}{4}\begin{bmatrix}4&&0\\6&&3\end{bmatrix}$

commutative property applies for multiplication but not for division

combination of operations, follow normal precedence rules

matrix-vector multiplication

$\begin{bmatrix}1&&3\\4&&0\\2&&1\end{bmatrix} \begin{bmatrix}1\\5\end{bmatrix} = \begin{bmatrix}1 \times 1 + 3 \times 5 = 16 \\ 4 \times 1 + 0 \times 5 = 4 \\ 2 \times 1 + 1 \times 5 = 7 \end{bmatrix}$

a $3 \times 2$ matrix multiplied by a $2 \times 1$ vector/matrix results in a $3 \times 1$ vector/matrix

$A \times x = y$

$m \times n$ $\cdot$ $n \times 1$ $=$ $m \times 1$ (m-dimensional vector)

($n$ has to match $n$; the number of columns in the first matrix has to match the number of rows in the second matrix)

to get $y_{i}$, multiply each element (i.e., elementwise) of $A$'s $i$-th row with each element of vector $x$ (1:1, $A_{in} \cdot x_{n}$), then sum the products

[can simply pretend to stand the first matrix's rows vertically, multiplying the (then) horizontally adjacent elements]

example 1 of use (as opposed to an interative approach):

suppose the hypothesis: $h_{\theta}(x) = -40 + 0.25x$
and (for $x$) the house sizes: 2104, 1416, 1534, 852

$\begin{bmatrix} 1 && 2104 \\ 1 && 1416 \\ 1 && 1534 \\ 1 && 852 \end{bmatrix} \times \begin{bmatrix} -40 \\ 0.25 \end{bmatrix}= \begin{bmatrix} -40 \times 1 + 0.25 \times 2104 = h_{\theta}(2104) \\ -40 \times 1 + 0.25 \times 1416 = h_{\theta}(1416) \\ -40 \times 1 + 0.25 \times 1534 = h_{\theta}(1534) \\ -40 \times 1 + 0.25 \times 852 = h_{\theta}(852) \end{bmatrix}$

a $4 \times 2$ matrix times a $2 \times 1$ vector results in a $4 \times 1$ vector

prediction = DataMatrix * parameters

matrix-matrix multiplication

$\begin{bmatrix} 1 && 3 && 2 \\ 4 && 0 && 1 \end{bmatrix} \times \begin{bmatrix} 1 && 3 \\ 0 && 1 \\ 5 && 2 \end{bmatrix}=$

$\begin{bmatrix} 1 && 3 && 2 \\ 4 && 0 && 1 \end{bmatrix} \times \begin{bmatrix} 1 \\ 0 \\ 5 \end{bmatrix}= \begin{bmatrix} 1 \times 1 + 3 \times 0 + 2 \times 5 = 11 \\ 4 \times 1 + 0 \times 0 + 1 \times 5 = 9 \end{bmatrix}$

$\begin{bmatrix} 1 && 3 && 2 \\ 4 && 0 && 1 \end{bmatrix} \times \begin{bmatrix} 3 \\ 1 \\ 2 \end{bmatrix}= \begin{bmatrix} 1 \times 3 + 3 \times 1 + 2 \times 2 = 10 \\ 4 \times 3 + 0 \times 1 + 1 \times 2 = 14 \end{bmatrix}$

$\begin{bmatrix} 1 && 3 && 2 \\ 4 && 0 && 1 \end{bmatrix} \times \begin{bmatrix} 1 && 3 \\ 0 && 1 \\ 5 && 2 \end{bmatrix}= \begin{bmatrix} 11 && 10 \\ 9 && 14 \end{bmatrix}$

a $2 \times 3$ matrix times a $3 \times 2$ matrix results in a $2 \times 2$ matrix

a $m \times n$ matrix times a $n \times o$ matrix results in a $m \times o$ matrix

($n$'s must match [i.e., number of first matrix columns must match number of second matrix rows])

$A \times B = C$

multiply and sum the elements of the $i$-th row of $A$ with the elements of the $i$-th column of $B$ to obtain the elements of the $i$-th column of $C$

example 2 of use (as opposed to an interative approach):

suppose three competing hypotheses:
$h_{\theta}(x) = -40 + 0.25x$
$h_{\theta}(x) = 200 + 0.1x$
$h_{\theta}(x) = -150 + 0.4x$
and (for $x$) the house sizes: 2104, 1416, 1534, 852

$\begin{bmatrix} 1 && 2104 \\ 1 && 1416 \\ 1 && 1534 \\ 1 && 852 \end{bmatrix} \times \begin{bmatrix} -40 && 200 && -150 \\ 0.25 && 0.1 && 0.4 \end{bmatrix}= \begin{bmatrix} 486 && 410 && 692 \\ 314 && 342 && 416 \\ 344 && 353 && 464 \\ 173 && 285 && 191 \end{bmatrix}$

matrix multiplication properties

in general, not commutative (i.e., $A \times B \neq B \times A$)
dimensions might not even match ($m \times n$ and $n \times m$ results in $m \times m$, whereas $n \times m$ and $m \times n$ results in $n \times n)$

associative property does hold: $A \times (B \times C) = (A \times B) \times C$

identity matrix:

denoted $I$ or $I_{n \times n}$

examples ($1 \times 1$, $2 \times 2$, $3 \times 3$, $4 \times 4$):

$\begin{bmatrix} 1 \end{bmatrix}$

$\begin{bmatrix} 1 && 0 \\ 0 && 1 \end{bmatrix}$

$\begin{bmatrix} 1 && 0 && 0 \\ 0 && 1 && 0 \\ 0 && 0 && 1 \end{bmatrix}$

$\begin{bmatrix} 1 && 0 && 0 && 0 \\ 0 && 1 && 0 && 0 \\ 0 && 0 && 1 && 0 \\ 0 && 0 && 0 && 1 \end{bmatrix}$

for any matrix $A$:
$A \cdot I = I \cdot A = A$ (commutative property does hold) ($m \times n \cdot n \times n = m \times m \cdot m \times n = m \times n)$
(dimensions of identity matrix are implicit/dependent on context)

inverse and transpose

in the space of real numbers, the identity number is $1$

a number multiplied by its inverse results in $1$: $3(3^{-1}) = 1$

in the space of real numbers, not all numbers have an inverse (e.g., $0^{-1}$ is undefined); those that don't are called "singular" or "degenerate"

matrix inverse:

if $A$ is a $m \times m$ (i.e., a square) matrix, and if it has an inverse $A^{-1} (e.g., the zero matrix does not)$:
$A(A^{-1}) = A^{-1}(A) = I$

only square matrices (i.e., # of rows = # of columns) have inverses

$A = \begin{bmatrix} 3 && 4 \\ 2 && 16 \end{bmatrix}$

$A^{-1} = \begin{bmatrix} 0.4 && -0.1 \\ -0.05 && 0.075 \end{bmatrix}$

$A \times A^{-1} = \begin{bmatrix} 1 && 0 \\ 0 && 1 \end{bmatrix}= I_{2 \times 2}$

(inverses of matrices usually aren't computed by hand)

matrix transpose (operation)

(first row becomes first column; second row becomes second column...)

(also, imagine flipping/mirroring matrix along 45-degree diagonal axis)

$A = \begin{bmatrix} 1 && 2 && 0 \\ 3 && 5 && 9 \end{bmatrix}$

$A^{T} = \begin{bmatrix} 1 && 3 \\ 2 && 5 \\ 0 && 9 \end{bmatrix}$

let $A$ be a $m \times n$ matrix
let $B = A^{T}$
then $B$ is a $n \times m$ matrix and $B_{ij} = A_{ji}$

Linear Regression with Multiple Variables (Multivariate Linear Regression)

Multiple Features

now, features/variables: $x_{1}, x_{2}, x_{3}, ..., x_{n}$ prediction is still: $y$

notation:
$m =$ number of training examples
$n =$ number of features
$x^{(i)} =$ input (features) of the $i$-th training example (subscript is an index into training set)
$x_{j}^{(i)} =$ value of feature $j$ in the $i$-th training example

for example:
$x_{1}$: size (feet^2)
$x_{2}$: # of bedrooms
$x_{3}$ # of floors
$x_{4}$ age of home (years)
$y$: price ($1000)

$x^{(2)} = \begin{bmatrix} x_{1} = 1416 \\ x_{2} = 3 \\ x_{3} = 2 \\ x_{4} = 40 \end{bmatrix}$

thus, $x_{3}^{(2)} = 2$

new hypothesis form:

$h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$

for example: $h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{3} + \theta_{4}x_{4}$

thus, as an example: $h_{\theta}(x) = 80 + 0.1x_{1} + 0.01x_{2} + 3x_{3} - 2x_{4}$

for convienience of notation, define $x_{0} = 1$, so that $x_{0}^{(i)} = 1$ (defining a zeroth feature); now a zero-indexed feature vector:

$x = \begin{bmatrix} x_{0} \\ x_{1} \\ x_{2} \\ ... \\ x_{n} \end{bmatrix} \in \mathbb{R}^{n+1}$

$\theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \theta_{2} \\ ... \\ \theta_{n} \end{bmatrix} \in \mathbb{R}^{n+1}$

thus: $h_{\theta}(x) = \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$

$\theta^{T} = \begin{bmatrix}\theta_{0}&&\theta_{1}&&\theta_{2}&&...&&\theta_{n}\end{bmatrix}$
(a $(n + 1) \times 1$ row vector)

use the inner/dot/scalar product of vectors

thus: $h_{\theta}(x) = \theta^{T}x = \begin{bmatrix}\theta_{0}&&\theta_{1}&&\theta_{2}&&...&&\theta_{n}\end{bmatrix} \times \begin{bmatrix} x_{0} \\ x_{1} \\ x_{2} \\ ... \\ x_{n} \end{bmatrix}$

Gradient Descent for Multiple Features

hypothesis: $h_{\theta}(x) = \theta^{T}x$
parameters: $\theta$ (a ($n+1$)-dimensional vector)
cost function: $J(\theta_{0},\theta_{1},...,\theta_{n}) = J(\theta) = \frac{1}{2m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2$

new algorithm ($n \geq 1$):

gradient descent (simultaneously update $\theta_{j}$ for every $j = 0, ..., n$):
repeat {
$\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial\theta_{j}} J(\theta) = \theta_{j} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}] \cdot x_{j}^{(i)}$
}

vectorized implementation:
$\theta := \theta - \alpha \delta$
$\theta$: $\mathbb{R}^{n+1}$
$-$: matrix subtraction
$\alpha$: $\mathbb{R}$
$\delta$: $\mathbb{R}^{n+1}$
$h_{\theta}(x^{(i)}) - y^{(i)}$: $\mathbb{R}$
the final multiplicand $x^{(i)}$: $\mathbb{R}^{n+1}$

$\delta = \frac{1}{m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x^{(i)}$

$\delta = \begin{bmatrix} \delta_{0} \\ \delta_{1} \\ \delta_{2} \\ ... \\ \delta_{n} \end{bmatrix}$

$\delta_{0} = \frac{1}{m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)}$

$h_{\theta}(x^{(i)}) - y^{(i)} \cdot x^{(i)} = [h_{\theta}(x^{(0)}) - y^{(0)} \cdot x^{(0)}] + [h_{\theta}(x^{(1)}) - y^{(1)} \cdot x^{(1)}] + [h_{\theta}(x^{(2)}) - y^{(2)} \cdot x^{(2)}] + ...$

$x^{(i)} = \begin{bmatrix} x_{0}^{(i)} \\ x_{1}^{(i)} \\ x_{2}^{(i)} \\ ... \\ x_{n}^{(i)} \end{bmatrix}$

Feature Scaling

if features are on a similar scale, gradient descent can converge more quickly (contours of the cost function $J$ will become much less skewed):

for example:
$x_{1}$: size (0-2000 feet^2)
$x_{2}$: # of bedrooms (1-5)

get every feature into approximately a ($-1 \leq x_{i} \leq 1$) range

$x_{1} = \frac{\text{size (feet^{2})}}{2000}$
$x_{2} = \frac{\text{# of bedrooms}}{5}$

Mean Normalization

another alternative:
replace $x_{i}$ with $x_{i} - \mu_{i}$to make features have approximately zero mean

for example:
$x_{1} = \frac{\text{size}-1000}{2000}$
$x_{2} = \frac{\text{#bedrooms}-2}{5}$

thus, creating new ranges: $-0.5 \leq x_{i} \leq 0.5$

could also use $x_{i} = \frac{x_{i} - \mu_{i}}{s_{1}}$, where $s_{1}$ is the range (max$-$min) or the standard deviation

Make sure gradient descent is working correctly

Plot $J(\theta)$ as a function of the number of iterations

$J(\theta)$ should decrease after every iteration, flattening out eventually when gradient descent converges

the number of iterations necessary for convergence varies immensely

example of an automatic convergence test: declare convergence if $J(\theta)$ decreases by less than some small value $\epsilon$ (e.g., $10^{-3}$) in one iteration

if $J(\theta)$ increases, or increases and decreases alternately, try using a small $\alpha$

for sufficiently small $\alpha$, $J(\theta)$ should decreases on every step
but if $\alpha$ is too small,gradient decent can be slow to converge

How to choose learning rate $\alpha$

to choose $\alpha$, try a range of values (factor of 10 differences, with 3-fold increases in between): $0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1$



In [4]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# THIS IS JUST A VISUAL EXAMPLE
x = np.linspace(10,400)
y = (1/x)
plt.plot(x,y)
plt.legend([r"$J(\theta)$"])

plt.show()

Features and Polynomial Regression

feature selection

sometimes, it's best to combine (defining new) features based on insight into a particular problem

for example:
$h_{\theta}(x) = \theta_{0} + \theta_{1} \times frontage + \theta_{2} + depth$
$x_{1}=$ frontage
$x_{2}=$ depth

instead, area $x = $ $frontage \times depth$
$h_{\theta}(x) = \theta_{0} + \theta_{1}x$

polynomial regression

for example, for the housing data, a linear model might be too dramatic and not fit data very well, while a quadratic model ($\theta_{0} + \theta_{1}x + \theta_{2}x^{2}$) doesn't make sense since it eventually descends

thus, perhaps a cubic function ($\theta_{0} + \theta_{1} x + \theta_{2}x^{2} + \theta_{3}x^{3}$) would be more appropriate:

$h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{3} = \theta_{0} + \theta_{1}(size) + \theta_{2}(size)^{2} + \theta_{3}(size)^{3}$

$x_{1} = (size)$
$x_{2} = (size)^{2}$
$x_{3} = (size)^{3}$

if choosing features like this, feature scaling becomes increasingly important (e.g., size = 1-1000, size^2 = 1-1,000,000, size^3 = 10^9)

another reasonable choice would be a square-root function:

$h_{\theta}(x) = \theta_{0} + \theta_{1}(size) + \theta_{2}\sqrt{(size)}$



In [36]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# THIS IS JUST A VISUAL EXAMPLE

x_i = np.linspace(1,1000)
x_i = np.random.choice(x_i, 10)

y_i = np.log(x_i**3)

plt.scatter(x_i, y_i, marker='x', color='r')

x_i = np.linspace(1,1000)
x_i = np.random.choice(x_i, 10)

y_i = np.log(x_i**2.8)

plt.scatter(x_i, y_i, marker='x', color='r')

x_i = np.linspace(1,1000)
x_i = np.random.choice(x_i, 10)

y_i = np.log(x_i**2.9)

plt.scatter(x_i, y_i, marker='x', color='r')


plt.tick_params(\
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom='off',      # ticks along the bottom edge are off
    top='off',         # ticks along the top edge are off
    labelbottom='off')

plt.xlabel('size (x)')
plt.ylabel('price (y)')



plt.show()

The Normal Equation

works for many linear regression problems to solve for optimum value

solves for $\theta$ analytically (as opposed to iteratively, as with gradient descent)

if using the normal equation method, feature scaling is not necessary

intuition for a 1-dimensional real valued number/scalar ($\theta \in \mathbb{R}$):
$J(\theta) = a\theta^{2} + b\theta + c$
to minimize this cost function, take the derivative and then set the derivative to zero:
$\frac{d}{d\theta}J(\theta)=0$, then solve for $\theta$

when $\theta$ is a ($n+1$)-dimensional parameter vector ($\theta \in \mathbb{R}^{n+1}$):

and the cost function $J$ is a function of the vector $\theta$:

$J(\theta_{0},\theta_{1},...,\theta_{m}) = \frac{1}{2m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2$

to minimize this cost function, take the partial derivative with respect to every parameter $\theta_{j}$, and then set each partial derivative to zero:
$\frac{\partial}{\partial\theta_{j}}J(\theta)=0$, for every $j$, then solve for $\theta_{0},\theta_{1},...,\theta_{n}$

to implement the normal equation method:

for example, place all features in a $m\times(n+1)$ matrix $X$ (n + 1 because of extra feature $x_{0}$):
$X = \begin{bmatrix} 1 && 2104 && 5 && 1 && 45 \\ 1 && 1416 && 3 && 2 && 40 \\ 1 && 1534 && 3 && 2 && 30 \\ 1 && 852 && 2 && 1 && 36 \end{bmatrix}$

and all $y$ values into a $m$-dimensional vector $y$:
$y = \begin{bmatrix} 460 \\ 232 \\ 315 \\ 178 \end{bmatrix}$

now, to get the value of $\theta$ that minimizes the cost function:
$\theta = (X^{T}X)^{-1}X^{T}y$

more generally, $m$ examples $(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)})$, $n$ features:
$x^{(i)} = \begin{bmatrix} x_{0}^{(i)} \\ x_{1}^{(i)} \\ x_{2}^{(i)} \\ ... \\ x_{n}^{(i)} \end{bmatrix} \in \mathbb{R}^{n+1}$

to construct the ($m\times(n+1)$) design matrix $X$:
$X = \begin{bmatrix} ---(x^{(1)})^{T}--- \\ ---(x^{(2)})^{T}--- \\ ---(x^{(3)})^{T}--- \\ ... \\ ---(x^{(m)})^{T}--- \end{bmatrix}$

for example, if just one feature:
$x^{(i)} = \begin{bmatrix} 1 \\ x_{1}^{(i)} \end{bmatrix} \rightarrow X = \begin{bmatrix} 1 && x_{1}^{(1)} \\ 1 && x_{2}^{(1)} \\ ... \\ 1 && x_{m}^{(1)} \end{bmatrix}$ (a $m\times2$-dimensional matrix)

$\vec{y} = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ ... \\ y^{(m)} \end{bmatrix}$ (a $m$-dimensional vector)

now, $\theta = (X^{T}X)^{-1}X^{T}y$

$(X^{T}X)^{-1}$: inverse of $X^{T}X$

what if $X^{T}X$ is non-invertible (singular/degenerate)?

it's rare, but most programming environment implementations will perform the computation (in octave pinv() vs. inv());

pinv() is short for pseudo-inverse

if $X^{T}X$ is non-invertible, that's usually because of:

1) redundant features (linearly dependent)

for example, $x_{1} = $ size in feet^2, $x_{2} = $ size in m^2:
since 1 meter $\approx$ 3.28 feet
then, always, $x_{1} = (3.28)^{2}x_{2}$ if two features are related via a linear equation, then $X^{T}X$ will be non-invertible (can be proven in linear algebra)

2) too many features (e.g., $m \leq n$)

for example, $m=10$ and $n=100$:
trying to fit parameter vector $\theta \in \mathbb{R}^{101}$ (i.e., 101 parameters); simply, it's too little data

either delete some features or use regularization



In [87]:

    
import numpy as np

# number of training examples
m = 4

# number of features (not including x_0)
n = 4

#x_0 = np.array([[1],
#                [1],
#                [1],
#                [1]])

#x_1 = np.array([[2104],
#                [1416],
#                [1534],
#                [852]])

#x_2 = np.array([[5],
#                [3],
#                [3],
#                [2]])

#x_3 = np.array([[1],
#                [2],
#                [2],
#                [1]])

#x_4 = np.array([[45],
#                [40],
#                [30],
#                [36]])

# training examples:
xindex1 = np.array([[1],
                    [2104],
                    [5],
                    [1],
                    [45]])

xindex2 = np.array([[1],
                    [1416],
                    [3],
                    [2],
                    [40]])

xindex3 = np.array([[1],
                    [1534],
                    [3],
                    [2],
                    [30]])

xindex4 = np.array([[1],
                    [852],
                    [2],
                    [1],
                    [36]])
# y values
y = np.array([[460],
             [232],
             [315],
             [178]])

# design matrix
X = np.zeros(shape=(m,(n+1)), dtype=np.int16)

X[0,:] = np.transpose(xindex1)
X[1,:] = np.transpose(xindex2)
X[2,:] = np.transpose(xindex3)
X[3,:] = np.transpose(xindex4)

#print X
#print X.shape

# X.T = X transpose
#print X.T
#print X.T.shape

#print np.dot(X.T,X)

#print np.linalg.pinv(np.dot(X.T,X))

#print np.dot(X.T,y)

#print np.dot(np.linalg.pinv(np.dot(X.T,X)), np.dot(X.T,y))

theta = np.dot(np.linalg.pinv(np.dot(X.T,X)), np.dot(X.T,y))

print theta









    



[[ -6.87153812e+00]
 [  2.07287875e-03]
 [  9.35088659e+01]
 [  5.75081499e-01]
 [ -1.24649109e-01]]



In [140]:

    
# gradient descent attempt 1

import numpy as np

# number of training examples
m = 4

# number of features (not including x_0)
n = 4

# training examples:
xindex1 = np.array([[1],
                    [2104],
                    [5],
                    [1],
                    [45]])

xindex2 = np.array([[1],
                    [1416],
                    [3],
                    [2],
                    [40]])

xindex3 = np.array([[1],
                    [1534],
                    [3],
                    [2],
                    [30]])

xindex4 = np.array([[1],
                    [852],
                    [2],
                    [1],
                    [36]])


# y values
y = np.array([[460],
             [232],
             [315],
             [178]])

#print y
#print y.shape

theta = np.zeros(((n+1),1),dtype=np.int16)
#print theta.shape

#theta_transpose = np.transpose(theta)
#print theta_transpose
#hypothesis = theta_transpose.dot(x)

# design matrix
#X = np.concatenate((xindex1.T,xindex2.T,xindex3.T,xindex4.T))
X = np.concatenate((xindex1,xindex2,xindex3,xindex4), axis=1)

#print X
#print X.shape
#Xindex0 = X[:,0]
#print Xindex0.T
#Xindex0.shape = (5,1)
#print Xindex0.shape
#print Xindex0

#print theta.T.dot(X[:,0])

#Xtranspose = np.transpose(X)

alpha = 0.1

temp = np.zeros(((n+1),1),dtype=np.int16)
    
#for j in xrange(0,n):
    #Xsuperscripti
    #temp[j] = theta[j] - alpha * (1/m) * theta.T *

FINAL SUMMARY OF GRADIENT DESCENT

simultaneously update $\theta_{j}$ for $j=0,1,...,n$

$\theta_{j} := \theta_{j} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$

$\theta_{j} := \theta_{j} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}(\theta^{T}x^{(i)} - y^{(i)}) \cdot x_{j}^{(i)}$

where $j = 0$:

$\theta_{0} := \theta_{0} - \alpha \frac{1}{m}\sum\limits_{i=1}^{m}(\theta^{T}x^{(i)} - y^{(i)}) \cdot x_{0}^{(i)}$

where $m = 4$:

$\theta_{0} := \theta_{0} - \alpha \frac{1}{4}\sum\limits_{i=1}^{4}(\theta^{T}x^{(i)} - y^{(i)}) \cdot x_{0}^{(i)}$

expand summation:

$\theta_{0} := \theta_{0} - \alpha \frac{1}{4} \left[(\theta^{T}x^{(1)} - y^{(1)}) \cdot x_{0}^{(1)}\right] + \left[(\theta^{T}x^{(2)} - y^{(2)}) \cdot x_{0}^{(2)}\right] + \left[(\theta^{T}x^{(3)} - y^{(3)}) \cdot x_{0}^{(3)}\right] + \left[(\theta^{T}x^{(4)} - y^{(4)}) \cdot x_{0}^{(4)}\right]$

where $\theta$, $x$, and $y$ are vectors:

$\theta_{0} := \theta_{0} - \alpha \frac{1}{4} [(\begin{bmatrix}0&&0&&0&&0\end{bmatrix} \times \begin{bmatrix}1\\2104\\5\\1\\45\end{bmatrix} - 460) \cdot 1] + [(\begin{bmatrix}0&&0&&0&&0\end{bmatrix} \times \begin{bmatrix}1\\1416\\3\\2\\40\end{bmatrix} - 232) \cdot 1] + [(\begin{bmatrix}0&&0&&0&&0\end{bmatrix} \times \begin{bmatrix}1\\1534\\3\\2\\30\end{bmatrix} - 315) \cdot 1] + [(\begin{bmatrix}0&&0&&0&&0\end{bmatrix} \times \begin{bmatrix}1\\852\\2\\1\\36\end{bmatrix} - 178) \cdot 1]$

each of the $\theta^{T}x$ operations result in a single value, which is then averaged and multiplied by the learning rate, finally being assigned to the relevant $\theta$ parameter



In [251]:

    
# gradient descent attempt 2

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# number of training examples
m = 4

# number of features (not including x_0)
n = 4

# learning rate
alpha = 0.1

# training samples matrix
X = np.array([[1,2104,5,1,45],
              [1,1416,3,2,40],
              [1,1534,3,2,30],
              [1,852, 2,1,36]])

# y value vector
y = np.array([[460],
             [232],
             [315],
             [178]])

# theta parameters
theta = np.zeros(((n+1),1))
#print theta
#print theta.T

#x = (X[0])
#x.shape = (((n+1),1))
#print x
#print x.shape

#print theta.T.dot(x)

iterations = 10
J = np.zeros((iterations,1))

def gradient_descent():
    # temp theta parameters
    temp = np.zeros(((n+1),1))
    # deltas
    delta = np.zeros(((n+1),1))
    
    # update temp storage for theta
    for j in xrange(0,n):
    
        # training-sample numbered summation and average to create delta
        temp_delta = 0
        temp_J = 0
        for i in xrange(0,m):
            
            x_i = X[i]
            #print x
            x_i.shape = (((n+1),1))
            #print x.shape
            
            # get delta
            temp_delta += ((theta.T.dot(x_i)) - y[i]) * x_i[j]
            
            # get cost
            temp_J += ((theta.T.dot(x_i)) - y[i])**2
        
        delta[j] = (1.0/m) * temp_delta
        J[j] = (1.0/(2*m)) * temp_J

        temp[j] = theta[j] - (alpha * delta[j])
    
    # update theta_0 through theta_n
    for j in xrange(0,n):
        theta[j] = temp[j]

for x in xrange(iterations):
    gradient_descent()
    
print theta

x = np.arange(1,(iterations+1))
x.shape = (((iterations),1))

plt.plot(x,J)
plt.legend([r"$J(\theta)$"])

plt.show()









    



[[ -7.28308356e+49]
 [ -1.17285653e+53]
 [ -2.59877566e+50]
 [ -1.09209256e+50]
 [  0.00000000e+00]]

Gradient Descent vs. Normal Equation

$m$ training example,s $n$ features

gradient descent disadvantages:

need to choose $\alpha$
needs many interations

gradient descent advantages:

works well even when $n$ is large (e.g., millions of features)

normal equation advantages:

do not need to choose $\alpha$
do not need to iterate

normal equation disadvantages:

need to compute $(X^{T}X)^{-1}$ (results in a $n\times n$ matrix); on most computer implementations, the cost of inverting a matrix grows roughly as the cube of the dimension of the matrix ($\approx O(n^{3})$)
thus, slow if $n$ is very large (n = 100, n = 1000 should be fine; once n = 10,000 start to consider gradient descent)



In [ ]: