gradient descent for line fitting:
We need to minimize $sse$ wrt $m$ and $b$.
Thus we take derivative of $sse$ wrt $m$ and $b$ and use gradient descent to approach to zero slope.
$$ \begin{align} \frac{\partial sse}{\partial m} & = \frac{2}{n} \sum_i^n - x_i (y_i - (m x_i + b)) \\ \frac{\partial sse}{\partial b} & = \frac{2}{n} \sum_i^n - (y_i - (m x_i + b)) \end{align} $$Note that gradients depend on n points: $(x_i, y_i)$
So, the code to calculate gradients will loop over all points:
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
In [11]:
from numpy import *
# y = mx + b
# m is slope, b is y-intercept
def compute_error_for_line_given_points(b, m, points):
totalError = 0
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
totalError += (y - (m * x + b)) ** 2
return totalError / float(len(points))
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
def run(points):
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
return [b,m]
In [12]:
points = genfromtxt("../../../study_data/math_of_intelligence/01_data.csv", delimiter=",")
run(points)
Out[12]:
use cases for svm:
svm good when:
other algorithms (random forest, dnn etc)
Knuth: "Premature optimization is the root of all evil in programming"
a discriminative classifier
opposite to discriminative: generative. it generates new data.
svm tries to maximise the gap between two classes of points
these points: support vectors
hyperplane: the line that separates the two classes of points
hyperplane is (n-1) dimensional in $\rm I\!R^n$
to do non-linear classification: kernel trick
Minimize loss function to maximize the discrimination.
Hinge loss function: for maximum margin classification
$$c(x,y,f(x)) = (1 - y \times f(x))_+$$c is the loss function. x the sample. y true label. f(x) predicted label.
So:
$$ c(x,y,f(x)) = \begin{cases} 0,& \text{if } y \times f(x) \geq 1 \\ 1 - y \times f(x),& \text{else} \end{cases} $$note that: $w = f(x)$
objective function = regularizer + hinge loss
what does regularizer do?
so, regularizer controls trade off between low training error and low testing error