Video 01: Gradient Descent

gradient descent for line fitting:

Sum of Squares Error Function

$$sse(m,b) = \frac{1}{n} \sum_{i=1}^{n} (y_i - (m x_i + b))^2 $$

We need to minimize $sse$ wrt $m$ and $b$.

Thus we take derivative of $sse$ wrt $m$ and $b$ and use gradient descent to approach to zero slope.

$$ \begin{align} \frac{\partial sse}{\partial m} & = \frac{2}{n} \sum_i^n - x_i (y_i - (m x_i + b)) \\ \frac{\partial sse}{\partial b} & = \frac{2}{n} \sum_i^n - (y_i - (m x_i + b)) \end{align} $$

Note that gradients depend on n points: $(x_i, y_i)$

So, the code to calculate gradients will loop over all points:

def step_gradient(b_current, m_current, points, learningRate):
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))

In [11]:
from numpy import *

# y = mx + b
# m is slope, b is y-intercept
def compute_error_for_line_given_points(b, m, points):
    totalError = 0
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        totalError += (y - (m * x + b)) ** 2
    return totalError / float(len(points))

def step_gradient(b_current, m_current, points, learningRate):
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
    new_b = b_current - (learningRate * b_gradient)
    new_m = m_current - (learningRate * m_gradient)
    return [new_b, new_m]

def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
    b = starting_b
    m = starting_m
    for i in range(num_iterations):
        b, m = step_gradient(b, m, array(points), learning_rate)
    return [b, m]

def run(points):
    learning_rate = 0.0001
    initial_b = 0 # initial y-intercept guess
    initial_m = 0 # initial slope guess
    num_iterations = 1000
    [b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
    return [b,m]

In [12]:
points = genfromtxt("../../../study_data/math_of_intelligence/01_data.csv", delimiter=",")
run(points)


Out[12]:
[0.088936519937413458, 1.4777440851894448]

Video 02: Support Vector Machines

use cases for svm:

  • classification, regression, outlier detection, clustering

Comparison to other methods

svm good when:

  • small set of data (<1000 rows)

other algorithms (random forest, dnn etc)

  • more data
  • always very robust

Knuth: "Premature optimization is the root of all evil in programming"

What is svm?

a discriminative classifier

opposite to discriminative: generative. it generates new data.

svm tries to maximise the gap between two classes of points

these points: support vectors

hyperplane: the line that separates the two classes of points

hyperplane is (n-1) dimensional in $\rm I\!R^n$

Linear vs nonlinear classification

to do non-linear classification: kernel trick

Loss function

Minimize loss function to maximize the discrimination.

Hinge loss function: for maximum margin classification

$$c(x,y,f(x)) = (1 - y \times f(x))_+$$

c is the loss function. x the sample. y true label. f(x) predicted label.

So:

$$ c(x,y,f(x)) = \begin{cases} 0,& \text{if } y \times f(x) \geq 1 \\ 1 - y \times f(x),& \text{else} \end{cases} $$

Objective function

$$min_w \lambda \| w \|^2 + \sum_{i=1}^n (1-y_i \langle x_i, w \rangle )_+$$

note that: $w = f(x)$

objective function = regularizer + hinge loss

what does regularizer do?

  • too high => overfit
  • too low => underfit

so, regularizer controls trade off between low training error and low testing error