In [1]:

    
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D

import numpy as np

An Intro to Univariate and Multivariate Linear Regression

Nelson Liu

December 5th, 2016

The Problem

Let's say we want to predict how much a house in Boston will cost given the number of rooms that it has.
This is a regression problem, which indicates that we are predicting a real valued output
- To contrast, classification predicts...?

Let's get our data and (crudely) visualize it



In [2]:

    
boston_dataset = load_boston()
# extract prices (in thousands)
house_prices = boston_dataset.target
num_rooms = boston_dataset.data[:,5].reshape(-1, 1)



In [3]:

    
plt.scatter(num_rooms, house_prices)
plt.show()

Seems pretty "linear"...

A Note About Notation

X = "input" variable (features). We want to make predictions given this information.

Y = "output" variable (target). We want to guess this value, given X.

n = number of training samples we have.

(x, y) one specific training example (note the lowercase)

($x^{(i)}, y^{(i)}$) the i-th training example, where $i \in [0, n-1]$
- e.g., ($x^{(1)}, y^{(1)}$) is the first training example
- not to be confused with power!

Machine Learning and Functions

There is an unknown function $f(x) = y$ that takes input from the world and predicts a value from it
- In our case, $f(x)$ is a function from average # of rooms/house in a suburb to average price of a house in that suburb.

The goal of machine learning: guess (approximate) this function $f(x) = y$ given inputs and outputs.
- Our machine learning system guesses a "hypothesis" $h(x)$, where we hope $h(x) \approx f(x)$

In univariate linear regression, we want a hypothesis of the form $h(x) = \theta_0 + \theta_1x$
- This corresponds to a straight line in the Cartesian plane
- The $\theta$'s are known as parameters of our function, and we want to find the values that best fit the data

How do we fit this straight line to our data?

$h(x) = \theta_0 + \theta_1x$
We need to "tune" parameters! By changing the values of $\theta_0$ and $\theta_1$, we generate different lines.
- So we want to choose values of $\theta_0$ and $\theta_1$ to make $h(x)$ generate y as accurately as possible for some data x

Formalizing this intuition

We want to find the values of $\theta_0$ and $\theta_1$ to minimize the error between our predicted value $h(x)$ and our true value $y$ for an arbitrary $(x, y)$ pair in our training set.

cost = $ \frac{1}{2n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)})^2$
$h(x) = \theta_0 + \theta_1x$

Note that the values of $\theta_0$ and $\theta_1$ control the values of $h(x_i)$, and $y_i$ is constant.
- Thus, the cost function varies on $\theta_0$ and $\theta_1$
- we want to find the values of $\theta_0$ and $\theta_1$ to minimize the cost function!

Cost Function

our goal is to minimize the cost function.
- We'll represent the cost function as $C(\theta_0,\theta_1)$
- The idea is that the values of the parameters that minimize the cost function should give us a good model.

Visualizing the Cost Function

So let's return to our task above. We want to find a function $h(x) = y$ to approximate housing prices based on the number of rooms, which would entail minimizing our cost function $C(\theta_0,\theta_1)$.

Since $C(\theta_0,\theta_1)$ depends on two variables and outputs one value, we can visualize it in 3-d.
- Let's plug in a bunch of values of $\theta_0$ and $\theta_1$ to $C(\theta_0,\theta_1)$ and see what it looks like



In [4]:

    
# theta_0 = y-intercept
# theta_1 = slope
# num_rooms = array with the average number of rooms in the suburb
# prices = array with the average price in the suburb
# n = number of training examples

def compute_cost(theta_0, theta_1, num_rooms, prices):
    total_squared_error = 0.0
    n = len(num_rooms)
    for i in range(n):
        # compute h(x_i), where x_i = num_rooms[i] for the values of 
        # theta_0 and theta_1
        h_x = (theta_1 * num_rooms[i][0]) + theta_0
        
        # compute the difference between our guess
        # for h(x_i) and the true value y_i
        raw_error = prices[i] - h_x
        
        # square the error to make the function differentiable
        # and make the error positive
        squared_error = raw_error**2
        
        # add it to the total
        total_squared_error += squared_error
    MSE = total_squared_error / (2.0 * n)
    return MSE



In [5]:

    
# you don't really need to know what this code does, but you should
# know what the visualization means. This might take awhile to run.
def plot_cost_function(rotate=False):
    fig = plt.figure(figsize=(16, 16))
    ax = fig.gca(projection='3d')

    # values of theta_0 (intercept) and theta_1 (slope) to try
    theta_0s = np.linspace(-50, -20, 100)
    theta_1s = np.linspace(0, 20, 100)

    # create meshgrid out of them
    theta_0_mesh, theta_1_mesh = np.meshgrid(theta_0s, theta_1s)
    mses = np.array([compute_cost(theta_0, theta_1, num_rooms, house_prices) 
                       for theta_0, theta_1 in zip(np.ravel(theta_0_mesh), 
                                                   np.ravel(theta_1_mesh))])
    mse = mses.reshape(theta_0_mesh.shape)

    if rotate is True:
        surf = ax.plot_surface(theta_1_mesh, theta_0_mesh, mse, 
                               rstride=1, cstride=1, linewidth=0.1, 
                               cmap=cm.coolwarm, alpha=0.5)
    else:
        surf = ax.plot_surface(theta_0_mesh, theta_1_mesh, mse, 
                               rstride=1, cstride=1, linewidth=0.1, 
                               cmap=cm.coolwarm, alpha=0.5)

    if rotate is True:
        ax.set_xlabel('theta 1 (slope)')
        ax.set_ylabel('theta 0 (intercept)')
    else:
        ax.set_xlabel('theta 0 (intercept)')
        ax.set_ylabel('theta 1 (slope)')
    ax.set_zlabel('MSE')
    plt.show()



In [6]:

    
plot_cost_function()



In [7]:

    
plot_cost_function(rotate=True)

What do you think the optimal parameters are, based on the previous figures?

Wouldn't it be nice to have an algorithm to automatically figure this out for you, given the plot?
- This algorithm is called Gradient Descent

Gradient Descent

Method for optimizing the value of the parameters to minimize the cost function $C(\theta_0, \theta_1)$.
- Used broadly throughout machine learning (e.g. neural nets) and numerical optimization.

Gradient Descent -- Intuition in Pictures

walking down a hill, taking a step in the steepest downhill direction each time.
- keep doing this until you reach a minimum

Gradient Descent Algorithm

repeat until you reach a local min (convergence):
- for j in {0,1}:
  - $\theta_j = \theta_j - \alpha(\frac{\partial}{\partial\theta_j} C(\theta_0, \theta_1))$

$\alpha$ refers to the learning rate
- how big of a step we take in a particular direction when updating $\theta_j$

$\frac{\partial}{\partial\theta_j} C(\theta_0, \theta_1)$ is the partial derivative of $C$
- find the slope of function, and adjust the parameter to minimize $C$.

Gradient Descent on Linear Regression

$$\frac{\partial}{\partial\theta_j} C(\theta_0, \theta_1) = \frac{\partial}{\partial\theta_j} \left(\frac{1}{2n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)})^2\right)$$

$$\frac{\partial}{\partial\theta_j} C(\theta_0, \theta_1) = \frac{\partial}{\partial\theta_j} \left(\frac{1}{2n} \sum \limits_{i=1}^{n} ((\theta_0 + \theta_1x^{(i)}) - y^{(i)})^2\right)$$

calculating out the derivatives...(proof left as an exercise)

$$j=0 : \frac{\partial}{\partial\theta_0} C(\theta_0, \theta_1) = \frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)})$$$$j=1 : \frac{\partial}{\partial\theta_0} C(\theta_0, \theta_1) = \frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)}) \times x^{(i)}$$

Gradient Descent on Linear Regression (cont.)

repeat until you reach a local min (convergence):
- for j in {0,1}:
  - $\theta_0 = \theta_0 - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)})$
  - $\theta_1 = \theta_1 - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)}) \times x_i$

make sure to simulataneously update $\theta_0$ and $\theta_1$

In pseudocode

repeat until you reach a local min (convergence):
- for j in {0,1}:
  - temp_0 $= \theta_0 - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)})$
  - temp_1 $= \theta_1 - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)}) \times x^{(i)}$
  - $\theta_0 =$ temp_0
  - $\theta_1 =$ temp_1

How do we extend this to multiple variables?

There are a lot more factors than just # of rooms in pricing a house
- location, age of house, etc

It'd be great if we could incorporate information about multiple variables in our models
- Multivariate Linear Regression!

Adjusting the Hypothesis

Previously (in the univariate case), $h(x) = \theta_0 + \theta_1x$
- To generalize this to more variables, we just add more terms!

Say we have 3 input variables (features)
- $h(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3$
- $x_j$ is the jth feature.

generally, $h(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n$

Vector Representation

To make the notation a bit cleaner, let's define $x_0 = 1$.
- $h(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n = \theta_0 \cdot 1 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n$

$X = \begin{pmatrix} x_0\\ x_1\\ \vdots\\ x_n \end{pmatrix}$ and $\theta = \begin{pmatrix} \theta_0\\ \theta_1\\ \vdots\\ \theta_n \end{pmatrix}$, so $h(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n = \theta^TX$

$X$ and $\theta$ are 1 x (n+1), or "column vectors"

$\theta^T = \begin{bmatrix} \theta_0 & \theta_1 & \dots & \theta_n \end{bmatrix}$

(n+1) x 1 matrix, or a "row vector"

Summary So Far: Multivariate Linear Regression

hypothesis: $h(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n = \theta^TX$
- $x_0 = 1$, $x_j$ indicates the $jth$ feature (input variable)

Parameters of model: $\theta_0, \theta_1, ... \theta_n$, represented simply as $\theta$
- $\theta$ is a n+1 dimensional column vector, where the elements are $\theta_0, \theta_1, ... \theta_n$

Cost function: $C(\theta_0, \theta_1, ... \theta_n) = \frac{1}{2n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)})^2$

Optimization by gradient descent:
- repeat $\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} C(\theta)$ until convergence
- simultaneously update for every j = 0...n

Univariate vs Multivariate Gradient Descent

Univariate

repeat until you reach a local min (convergence):
- for j in {0,1}:
  - $\theta_0 = \theta_0 - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)})$
  - $\theta_1 = \theta_1 - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)}) \times x_i$

Multivariate

repeat until you reach a local min (convergence):
- for j in {0...n}:
  - $\theta_j = \theta_j - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)}) x_j^{(i)}$

A closer look at the multivariate partial derivative

$$\theta_j = \theta_j - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)}) x_j^{(i)}$$$$\theta_0 = \theta_0 - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)}) x_0^{(i)}$$$$\theta_1 = \theta_1 - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)}) x_1^{(i)}$$$$\theta_1 = \theta_2 - \alpha\frac{1}{n} \sum \limits_{i=1}^{n} (h(x^{(i)}) - y^{(i)}) x_2^{(i)}$$

Recall $x_0 = 1$
The partial derivatives are the same in uni vs multivariate! Just taking them over more variables.

Parting Remarks: The Normal Equation

Solve for the values of our parameter vector $\theta$ automatically!

$\theta = (X^T X)^{-1} X^T Y$

Gradient Descent vs. Normal Equation

In gradient descent, you need to pick a good value of $\alpha$
- Not necessary when using normal equations
Gradient descent needs many iterations
- No iteration in normal equation
Gradient descent scales very well when we have many features (just compute some more derivatives)
- must calculate $(X^T X)^{-1}$, which can be slow if $n$ is very large (10,000+ features).