Ordinary Least Squares

 Course (very short) recap

This lab consists in implementing the Ordinary Least Squares (OLS) algorithm, which is a linear regression with a least-squares penalty. Given a training set $ D = \left\{ \left(x^{(i)}, y^{(i)}\right), x^{(i)} \in \mathcal{X}, y^{(i)} \in \mathcal{Y}, i \in \{1, \dots, n \} \right\}$, recall (from lectures 1 and 2) OLS aims at minimizing the following cost function $J$: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ where $$h(x) = \sum_{j = 0}^{d} \theta_j x_j = \theta^T x.$$

For the sake of simplicity, we will be working on a small training set (the one we used in lectures 1 and 2) $D$:

living area (m$^2$) price (1000's BGN)
50 30
76 48
26 12
102 90

Defining the training set

Exercise 1: Define variables X and Y that will contain the features $\mathcal{X}$ and labels $\mathcal{Y}$ of the training set.

Note: Do not forget the intercept!


In [ ]:
X = ?
Y = ?

In this simple example, the dimensionality is $d = 1$ and the number of samples is $n = 4$.

Prediction function

Exercise: Define a function predict that takes as parameter the feature vector $x$ and the model $\theta$ and returns the predicted label: $$ \hat{y} = h(x) = \theta^T x = \sum_{j = 0}^d \theta_j x_j$$


In [ ]:
def predict():
    #TODO
    return

Defining the cost function

Cost function on a single sample

Exercise: Define a function cost_function that takes as parameter the predicted label $y$ and the actual label $\hat{y}$ of a single sample and returns the loss of this pair given by: $$ \ell \left( y - \hat{y} \right) = \dfrac{1}{2}\left( y - \hat{y} \right)^2$$


In [ ]:
def cost_function():
    #TODO
    return

Cost function on the whole training set

Now we are able to compute the cost function for a single sample, we can easily compute the cost function for the whole training set by summing the cost function values for all the samples in the training set. Recall that the total cost function is given by: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ where, for all $i \in \{ 1, \dots, n \}$, we have: $$h\left(x^{(i)}\right) = \sum_{j = 0}^{d} \theta_j x^{(i)}_j = \theta^T x.$$


In [ ]:
def cost_function_total():
    #TODO
    return

Let's now test the code written above and check the total cost function we would have when $\theta = [0, 0]$


In [ ]:
#TODO

Defining the gradient of the cost function

Gradient on a single sample

Exercise: Define a function gradient that implements the gradient of the cost function for a given sample $(x, y)$. Recall from the lectures 1 and 2 that the gradient is given by: $$\nabla J(\theta) = \left[ \dfrac{\partial}{\partial \theta_1} J(\theta), \dots, \dfrac{\partial}{\partial \theta_d} J(\theta) \right]^T$$ where, for all $j \in \{0, \dots, d \}$: $$ \dfrac{\partial}{\partial \theta_j} J(\theta) = \left( h\left(x\right) - y \right) x_j $$


In [ ]:
def gradient():
    #TODO
    return

In our application, the dimensionality of our data (with the intercept) is 2, so the gradient has 2 values (corresponding to $\theta_0$ and $\theta_1$).


In [ ]:
#TODO

Gradient on the whole training set

Now we are able to compute the gradient of the cost function on a single sample, we can easily compute gradient_total, the gradient of the cost function on the whole training set by summing the gradients for all the samples in the training set.


In [ ]:
def gradient_total():
    #TODO
    return

Let's now test the code written above and check the total gradient we would have when $\theta = [0, 0]$


In [ ]:
#TODO

Question: What is the sign of the gradient values? What would it mean if we had such a gradient when applying a gradient descent?

Hint: Recall the gradient descent update.

Applying a gradient descent

Gradient descent step implementation

We now have all the building blocs needed for the gradient descent algorithm, that is:

  • The loss function
  • The gradient Indeed, the iterative update scheme of this algorithm is given by the following formula: $$\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j} J(\theta)$$ for all $j \in \{0, \dots, d \}$. Recall that $\alpha$ is a parameter called the learning rate (or step size).

Exercise: Define a function called gradient_descent_step that performs an update on theta by applying the formula above.


In [ ]:
def gradient_descent_step():
    #TODO
    return

Try to run a few iterations manually. Play with the value of $\alpha$ to see how it impacts the algorithm.


In [ ]:
#TODO

Iterating gradient descent steps

The gradient_descent_step implements a single gradient step of the gradient descent algorithm. Implement a function called gradient_descent that starts from a given $\theta$ (exaple $\theta = [0, 0]$) and applies 100 gradient descent iterations. Display the total cost function $J(\theta)$ at each iteration.


In [ ]:
def gradient_descent():
    #TODO
    return

Play with the function you've just implemented. Try different values of $\alpha$ and see the impact it has.


In [ ]:
#TODO

Question: Looks at the evolution of the cost function over time. What comment can you make?

Question: What does the value theta_trained represents?

Homework: Batch gradient descent vs. stochastic gradient descent

As we have seen during the lecture 1, the gradient descent methods are often split into 2 different subfamilies:

  • Batch methods update $\theta$ after having computed the gradient on the whole training set
  • Stochastic methods update $\theta$ after having computed the gradient on a single sample The gradient descent we have implemented above (gradient_descent_step and gradient_descent) corresponds to the batch version because it sums the gradient of all the samples in the training set.

During the next session, we will work on the following:

  • Stochastic gradient descent implementation and see how it compares to the batch version of gradient descent
  • Visualizing the loss function, its evolution over iterations as well as the steps that take the gradient descent algorithm.

Exercise: Try to implement the stochastic version of the gradient descent algorithm.