Ordinary Least Squares -- Part II

Course recap

This lab consists in implementing the Ordinary Least Squares (OLS) algorithm, which is a linear regression with a least-squares penalty. Given a training set $ D = \left\{ \left(x^{(i)}, y^{(i)}\right), x^{(i)} \in \mathcal{X}, y^{(i)} \in \mathcal{Y}, i \in \{1, \dots, n \} \right\}$, recall (from lectures 1 and 2) OLS aims at minimizing the following cost function $J$: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ where $$h(x) = \sum_{j = 0}^{d} \theta_j x_j = \theta^T x.$$

For the sake of simplicity, we will be working on a small training set (the one we used in lectures 1 and 2):

living area (m$^2$)	price (1000's BGN)
50	30
76	48
26	12
102	90

Defining the training set

Exercise 1: Define variables X and Y that will contain the features $\mathcal{X}$ and labels $\mathcal{Y}$ of the training set.

Hint: Do not forget the intercept!



In [ ]:

In this simple example, the dimensionality is $d = 1$ (which means 2 features: don't forget the intercept!) and the number of samples is $n = 4$.

Remark: 1. is written instead of 1 in order to avoid integers operations. For example, in some languages (including Python 2), the result of 1/2 is 0 and not 0.5:



In [ ]:

Instead, writing 1./2 forces a float operation and gives 0.5 as a result, which is what we want:



In [ ]:

Prediction function

Exercise: Define a function predict that takes as parameter the feature vector $x$ and the model $\theta$ and returns the predicted label: $$ \hat{y} = h(x) = \theta^T x = \sum_{j = 0}^d \theta_j x_j$$



In [ ]:

Defining the cost function

Cost function on a single sample

Exercise: Define a function cost_function that takes as parameter the predicted label $y$ and the actual label $\hat{y}$ of a single sample and returns the value of the cost function for this pair. Recall from lectures 1 and 2 that it is given by: $$ \ell \left( y, \hat{y} \right) = \dfrac{1}{2}\left( y - \hat{y} \right)^2$$



In [ ]:

Cost function on the whole training set

We are now able to compute the cost function for a single sample. We can easily compute the cost function for the whole training set by summing the cost function values for all the samples in the training set. Recall that the total cost function is given by: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ where, for all $i \in \{ 1, \dots, n \}$ $$h\left(x^{(i)}\right) = \sum_{j = 0}^{d} \theta_j x^{(i)}_j = \theta^T x^{(i)}$$ is the prediction of $x$ given the model $\theta$.



In [ ]:

Let's now test the code written above and check the total cost function we would have when $\theta = [0, 0]$.



In [ ]:

Note that this error is big, which is expectable because having $\theta = [0, 0]$ means always predicting $\hat{y} = 0$.

Defining the gradient of the cost function

Gradient on a single sample

Exercise: Define a function gradient that implements the gradient of the cost function for a given sample $(x, y)$. Recall from the lectures 1 and 2 that the gradient is given by: $$\nabla J(\theta) = \left[ \dfrac{\partial}{\partial \theta_1} J(\theta), \dots, \dfrac{\partial}{\partial \theta_d} J(\theta) \right]^T$$ where, for all $j \in \{0, \dots, d \}$: $$ \dfrac{\partial}{\partial \theta_j} J(\theta) = \left( h\left(x\right) - y \right) x_j. $$ Hint: Recall that $d = 1$, hence the gradient is of size $2$ (one value for $j = 0$, and another one for $j = 1$). Its two values are given by: $$ \dfrac{\partial}{\partial \theta_0} J(\theta) = \left( h\left(x\right) - y \right) x_0 \quad \text{and} \quad \dfrac{\partial}{\partial \theta_1} J(\theta) = \left( h\left(x\right) - y \right) x_1. $$



In [ ]:

Let's try the gradient function on a simple example ($\theta = [0, 0]$ on the first sample of the training set, i.e. $\left(x^{(0)}, y^{(0)}\right)$).



In [ ]:

Gradient on the whole training set

Now we are able to compute the gradient of the cost function on a single sample, we can easily compute gradient_total, the gradient of the cost function on the whole training set by summing the gradients for all the samples in the training set.



In [ ]:

Let's now test the code written above and check the total gradient we would have when $\theta = [0, 0]$



In [ ]:

Question: What is the sign of the gradient values? What would it mean if we had such a gradient when applying a gradient descent?

Hint: Recall the gradient descent update: $$\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j} J(\theta) \quad \text{for all } j \in \{0, \dots, d \}$$

Answer: Both values are negative, which means this gradient step would increase the value of $\theta$ due to fact we substract the gradient. This makes sense, because:

we start with $\theta = [0, 0]$,
we expect $\theta_0 > 0$ and $\theta_1 > 0$ because otherwise we could predict a negative price.

Applying a gradient descent

Gradient descent step implementation

We now have all the building blocs needed for the gradient descent algorithm, that is:

The loss function
The gradient Indeed, the iterative update scheme of this algorithm is given by the following formula: $$\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j} J(\theta)$$ for all $j \in \{0, \dots, d \}$. Recall that $\alpha$ is a parameter called the learning rate (or step size).

Exercise: Define a function called gradient_descent_step that performs an update on theta by applying the formula above.



In [ ]:

Exercise: Run a few iterations manually. Play with the value of $\alpha$ to see how it impacts the algorithm.



In [ ]:

Iterating gradient descent steps

The gradient_descent_step implements a single gradient step of the gradient descent algorithm. Implement a function called gradient_descent that starts from a given $\theta$ (exaple $\theta = [0, 0]$) and applies 100 gradient descent iterations. Display the total cost function $J(\theta)$ at each iteration.



In [ ]:

Exercise: Play with the code you've just run. Try different values of $\alpha$ and see the impact it has.



In [ ]:

Exercise: Plot the loss over iterations.



In [ ]:

Let us see what the trained regression looks like.



In [ ]:

Question: Looks at the evolution of the cost function over time. What comment can you make?

Answer: The loss function constantly drops over time with this initial value of $\theta$ and this $\alpha$. It ends up reaching a plateau at around 186. It seems like the algorithm has converged to the optimal value of the cost function.

Question: What does the value theta_trained represent?

Answer: Recall that the model $\theta$ has to values, $\theta_0$ and $\theta_1$. Hence, with the model theta_trained we've learnt, price prediction would be: $$ \text{price} = \theta_0 + \theta_1 \times \text{area}.$$

Batch gradient descent vs. stochastic gradient descent

As we have seen during the lecture 1, the gradient descent methods are often split into 2 different subfamilies:

Batch methods update $\theta$ after having computed the gradient on the whole training set
Stochastic methods update $\theta$ after having computed the gradient on a single sample The gradient descent we have implemented above (gradient_descent_step and gradient_descent) corresponds to the batch version because it sums the gradient of all the samples in the training set.

Exercise: Try to implement the stochastic version of the gradient descent algorithm. You will need to define a function stochastic_gradient_descent_step that compute a stochastic gradient step (on a single $(x, y)$ sample) and a function stochastic_gradient_descent iterates 100 stochastic gradient descent steps and returns the trained model $\theta$.

Exercise: Apply the algorithm with the same parameters.



In [ ]:

Exercise: Plot the loss history over iterations.



In [ ]:

Exercise: Plot the obtained regression.



In [ ]:

Question: Compare the results obtained when solving the OLS problem with stochastic gradient descent and batch gradient descent. Are the results the same? Why?

Hint: Plot the loss function.



In [ ]: