This lab consists in implementing the Ordinary Least Squares (OLS) algorithm, which is a linear regression with a least-squares penalty. Given a training set $ D = \left\{ \left(x^{(i)}, y^{(i)}\right), x^{(i)} \in \mathcal{X}, y^{(i)} \in \mathcal{Y}, i \in \{1, \dots, n \} \right\}$, recall (from lectures 1 and 2) OLS aims at minimizing the following cost function $J$: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ where $$h(x) = \sum_{j = 0}^{d} \theta_j x_j = \theta^T x.$$
For the sake of simplicity, we will be working on a small training set (the one we used in lectures 1 and 2):
living area (m$^2$) | price (1000's BGN) |
---|---|
50 | 30 |
76 | 48 |
26 | 12 |
102 | 90 |
In [1]:
X = [[1., 50.], [1., 76.], [1., 26.], [1., 102.]]
Y = [30., 48., 12., 90.]
In this simple example, the dimensionality is $d = 1$ (which means 2 features: don't forget the intercept!) and the number of samples is $n = 4$.
Remark: 1.
is written instead of 1
in order to avoid integers operations. For example, in some languages (including Python 2), the result of 1/2
is 0
and not 0.5
:
In [2]:
1/2
Out[2]:
Instead, writing 1./2
forces a float operation and gives 0.5
as a result, which is what we want:
In [3]:
1./2
Out[3]:
In [4]:
def predict(x, theta):
y_hat = x[0] * theta[0] + x[1] * theta[1] # compute the dot product between 'theta' and 'x'
return y_hat
Exercise: Define a function cost_function
that takes as parameter the predicted label $y$ and the actual label $\hat{y}$ of a single sample and returns the value of the cost function for this pair. Recall from lectures 1 and 2 that it is given by:
$$ \ell \left( y, \hat{y} \right) = \dfrac{1}{2}\left( y - \hat{y} \right)^2$$
In [5]:
def cost_function(y, y_hat):
loss = (y - y_hat) ** 2 / 2
return loss
We are now able to compute the cost function for a single sample. We can easily compute the cost function for the whole training set by summing the cost function values for all the samples in the training set. Recall that the total cost function is given by: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ where, for all $i \in \{ 1, \dots, n \}$ $$h\left(x^{(i)}\right) = \sum_{j = 0}^{d} \theta_j x^{(i)}_j = \theta^T x^{(i)}$$ is the prediction of $x$ given the model $\theta$.
In [6]:
def cost_function_total(X, Y, theta):
cost = 0 # initialize the cost with 0
n = len(Y)
for i in range(n): # iterate over the training set (n = 4 in our case)
x = X[i] # get the ith feature vector
y = Y[i] # get the ith label
y_hat = predict(x, theta) # predict the ith label
cost += cost_function(y, y_hat) # add the cost of the current sample to the total cost
return cost
Let's now test the code written above and check the total cost function we would have when $\theta = [0, 0]$.
In [7]:
theta_0 = [0, 0]
cost_function_total(X, Y, theta_0)
Out[7]:
Note that this error is big, which is expectable because having $\theta = [0, 0]$ means always predicting $\hat{y} = 0$.
Exercise: Define a function gradient
that implements the gradient of the cost function for a given sample $(x, y)$. Recall from the lectures 1 and 2 that the gradient is given by:
$$\nabla J(\theta) = \left[ \dfrac{\partial}{\partial \theta_1} J(\theta), \dots, \dfrac{\partial}{\partial \theta_d} J(\theta) \right]^T$$
where, for all $j \in \{0, \dots, d \}$:
$$ \dfrac{\partial}{\partial \theta_j} J(\theta) = \left( h\left(x\right) - y \right) x_j. $$
Hint: Recall that $d = 1$, hence the gradient is of size $2$ (one value for $j = 0$, and another one for $j = 1$). Its two values are given by:
$$ \dfrac{\partial}{\partial \theta_0} J(\theta) = \left( h\left(x\right) - y \right) x_0 \quad \text{and} \quad
\dfrac{\partial}{\partial \theta_1} J(\theta) = \left( h\left(x\right) - y \right) x_1. $$
In [8]:
def gradient(x, y, theta):
grad = [0, 0]
grad[0] = (predict(x, theta) - y) * x[0] # first value of the gradient
grad[1] = (predict(x, theta) - y) * x[1] # second value of the gradient
return grad
Let's try the gradient
function on a simple example ($\theta = [0, 0]$ on the first sample of the training set, i.e. $\left(x^{(0)}, y^{(0)}\right)$).
In [9]:
gradient(X[0], Y[0], theta_0)
Out[9]:
In [10]:
def gradient_total(X, Y, theta):
grad_total = [0, 0] # initialize the gradient with zeros
n = len(Y)
for i in range(n): # iterate over the training set
x = X[i] # get the ith feature vector
y = Y[i] # get the ith label
grad = gradient(x, y, theta) # predict the ith label given 'theta'
grad_total[0] += grad[0] # add the gradient corresponding to theta[0]
grad_total[1] += grad[1] # add the gradient corresponding to theta[1]
return grad_total
Let's now test the code written above and check the total gradient we would have when $\theta = [0, 0]$
In [11]:
gradient_total(X, Y, theta_0)
Out[11]:
Question: What is the sign of the gradient values? What would it mean if we had such a gradient when applying a gradient descent?
Hint: Recall the gradient descent update: $$\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j} J(\theta) \quad \text{for all } j \in \{0, \dots, d \}$$
Answer: Both values are negative, which means this gradient step would increase the value of $\theta$ due to fact we substract the gradient. This makes sense, because:
We now have all the building blocs needed for the gradient descent algorithm, that is:
Exercise: Define a function called gradient_descent_step
that performs an update on theta by applying the formula above.
In [12]:
def gradient_descent_step(X, Y, theta, alpha):
theta_updated = [0, 0]
grad = gradient_total(X, Y, theta)
theta_updated[0] = theta[0] - alpha * grad[0]
theta_updated[1] = theta[1] - alpha * grad[1]
return theta_updated
Try to run a few iterations manually. Play with the value of $\alpha$ to see how it impacts the algorithm.
In [13]:
alpha = 0.0001
theta_1 = gradient_descent_step(X, Y, theta_0, alpha)
theta_2 = gradient_descent_step(X, Y, theta_1, alpha)
theta_2
Out[13]:
The gradient_descent_step
implements a single gradient step of the gradient descent algorithm. Implement a function called gradient_descent
that starts from a given $\theta$ (exaple $\theta = [0, 0]$) and applies 100 gradient descent iterations. Display the total cost function $J(\theta)$ at each iteration.
In [14]:
def gradient_descent(X, Y, alpha):
theta = [0, 0] # initializing theta with zeros (it can be initialized in another manner)
n_iteration_max = 100
for i_iteration in range(n_iteration_max):
loss = cost_function_total(X, Y, theta)
print("Iteration {:>2}. Current loss = {}".format(i_iteration, loss))
theta = gradient_descent_step(X, Y, theta, alpha)
loss = cost_function_total(X, Y, theta)
print("Optimization complete. Final loss = {}".format(loss))
return theta
Play with the code you've just run. Try different values of $\alpha$ and see the impact it has.
In [15]:
theta_trained = gradient_descent(X, Y, alpha)
In [16]:
theta_trained
Out[16]:
Question: Looks at the evolution of the cost function over time. What comment can you make?
Answer: The loss function constantly drops over time with this initial value of $\theta$ and this $\alpha$. It ends up reaching a plateau at around 186. It seems like the algorithm has converged to the optimal value of the cost function.
Question: What does the value theta_trained
represent?
Answer: Recall that the model $\theta$ has to values, $\theta_0$ and $\theta_1$. Hence, with the model theta_trained
we've learnt, price prediction would be:
$$ \text{price} = \theta_0 + \theta_1 \times \text{area}.$$
As we have seen during the lecture 1, the gradient descent methods are often split into 2 different subfamilies:
gradient_descent_step
and gradient_descent
) corresponds to the batch version because it sums the gradient of all the samples in the training set. During the next session, we will work on the following:
Exercise: Try to implement the stochastic version of the gradient descent algorithm.