Ordinary Least Squares

Course recap

This lab consists in implementing the Ordinary Least Squares (OLS) algorithm, which is a linear regression with a least-squares penalty. Given a training set $ D = \left\{ \left(x^{(i)}, y^{(i)}\right), x^{(i)} \in \mathcal{X}, y^{(i)} \in \mathcal{Y}, i \in \{1, \dots, n \} \right\}$, recall (from lectures 1 and 2) OLS aims at minimizing the following cost function $J$: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ where $$h(x) = \sum_{j = 0}^{d} \theta_j x_j = \theta^T x.$$

For the sake of simplicity, we will be working on a small training set (the one we used in lectures 1 and 2):

living area (m$^2$)	price (1000's BGN)
50	30
76	48
26	12
102	90

Defining the training set

Exercise 1: Define variables X and Y that will contain the features $\mathcal{X}$ and labels $\mathcal{Y}$ of the training set.

Hint: Do not forget the intercept!



In [1]:

    
X = [[1., 50.], [1., 76.], [1., 26.], [1., 102.]]
Y = [30., 48., 12., 90.]

In this simple example, the dimensionality is $d = 1$ (which means 2 features: don't forget the intercept!) and the number of samples is $n = 4$.

Remark: 1. is written instead of 1 in order to avoid integers operations. For example, in some languages (including Python 2), the result of 1/2 is 0 and not 0.5:



In [2]:

    
1/2









    Out[2]:





0

Instead, writing 1./2 forces a float operation and gives 0.5 as a result, which is what we want:



In [3]:

    
1./2









    Out[3]:





0.5

Prediction function

Exercise: Define a function predict that takes as parameter the feature vector $x$ and the model $\theta$ and returns the predicted label: $$ \hat{y} = h(x) = \theta^T x = \sum_{j = 0}^d \theta_j x_j$$



In [4]:

    
def predict(x, theta):
    y_hat = x[0] * theta[0] + x[1] * theta[1] # compute the dot product between 'theta' and 'x'
    return y_hat

Defining the cost function

Cost function on a single sample

Exercise: Define a function cost_function that takes as parameter the predicted label $y$ and the actual label $\hat{y}$ of a single sample and returns the value of the cost function for this pair. Recall from lectures 1 and 2 that it is given by: $$ \ell \left( y, \hat{y} \right) = \dfrac{1}{2}\left( y - \hat{y} \right)^2$$



In [5]:

    
def cost_function(y, y_hat):
    loss = (y - y_hat) ** 2 / 2
    return loss

Cost function on the whole training set

We are now able to compute the cost function for a single sample. We can easily compute the cost function for the whole training set by summing the cost function values for all the samples in the training set. Recall that the total cost function is given by: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ where, for all $i \in \{ 1, \dots, n \}$ $$h\left(x^{(i)}\right) = \sum_{j = 0}^{d} \theta_j x^{(i)}_j = \theta^T x^{(i)}$$ is the prediction of $x$ given the model $\theta$.



In [6]:

    
def cost_function_total(X, Y, theta):
    cost = 0 # initialize the cost with 0
    n = len(Y)
    for i in range(n): # iterate over the training set (n = 4 in our case)
        x = X[i] # get the ith feature vector
        y = Y[i] # get the ith label
        y_hat = predict(x, theta) # predict the ith label
        cost += cost_function(y, y_hat) # add the cost of the current sample to the total cost
    return cost

Let's now test the code written above and check the total cost function we would have when $\theta = [0, 0]$.



In [7]:

    
theta_0 = [0, 0]
cost_function_total(X, Y, theta_0)









    Out[7]:





5724.0

Note that this error is big, which is expectable because having $\theta = [0, 0]$ means always predicting $\hat{y} = 0$.

Defining the gradient of the cost function

Gradient on a single sample

Exercise: Define a function gradient that implements the gradient of the cost function for a given sample $(x, y)$. Recall from the lectures 1 and 2 that the gradient is given by: $$\nabla J(\theta) = \left[ \dfrac{\partial}{\partial \theta_1} J(\theta), \dots, \dfrac{\partial}{\partial \theta_d} J(\theta) \right]^T$$ where, for all $j \in \{0, \dots, d \}$: $$ \dfrac{\partial}{\partial \theta_j} J(\theta) = \left( h\left(x\right) - y \right) x_j. $$ Hint: Recall that $d = 1$, hence the gradient is of size $2$ (one value for $j = 0$, and another one for $j = 1$). Its two values are given by: $$ \dfrac{\partial}{\partial \theta_0} J(\theta) = \left( h\left(x\right) - y \right) x_0 \quad \text{and} \quad \dfrac{\partial}{\partial \theta_1} J(\theta) = \left( h\left(x\right) - y \right) x_1. $$



In [8]:

    
def gradient(x, y, theta):
    grad = [0, 0]
    grad[0] = (predict(x, theta) - y) * x[0] # first value of the gradient 
    grad[1] = (predict(x, theta) - y) * x[1] # second value of the gradient
    return grad

Let's try the gradient function on a simple example ($\theta = [0, 0]$ on the first sample of the training set, i.e. $\left(x^{(0)}, y^{(0)}\right)$).



In [9]:

    
gradient(X[0], Y[0], theta_0)









    Out[9]:





[-30.0, -1500.0]

Gradient on the whole training set

Now we are able to compute the gradient of the cost function on a single sample, we can easily compute gradient_total, the gradient of the cost function on the whole training set by summing the gradients for all the samples in the training set.



In [10]:

    
def gradient_total(X, Y, theta):
    grad_total = [0, 0] # initialize the gradient with zeros
    n = len(Y)
    for i in range(n): # iterate over the training set
        x = X[i] # get the ith feature vector
        y = Y[i] # get the ith label
        grad = gradient(x, y, theta) # predict the ith label given 'theta'
        grad_total[0] += grad[0] # add the gradient corresponding to theta[0]
        grad_total[1] += grad[1] # add the gradient corresponding to theta[1]
    return grad_total

Let's now test the code written above and check the total gradient we would have when $\theta = [0, 0]$



In [11]:

    
gradient_total(X, Y, theta_0)









    Out[11]:





[-180.0, -14640.0]

Question: What is the sign of the gradient values? What would it mean if we had such a gradient when applying a gradient descent?

Hint: Recall the gradient descent update: $$\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j} J(\theta) \quad \text{for all } j \in \{0, \dots, d \}$$

Answer: Both values are negative, which means this gradient step would increase the value of $\theta$ due to fact we substract the gradient. This makes sense, because:

we start with $\theta = [0, 0]$,
we expect $\theta_0 > 0$ and $\theta_1 > 0$ because otherwise we could predict a negative price.

Applying a gradient descent

Gradient descent step implementation

We now have all the building blocs needed for the gradient descent algorithm, that is:

The loss function
The gradient Indeed, the iterative update scheme of this algorithm is given by the following formula: $$\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j} J(\theta)$$ for all $j \in \{0, \dots, d \}$. Recall that $\alpha$ is a parameter called the learning rate (or step size).

Exercise: Define a function called gradient_descent_step that performs an update on theta by applying the formula above.



In [12]:

    
def gradient_descent_step(X, Y, theta, alpha):
    theta_updated = [0, 0]
    grad = gradient_total(X, Y, theta)
    theta_updated[0] = theta[0] - alpha * grad[0]
    theta_updated[1] = theta[1] - alpha * grad[1]
    return theta_updated

Try to run a few iterations manually. Play with the value of $\alpha$ to see how it impacts the algorithm.



In [13]:

    
alpha = 0.0001
theta_1 = gradient_descent_step(X, Y, theta_0, alpha)
theta_2 = gradient_descent_step(X, Y, theta_1, alpha)
theta_2









    Out[13]:





[-0.0011927999999999973, 0.0938243999999997]

Iterating gradient descent steps

The gradient_descent_step implements a single gradient step of the gradient descent algorithm. Implement a function called gradient_descent that starts from a given $\theta$ (exaple $\theta = [0, 0]$) and applies 100 gradient descent iterations. Display the total cost function $J(\theta)$ at each iteration.



In [14]:

    
def gradient_descent(X, Y, alpha):
    theta = [0, 0] # initializing theta with zeros (it can be initialized in another manner)
    n_iteration_max = 100
    for i_iteration in range(n_iteration_max):
        loss = cost_function_total(X, Y, theta)
        print("Iteration {:>2}. Current loss = {}".format(i_iteration, loss))
        theta = gradient_descent_step(X, Y, theta, alpha)
    loss = cost_function_total(X, Y, theta)
    print("Optimization complete. Final loss = {}".format(loss))
    return theta

Play with the code you've just run. Try different values of $\alpha$ and see the impact it has.



In [15]:

    
theta_trained = gradient_descent(X, Y, alpha)









    



Iteration  0. Current loss = 5724.0
Iteration  1. Current loss = 5037.312744
Iteration  2. Current loss = 4435.7926733
Iteration  3. Current loss = 3908.87660327
Iteration  4. Current loss = 3447.31148851
Iteration  5. Current loss = 3042.99192789
Iteration  6. Current loss = 2688.81782352
Iteration  7. Current loss = 2378.56969424
Iteration  8. Current loss = 2106.79945376
Iteration  9. Current loss = 1868.73473546
Iteration 10. Current loss = 1660.19508368
Iteration 11. Current loss = 1477.51853971
Iteration 12. Current loss = 1317.49733324
Iteration 13. Current loss = 1177.32154991
Iteration 14. Current loss = 1054.52978568
Iteration 15. Current loss = 946.965921541
Iteration 16. Current loss = 852.741259375
Iteration 17. Current loss = 770.201354036
Iteration 18. Current loss = 697.896959252
Iteration 19. Current loss = 634.558577041
Iteration 20. Current loss = 579.074163751
Iteration 21. Current loss = 530.469601184
Iteration 22. Current loss = 487.891589865
Iteration 23. Current loss = 450.59266404
Iteration 24. Current loss = 417.918065244
Iteration 25. Current loss = 389.294243928
Iteration 26. Current loss = 364.218787212
Iteration 27. Current loss = 342.251595891
Iteration 28. Current loss = 323.007155739
Iteration 29. Current loss = 306.147767403
Iteration 30. Current loss = 291.377615979
Iteration 31. Current loss = 278.437576137
Iteration 32. Current loss = 267.10066155
Iteration 33. Current loss = 257.168038739
Iteration 34. Current loss = 248.465535303
Iteration 35. Current loss = 240.840581238
Iteration 36. Current loss = 234.159529618
Iteration 37. Current loss = 228.305309583
Iteration 38. Current loss = 223.175370436
Iteration 39. Current loss = 218.679880719
Iteration 40. Current loss = 214.740150664
Iteration 41. Current loss = 211.287250302
Iteration 42. Current loss = 208.260798966
Iteration 43. Current loss = 205.607904929
Iteration 44. Current loss = 203.282236566
Iteration 45. Current loss = 201.243208706
Iteration 46. Current loss = 199.455269908
Iteration 47. Current loss = 197.887278137
Iteration 48. Current loss = 196.511953863
Iteration 49. Current loss = 195.305401008
Iteration 50. Current loss = 194.246687295
Iteration 51. Current loss = 193.31747665
Iteration 52. Current loss = 192.501707203
Iteration 53. Current loss = 191.785309219
Iteration 54. Current loss = 191.155958013
Iteration 55. Current loss = 190.602857518
Iteration 56. Current loss = 190.11655069
Iteration 57. Current loss = 189.688753431
Iteration 58. Current loss = 189.312209109
Iteration 59. Current loss = 188.980561123
Iteration 60. Current loss = 188.688241277
Iteration 61. Current loss = 188.430371987
Iteration 62. Current loss = 188.202680633
Iteration 63. Current loss = 188.001424523
Iteration 64. Current loss = 187.823325164
Iteration 65. Current loss = 187.665510694
Iteration 66. Current loss = 187.525465442
Iteration 67. Current loss = 187.400985754
Iteration 68. Current loss = 187.290141292
Iteration 69. Current loss = 187.191241135
Iteration 70. Current loss = 187.102804087
Iteration 71. Current loss = 187.02353266
Iteration 72. Current loss = 186.952290295
Iteration 73. Current loss = 186.888081396
Iteration 74. Current loss = 186.830033851
Iteration 75. Current loss = 186.777383715
Iteration 76. Current loss = 186.729461795
Iteration 77. Current loss = 186.685681895
Iteration 78. Current loss = 186.645530527
Iteration 79. Current loss = 186.608557887
Iteration 80. Current loss = 186.574369962
Iteration 81. Current loss = 186.542621608
Iteration 82. Current loss = 186.513010487
Iteration 83. Current loss = 186.485271761
Iteration 84. Current loss = 186.459173439
Iteration 85. Current loss = 186.434512303
Iteration 86. Current loss = 186.411110342
Iteration 87. Current loss = 186.388811622
Iteration 88. Current loss = 186.367479547
Iteration 89. Current loss = 186.346994468
Iteration 90. Current loss = 186.327251571
Iteration 91. Current loss = 186.308159044
Iteration 92. Current loss = 186.289636463
Iteration 93. Current loss = 186.271613377
Iteration 94. Current loss = 186.254028075
Iteration 95. Current loss = 186.236826499
Iteration 96. Current loss = 186.219961295
Iteration 97. Current loss = 186.203390982
Iteration 98. Current loss = 186.187079226
Iteration 99. Current loss = 186.170994198
Optimization complete. Final loss = 186.155108016



In [16]:

    
theta_trained









    Out[16]:





[-0.11087633725023004, 0.7567940140789401]

Question: Looks at the evolution of the cost function over time. What comment can you make?

Answer: The loss function constantly drops over time with this initial value of $\theta$ and this $\alpha$. It ends up reaching a plateau at around 186. It seems like the algorithm has converged to the optimal value of the cost function.

Question: What does the value theta_trained represent?

Answer: Recall that the model $\theta$ has to values, $\theta_0$ and $\theta_1$. Hence, with the model theta_trained we've learnt, price prediction would be: $$ \text{price} = \theta_0 + \theta_1 \times \text{area}.$$

Homework: Batch gradient descent vs. stochastic gradient descent

As we have seen during the lecture 1, the gradient descent methods are often split into 2 different subfamilies:

Batch methods update $\theta$ after having computed the gradient on the whole training set
Stochastic methods update $\theta$ after having computed the gradient on a single sample The gradient descent we have implemented above (gradient_descent_step and gradient_descent) corresponds to the batch version because it sums the gradient of all the samples in the training set.

During the next session, we will work on the following:

Stochastic gradient descent implementation and see how it compares to the batch version of gradient descent
Visualizing the loss function, its evolution over iterations as well as the steps that take the gradient descent algorithm.

Exercise: Try to implement the stochastic version of the gradient descent algorithm.