Ordinary Least Squares -- Part II

 Course recap

This lab consists in implementing the Ordinary Least Squares (OLS) algorithm, which is a linear regression with a least-squares penalty. Given a training set $ D = \left\{ \left(x^{(i)}, y^{(i)}\right), x^{(i)} \in \mathcal{X}, y^{(i)} \in \mathcal{Y}, i \in \{1, \dots, n \} \right\}$, recall (from lectures 1 and 2) OLS aims at minimizing the following cost function $J$: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ where $$h(x) = \sum_{j = 0}^{d} \theta_j x_j = \theta^T x.$$

For the sake of simplicity, we will be working on a small training set (the one we used in lectures 1 and 2):

living area (m$^2$) price (1000's BGN)
50 30
76 48
26 12
102 90

Defining the training set

Exercise 1: Define variables X and Y that will contain the features $\mathcal{X}$ and labels $\mathcal{Y}$ of the training set.

Hint: Do not forget the intercept!


In [31]:
X = [[1., 50.], [1., 76.], [1., 26.], [1., 102.]]
Y = [30., 48., 12., 90.]

# Y[3] = 300 # Outlier. Uncomment this line if you want to introduce an outlier.


Out[31]:
0.6394267984578837

In this simple example, the dimensionality is $d = 1$ (which means 2 features: don't forget the intercept!) and the number of samples is $n = 4$.

Remark: 1. is written instead of 1 in order to avoid integers operations. For example, in some languages (including Python 2), the result of 1/2 is 0 and not 0.5:


In [2]:
1/2


Out[2]:
0.5

Instead, writing 1./2 forces a float operation and gives 0.5 as a result, which is what we want:


In [3]:
1./2


Out[3]:
0.5

Prediction function

Exercise: Define a function predict that takes as parameter the feature vector $x$ and the model $\theta$ and returns the predicted label: $$ \hat{y} = h(x) = \theta^T x = \sum_{j = 0}^d \theta_j x_j$$


In [4]:
def predict(x, theta):
    y_hat = x[0] * theta[0] + x[1] * theta[1] # compute the dot product between 'theta' and 'x'
    return y_hat

Defining the cost function

Cost function on a single sample

Exercise: Define a function cost_function that takes as parameter the predicted label $y$ and the actual label $\hat{y}$ of a single sample and returns the value of the cost function for this pair. Recall from lectures 1 and 2 that it is given by: $$ \ell \left( y, \hat{y} \right) = \dfrac{1}{2}\left( y - \hat{y} \right)^2$$


In [5]:
def cost_function(y, y_hat):
    loss = (y - y_hat) ** 2 / 2
    return loss

Cost function on the whole training set

We are now able to compute the cost function for a single sample. We can easily compute the cost function for the whole training set by summing the cost function values for all the samples in the training set. Recall that the total cost function is given by: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ where, for all $i \in \{ 1, \dots, n \}$ $$h\left(x^{(i)}\right) = \sum_{j = 0}^{d} \theta_j x^{(i)}_j = \theta^T x^{(i)}$$ is the prediction of $x$ given the model $\theta$.


In [6]:
def cost_function_total(X, Y, theta):
    cost = 0 # initialize the cost with 0
    n = len(Y)
    for i in range(n): # iterate over the training set (n = 4 in our case)
        x = X[i] # get the ith feature vector
        y = Y[i] # get the ith label
        y_hat = predict(x, theta) # predict the ith label
        cost += cost_function(y, y_hat) # add the cost of the current sample to the total cost
    return cost

Let's now test the code written above and check the total cost function we would have when $\theta = [0, 0]$.


In [7]:
theta_0 = [0, 0]
cost_function_total(X, Y, theta_0)


Out[7]:
5724.0

Note that this error is big, which is expectable because having $\theta = [0, 0]$ means always predicting $\hat{y} = 0$.

Defining the gradient of the cost function

Gradient on a single sample

Exercise: Define a function gradient that implements the gradient of the cost function for a given sample $(x, y)$. Recall from the lectures 1 and 2 that the gradient is given by: $$\nabla J(\theta) = \left[ \dfrac{\partial}{\partial \theta_1} J(\theta), \dots, \dfrac{\partial}{\partial \theta_d} J(\theta) \right]^T$$ where, for all $j \in \{0, \dots, d \}$: $$ \dfrac{\partial}{\partial \theta_j} J(\theta) = \left( h\left(x\right) - y \right) x_j. $$ Hint: Recall that $d = 1$, hence the gradient is of size $2$ (one value for $j = 0$, and another one for $j = 1$). Its two values are given by: $$ \dfrac{\partial}{\partial \theta_0} J(\theta) = \left( h\left(x\right) - y \right) x_0 \quad \text{and} \quad \dfrac{\partial}{\partial \theta_1} J(\theta) = \left( h\left(x\right) - y \right) x_1. $$


In [8]:
def gradient(x, y, theta):
    grad = [0, 0]
    grad[0] = (predict(x, theta) - y) * x[0] # first value of the gradient 
    grad[1] = (predict(x, theta) - y) * x[1] # second value of the gradient
    return grad

Let's try the gradient function on a simple example ($\theta = [0, 0]$ on the first sample of the training set, i.e. $\left(x^{(0)}, y^{(0)}\right)$).


In [9]:
gradient(X[0], Y[0], theta_0)


Out[9]:
[-30.0, -1500.0]

Gradient on the whole training set

Now we are able to compute the gradient of the cost function on a single sample, we can easily compute gradient_total, the gradient of the cost function on the whole training set by summing the gradients for all the samples in the training set.


In [10]:
def gradient_total(X, Y, theta):
    grad_total = [0, 0] # initialize the gradient with zeros
    n = len(Y)
    for i in range(n): # iterate over the training set
        x = X[i] # get the ith feature vector
        y = Y[i] # get the ith label
        grad = gradient(x, y, theta) # predict the ith label given 'theta'
        grad_total[0] += grad[0] # add the gradient corresponding to theta[0]
        grad_total[1] += grad[1] # add the gradient corresponding to theta[1]
    return grad_total

Let's now test the code written above and check the total gradient we would have when $\theta = [0, 0]$


In [11]:
gradient_total(X, Y, theta_0)


Out[11]:
[-180.0, -14640.0]

Question: What is the sign of the gradient values? What would it mean if we had such a gradient when applying a gradient descent?

Hint: Recall the gradient descent update: $$\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j} J(\theta) \quad \text{for all } j \in \{0, \dots, d \}$$

Answer: Both values are negative, which means this gradient step would increase the value of $\theta$ due to fact we substract the gradient. This makes sense, because:

  • we start with $\theta = [0, 0]$,
  • we expect $\theta_0 > 0$ and $\theta_1 > 0$ because otherwise we could predict a negative price.

Applying a gradient descent

Gradient descent step implementation

We now have all the building blocs needed for the gradient descent algorithm, that is:

  • The loss function
  • The gradient Indeed, the iterative update scheme of this algorithm is given by the following formula: $$\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j} J(\theta)$$ for all $j \in \{0, \dots, d \}$. Recall that $\alpha$ is a parameter called the learning rate (or step size).

Exercise: Define a function called gradient_descent_step that performs an update on theta by applying the formula above.


In [12]:
def gradient_descent_step(X, Y, theta, alpha):
    theta_updated = [0, 0]
    grad = gradient_total(X, Y, theta)
    theta_updated[0] = theta[0] - alpha * grad[0]
    theta_updated[1] = theta[1] - alpha * grad[1]
    return theta_updated

Try to run a few iterations manually. Play with the value of $\alpha$ to see how it impacts the algorithm.


In [13]:
alpha = 0.0001
theta_1 = gradient_descent_step(X, Y, theta_0, alpha)
theta_2 = gradient_descent_step(X, Y, theta_1, alpha)
theta_2


Out[13]:
[-0.0011927999999999973, 0.0938243999999997]

Iterating gradient descent steps

The gradient_descent_step implements a single gradient step of the gradient descent algorithm. Implement a function called gradient_descent that starts from a given $\theta$ (exaple $\theta = [0, 0]$) and applies 100 gradient descent iterations. Display the total cost function $J(\theta)$ at each iteration.


In [14]:
def gradient_descent(X, Y, alpha):
    theta = [0, 0] # initializing theta with zeros (it can be initialized in another manner)
    n_iteration_max = 100
    loss_history = []
    for i_iteration in range(n_iteration_max):
        loss = cost_function_total(X, Y, theta)
        loss_history.append(loss)
        print("Iteration {:>2}. Current loss = {}".format(i_iteration, loss))
        theta = gradient_descent_step(X, Y, theta, alpha)
    loss = cost_function_total(X, Y, theta)
    loss_history.append(loss)
    print("Optimization complete. Final loss = {}".format(loss))
    return theta, loss_history

Play with the code you've just run. Try different values of $\alpha$ and see the impact it has.

Note: $\alpha = 0.0001$ works well in this case.


In [15]:
theta_trained_gd, loss_history = gradient_descent(X, Y, alpha)


Iteration  0. Current loss = 5724.0
Iteration  1. Current loss = 5037.312744000001
Iteration  2. Current loss = 4435.792673300704
Iteration  3. Current loss = 3908.876603267039
Iteration  4. Current loss = 3447.311488513006
Iteration  5. Current loss = 3042.9919278860875
Iteration  6. Current loss = 2688.8178235154955
Iteration  7. Current loss = 2378.569694239707
Iteration  8. Current loss = 2106.799453761523
Iteration  9. Current loss = 1868.7347354588892
Iteration 10. Current loss = 1660.1950836758394
Iteration 11. Current loss = 1477.5185397081189
Iteration 12. Current loss = 1317.497333241855
Iteration 13. Current loss = 1177.3215499067294
Iteration 14. Current loss = 1054.5297856756622
Iteration 15. Current loss = 946.9659215407085
Iteration 16. Current loss = 852.7412593745793
Iteration 17. Current loss = 770.2013540363196
Iteration 18. Current loss = 697.8969592515267
Iteration 19. Current loss = 634.5585770405551
Iteration 20. Current loss = 579.0741637509263
Iteration 21. Current loss = 530.469601184068
Iteration 22. Current loss = 487.89158986501747
Iteration 23. Current loss = 450.5926640395761
Iteration 24. Current loss = 417.9180652435737
Iteration 25. Current loss = 389.2942439277548
Iteration 26. Current loss = 364.21878721247606
Iteration 27. Current loss = 342.2515958910323
Iteration 28. Current loss = 323.0071557387844
Iteration 29. Current loss = 306.1477674026388
Iteration 30. Current loss = 291.3776159792998
Iteration 31. Current loss = 278.43757613668663
Iteration 32. Current loss = 267.1006615499776
Iteration 33. Current loss = 257.1680387386946
Iteration 34. Current loss = 248.46553530284635
Iteration 35. Current loss = 240.84058123839284
Iteration 36. Current loss = 234.159529617723
Iteration 37. Current loss = 228.30530958295333
Iteration 38. Current loss = 223.17537043567705
Iteration 39. Current loss = 218.67988071882633
Iteration 40. Current loss = 214.74015066426756
Iteration 41. Current loss = 211.28725030235205
Iteration 42. Current loss = 208.26079896569684
Iteration 43. Current loss = 205.6079049293753
Iteration 44. Current loss = 203.28223656627523
Iteration 45. Current loss = 201.24320870595525
Iteration 46. Current loss = 199.45526990844746
Iteration 47. Current loss = 197.8872781366419
Iteration 48. Current loss = 196.51195386328519
Iteration 49. Current loss = 195.30540100847043
Iteration 50. Current loss = 194.24668729467783
Iteration 51. Current loss = 193.31747664988472
Iteration 52. Current loss = 192.5017072032791
Iteration 53. Current loss = 191.7853092187849
Iteration 54. Current loss = 191.15595801295416
Iteration 55. Current loss = 190.60285751816338
Iteration 56. Current loss = 190.11655069020554
Iteration 57. Current loss = 189.68875343080742
Iteration 58. Current loss = 189.3122091085363
Iteration 59. Current loss = 188.98056112330914
Iteration 60. Current loss = 188.68824127657257
Iteration 61. Current loss = 188.4303719867952
Iteration 62. Current loss = 188.20268063306162
Iteration 63. Current loss = 188.0014245225231
Iteration 64. Current loss = 187.82332516404898
Iteration 65. Current loss = 187.66551069383956
Iteration 66. Current loss = 187.52546544192012
Iteration 67. Current loss = 187.4009857538448
Iteration 68. Current loss = 187.29014129177935
Iteration 69. Current loss = 187.19124113536765
Iteration 70. Current loss = 187.1028040870657
Iteration 71. Current loss = 187.0235326604702
Iteration 72. Current loss = 186.9522902948442
Iteration 73. Current loss = 186.88808139569693
Iteration 74. Current loss = 186.83003385090893
Iteration 75. Current loss = 186.77738371535878
Iteration 76. Current loss = 186.7294617950999
Iteration 77. Current loss = 186.6856818954842
Iteration 78. Current loss = 186.64553052685784
Iteration 79. Current loss = 186.60855788704697
Iteration 80. Current loss = 186.57436996227526
Iteration 81. Current loss = 186.54262160779513
Iteration 82. Current loss = 186.51301048672116
Iteration 83. Current loss = 186.4852717606192
Iteration 84. Current loss = 186.45917343862135
Iteration 85. Current loss = 186.43451230337817
Iteration 86. Current loss = 186.4111103423159
Iteration 87. Current loss = 186.38881162151648
Iteration 88. Current loss = 186.3674795473284
Iteration 89. Current loss = 186.34699446761687
Iteration 90. Current loss = 186.32725157052732
Iteration 91. Current loss = 186.3081590438632
Iteration 92. Current loss = 186.28963646275298
Iteration 93. Current loss = 186.27161337729555
Iteration 94. Current loss = 186.25402807537478
Iteration 95. Current loss = 186.2368264989228
Iteration 96. Current loss = 186.2199612945987
Iteration 97. Current loss = 186.2033909822071
Iteration 98. Current loss = 186.1870792262614
Iteration 99. Current loss = 186.17099419788883
Optimization complete. Final loss = 186.15510801588317

Let's have a more visual interpretation by plotting the loss over iterations.


In [16]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(loss_history)
plt.ylabel('loss')
plt.show()



In [17]:
theta_trained_gd


Out[17]:
[-0.11087633725023004, 0.7567940140789401]

Let us see what the trained regression looks like.


In [18]:
houses_sizes = range(150)
estimated_prices = [theta_trained_gd[0] + house_size * theta_trained_gd[1] for house_size in houses_sizes]
x1s = [x[1] for x in X]
plt.plot(houses_sizes, estimated_prices)
plt.ylabel("price")
plt.xlabel("house size")
plt.scatter(x1s, Y, color="red")
plt.show()


Question: Looks at the evolution of the cost function over time. What comment can you make?

Answer: The loss function constantly drops over time with this initial value of $\theta$ and this $\alpha$. It ends up reaching a plateau at around 186. It seems like the algorithm has converged to the optimal value of the cost function.

Question: What does the value theta_trained represent?

Answer: Recall that the model $\theta$ has to values, $\theta_0$ and $\theta_1$. Hence, with the model theta_trained we've learnt, price prediction would be: $$ \text{price} = \theta_0 + \theta_1 \times \text{area}.$$

Batch gradient descent vs. stochastic gradient descent

As we have seen during the lecture 1, the gradient descent methods are often split into 2 different subfamilies:

  • Batch methods update $\theta$ after having computed the gradient on the whole training set
  • Stochastic methods update $\theta$ after having computed the gradient on a single sample The gradient descent we have implemented above (gradient_descent_step and gradient_descent) corresponds to the batch version because it sums the gradient of all the samples in the training set.

Exercise: Try to implement the stochastic version of the gradient descent algorithm. You will need to define a function stochastic_gradient_descent_step that compute a stochastic gradient step (on a single $(x, y)$ sample) and a function stochastic_gradient_descent iterates 100 stochastic gradient descent steps and returns the trained model $\theta$.

Solution: We first define the function stochastic_gradient_descent_step that implements a stochastic gradient step.


In [19]:
def stochastic_gradient_descent_step(x, y, theta, alpha):
    theta_updated = [0, 0]
    grad = gradient(x, y, theta)
    theta_updated[0] = theta[0] - alpha * grad[0]
    theta_updated[1] = theta[1] - alpha * grad[1]
    return theta_updated

Then the stochastic_gradient_descent function that will do the iterations.


In [20]:
def stochastic_gradient_descent(X, Y, alpha):
    theta = [0, 0] # initializing theta with zeros (it can be initialized in another manner)
    n_iteration_max = 100
    loss_history = []
    n_samples = len(Y)
    for i_iteration in range(n_iteration_max):
        i_sample = i_iteration % n_samples
        loss = cost_function_total(X, Y, theta)
        loss_history.append(loss)
        print("Iteration {:>2}. Current loss = {}".format(i_iteration, loss))
        theta = stochastic_gradient_descent_step(X[i_sample], Y[i_sample], theta, alpha) # run the gradient update on a single sample
    loss = cost_function_total(X, Y, theta)
    loss_history.append(loss)
    print("Optimization complete. Final loss = {}".format(loss))
    return theta, loss_history

Let's now apply the algorithm with the same parameters.


In [21]:
theta_trained_sgd, loss_history_sgd = stochastic_gradient_descent(X, Y, alpha)


Iteration  0. Current loss = 5724.0
Iteration  1. Current loss = 3745.329318
Iteration  2. Current loss = 1229.5885055588851
Iteration  3. Current loss = 1215.4002506952156
Iteration  4. Current loss = 389.2488988697621
Iteration  5. Current loss = 233.84872707672636
Iteration  6. Current loss = 205.38990003797537
Iteration  7. Current loss = 222.35389730665707
Iteration  8. Current loss = 360.2446248124806
Iteration  9. Current loss = 223.65082333311133
Iteration 10. Current loss = 208.31517423690133
Iteration 11. Current loss = 226.11952840036483
Iteration 12. Current loss = 360.5442624543617
Iteration 13. Current loss = 223.74419111203997
Iteration 14. Current loss = 208.25999331497226
Iteration 15. Current loss = 226.05126940341967
Iteration 16. Current loss = 360.50675808520907
Iteration 17. Current loss = 223.721958030496
Iteration 18. Current loss = 208.24178739067673
Iteration 19. Current loss = 226.03001538194087
Iteration 20. Current loss = 360.4732920259364
Iteration 21. Current loss = 223.7011117481283
Iteration 22. Current loss = 208.22314266007874
Iteration 23. Current loss = 226.0082029487174
Iteration 24. Current loss = 360.4397826157614
Iteration 25. Current loss = 223.68025225863477
Iteration 26. Current loss = 208.20450634314338
Iteration 27. Current loss = 225.9864006950348
Iteration 28. Current loss = 360.40627866132445
Iteration 29. Current loss = 223.65939630261482
Iteration 30. Current loss = 208.18587309127957
Iteration 31. Current loss = 225.96460182117386
Iteration 32. Current loss = 360.37277957826257
Iteration 33. Current loss = 223.63854367935122
Iteration 34. Current loss = 208.16724296793694
Iteration 35. Current loss = 225.94280640789367
Iteration 36. Current loss = 360.3392853728212
Iteration 37. Current loss = 223.61769439071062
Iteration 38. Current loss = 208.14861597184492
Iteration 39. Current loss = 225.92101445367686
Iteration 40. Current loss = 360.3057960441828
Iteration 41. Current loss = 223.59684843613806
Iteration 42. Current loss = 208.12999210250692
Iteration 43. Current loss = 225.89922595799038
Iteration 44. Current loss = 360.27231159161306
Iteration 45. Current loss = 223.57600581510619
Iteration 46. Current loss = 208.111371359418
Iteration 47. Current loss = 225.87744092029027
Iteration 48. Current loss = 360.23883201437906
Iteration 49. Current loss = 223.55516652708909
Iteration 50. Current loss = 208.09275374207272
Iteration 51. Current loss = 225.85565934003165
Iteration 52. Current loss = 360.205357311747
Iteration 53. Current loss = 223.53433057155982
Iteration 54. Current loss = 208.07413924996584
Iteration 55. Current loss = 225.8338812166708
Iteration 56. Current loss = 360.1718874829827
Iteration 57. Current loss = 223.5134979479919
Iteration 58. Current loss = 208.05552788259214
Iteration 59. Current loss = 225.81210654966304
Iteration 60. Current loss = 360.13842252735407
Iteration 61. Current loss = 223.49266865585912
Iteration 62. Current loss = 208.03691963944726
Iteration 63. Current loss = 225.79033533846427
Iteration 64. Current loss = 360.10496244412576
Iteration 65. Current loss = 223.4718426946345
Iteration 66. Current loss = 208.0183145200253
Iteration 67. Current loss = 225.76856758253032
Iteration 68. Current loss = 360.0715072325656
Iteration 69. Current loss = 223.45102006379193
Iteration 70. Current loss = 207.99971252382204
Iteration 71. Current loss = 225.74680328131802
Iteration 72. Current loss = 360.03805689193996
Iteration 73. Current loss = 223.43020076280555
Iteration 74. Current loss = 207.9811136503327
Iteration 75. Current loss = 225.725042434283
Iteration 76. Current loss = 360.0046114215164
Iteration 77. Current loss = 223.4093847911493
Iteration 78. Current loss = 207.96251789905267
Iteration 79. Current loss = 225.7032850408812
Iteration 80. Current loss = 359.9711708205613
Iteration 81. Current loss = 223.38857214829648
Iteration 82. Current loss = 207.94392526947695
Iteration 83. Current loss = 225.68153110056898
Iteration 84. Current loss = 359.93773508834147
Iteration 85. Current loss = 223.36776283372149
Iteration 86. Current loss = 207.92533576110145
Iteration 87. Current loss = 225.65978061280353
Iteration 88. Current loss = 359.90430422412567
Iteration 89. Current loss = 223.3469568468986
Iteration 90. Current loss = 207.906749373421
Iteration 91. Current loss = 225.63803357704077
Iteration 92. Current loss = 359.8708782271795
Iteration 93. Current loss = 223.3261541873017
Iteration 94. Current loss = 207.88816610593256
Iteration 95. Current loss = 225.61628999273688
Iteration 96. Current loss = 359.8374570967709
Iteration 97. Current loss = 223.30535485440527
Iteration 98. Current loss = 207.86958595813053
Iteration 99. Current loss = 225.5945498593494
Optimization complete. Final loss = 359.80404083216865

Here again, let's plot the loss history over iterations.


In [22]:
plt.plot(loss_history_sgd)
plt.ylabel('loss')
plt.show()


Again, let's look at the regression we obtain.


In [23]:
houses_sizes = range(150)
estimated_prices = [theta_trained_sgd[0] + house_size * theta_trained_sgd[1] for house_size in houses_sizes]
x1s = [x[1] for x in X]
plt.plot(houses_sizes, estimated_prices)
plt.ylabel("price")
plt.xlabel("house size")
plt.scatter(x1s, Y, color="red")
plt.show()


Question: Compare the results obtained when solving the OLS problem with stochastic gradient descent and batch gradient descent. Are the results the same? Why?

Hint: Plot the loss function.

Answer: The results are not the same, the loss function value differs a lot: More than 200 for the stochastic gradient descent, and about 186 for the batch gradient descent version.

To understand what happens, let's look at what the function looks like. It is a function of 2 variables ($\theta_0$ and $\theta_1$), so we can use either a 3D plot or a contour plot for visualize it.

The following code shows the contour plot. First, we need to define a grid of $\theta$ values on which we will evaluate the loss function. theta0_vector and theta1_vector define the values of $\theta_0$ and $\theta_1$ on which the function will be evaluated.


In [24]:
import numpy as np
dtheta = 0.001
theta0_vector = np.arange(-1,+1, dtheta)
theta1_vector = np.arange(-1,+1, dtheta)

Let's compute the value of the loss function for each pair $(\theta_0, \theta_1)$.


In [25]:
n0 = len(theta0_vector)
n1 = len(theta1_vector)
loss = np.zeros([n0, n1])
for i0 in range(n0):
    for i1 in range(n1):
        loss[i0, i1] = cost_function_total(X, Y, [theta0_vector[i0], theta1_vector[i1]])

In [26]:
fig = plt.figure()
plt.contourf(theta0_vector, theta1_vector, loss)
plt.colorbar()
plt.show()


We can look at what the function from "further away":


In [27]:
dtheta = 1.
# from -100 to +100 on both axis
theta0_vector = np.arange(-100,100, dtheta)
theta1_vector = np.arange(-100,100, dtheta)
n0 = len(theta0_vector)
n1 = len(theta1_vector)
loss = np.zeros([n0, n1])
for i0 in range(n0):
    for i1 in range(n1):
        loss[i0, i1] = cost_function_total(X, Y, [theta0_vector[i0], theta1_vector[i1]])
fig = plt.figure()
plt.contourf(theta0_vector, theta1_vector, loss)
plt.colorbar()
plt.show()


It appears that the function behaves very differently on the two different axis:

  • It varies very fast on the horizontal axis
  • It varies very slowly on the vertical axis

Question: What could this be due to? How could we solve this issue?

Answer: This is due to the fact that the two features (the size and the intercept) are on very different scales: the intercept is constantly equal to 1, while the size of the house varies on a much wider range (up to 102). In our algorithm, we make no difference between $\theta_0$ and $\theta_1$: the same step-size is A way to address this issue is to normalize (or rescale) the features. There are several ways to do it:

  1. The first way consists in, for each feature, dividing all the feature values by its maximal value. In the case of the house size, it would consist in dividing all the house sizes by 102, so that the biggest house would have a size of 1.
  2. Another way consists in computing the $z$-score of the feature: $$ z = \dfrac{x - x_{\min}}{x_{\max} - x_{\min}} $$ so that the feature respectively has 0 and 1 as minimum and maximum value. Note that this does not apply to the intercept $x_0$, because it is constantly equal to $1$.

Homeworks:

  1. Implement the feature normalization of your choice. Run the OLS algorithm on it, and compare the result with the non-normalized regression.
  2. The gradient descent we have implemented seems to lower the loss smoothly. The impelmentation we proposed did 100 iterations, and we can see that the loss function values is not changing much after 40-50 iterations. Could you think of a way to stop the algorithm to avoid having too many iterations giving a very marginal gain in the loss function?
  3. Add an outlier to the training set (for example, a house with a normal size but a very small or very big price), and run the OLS algorithm on it. What impact does the outlier have on the quality of the regression? How to correct this issue?

In the next session, we will talk about regularization and how to define a more complex model when the data is not linearly separable.

Let's apply the following transformation: $$ z = \dfrac{x - x_{\min}}{x_{\max} - x_{\min}} $$ To do so, we need to compute the min and max values of $X$. Let's also define a function that computes $z$ given $x$


In [28]:
x_min = X[0][1]
x_max = X[0][1]
for i in range(len(X)):
    x_min = min(x_min, X[i][1])
    x_max = max(x_max, X[i][1])
    
def rescale(x):
    return (x - x_min)/(x_max - x_min)

Let's now apply this transformation to all the element of $X$.

Remark: We keep the intercept as it is, without applying the transformation to it.


In [29]:
Z = [[x[0], rescale(x[1])] for x in X]
print(Z)


[[1.0, 0.3157894736842105], [1.0, 0.6578947368421053], [1.0, 0.0], [1.0, 1.0]]

The intercept remains unchanged, and the second value of each instance (which corresponds to the area of the house) is between 0 and 1.


In [30]:
dtheta = 1.
# from -100 to +100 on both axis
theta0_vector = np.arange(-100,100, dtheta)
theta1_vector = np.arange(-100,100, dtheta)
n0 = len(theta0_vector)
n1 = len(theta1_vector)
loss = np.zeros([n0, n1])
for i0 in range(n0):
    for i1 in range(n1):
        loss[i0, i1] = cost_function_total(Z, Y, [theta0_vector[i0], theta1_vector[i1]])
fig = plt.figure()
plt.contourf(theta0_vector, theta1_vector, loss)
plt.colorbar()
plt.show()


We can now apply the gradient algorithm we've defined previously.


In [31]:
def gradient_descent(X, Y, alpha):
    theta = [0, 0] # initializing theta with zeros (it can be initialized in another manner)
    n_iteration_max = 100
    loss_history = []
    theta_history = [theta]
    for i_iteration in range(n_iteration_max):
        loss = cost_function_total(X, Y, theta)
        loss_history.append(loss)
        # print("Iteration {:>2}. Current loss = {}".format(i_iteration, loss))
        theta = gradient_descent_step(X, Y, theta, alpha)
        theta_history.append(theta)
    loss = cost_function_total(X, Y, theta)
    loss_history.append(loss)
    print("Optimization complete. Final loss = {}".format(loss))
    plt.plot(loss_history)
    plt.ylabel('loss')
    plt.xlabel('iterations')
    plt.show()
    return theta_history, loss_history

alpha = 0.0001
theta_history_gd, loss_history_gd = gradient_descent(Z, Y, alpha)


Optimization complete. Final loss = 5252.219720169256

The loss is constantly going down, and does not seem to have reached its minimal value. This is because we chose $\alpha = 0.0001$, which isn't appropriate in this case. $\alpha$ is way too small, hence we are doing tiny steps and it would take a long time for the algorithm to converge.

Note: As previously, the chosen value for $\alpha$ is still important, if we take $\alpha = 1$, the algorithm would diverge:


In [32]:
alpha = 1.
theta_history_gd, loss_history_gd = gradient_descent(Z, Y, alpha)


Optimization complete. Final loss = 1.2779108922343823e+126

Instead, let us try with a higher value, say $\alpha = 0.1$.


In [33]:
alpha = 0.1
theta_history_gd, loss_history = gradient_descent(Z, Y, alpha)


Optimization complete. Final loss = 77.56352172981322

$\alpha = 0.1$ seems to work much better, the loss function is decreasing fast at the beginning and seems to reach a loss function value of about 77.

Note: This loss function value is different than the one we used to have without rescaling the $x$ value (77 vs. 186). This is completely normal, because rescaling $X$ into $Z$ changes the loss function value, which becomes $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(z^{(i)}\right) - y^{(i)} \right)^2$$ instead of: $$J(\theta) = \dfrac{1}{2} \sum_{i = 1}^{n} \left( h\left(x^{(i)}\right) - y^{(i)} \right)^2$$ ($x^{(i)}$ is replaced by $z^{(i)}$ in the formula). As a consequence, having 77 rather than 186 doesn't mean we're doing a better job than before!


In [34]:
dtheta = 1.
# from -100 to +100 on both axis
theta0_vector = np.arange(0,100, dtheta)
theta1_vector = np.arange(0,100, dtheta)
n0 = len(theta0_vector)
n1 = len(theta1_vector)
loss = np.zeros([n0, n1])
for i0 in range(n0):
    for i1 in range(n1):
        loss[i0, i1] = cost_function_total(Z, Y, [theta0_vector[i0], theta1_vector[i1]])
fig = plt.figure()
plt.contourf(theta0_vector, theta1_vector, loss)
plt.colorbar()
for theta in theta_history_gd:
    plt.scatter(theta[1], theta[0], color="red")
plt.show()


Let's do the same for stochastic gradient descent:


In [35]:
def stochastic_gradient_descent(X, Y, alpha):
    theta = [0, 0] # initializing theta with zeros (it can be initialized in another manner)
    n_iteration_max = 100
    loss_history = []
    n_samples = len(Y)
    theta_history = [theta]
    for i_iteration in range(n_iteration_max):
        i_sample = i_iteration % n_samples
        loss = cost_function_total(X, Y, theta)
        loss_history.append(loss)
        print("Iteration {:>2}. Current loss = {}".format(i_iteration, loss))
        theta = stochastic_gradient_descent_step(X[i_sample], Y[i_sample], theta, alpha) # run the gradient update on a single sample
        theta_history.append(theta)
    loss = cost_function_total(X, Y, theta)
    loss_history.append(loss)
    print("Optimization complete. Final loss = {}".format(loss))
    return theta_history, loss_history

In [36]:
alpha = .5
theta_history_sgd, loss_history_sgd = stochastic_gradient_descent(Z, Y, alpha)
plt.plot(loss_history_sgd)
plt.ylabel('loss')
plt.show()


Iteration  0. Current loss = 5724.0
Iteration  1. Current loss = 3010.6532446804435
Iteration  2. Current loss = 1241.9025020501365
Iteration  3. Current loss = 1685.279818913434
Iteration  4. Current loss = 1531.9766999537965
Iteration  5. Current loss = 561.6195267010057
Iteration  6. Current loss = 561.6950873394483
Iteration  7. Current loss = 695.3182321565688
Iteration  8. Current loss = 931.8346044543953
Iteration  9. Current loss = 330.2952436540648
Iteration 10. Current loss = 345.2981694177922
Iteration 11. Current loss = 417.82247243762936
Iteration 12. Current loss = 602.6180880809652
Iteration 13. Current loss = 211.2669451858648
Iteration 14. Current loss = 238.35061763263207
Iteration 15. Current loss = 269.80037270918945
Iteration 16. Current loss = 418.37493939016014
Iteration 17. Current loss = 149.91153342330858
Iteration 18. Current loss = 186.35699699370403
Iteration 19. Current loss = 189.76918378198243
Iteration 20. Current loss = 312.8279513004007
Iteration 21. Current loss = 118.20691670069019
Iteration 22. Current loss = 161.71909972666782
Iteration 23. Current loss = 145.76124061473206
Iteration 24. Current loss = 250.76723445227606
Iteration 25. Current loss = 101.76863999861726
Iteration 26. Current loss = 150.52615505433388
Iteration 27. Current loss = 121.06195124820532
Iteration 28. Current loss = 213.254642288688
Iteration 29. Current loss = 93.20643928610741
Iteration 30. Current loss = 145.814144689401
Iteration 31. Current loss = 106.86625787677015
Iteration 32. Current loss = 189.94336293986785
Iteration 33. Current loss = 88.71889884214102
Iteration 34. Current loss = 144.13139376144792
Iteration 35. Current loss = 98.48970924226285
Iteration 36. Current loss = 175.07094578163267
Iteration 37. Current loss = 86.34740181551697
Iteration 38. Current loss = 143.79247658929688
Iteration 39. Current loss = 93.40804945660534
Iteration 40. Current loss = 165.3546426031773
Iteration 41. Current loss = 85.08049511257698
Iteration 42. Current loss = 143.99318164746717
Iteration 43. Current loss = 90.23898381794234
Iteration 44. Current loss = 158.87598385797403
Iteration 45. Current loss = 84.39421019660546
Iteration 46. Current loss = 144.3638645076205
Iteration 47. Current loss = 88.2105416041328
Iteration 48. Current loss = 154.48268130141545
Iteration 49. Current loss = 84.01594715100458
Iteration 50. Current loss = 144.7454481849383
Iteration 51. Current loss = 86.88154567000558
Iteration 52. Current loss = 151.4631193322166
Iteration 53. Current loss = 83.80305827048062
Iteration 54. Current loss = 145.07810690060668
Iteration 55. Current loss = 85.99326603006479
Iteration 56. Current loss = 149.36591513064735
Iteration 57. Current loss = 83.68031682422452
Iteration 58. Current loss = 145.34669518675065
Iteration 59. Current loss = 85.38973629688371
Iteration 60. Current loss = 147.8976824704998
Iteration 61. Current loss = 83.60764408252136
Iteration 62. Current loss = 145.5545468411729
Iteration 63. Current loss = 84.97429147320415
Iteration 64. Current loss = 146.86364210078062
Iteration 65. Current loss = 83.56340419140346
Iteration 66. Current loss = 145.71130737950176
Iteration 67. Current loss = 84.685411332905
Iteration 68. Current loss = 146.13217530376218
Iteration 69. Current loss = 83.5357226578277
Iteration 70. Current loss = 145.82759833410626
Iteration 71. Current loss = 84.48299101275114
Iteration 72. Current loss = 145.6130704561316
Iteration 73. Current loss = 83.51795012529058
Iteration 74. Current loss = 145.91292594711132
Iteration 75. Current loss = 84.34033807981587
Iteration 76. Current loss = 145.24380606439095
Iteration 77. Current loss = 83.50627483612044
Iteration 78. Current loss = 145.97506922228146
Iteration 79. Current loss = 84.23937870145936
Iteration 80. Current loss = 144.98068216103007
Iteration 81. Current loss = 83.49845387500828
Iteration 82. Current loss = 146.0200951455565
Iteration 83. Current loss = 84.16770504373368
Iteration 84. Current loss = 144.79295896443543
Iteration 85. Current loss = 83.4931304929281
Iteration 86. Current loss = 146.0526017003134
Iteration 87. Current loss = 84.11670711128062
Iteration 88. Current loss = 144.65891082822105
Iteration 89. Current loss = 83.48946095342595
Iteration 90. Current loss = 146.07601068372122
Iteration 91. Current loss = 84.0803612093535
Iteration 92. Current loss = 144.56312956025602
Iteration 93. Current loss = 83.48690659451836
Iteration 94. Current loss = 146.09283816901842
Iteration 95. Current loss = 84.05442713380813
Iteration 96. Current loss = 144.49465972833522
Iteration 97. Current loss = 83.48511529329402
Iteration 98. Current loss = 146.10491926351492
Iteration 99. Current loss = 84.03590653123382
Optimization complete. Final loss = 144.44569756315704

As it was the case previously, stochastic gradient descend seems to osciliate around the optimal position. Let us see how the weight vector $\theta$ evolves.


In [37]:
dtheta = 1.
# from -100 to +100 on both axis
theta0_vector = np.arange(0,100, dtheta)
theta1_vector = np.arange(0,100, dtheta)
n0 = len(theta0_vector)
n1 = len(theta1_vector)
loss = np.zeros([n0, n1])
for i0 in range(n0):
    for i1 in range(n1):
        loss[i0, i1] = cost_function_total(Z, Y, [theta0_vector[i0], theta1_vector[i1]])
fig = plt.figure()
plt.contourf(theta0_vector, theta1_vector, loss)
plt.colorbar()
for theta in theta_history_sgd:
    plt.scatter(theta[1], theta[0], color="red")
plt.show()


Let's have the learning rate decrease over time and put everything together:


In [38]:
from math import sqrt
def stochastic_gradient_descent(X, Y, alpha):
    theta = [0, 0] # initializing theta with zeros (it can be initialized in another manner)
    n_iteration_max = 100
    loss_history = []
    n_samples = len(Y)
    theta_history = [theta]
    for i_iteration in range(n_iteration_max):
        i_sample = i_iteration % n_samples
        loss = cost_function_total(X, Y, theta)
        loss_history.append(loss)
        # change the learning rate
        alpha_sgd = alpha / sqrt(i_iteration + 1)
        theta = stochastic_gradient_descent_step(X[i_sample], Y[i_sample], theta, alpha_sgd) # run the gradient update on a single sample
        theta_history.append(theta)
    loss = cost_function_total(X, Y, theta)
    loss_history.append(loss)
    print("Optimization complete. Final loss = {}".format(loss))
    return theta_history, loss_history

alpha = 1.
theta_history_sgd, loss_history_sgd = stochastic_gradient_descent(Z, Y, alpha)
plt.plot(loss_history_sgd)
plt.ylabel('loss')
plt.show()

dtheta = 1.
# from -100 to +100 on both axis
theta0_vector = np.arange(0,100, dtheta)
theta1_vector = np.arange(0,100, dtheta)
n0 = len(theta0_vector)
n1 = len(theta1_vector)
loss = np.zeros([n0, n1])
for i0 in range(n0):
    for i1 in range(n1):
        loss[i0, i1] = cost_function_total(Z, Y, [theta0_vector[i0], theta1_vector[i1]])
fig = plt.figure()
plt.contourf(theta0_vector, theta1_vector, loss)
plt.colorbar()
for theta in theta_history_sgd:
    plt.scatter(theta[1], theta[0], color="red")
plt.show()


Optimization complete. Final loss = 89.96861903899075