Linear Regression (Gradient Descent)

So far we have looked at direct matrix method for solving the $Ax = b$ problem. But most machine learning algorithms may not be directly computatable. So the alternative way to do is to define a cost function that needs to be minimised and use a gradient descent approach to solve it.



In [1]:

    
import numpy as np
import pandas as pd



In [2]:

    
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (9, 6)



In [3]:

    
pop = pd.read_csv('data/cars_small.csv')
pop.head()









    Out[3]:







  
    
      
      brand
      model
      price
      kmpl
      bhp
      type
    
  
  
    
      0
      Chevrolet
      Beat
      421
      18.6
      79
      Hatchback
    
    
      1
      Chevrolet
      Sail
      551
      18.2
      82
      Sedan
    
    
      2
      Chevrolet
      Sail Hatchback
      468
      18.2
      82
      Hatchback
    
    
      3
      Chevrolet
      Spark
      345
      16.2
      62
      Hatchback
    
    
      4
      Fiat
      Linea Classic
      612
      14.9
      89
      Sedan

Solving using Sklearn



In [4]:

    
from sklearn import linear_model
from sklearn import metrics



In [5]:

    
X = pop['kmpl'].values.reshape(-1,1)
y = pop['price']



In [6]:

    
model = linear_model.LinearRegression(fit_intercept=True)
model.fit(X,y)









    Out[6]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [7]:

    
model.coef_









    Out[7]:





array([-36.12619236])



In [8]:

    
beta0 = round(model.intercept_)
beta1 = round(model.coef_[0])
print("b0 = ", beta0, "b1 =", beta1)









    



b0 =  1158.0 b1 = -36.0

Plotting the Regression Line

$$ price = 1158 - 36 * kmpl ~~~~ \textit{(population = 42)}$$



In [9]:

    
x = np.arange(min(pop.kmpl),max(pop.kmpl),1)

plt.xlabel('kmpl')
plt.ylabel('price')

y_p = beta0 + beta1 * x
plt.plot(x, y_p, '-')
plt.scatter(pop.kmpl, pop.price, s = 150, alpha = 0.5 )









    Out[9]:





<matplotlib.collections.PathCollection at 0x11acb1518>

Calculating the Error term

The error term as we saw is defined as:

$$ E(\beta)= \frac {1}{n} {||y - X\beta||}^2 $$



In [10]:

    
# Solving using Ax = b approach
n = pop.shape[0]
x0 = np.ones(n)
x1 = pop.kmpl
X = np.c_[x0, x1]
X = np.asmatrix(X)
y = np.asmatrix(pop.price.values.reshape(-1,1))
b = np.asmatrix([[beta0],
                 [beta1]])



In [11]:

    
# Error calculation
def error_term(X,y,b,n):
    M = (y - X*b)
    error = M.T*M / n
    return error[0,0]



In [12]:

    
round(error_term(X,y,b,n))









    Out[12]:





13065.0

Shape of the error term

Lets plot the error term and see that the error surface is CONVEX. And there is one absolute minimum point.



In [14]:

    
# Calculate a range of values for b0 and b1
b0_min, b0_max = -500, 2000
b1_min, b1_max = -100, 100
bb0, bb1 = np.meshgrid(np.arange(b0_min, b0_max, (b0_max - b0_min)/100), 
                       np.arange(b1_min, b1_max, (b1_max - b1_min)/100))

# Calculate a mesh of values for b0 and b1 and error term for each
bb = np.c_[bb0.ravel(), bb1.ravel()]
errors = [error_term(X,y,np.asmatrix(i).T,n) for i in bb]

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.gca(projection='3d')
trisurf = ax.plot_trisurf(bb[:,0], bb[:,1], errors, cmap=plt.cm.viridis, linewidth=0)

ax.view_init(azim=20)
fig.colorbar(trisurf, shrink=0.3, aspect=5)

ax.set_xlabel('Intercept')
ax.set_ylabel('Slope')
ax.set_zlabel("Error")









    Out[14]:





<matplotlib.text.Text at 0x11ae80630>

Let's plot the error term as a 2D plot - as a contour plot and make it more clear.



In [15]:

    
# Get the errors
Z = np.asmatrix(errors).reshape(bb0.shape)

# Get the contour
e_min, e_max = np.min(errors), np.max(errors)
levels = np.logspace(0, np.log10(e_max), 40)

# Plot the contour
fig = plt.figure()
cs = plt.contourf(bb0, bb1, Z, cmap=plt.cm.viridis, levels = levels, linewidth=0.3, alpha = 0.5)
fig.colorbar(cs, shrink=0.5, aspect=5)

plt.xlabel('Intercept')
plt.ylabel('Slope')









    Out[15]:





<matplotlib.text.Text at 0x11b708fd0>

Gradient Descent

In our linear regression problem, there is only one minimum. Our error surface is convex. So we can start from one point on the error surface and gradually move down in the error surface in the direction of the minimum.

"At a theoretical level, gradient descent is an algorithm that minimizes the cost functions. Given a cost function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the cost function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient."

So the gradient in linear regression is: $$ \nabla E(\beta)= \frac {2}{n} X^T(X\beta−y) $$

Lets start at an arbitary point for the features (0,0) and calculate the gradient



In [16]:

    
b_initial = np.asmatrix([[500],
                      [-100]])



In [17]:

    
def gradient(X,y,b,n):
    g = (2/n)*X.T*(X*b - y)
    return g



In [18]:

    
error_term(X,y,b_initial,n)









    Out[18]:





3371573.9285714286



In [19]:

    
gradient(X,y,b_initial,n)









    Out[19]:





matrix([[ -3651.28571429],
        [-67557.49047619]])



In [20]:

    
b_next = b_initial - 0.01 * gradient(X,y,b_initial,n)
b_next









    Out[20]:





matrix([[ 536.51285714],
        [ 575.57490476]])

Learning Rate and Epochs

Now we need to update our parameters in the direction of the gradient. And keep repeating the process.

$$ \beta_{i+1} = \beta_{i} - \eta * \nabla E(\beta)$$

Learning rate ($\eta$) - how far we need to move towards the gradient in each step
Epoch - Number of times we want to execute this process



In [29]:

    
def gradient_descent (eta, epochs, X, y, n):
    
    # Set Initial Values 
    b = np.asmatrix([[-250],[-50]])
    e = error_term(X,y,b,n)
    b0s, b1s, errors = [], [], []
    b0s.append(b.item(0)) 
    b1s.append(b.item(1)) 
    errors.append(e)

    # Run the calculation for those many epochs
    for i in range(epochs):
        g = gradient(X,y,b,n)
        b = b - eta * g
        e = error_term(X,y,b,n)
        
        b0s.append(b.item(0)) 
        b1s.append(b.item(1)) 
        errors.append(e)
    
    print('error =', round(errors[-1]), ' b0 =', round(b0s[-1]), 'b1 =', round(b1s[-1]))
          
    return errors, b0s, b1s

Lets see the error rate with different Learning Rate and Epochs



In [35]:

    
errors, b0s, b1s = gradient_descent (0.001, 100, X, y, n)









    



error = 48985.0  b0 = -240 b1 = 39

Exercise

Try with learning rate = 0.001, Epochs = 1000



In [ ]:

Try with learning rate = 0.02, Epochs = 1000



In [ ]:

Try with learning rate = 0.001, Epochs = 50000



In [ ]:



In [ ]:



In [ ]:

Plotting Epochs and Learning Rate



In [38]:

    
def plot_gradient_descent(eta, epoch, gradient_func):
    
    es, b0s, b1s = gradient_func(eta, epoch, X, y, n)
    
    # Plot the intercept and coefficients
    plt.figure(figsize=(15,6))
    plt.subplot(1, 2, 1)
    #plt.tight_layout()
    
    # Plot the contour
    cs = plt.contourf(bb0, bb1, Z, cmap=plt.cm.viridis, levels = levels, 
                      linewidth=0.3, alpha = 0.5)
    
    # Plot the intercept and coefficients
    plt.plot(b0s, b1s, '-b', linewidth = 2)
    plt.xlim([-500,2000])
    plt.ylim([-100,100])
    plt.xlabel('Intercept')
    plt.ylabel('Slope')
    
    # Plot the error rates
    plt.subplot(1, 2, 2)
    plt.plot(np.log10(es))
    plt.ylim([3,10])
    plt.xlabel('Epochs')
    plt.ylabel('log(Error)')



In [39]:

    
plot_gradient_descent(0.0005, 15, gradient_descent)









    



error = 49239.0  b0 = -245 b1 = 39

Exercise

Plot with learning rate = 0.001, Epochs = 1000



In [ ]:

Plot with learning rate = 0.005, Epochs = 10000



In [ ]:

Plot with learning rate = 0.001, Epochs = 50000



In [ ]:



In [ ]:

Challenges with Simple Gradient Descent

Convexity

In our linear regression problem, there was only one minimum. Our error surface was convex. Regardless of where we started, we would eventually arrive at the absolute minimum. In general, this need not be the case. It’s possible to have a problem with local minima that a gradient search can get stuck in. There are several approaches to mitigate this (e.g., stochastic gradient search).

Performance

We used vanilla gradient descent with a learning rate of 0.001 in the above example, and ran it for 50000 iterations. There are approaches such a line search, that can reduce the number of iterations required and also avoid the overshooting problem with large learning rate.

Convergence

We didn’t talk about how to determine when the search finds a solution. This is typically done by looking for small changes in error iteration-to-iteration (e.g., where the gradient is near zero).

Line Search Gradient Descent

To avoid the problem of setting the learning rate and overshooting, we can adaptively choose the step size

$$ \beta_{i+1} = \beta_{i} - \eta * \nabla E(\beta)$$

First, fix a parameter $ 0 < \gamma < 1$, then start with $\eta = 1$, and while

$$ E(\beta_{i+1}) > E(\beta) - \frac{\eta}{2} * {||\nabla E(\beta)||}^2 $$

then

$$ \eta = \gamma * \eta $$

Typically, we take $\gamma = [0.1, 0.8] $



In [64]:

    
def gradient_descent_line (eta, epochs, X, y, n):
    
    gamma = 0.5
        
    # Set Initial Values 
    b = np.asmatrix([[-250],[-50]])
    es = error_term(X,y,b,n)
    b0s, b1s, errors = [], [], []
    b0s.append(b.item(0)) 
    b1s.append(b.item(1)) 
    errors.append(es)

    # Run the calculation for those many epochs
    for i in range(epochs):
        es_old = error_term(X,y,b,n)
        g = gradient(X,y,b,n)
        b = b - eta * g
        es = error_term(X,y,b,n)
        #print(e,e_old)
        if es > es_old - eta/2*(g.T*g):
            eta = eta * gamma
        
        b0s.append(b.item(0)) 
        b1s.append(b.item(1)) 
        errors.append(es)
    
    print('error =', round(errors[-1]), ' b0 =', round(b0s[-1]), 'b1 =', round(b1s[-1]))

    return errors, b0s, b1s

Avoid Overshooting

Let see the performance difference between Simple Gradient Descent and Line Search Gradient Descent



In [65]:

    
errors, b0s, b1s = gradient_descent_line (1, 50000, X, y, n)









    



error = 13928.0  b0 = 940 b1 = -24



In [66]:

    
errors, b0s, b1s = gradient_descent (0.0005, 50000, X, y, n)









    



error = 18845.0  b0 = 597 b1 = -6

Plotting the Line Search Gradient Descent



In [68]:

    
plot_gradient_descent(1, 10000, gradient_descent_line)









    



error = 28314.0  b0 = 247 b1 = 13



In [ ]:

	brand	model	price	kmpl	bhp	type
0	Chevrolet	Beat	421	18.6	79	Hatchback
1	Chevrolet	Sail	551	18.2	82	Sedan
2	Chevrolet	Sail Hatchback	468	18.2	82	Hatchback
3	Chevrolet	Spark	345	16.2	62	Hatchback
4	Fiat	Linea Classic	612	14.9	89	Sedan