Linear regression

Building a linear model to predict the price of the house with the size of the house and the number of bedrooms

  • implement a fully-vectorized loss function for the Linear Regression
  • implement the fully-vectorized expression for its analytic gradient
  • check implementation using numerical gradient
  • use a validation set to tune the learning rate
  • optimize the loss function with Batch Gradient Descent and Stochastic Gradient Descent

In [1]:
# Setup code for this notebook
import random 
import numpy as np
import matplotlib.pyplot as plt

# make matplotlib figures appear inline
%matplotlib inline 

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [2]:
# Load the data 
# The data is copied from the linear regression exercise of machine learning course at https://www.coursera.org/course/ml
# The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first column is the 
# size of the house (in square feet), the second column is the number of bedrooms, 
# and the third column is the price of the house, which we want to predict.
file_name = 'datasets/ex1data2.csv'
with open(file_name, 'r') as f:
    house_data = np.loadtxt(file_name, delimiter=',')

num_sample = house_data.shape[0] # number of all the samples
X = house_data[:, :2]
y = house_data[:, 2]
print y.shape

# Add intercept term or bias to X
print 'X shape: ', X.shape
print 'y shape: ', y.shape
print 'First 10 examples from the dataset'
print house_data[0:10, :]


(47,)
X shape:  (47, 2)
y shape:  (47,)
First 10 examples from the dataset
[[  2.10400000e+03   3.00000000e+00   3.99900000e+05]
 [  1.60000000e+03   3.00000000e+00   3.29900000e+05]
 [  2.40000000e+03   3.00000000e+00   3.69000000e+05]
 [  1.41600000e+03   2.00000000e+00   2.32000000e+05]
 [  3.00000000e+03   4.00000000e+00   5.39900000e+05]
 [  1.98500000e+03   4.00000000e+00   2.99900000e+05]
 [  1.53400000e+03   3.00000000e+00   3.14900000e+05]
 [  1.42700000e+03   3.00000000e+00   1.98999000e+05]
 [  1.38000000e+03   3.00000000e+00   2.12000000e+05]
 [  1.49400000e+03   3.00000000e+00   2.42500000e+05]]

In [3]:
# Feature Normalization
# By looking at the data, features differ by orders of magnitude
# we need to perform feature scaling to make gradient descent converge much more quickly
# Normalization: (x - mean) / std
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
# Add bias dimension
X = np.hstack((X, np.ones((num_sample, 1))))
X = X.T
print 'First 10 examples from the dataset'
print X[:, :10].T


First 10 examples from the dataset
[[ 0.13141542 -0.22609337  1.        ]
 [-0.5096407  -0.22609337  1.        ]
 [ 0.5079087  -0.22609337  1.        ]
 [-0.74367706 -1.5543919   1.        ]
 [ 1.27107075  1.10220517  1.        ]
 [-0.01994505  1.10220517  1.        ]
 [-0.59358852 -0.22609337  1.        ]
 [-0.72968575 -0.22609337  1.        ]
 [-0.78946678 -0.22609337  1.        ]
 [-0.64446599 -0.22609337  1.        ]]

Implement the class of LinearRegression, which implemented in backend and can find it here


In [4]:
# Implement linear_loss_grad(W, X, y) to compute loss and gradients
# and use numeric gradient checking as a debugging tool to check the implementation of gradient
from algorithms.regression import linear_loss_grad
from algorithms.gradient_check import grad_check_sparse

test_W = np.random.randn(1, X.shape[0]) * 0.001 # [1, D]
grad = linear_loss_grad(test_W, X, y)[1]
f = lambda W: linear_loss_grad(W, X, y)[0]
grad_numerical = grad_check_sparse(f, test_W, grad, 10)


(1, 3)
(0, 0)
numerical: -105765.533447 analytic: -105764.132202, relative error: 6.624344e-06
(0, 2)
numerical: -340412.902832 analytic: -340412.659393, relative error: 3.575640e-07
(0, 2)
numerical: -340412.902832 analytic: -340412.659393, relative error: 3.575640e-07
(0, 2)
numerical: -340412.902832 analytic: -340412.659393, relative error: 3.575640e-07
(0, 0)
numerical: -105765.533447 analytic: -105764.132202, relative error: 6.624344e-06
(0, 2)
numerical: -340412.902832 analytic: -340412.659393, relative error: 3.575640e-07
(0, 2)
numerical: -340412.902832 analytic: -340412.659393, relative error: 3.575640e-07
(0, 0)
numerical: -105765.533447 analytic: -105764.132202, relative error: 6.624344e-06
(0, 1)
numerical: -54709.625244 analytic: -54708.820509, relative error: 7.354656e-06
(0, 2)
numerical: -340412.902832 analytic: -340412.659393, relative error: 3.575640e-07

In [5]:
# Compare the naive version (for loop) and vectorized version to compute loss and gradient
import time
from algorithms.regression import linear_loss_grad
from algorithms.regression import linear_loss_grad_naive

tic = time.time()
loss_naive, grad_naive = linear_loss_grad_naive(test_W, X, y)
toc = time.time()
print 'naive loss: %e computed in %fs' % (loss_naive, toc - tic)

tic = time.time()
loss_vectorized, grad_vectorized = linear_loss_grad(test_W, X, y)
toc = time.time()
print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)

# Use Frobenius norm to compare the two versions of the gradients
grad_diff = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)
print 'Gradient difference: %f' % grad_diff


naive loss: 6.559155e+10 computed in 0.000647s
vectorized loss: 6.559155e+10 computed in 0.000313s
Loss difference: 0.000023
Gradient difference: 0.000000

In [7]:
# Now train linear regression model by BGD and SGD algorithms 
from algorithms.regression import LinearRegression
lr_bgd = LinearRegression()
tic = time.time()
losses_bgd = lr_bgd.train(X, y, method='bgd', learning_rate=1e-2, num_iters=1000, verbose=True)
toc = time.time()
print 'Traning time for BGD with vectorized version is %f \n' % (toc - tic)

lr_sgd = LinearRegression()
tic = time.time()
losses_sgd = lr_sgd.train(X, y, method='sgd', learning_rate=1e-2, num_iters=1000, verbose=True)
toc = time.time()
print 'Traning time for SGD with vectorized version is %f' % (toc - tic)


iteration 0 / 1000 : loss 65591548180.890114
iteration 100 / 1000 : loss 10596969360.070049
iteration 200 / 1000 : loss 3344770640.251828
iteration 300 / 1000 : loss 2288004376.085593
iteration 400 / 1000 : loss 2105448289.315550
iteration 500 / 1000 : loss 2063782403.954787
iteration 600 / 1000 : loss 2051066484.892333
iteration 700 / 1000 : loss 2046409416.209262
iteration 800 / 1000 : loss 2044562883.389050
iteration 900 / 1000 : loss 2043809396.298192
Traning time for BGD with vectorized version is 0.030501 

iteration 0 / 1000 : loss 101205005517.462662
iteration 100 / 1000 : loss 12255168635.716475
iteration 200 / 1000 : loss 4565851031.978956
iteration 300 / 1000 : loss 250209621.960446
iteration 400 / 1000 : loss 2306220008.436017
iteration 500 / 1000 : loss 39285209.429809
iteration 600 / 1000 : loss 2889925121.826027
iteration 700 / 1000 : loss 1486683144.941179
iteration 800 / 1000 : loss 1273608695.117948
iteration 900 / 1000 : loss 25680393.305553
Traning time for SGD with vectorized version is 0.017963

In [8]:
#A useful degugging strategy is to plot the loss 
# as a function of iteration number:
from ggplot import *
# plot losses from BDG
qplot(xrange(len(losses_bgd)), losses_bgd) + labs(x='Iteration number', y='BGD loss value')


Out[8]:
<ggplot: (278995149)>

In [9]:
# plot losses from SGD
qplot(xrange(len(losses_sgd)), losses_sgd) + labs(x='Iteration number', y='SGD loss value')


Out[9]:
<ggplot: (274080829)>