Experiment with Gradient Descent for Linear Regression.

Enough irises. Let's use the Boston housing data set and try training some regression models. This notebook features the seaborn package which makes all plots prettier and has some nice built-in features like the correlation plot you'll see below. You'll probably have to install it (ideally with pip).



In [1]:

    
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import time
from numpy.linalg import inv
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

np.set_printoptions(precision=4, suppress=True)

Load the Boston housing data. This data set is pretty out-of-date since it was collected in the 1970s. Each of the 506 entries represents a local district in the Boston area. The standard target variable is the median home value in the district (in 1000s of dollars). Let's print out the description along with a histogram of the target. Notice that the distribution of median value is roughly Gaussian with a significant outlier -- there are around 15 very wealthy districts.



In [10]:

    
boston = load_boston()
X, Y = boston.data, boston.target
plt.hist(Y, 50)
plt.xlabel('Median value (in $1000)')
print boston.DESCR









    



Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

As usual, let's create separate training and test data.



In [11]:

    
# Shuffle the data, but make sure that the features and accompanying labels stay in sync.
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

# Split into train and test.
train_data, train_labels = X[:350], Y[:350]
test_data, test_labels = X[350:], Y[350:]

Before we start making any predictions, let's get some intuition about the data by examining the correlations. Seaborn makes it easy to visualize a correlation matrix. Note, for example, that value and crime rate are negatively correlated: districts with lower crime rates tend to be higher valued.



In [12]:

    
# Combine all the variables (features and target) into a single matrix so we can easily compute all the correlations.
# Is there a better way to do this??
train_labels_as_matrix = np.array([train_labels]).T
all_data = np.hstack((train_data, train_labels_as_matrix))
all_labels = np.append(boston.feature_names, 'VALUE')

# Use seaborn to create a pretty correlation heatmap.
fig, ax = plt.subplots(figsize=(9, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.corrplot(all_data, names=all_labels, annot=True, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
fig.tight_layout()

Ok. Let's implement gradient descent. It's more efficient to implement it with vector calculations, though this may be a bit more difficult to understand at first glance. Try to think through each step and make sure you understand how it works.



In [13]:

    
# eta is the learning rate; smaller values will tend to give slower but more precise convergence.
# num_iters is the number of iterations to run.
def gradient_descent(train_data, target_data, eta, num_iters):
    # Add a 1 to each feature vector so we learn an intercept.
    X = np.c_[np.ones(train_data.shape[0]), train_data]
    
    # m = number of samples, k = number of features
    m, k = X.shape
    
    # Initially, set all the parameters to 1.
    theta = np.ones(k)
    
    # Keep track of costs after each step.
    costs = []
    
    for iter in range(0, num_iters):
        # Get the current predictions for the training examples given the current estimate of theta.
        hypothesis = np.dot(X, theta)
        
        # The loss is the difference between the predictions and the actual target values.
        loss = hypothesis - target_data
        
        # In standard linear regression, we want to minimize the sum of squared losses.
        cost = np.sum(loss ** 2) / (2 * m)
        costs.append(cost)
        
        # Compute the gradient.
        gradient = np.dot(X.T, loss) / m

        # Update theta, scaling the gradient by the learning rate.
        theta = theta - eta * gradient
        
    return theta, costs

# Run gradient descent and plot the cost vs iterations.
theta, costs = gradient_descent(train_data[:,0:1], train_labels, .01, 500)
plt.plot(costs)
plt.xlabel('Iteration'), plt.ylabel('Cost')
plt.show()

Let's compare our results to sklearn's regression as well as the algebraic solution to "ordinary least squares". Try increasing the number of iterations above to see whether we get closer.



In [14]:

    
def OLS(X, Y):
    # Add the intercept.
    X = np.c_[np.ones(X.shape[0]), X]
    
    # We use np.linalg.inv() to compute a matrix inverse.
    return np.dot(inv(np.dot(X.T, X)), np.dot(X.T, Y))

ols_solution = OLS(train_data[:,0:1], train_labels)

lr = LinearRegression(fit_intercept=True)
lr.fit(train_data[:,0:1], train_labels)

print 'Our estimated theta:     %.4f + %.4f*CRIM' %(theta[0], theta[1])
print 'OLS estimated theta:     %.4f + %.4f*CRIM' %(ols_solution[0], ols_solution[1])
print 'sklearn estimated theta: %.4f + %.4f*CRIM' %(lr.intercept_, lr.coef_[0])









    



Our estimated theta:     23.3397 + -0.3466*CRIM
OLS estimated theta:     23.6380 + -0.3575*CRIM
sklearn estimated theta: 23.6380 + -0.3575*CRIM

Ok, let's try fitting a model that uses more of the variables. Let's run just a few iterations and check the cost function.



In [15]:

    
num_feats = 5
theta, costs = gradient_descent(train_data[:,0:num_feats], train_labels, .01, 10)
plt.plot(map(np.log, costs))
plt.xlabel('Iteration'), plt.ylabel('Log Cost')
plt.show()

The cost is increasing and fast! This can happen when the learning rate is too large. The updated parameters skip over the optimum and the cost ends up larger than it was before. Let's reduce the learning rate and try again.



In [16]:

    
start_time = time.time()
theta, costs = gradient_descent(train_data[:,0:num_feats], train_labels, .001, 100000)
train_time = time.time() - start_time
plt.plot(map(np.log, costs))
plt.xlabel('Iteration'), plt.ylabel('Log Cost')
plt.show()

print 'Training time: %.2f secs' %train_time
print 'Our estimated theta:', theta
print 'OLS estimated theta:', OLS(train_data[:,0:num_feats], train_labels)









    












    



Training time: 2.22 secs
Our estimated theta: [ 26.0983  -0.2143   0.0345  -0.4745   4.9007   2.8408]
OLS estimated theta: [ 32.5831  -0.1926   0.0218  -0.3213   5.4266 -11.8824]

This is getting pretty slow, and it looks like it hasn't yet converged (see the last value of theta, especially). The scale of the parameters can also make convergence difficult. Let's examine the distributions of the features we're using.



In [17]:

    
plt.figure(figsize=(15, 3))
for feature in range(num_feats):
    plt.subplot(1, num_feats, feature+1)
    plt.hist(train_data[:,feature])
    plt.title(boston.feature_names[feature])

Clearly, the distribution of the feature values varies a great deal. Let's apply the standard scaler -- subtract the mean, divide by the standard deviation -- for each feature. This is built in as a preprocessor in sklearn. We run the fit() function on the training data and then apply the transformation to both train and test data. We don't fit on the test data because this would be cheating -- we shouldn't know in advance the mean and variance of the feature values in the test data, so we assume they are the same as the training data.



In [18]:

    
scaler = preprocessing.StandardScaler()
scaler.fit(train_data)
scaled_train_data = scaler.transform(train_data)
scaled_test_data = scaler.transform(test_data)

plt.figure(figsize=(15, 3))

for feature in range(5):
    plt.subplot(1, 5, feature+1)
    plt.hist(scaled_train_data[:,feature])
    plt.title(boston.feature_names[feature])

Ok, let's try gradient descent again. We can increase the learning rate and decrease the number of iterations.



In [19]:

    
start_time = time.time()
theta, costs = gradient_descent(scaled_train_data[:,0:5], train_labels, .01, 5000)
train_time = time.time() - start_time
plt.plot(map(np.log, costs))
plt.xlabel('Iteration'), plt.ylabel('Log Cost')
plt.show()

print 'Training time: %.2f secs' %train_time
print 'Our estimated theta:', theta
print 'OLS estimated theta:', OLS(scaled_train_data[:,0:5], train_labels)









    












    



Training time: 0.13 secs
Our estimated theta: [ 22.2951  -1.8194   0.5028  -2.1295   1.3976  -1.3729]
OLS estimated theta: [ 22.2951  -1.8194   0.5028  -2.1295   1.3976  -1.3729]

Why do we even bother with gradient descent when the closed form solution works just fine?

There are many answers. First, gradient descent is a general-purpose tool; we are applying it to the least squares problem which happens to have a closed form solution, but most other machine learning objectives do not have this luxury.

Also, if we want to add regularization, we can no longer use the algebraic formula.

Here's one more reason: You can't take the inverse of a singular matrix. That is, if two of our features are co-linear, the inverse function will fail. Gradient descent doesn't have this problem and should learn instead to share the weight appropriately between the two co-linear features. Let's test this by simply adding a copy of the crime feature.



In [27]:

    
# Create an augmented training set that has 2 copies of the crime variable.
augmented_train_data = np.c_[scaled_train_data[:,0], scaled_train_data]

# Run gradient descent and OLS and compare the results.
theta, costs = gradient_descent(augmented_train_data[:,0:6], train_labels, .01, 5000)
print 'Our estimated theta:', theta

print 'OLS estimated theta:',
try: print OLS(augmented_train_data[:,0:6], train_labels)
except: print 'ERROR, singular matrix not invertible'









    



Our estimated theta: [ 22.2951  -0.9097  -0.9097   0.5028  -2.1295   1.3976  -1.3729]
OLS estimated theta: ERROR, singular matrix not invertible

One final note about regression models. As you may know, they have become extremely common in all sorts of analysis. When all the various assumptions of the model are assumed to be true, it is tempting to interpret the estimated coefficients as causal. For example, in our 5-variable model above, we might conclude that a reduction in (standardized) per-capita crime rate of 1 would result in an increase in median value of about $18,194.

Statistically, though, this conclusion is almost impossible to justify. You can see just how brittle the coefficients are in these experiments with gradient descent. Try adding a variable, for example. Or if we haven't quite found the global minimum, one coefficient can still change quite dramatically.

This is why regression is much more useful for prediction than it is for understanding causal relationships. If we only care about prediction, we don't need to worry about whether the many model assumptions (like whether any meaningful variables are missing) are true. It is possible to study causation with statistical analysis, but simply running regression is not sufficient.