Linear Regression with One Variable

In this part of this exercise, you will implement linear regression with one variable to predict profits for a food truck. Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from the cities.

You would like to use this data to help you select which city to expand to next.


In [ ]:
# Import numpy for linear algebra and numerical computing functions, and matplotlib for plotting graphs
import numpy as np
from numpy import ones, zeros, newaxis, r_, c_, mat, dot
import matplotlib.pyplot as plt

# Enable matplotlib inline plotting for this notebook
%matplotlib inline

The file ex1data1.txt contains the dataset for our linear regression prob- lem. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss.


In [ ]:
data = np.loadtxt('../data/ex1data1.txt', delimiter=',')
X = mat(data[:, :1]) # training example inputs
y = c_[data[:, 1]]   # training example outputs
m = X.shape[0]

data[:5] # First 5 rows of training examples (just for viewing)

Plotting the Data

Before starting on any task, it is often useful to understand the data by visualizing it. For this dataset, you can use a scatter plot to visualize the data, since it has only two properties to plot (profit and population). (Many other problems that you will encounter in real life are multi-dimensional and can’t be plotted on a 2-d plot.)

Your job is to complete plot_data(X,y) to draw the plot.


In [ ]:
def plot_data(X, y):
    """"Plots the data points x and y into a new figure
    plots the data points and gives the figure axes labels of
    population and profit.
    """

    ### YOUR CODE HERE ###
    # Instructions: Plot the training data into a figure using the
    #               "plot" command. Set the axes labels using
    #               the "xlabel" and "ylabel" commands. Assume the
    #               population and revenue data have been passed in
    #               as the X and y arguments of this function.
    #
    # Hint: You can use the 'rx' option with plot to have the markers
    #       appear as red crosses. Furthermore, you can make the
    #       markers larger by using plt.plot(..., 'rx', markersize=7)
    #
    #       To get help with plotting in IPython notebooks, start typing
    #       plt.plot( and the press Shift-tab
    #       Pressing Shift-tab multiple times will give more verbose help options
    pass

In [ ]:
# When you are done implementing plot_data(X, y) run this cell
# This will take the X array (Population of the city) and plot it against
# the output array y (Profit of a food truck in that city)
plot_data(X.A, y)

Gradient Descent

In this part, you will fit the linear regression parameters θ to our dataset using gradient descent.

Update Equations

The objective of linear regression is to minimize the cost function

$$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m}\left(h_\theta(x^{(i)}) - y^{(i)}\right)^2$$

where the hypothesis $h_\theta(x)$ is given by the linear model

$$h_\theta(x) = \theta^Tx = \theta_0 + \theta_1x_1$$

Recall that the parameters of your model are the $\theta_j$ values. These are the values you will adjust to minimize cost $J(\theta)$. One way to do this is to use the batch gradient descent algorithm. In batch gradient descent, each iteration performs the update

$$\theta_J := \theta_J - \alpha\frac{1}{m}\sum_{i=1}^{m}\left(h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)}$$

With each step of gradient descent, your parameters $\theta_j$ come closer to the optimal values that will achieve the lowest cost $J(\theta)$.

Implementation

We have already set up the data for linear regression. In the following lines, we add another dimension to our data to accommodate the $\theta_0$ intercept term. We also initialize the initial parameters to 0 and the learning rate alpha to 0.01.


In [ ]:
X = mat(data[:, :1]) # Original input data
X = c_[np.ones(m), X] # Add a column of ones to X for the theta_0 intercept term

theta = c_[np.zeros(2)] # initialize fitting parameters to 0

iterations = 1500
alpha = 0.01

Computing the Cost of $J(\theta)$

As you perform gradient descent to learn minimize the cost function $J(\theta)$, it is helpful to monitor the convergence by computing the cost. In this section, you will implement a function to calculate $J(\theta)$ so you can check the convergence of your gradient descent implementation.

Your next task is to complete the function compute_cost(X, y, theta), which is a function that computes $J(\theta)$. As you are doing this, remember that the variables X and y are not scalar values, but matrices whose rows represent the examples from the training set.


In [ ]:
def compute_cost(X, y, theta):
    """Compute cost for linear regression
    computes the cost of using theta as the parameter
    for linear regression to fit the data points X and y
    """
    # Initialize some useful variables
    m = X.shape[0] # number of training examples

    # You need to return the following variables correctly
    J = 0

    ### YOUR CODE HERE ###
    # Instructions: Compute the cost of a particular choice of theta
    # You should set J to the cost.
    
    
    return J

Once you have completed the function, the next cell will run compute_cost once using $\theta$ initialized to zeros, and you will see the cost printed to the screen.

You should expect to see a cost of 32.07


In [ ]:
# Initial cost using theta = [0,0]
compute_cost(mat(X), y, theta)

Gradient descent

Next, you will implement gradient descent in the function gradient_descent(X, y, theta, alpha, num_iters). The loop structure has been written for you, and you only need to supply the updates to $\theta$ within each iteration.

As you program, make sure you understand what you are trying to optimize and what is being updated. Keep in mind that the cost $J(\theta)$ is parameterized by the vector $\theta$, not X and y. That is, we minimize the value of $J(\theta)$ by changing the values of the vector $\theta$, not by changing X or y. Refer to the equations in this handout and to the video lectures if you are uncertain.

A good way to verify that gradient descent is working correctly is to look at the value of $J(\theta)$ and check that it is decreasing with each step. The starter code for gradient_descent calls compute_cost on every iteration and prints the cost. Assuming you have implemented gradient descent and compute_cost correctly, your value of $J(\theta)$ should never increase, and should converge to a steady value by the end of the algorithm.


In [ ]:
def gradient_descent(X, y, theta, alpha, num_iters):
    """Performs gradient descent to learn theta
    updates theta by taking num_iters gradient steps
    with learning rate alpha
    """
    # Initialize some useful values
    m = X.shape[0] # number of training examples
    J_history = zeros(num_iters)
    
    for i in range(num_iters):
        ### YOUR CODE HERE ###
        # Instructions: Perform a single gradient step on the parameter vector
        #               theta. 
        #
        # Hint: While debugging, it can be useful to print out the values
        #       of the cost function (computeCost) and gradient here.
        #
        
        
        # Save the cost J in every iteration
        J_history[i] = compute_cost(X, y, theta)
        
    return theta, J_history

After you are finished, the next cell will use your final parameters to plot the linear fit.


In [ ]:
theta, J_history = gradient_descent(mat(X), y, theta, alpha, iterations)

print('Theta found by gradient descent:\n', theta)

x_feature = X.A[:, 1]
plot_data(x_feature, y)
plt.plot(x_feature, (X*theta).A, '-')
plt.legend(['Training data', 'Linear regression'], loc='lower right')

Your final values for $\theta$ will also be used to make predictions on profits in areas of 35,000 and 70,000 people. Note the way that the lines in the following cell use matrix multiplication, rather than explicit summation or looping, to calculate the predictions. This is an example of code vectorization in Numpy.


In [ ]:
predict1 = dot([1, 3.5], theta)
# predict1 = predict1.A[0][0] # If predict1 is a matrix, pull out the first value
print('For population = 35,000, we predict a profit of ${:.2f}'.format(predict1 * 10000))

predict2 = dot([1, 7.0], theta)
# predict2 = predict2.A[0][0] # If predict2 is a matrix, pull out the first value
print('For population = 70,000, we predict a profit of ${:.2f}'.format(predict2 * 10000))

Visualizing $J(\theta)$

To understand the cost function $J(\theta)$ better, you will now plot the cost over a 2-dimensional grid of $\theta_0$ and $\theta_1$ values. You will not need to code anything new for this part, but you should understand how the code you have written already is creating these images.

In the next cell, there is code set up to calculate $J(\theta)$ over a grid of values using the compute_cost function that you wrote.


In [ ]:
theta0_vals = np.linspace(-10, 10, 100)
theta1_vals = np.linspace(-1, 4, 100)

J_vals = np.zeros((len(theta0_vals), len(theta1_vals)))
 
for i in range(len(theta0_vals)):
    for j in range(len(theta1_vals)):
        t = r_[theta0_vals[i], theta1_vals[j]][:, newaxis]
        J_vals[i, j] = compute_cost(X, y, t)

J_vals = J_vals.T
plt.contour(theta0_vals, theta1_vals, J_vals, np.logspace(-2, 3, 20))
plt.plot(theta.A[0][0], theta.A[1][0], 'rx', markersize=10, linewidth=5)
plt.xlabel(r'$\Theta_0$'); plt.ylabel(r'$\Theta_1$')

The purpose of this graph is to show you that how $J(\theta)$ varies with changes in $\theta_0$ and $\theta_1$. The cost function $J(\theta)$ is bowl-shaped and has a global mininum. (This is easier to see in the contour plot than in the 3D surface plot). This minimum is the optimal point for $\theta_0$ and $\theta_1$, and each step of gradient descent moves closer to this point.