AM 207: Homework 7


Pavlos Protopapas
Handed out: Thursday, March 27th, 2014
Due: 11.59 P.M. Wednesday April 3rd, 2014

Instructions:

  • Upload your answers in an ipython notebook to the dropbox.

  • Your individual submissions use the following filenames: AM207_YOURNAME_HM7.ipynb

  • Your code should be in code cells as part of your ipython notebook. Do not use a different language (or format) unless you get permission from the TFs. If you use any special libraries you must include them with your code (program should run as is).

  • If you have multiple files (e.g. you've added code files and images) create a tarball for all files in a single file and name it: AM207_YOURNAME_HM7.tar or AM207_YOURNAME_HM7.zip



In [1]:
%matplotlib inline
import seaborn as sns
import IPython.display as display
from IPython.display import Image
!pip install prettyplotlib #JIC you need the reqs :)

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import colors
import prettyplotlib as pplt
from copy import copy

from matplotlib import rcParams
import scipy.stats as stats
import scipy.misc as misc
from sklearn.preprocessing import normalize

from itertools import combinations

from collections import OrderedDict
from copy import deepcopy

import pymc

import seaborn as sns

from mpl_toolkits.mplot3d import Axes3D
c1, c2 = pplt.brewer2mpl.get_map('Dark2', 'Qualitative', 8).mpl_colors[3:5]

blue_red = pplt.brewer2mpl.get_map('RdBu', 'Diverging', 9, reverse=True).mpl_colormap

rcParams["figure.figsize"] = (12,8)


Requirement already satisfied (use --upgrade to upgrade): prettyplotlib in /Users/bull/anaconda/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): matplotlib>=1.2.1 in /Users/bull/anaconda/lib/python2.7/site-packages (from prettyplotlib)
Requirement already satisfied (use --upgrade to upgrade): brewer2mpl>=1.3.1 in /Users/bull/anaconda/lib/python2.7/site-packages (from prettyplotlib)
Cleaning up...

Question 1: The perfect configuration

Consider a system consisting of 7 particles of type A and 12 particles of type B, which exist in a 2-dimensional (x, y) space. Two particles of type A interact among themselves with the Lennard-Jones (LJ) potential

$$ V_{AA}(r_{12} )= \epsilon_{A} \left[ \left( \frac{\sigma_{A}}{r_{12}} \right) ^{12} - \left( \frac{\sigma_{A}}{r_{12}} \right) ^{6} \right] $$

where $r_{12} = \sqrt{(x_1 - x_2)^2 + (y_1-y_2)^2 }$ is the distance between the particles at positions ($x_1$, $y_1$) and ($x_2$, $y_2$), and $\sigma_A$, $\epsilon_A$ are constants. Similarly particles of type B interact among themselves with a LJ potential $V_{BB}(r_{12})$ with different constants $\sigma_B$, $\epsilon_B$ and particles of type A and type B interact with a LJ potential $V_{AB}(r_{12})$ with constants which are the geometric and arithmetic averages of the constants corresponding to $A - A$ and $B - B$ interactions, that is:

$$ \sigma_{AB} = \sqrt{ \sigma_A \, \sigma_B}, \,\,\, \epsilon_{AB} = \frac{1}{2} (\epsilon_A + \epsilon_B). $$

Using the GA method, find the optimal configuration of the cluster of 19 $A$ and $B$ particles. Use the following values for the constants:

$$ \sigma_A=1.0, \,\, \sigma_B=1.125, \,\, \epsilon_A=2.0, \,\, \epsilon_B = 1.5 $$

Answer 1

We can start by graphing the leonard-jones potential for different radiuses, $r$, that represent the distance between to particles. We will examine the function for A-A interactions, B-B interactions, and A-B interactions.

First we define a function calc_energy that takes a distance between to particles, $r$, and values for $\sigma$ and $\epsilon$. This returns the leonard-jones potential between two particles.


In [2]:
def calc_energy(r, sig, eps):
    term1 = (sig/r)**12
    term2 = (sig/r)**6
    
    return eps*(term1 - term2)

rs = np.linspace(0.75, 2, 1000)

sigmas=[1.0, 1.125, np.sqrt(1.125)]
epses=[2.0, 1.5, (3.5)/2]
titles = ["L-J for A-A interactions",
         "L-J for B-B interactions",
         "L-J for A-B interactions"]

fig, axes = plt.subplots(3, 1)

for i, vals in enumerate(zip(sigmas, epses)):
    s, e = vals
    energies = (calc_energy(rs, s, e))
    
    pplt.plot(axes[i], rs, energies)
    
    axes[i].set_ylim(-1, 1)
    axes[i].set_xlim(0.75, 2)
    
    axes[i].set_ylabel("Leonard-Jones Potential")
    axes[2].set_xlabel("Distance bettween two particles, $r$")
    
    axes[i].set_title(titles[i])


Answer 1 - continued

Now we'll make some utility functions to handle parts of this problem. A Particle object will hold the $x$ position, $y$ position, $\sigma$ and $\epsilon$ for a particle. The particle_factory will take a set of parameters about the number of particles of each type we want, the values of $\sigma$ and $\epsilon$, and the starting positions of the particles. Given this set of positions and particle values it creates a list of all the Particle objects in the system.

The distance function takes two particles and calculates the Euclidian distance between them.


In [3]:
class Particle(object):
    def __init__(self, x, y, sigma, eps, fixed=False):
        self.x = x
        self.y = y
        self.sigma = sigma
        self.eps = eps
        self.fixed = fixed
        
    def __repr__(self):
        type_str = "A" if self.sigma == 1.0 else "B"
        return "[x:{}, y:{}, type:{}, fixed:{}]".format(self.x, self.y, type_str, self.fixed)
    
    def __eq__(self, other):
        return (isinstance(other, self.__class__)
            and self.__dict__ == other.__dict__)
    
def particle_factory(num_a=7,
                     sigma_a=1.0,
                     eps_a=2.0,
                     num_b=12,
                     sigma_b=1.125,
                     eps_b=1.5, 
                     start_x=None, 
                     start_y=None):
    
    particle_list = []
    
    for i in range(num_a + num_b):

        x = start_x[i] if start_x else 0
        y = start_y[i] if start_y else 0
        
        if i < num_a:
            sig = sigma_a
            eps = eps_a
        else:
            sig = sigma_b
            eps = eps_b
            
        particle_list.append(Particle(x, y, sig, eps))
            
    return particle_list

def distance(part1, part2):
    xsq = (part1.x - part2.x)**2
    ysq = (part1.y - part2.y)**2
    return np.sqrt(xsq + ysq)

test_parts = particle_factory(num_a=1, 
                              num_b=1, 
                              start_x=[0., 1], 
                              start_y=[0., 1])

# verify distance is working
assert distance(test_parts[0], test_parts[1]) == np.sqrt(2)

Answer 1 - continued

Now we can define a function energy that calculates the energy between two particles. If two particles are in exactly the same position, it returns numpy.inf. Otherwise it gets the $\sigma$ and $\epsilon$ values of the particles and calculates the energy between the Particle objects according to the Leonard-Jones potential.


In [4]:
def energy(part1, part2):
    # get sigma and epsilon between two particles
    sig = np.sqrt(part1.sigma * part2.sigma)
    eps = (part1.eps + part2.eps) / 2
        
    # get distance
    r = distance(part1, part2)
    if r == 0:
        return np.inf
    
    # return energy
    return calc_energy(r, sig, eps)

energy(test_parts[0], test_parts[1])


Out[4]:
-0.25602878630161263

Answer 1 - continued

Next we can define a system_energy function. This function takes a list of all of the particles in the system, creates a set of combinations of all of these particles of length $\binom{N}{2}$, calculates the energy between each unique pair of particles, and then returns the sum of potentials between all of the particles in the system.

Per class, we also define a function fitness, which is equal to: $$ f(x) = -1000 \times \text{energy}(x) $$ This means that the relative differences between fitness will be tractable orders of magnitude, and it means tht there the minimum of the potential energy is now the maximum of the fitness.

We'll also make a plot_particles function so that we can get a sense of what's going on in the system.


In [5]:
def system_energy(particles):
    pairs = combinations(particles, 2)
    
    total_energy = 0
    pair_cnt = 0
    for pair in pairs:
        pair_cnt += 1
        total_energy += energy(pair[0], pair[1])
        
    return total_energy

def plot_particles(particles, ax=None):
    if not ax:
        _, ax = plt.subplots(1)
        
    for p in particles:
        color = c1 if p.sigma == 1.0 else c2
        label = "A" if p.sigma == 1.0 else "B"
        pplt.scatter(ax, p.x, p.y, color=color, s=100, label=label)
        
    # remove 
    handles, labels = ax.get_legend_handles_labels()
    by_label = OrderedDict(zip(labels, handles))
    ax.legend(by_label.values(), by_label.keys())
    ax.set_aspect('equal')
    

start_x = [3, 4, 5,
           2.5, 3.5, 4.5, 5.5,
           2, 3, 4, 5, 6,
           2.5, 3.5, 4.5, 5.5,
           3, 4, 5]

start_y = [5, 5, 5,
           4, 4, 4, 4,
           3, 3, 3, 3, 3,
           2, 2, 2, 2,
           1, 1, 1]

def initialize_individual():
    the_particles = particle_factory(start_x=start_x, start_y=start_y)
    
    return the_particles

the_particles = initialize_individual()
print system_energy(the_particles)
plot_particles(particle_factory(start_x=start_x, start_y=start_y))


20.3149460394

Answer 1 - continued

Now we need to define mutation functions. mutate_swap_positions randomly picks two different particles that are not of the same type and are not "fixed". Note, as per lab above we fixed two interior particles of type A so that our solution is invariant to translation or rotation. This should help us converge faster.

The function mutate_move_position takes a list of particles, a number of particles to move (n_moves), and a step size. It then randomly picks n_moves particles and moves them in a random direction along the x and y axis a distance of step_size.


In [6]:
def mutate_swap_positions(particle_list, n_swaps=1):
    for i in range(n_swaps):
        # get index for first particle to swap
        while True:
            a_ix = np.random.choice(np.arange(len(particle_list)), replace=False)
            
            if not particle_list[a_ix].fixed:
                break
        
        # get index for second particle to swap
        while True:
            b_ix = np.random.choice(np.arange(len(particle_list)), replace=False)
            
            if (not particle_list[b_ix].fixed) and (a_ix != b_ix) and (particle_list[a_ix].sigma != particle_list[b_ix].sigma):
                break
            
        # swap particles xs and ys
        particle_list[a_ix].x, particle_list[b_ix].x = particle_list[b_ix].x, particle_list[a_ix].x
        particle_list[a_ix].y, particle_list[b_ix].y = particle_list[b_ix].y, particle_list[a_ix].y
    
    return particle_list

def mutate_move_position(particle_list, n_moves=1, step_size=0.1):
    for i in range(n_moves):
        a_ix = np.random.choice(np.arange(len(particle_list)), replace=False)
        
        # get direction of change (up, down, or no move)
        signx = np.random.choice(np.arange(-1, 1))
        signy = np.random.choice(np.arange(-1, 1))
        
        particle_list[a_ix].x += signx * step_size
        particle_list[a_ix].y += signy * step_size
        
    return particle_list
        
fig, axes = plt.subplots(1, 3)
fig.set_size_inches(12, 4)

the_particles = initialize_individual()

# original
plot_particles(the_particles, ax=axes[0])

# swap some positions
plot_particles(mutate_swap_positions(the_particles, n_swaps=100), ax=axes[1])

# jitter the points by some amount.
plot_particles(mutate_move_position(the_particles, n_moves=100), ax=axes[2])


Answer 1 - continued

Finally we can start coding a genetic algorithm to maximize the fitness of the configurations. This algorithm has the following steps:

  • setup an initial population of random configurations
  • for each generation, the fittest individuals are more likely to reproduce, so we identify them and create a new "population" that is proportional to the relative fitness of the current population
  • we then introduce random mutations into this new generation:
    • with some probability we have a small number of swaps (1)
    • with some probability we have a large number of swaps (100)
    • with some probability we have a small "jitter" (n_moves = 100)
    • with some probability we have a large "jitter" (n_moves = 3)
  • we then see if any of our current generation has the best fitness we have seen yet; if so, we store this configuration
  • finally, we use the new generation as the current generation and start again
  • we iterate through a fixed number of generations

Implementation Details

These are the utility functions that we define to operate this procedure:

  • initialize_random_individual: initializes a configuration and then makes a number of random swaps and jitters to get a pseudo-random configuration
  • get_next_generation: takes a current population, which is a set of configurations, and then gets a new population of the same size, but with individuals distributed according to the relative fitnesses of the individuals in the original population.
  • fitness: function to convert LJ potential (which we want to minimize) to a fitness (which we want to maximize). currently we do this as $\exp(-V_{\text{total}})$ where $V_{\text{total}}$ is the sum of all of the LJ potentials for all of the unique particle interactions.
  • get_fitnesses: gets the fitness for for each of the configurations in a population. Optionally takes the normed boolean argument which returns fitnesses that sum to 1.
  • get_max_individual: given a population, this returns the individual (configuration) with the maximum fitness and the value of that maximum fitness.
  • mutate_generation: this is passed an unmutated version of the "next generation" population in order to introduce mutations into this generation. For each individual, this mutates the individual with large and small swaps and jitters according to the probabilties of these events that are passed to it.
  • genetic_LJ: this function runs the entire algorithm as described above utilizing the utility functions we describe here. It returns after a certain number of iterations.

In [ ]:
def initialize_random_individual():
    the_particles = initialize_individual()
    
    the_particles = mutate_swap_positions(the_particles, n_swaps=10)
    the_particles = mutate_move_position(the_particles, n_moves=5)
    
    return the_particles

def get_next_generation(current_generation):
    pop_size = len(current_generation)
    
    # get fitness of each member of the population
    fitnesses = get_fitnesses(current_generation, normed=True)
    
    # get the counts of individuals in the next generation
    next_gen_counts = np.random.multinomial(pop_size, fitnesses, size=1)
    
    # create the un-mutated next generation from the counts
    next_generation = []
    for idx, count in enumerate(next_gen_counts[0].tolist()):
        next_generation += [deepcopy(current_generation[idx]) for j in range(count)]
        
    return next_generation

def fitness(particles):
    fitness = np.e**-system_energy(particles)
    return fitness

def get_fitnesses(population, normed=False):
    fitnesses = np.array(map(fitness, population), dtype=np.float64)
    if normed:
        pop_size = len(population)
        fitnesses = normalize(fitnesses.reshape(pop_size, -1), axis=0, norm="l1").reshape(-1)

    return fitnesses
    
def get_max_individual(population):
    fitnesses = get_fitnesses(population)
    return fitnesses.max(), population[np.argmax(fitnesses)]
    
def mutate_generation(population, 
                      p_small_swap,
                      p_large_swap,
                      p_small_jitter,
                      p_large_jitter):
    
    new_population = []
    for individual in population:
        new_individual = deepcopy(individual)
        swap_prob = np.random.rand()
        if swap_prob < p_large_swap:
            new_individual = mutate_swap_positions(new_individual, n_swaps=100)
        elif swap_prob < p_small_swap:
            new_individual = mutate_swap_positions(new_individual, n_swaps=1)
            
        jitter_prob = np.random.rand()
        if swap_prob < p_large_jitter:
            new_individual = mutate_move_position(new_individual, n_moves=100)
        elif swap_prob < p_small_jitter:
            new_individual = mutate_move_position(new_individual, n_moves=3)
            
        new_population.append(new_individual)
        
    return new_population


def genetic_LJ(starting_generation,
               n_generations=1000,
               p_small_swap=0.3,
               p_large_swap=0.01,
               p_small_jitter=0.4,
               p_large_jitter=0.1):
    
    # keep track of the maximum fitness at each generation
    max_fitnesses = np.zeros(n_generations + 1, dtype=np.float64)
    
    # random starting best fitness
    best_fitness, most_fit_individual = get_max_individual(starting_generation)
    max_fitnesses[0] = best_fitness
    
    current_generation = starting_generation
    
    for g in np.arange(n_generations):
        # get the un-mutated next generation based on the relative fitness
        # of the current population
        next_generation = get_next_generation(current_generation)
        
        # mutate the next generation
        next_generation = mutate_generation(next_generation, 
                                            p_small_swap, 
                                            p_large_swap, 
                                            p_small_jitter, 
                                            p_large_jitter)
        
        # get maxes for the next generation
        curr_max, curr_max_ind = get_max_individual(next_generation)
        
        # store the maximum for next generation
        max_fitnesses[g+1] = curr_max
        
        # keep the maximum if it is the best we have seen yet
        if curr_max > best_fitness:
            most_fit_individual = deepcopy(curr_max_ind)
            best_fitness = curr_max

        # next generation becomes the current generation
        current_generation = next_generation
            
    return max_fitnesses, most_fit_individual
        
    
initial_generation = [initialize_random_individual() for i in range(500)]

# ignore warnings since there are some c-level deprecations
# that don't disappear
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fits, final = genetic_LJ(initial_generation)

In [10]:
plt.plot(fits)
plot_particles(final)


Answer 1 - continued

Using the genetic algorithm, we can observe an upward trend over generations in the best fitness. While we don't find the exact optimum solution in the number of generations that we executed, we can see the improvement towards the optimum solution and a leveling off when we find a pretty stable region. As we can see, this consists of a hexagon of particles of type A within a perimeter of particles of type B.

Question 2: The Cost of Housing

Here we ask you to use your knowledge of gradient descent to model home prices. The txt file files/houses.txt gives data on the living area (in sq. ft), number of bedrooms, and price (1st to third column) for 47 homes in the Portland, Oregon area.

Consider the following simple model of home prices

$$\text{Price} = \theta_0+\theta_1 \text{Area} +\theta_2 \text{Bedrooms} + \epsilon$$

a) Write out the least-squares cost function (the function you are trying to minimize) for this regression problem. Note that this is equivalent to MLE for gaussian $\epsilon$.

b) Use batch gradient descent to find the optimal values of $\theta$. Experiment with several different learning rates $\alpha$ before running the algorithm to convergence. Plot cost as a function of iteration for the different values of $\alpha$ you experiment with and explain what learning rate is optimal (for this plot you only need to go out to 50 iterations). Now using the learning rate you selected, run the batch gradient descent algorithm to convergence and record the optimal $\theta$ and minimum value of the cost function. Record the time the algorithm takes to converge.

c) Now minimize the cost function using stochastic gradient descent. Produce a plot of the cost as a function of iteration and record the optimal $\theta$ and minimum value of the cost function. Record the time the algorithm takes to converge.

d) Which algorithm took longer to converge? Why? Are your results in b) and c) similar?

e) What is the predicted price of a house with 2500 square feet and 2 bedrooms?

Hint: Before beginning the optimization, you may want to normalize both the living area and bedroom inputs by subtracting the mean and dividing by the standard deviation.

Answer 2a

$$\text{Price} = \theta_0+\theta_1 \text{Area} +\theta_2 \text{Bedrooms} + \epsilon$$ a) Write out the least-squares cost function (the function you are trying to minimize) for this regression problem. Note that this is equivalent to MLE for gaussian $\epsilon$.

If we consider our intercept, area, and bedrooms as elements in a vector, $\mathbf{x}$, we can rewrite our function in matrix notation:

$$ \text{Price} = h(\mathbf{x}) = \theta^{\text{T}}\mathbf{x} $$

For any set of parameters $\theta$, we can estimate the price of the house as:

$$ \hat{y}^{(i)} = h(\mathbf{x}^{(i)}) $$

Now we can write our cost function in terms of theta as the sum of sqaured distances between our predicted value and the actual values:

$$ J(\theta) = \frac{1}{2} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 $$

Answer 2b

b) Use **batch gradient descent** to find the optimal values of $\theta$. Experiment with several different learning rates $\alpha$ before running the algorithm to convergence. Plot cost as a function of iteration for the different values of $\alpha$ you experiment with and explain what learning rate is optimal (for this plot you only need to go out to 50 iterations). Now using the learning rate you selected, run the batch gradient descent algorithm to convergence and record the optimal $\theta$ and minimum value of the cost function. Record the time the algorithm takes to converge.

First, we need to know what the gradient of our cost function is. We can take the partial derivative with respect to each parameter, $\theta_j$ in the vector of parameters $\theta$:

$$ \frac{\partial}{\partial \theta_j} J(\theta) = \frac{\partial}{\partial \theta_j} \frac{1}{2} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 \\ = 2 \frac{1}{2} (\theta^{\text{T}}\mathbf{x} - y) \frac{\partial}{\partial \theta_j} (\theta^{\text{T}}\mathbf{x} - y) \\ = (\theta^{\text{T}}\mathbf{x} - y) x_j $$

Since we want to step in the direction opposite of the gradient, we reverse the sign of our subtraction so it is $(\theta^{\text{T}}\mathbf{x} - y)$ We can then update the parameters according to:

$$ \theta_{j+1} = \theta_j + \alpha (y^{i} - \theta^{\text{T}}\mathbf{x}) x_j $$

In [218]:
# load the data
orig_data = np.loadtxt("houses.txt", delimiter=",")

# add an intercept term
prices = orig_data[:, 2].copy()
orig_data[:, 2] = 1

# standardize the data by de-meaning and
# dividing by the standard deviation
def standardize(x):
    return (x - np.mean(x))/np.std(x)

# keep original data so we can re-standaradize
# any new data pts. e.g., in part e appropriately
data = orig_data.copy()

data[:, 0] = standardize(data[:, 0]) 
data[:, 1] = standardize(data[:, 1])

# print first 10 rows to verify
print data[10,:]

In [323]:
def cost(theta, data, prices):
    return 0.5 * ((np.dot(data, theta.T) - prices)**2).sum()

def batch_descent(theta, data, price, alpha=1e-2, max_iter=None, tol=1e-15):
    cnt = 0
    num_params = len(theta)
    costs = []
    
    while True:
        # break based on a number of iterations
        if max_iter and cnt > max_iter:
            break
            
        # batch update thetas
        theta_new = theta.copy()
        for j in range(num_params):
            theta_new[j] = theta[j] + alpha * ((price - np.dot(data, theta.T)) * data[:, j]).sum()           
                    
        # break one we've converged
        if np.linalg.norm(theta_new - theta) <= tol:
            break
        
        # increment count, costs, and theta
        cnt += 1
        theta = theta_new.copy()
        costs.append(cost(theta, data, price))
        
    return theta, costs

alphas_test = 10**np.arange(-7., -1)
alpha_costs = []
initial_theta = np.ones(3, dtype=np.float64) * 1

for alpha in alphas_test:
    best_theta, cost_list = batch_descent(initial_theta, data, prices, alpha=alpha, max_iter=50) 
    pplt.plot(cost_list, label=r"$\alpha={}$".format(alpha))
    
plt.title(r"Cost at every iteration for different values of $\alpha$")
plt.xlabel("Iteration")
plt.ylabel(r"Cost $J(\theta)$")
pplt.legend()
plt.show()


Answer 2b - continued

Observing the convergence for different values of the learning parameter, we can see that a good candidate is 0.001. Anything smaller that 0.01 gets extremely large after 50 iterations (and I suspect 0.01 would as well after more iterations). The smaller values will take a long time to converge. Therefore, 0.001 strikes a nice balance for our learning parameter, $\alpha$.

We'll run batch gradient descent 100 times to get a good sense of the time that it takes.


In [325]:
from time import time

times = []
for i in range(100):
    start = time()
    batch_thetas, batch_costs = batch_descent(initial_theta, data, prices, alpha=0.001)
    times.append(time() - start)
    
print "Average time for batch descent: ", np.mean(times)
print "Cost minimum is ", batch_costs[-1]
print "Values for theta are: ", batch_thetas


Average time for batch descent:  0.0847205996513
Cost minimum is  96034182437.5
Values for theta are:  [ 109447.76551898   -6578.27679028  340412.76595745]

Answer 2c

c) Now minimize the cost function using **stochastic gradient descent**. Produce a plot of the cost as a function of iteration and record the optimal $\theta$ and minimum value of the cost function. Record the time the algorithm takes to converge.

In stochastic gradient descent, we update theta according to only a single data point for each iteration (rather than the entire data set as in batch).

We'll run it 100 times to get a good estimate of the average time that it runs for.


In [329]:
def stochastic_descent(theta, data, price, alpha=0.001, max_iter=None, tol=0.005):
    cnt = 0
    num_params = len(theta)
    rows, _ = data.shape
    costs = []
    
    while True:
        # break based on a number of iterations
        if max_iter and cnt > max_iter:
            break
            
        # update thetas according to a single data point
        theta_new = theta.copy()
        row = cnt % rows 
        
        for j in range(num_params):
            theta_new[j] = theta[j] + alpha * (price[row] - np.dot(data[row, :], theta.T)) * data[row, j]     
                    
        # break one we've converged
        if np.linalg.norm(theta_new - theta) <= tol:
            break
        
        # increment count, costs, and theta
        cnt += 1
        theta = theta_new.copy()
        costs.append(cost(theta, data, price))
        
    return theta, costs

In [330]:
times = []
stoch_thetas_sum = np.zeros(3, dtype=np.float64)
stoch_cost_sum = 0

for i in range(100):
    start = time()
    stoch_thetas, stoch_costs = stochastic_descent(initial_theta, data, prices, max_iter=10000)
    times.append(time() - start)
    
    stoch_thetas_sum += stoch_thetas
    stoch_cost_sum += stoch_costs[-1]
    
print "Average time for stochastic descent: ", np.mean(times)
print "Average cost minimum is ", stoch_cost_sum / 100.
print "Average values for theta are: ", stoch_thetas_sum / 100.


Average time for stochastic descent:  0.332632708549
Average cost minimum is  96049425310.4
Average values for theta are:  [ 108904.49884932   -5611.23643796  340494.59994362]

In [331]:
pplt.plot(stoch_costs)
    
plt.title(r"Cost at every iteration")
plt.xlabel("Iteration")
plt.ylabel(r"Cost $J(\theta)$")
plt.show()

# plot the linear components individually
fig, axes = plt.subplots(1, 2)
for i in range(2):
    pplt.scatter(axes[i], data[:,i], prices)
    
    xs = np.linspace(data[:,i].min(), data[:,i].max(), 1000)
    pplt.plot(axes[i], xs, stoch_thetas[i]*xs + stoch_thetas[2])


Answer 2d

d) Which algorithm took longer to converge? Why? Are your results in b) and c) similar?

Batch descent took 0.082 seconds. Stochastic gradient descent took on the order of 0.33 seconds. In stochastic gradient descent, we update theta according to only a single data point at a time. Even though this operation is less costly than updating according to the entire dataset at each iteration (as we do in batch), it still takes longer to reach the minimum. This is likely because the stochastic version is less stable than batch. It's movement depends only on the data point it is looking at currently, so it will move around the minimum more than the batch version.

We can see how if we had a much larger dataset, stochastic gradient descent would be very advantageous. Given that it can make progress with each point in the dataset, we can start moving towards the minimum more quickly.

As we can see by the results, the minimum cost and the theta values are very close to the same for both methods. The theta values are:

Method Standardized Area Standardized Number of Bedrooms Intercept
Batch Gradient Descent 109448 -6578 340413
Stochastic Gradient Descent 108904 -5611 340495

We might initially be surprised that the coefficient for number of bedrooms is negative. However, this is likely because it is highly collinear with square footage. Also, the order of magnitude of this change will be dominated by the other terms, so the overall effect is likely to be very small.

Answer 2e

e) What is the predicted price of a house with 2500 square feet and 2 bedrooms?

To calculate this price, we need to perform the standardization procedure that we used on the original data to transform our input. Then we can get a point estimate for the price according to the original formula.

As we can see below, our regression tells us that the price of this hous is approximately $420,000 dollars.


In [338]:
new_point = np.array([2500, 2, 1], dtype=np.float64).reshape(3, -1)
new_point[0,:] = (new_point[0, :] - np.mean(orig_data[:,0]))/np.std(orig_data[:,0])
new_point[1,:] = (new_point[1, :] - np.mean(orig_data[:,1]))/np.std(orig_data[:,1])

print "Price for House (Batch):", np.dot(new_point.T, batch_thetas)
print "Price for House (Stochastic):", np.dot(new_point.T, stoch_thetas)


Price for House (Batch): [ 420148.52174648]
Price for House (Stochastic): [ 418382.16608885]

END



In [1]:
# Little hack to add styling to my answers to distinguish them
# from the questions and make them easier for me to find. To 
# use the "answer" styling, add <div class="answer" /> to a markdown cell.
# 
# NOTE: rerun this cell last to make sure all of your cells are updated.
#
from IPython.core.display import HTML
def add_answer_styling():
    styles = """
    <style>
        .answerstyle {
            /* Use !important to override ipython nb defaults */
            color: #e2e2e2 !important;
            border-left: 13px solid #822222 !important;
            padding: 10px !important;
            
            background-color: #323232;
            
        }
        
        .answerstyle table, td, th, tr
        {
            border:3px solid #822222 !important;
        }
        
        .answerstyle code
        {
            background-color: #e2e2e2 !important;
        }
        
        /* used to note a markdown cell as an answer */
        .answer{}
    </style>
    
    <script>
      // Get all the empty "answer" divs
      var elems = document.getElementsByClassName('answer');
      
      for (var i = 0; i < elems.length; i++) {
        
        // if the parent is not styled as an answer, apply that style
        if(elems[i].parentNode.className.indexOf('answerstyle') === -1)
        {
            elems[i].parentNode.className = 'answerstyle ' + elems[i].parentNode.className;
        }
        }
    </script>
    """
    return HTML(styles)
add_answer_styling()


Out[1]:

In [ ]: