Lab 9: The Poisson Process


In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab09.ok')

Today's lab has two topics:

  1. The Poisson process model. This is a model for events that happen "randomly" in time. It pops up in Homework 5.
  2. Solving least-squares problems with a computer.

The Poisson process

Homework 5 uses a Poisson process model in the context of website updates: A website could be updated at any time, and you can't really predict when an update will happen based on previous update times.

Here are a few more situations where events could be modeled by a Poisson process:

  1. Cars are arriving at an intersection when traffic is flowing freely.
  2. Customers are visiting an online store.
  3. Rain drops are falling on a field during a storm.
  4. Radiation particles are being emitted by a radioactive object.

We use the term "random variable" for numbers or single quantities that are random, but of course other things can be random, too. A Poisson process is a random set of numbers. For example:

  1. The times when cars or customers arrive, or when radiation particles are emitted.
  2. The places where rain drops fall.

It arises any time the set has just two properties:

A. All numbers that are possible have equal chances to appear.

B. Knowing that certain numbers appear in the set doesn't tell you anything about whether other numbers appear in the set. In other words, numbers appear independently in the set.

The second assumption is often problematic. For example, it's violated to some degree in the following situations:

  1. Cars are arriving at an intersection, but they all arrive around the same time because they were waiting for a light at a previous intersection.
  2. Customers are visiting an online store, but sometimes a big event (like a new product launch) makes many customers visit at the same time.

Question 1

Explain why.

SOLUTION: Situation 1: knowing that one car has arrived, you could predict that many other cars are about to arrive. Situation 2: knowing that an unusually-high number of customers have visited in the last minute, you could predict that many other customers are about to visit in the next minute.

Ooober

Suppose you are working in the Public Relations office at the ridesharing company Ooober. Your company is somewhat prone to scandal. Normally, one or two scandals are okay, but it is very bad when more than 3 scandals happen in one month, because then the media starts to write "pile-on" stories about how scandal-prone Ooober is.

You would like to model scandals and estimate how likely it is for more than 3 scandals to happen in a month.

Run the next cell to load the scandals table. It corresponds to the following schema:

scandals(primary key (scandal_id), time)

The times are given in fractions of a month from January 2014 to December 2015. For simplicity, in this lab, just assume every month has 30 days and every day has 24 hours.


In [43]:
scandals = pd.read_csv('scandals.csv')
scandals.set_index('scandal_id', inplace=True)
scandals

The following cell defines a function for viewing timelines of events.


In [3]:
def display_points(points, xlim, title):
    """Displays a timeline with points on it.
    
    Args:
      points (ndarray): A list of floats in the range [xlim[0], xlim[1]],
                        each one a point to display in the timeline.
      xlim (list-like): A list/tuple/array with 2 elements giving the
                        start and end of the timeline.
      title (str): The name to display on the plot.
    """
    fig, ax = plt.subplots(1)
    fig.set_size_inches(8, 1)
    ax.scatter(points, np.repeat(0, len(points)))
    ax.axhline(0, color="grey", zorder=-1, lw=.5)
    ax.yaxis.set_visible(False)
    ax.xaxis.set_visible(True)
    ax.set_xlim(xlim)
    ax.set_title("{} ({:d} total points)".format(title, len(points)))
    plt.show()

Question 2

Display the timeline of scandals at Ooober.


In [4]:
num_months = 24
display_points(scandals['time'], [0, num_months], 'Scandals at Ooober') #SOLUTION

Question 3

What was the average number of scandals per month at Ooober over the last 2 years?


In [ ]:
average_scandals_per_month = len(scandals) / num_months #SOLUTION
average_scandals_per_month

In [6]:
_ = ok.grade('q3')
_ = ok.backup()

Our goal is to simulate possible future scandal patterns. To do that, we need a model for the scandals, and we need to learn the parameters of that model from our data.

The scandals look somewhat spread out. Suppose we are willing to assume that they follow a Poisson process. Remember, that means:

A. Scandals are equally likely to happen at any time.

B. Scandals happen independently. That means a scandal is no more or less likely to happen tomorrow if one happened today.

How would we simulate future scandals under this model?

If the average number of scandals per month is $r$, then the average number of scandals per day must be $\frac{r}{30}$ (remember, we're assuming there are 30 days in each month), and the average number of scandals per hour must be $\frac{r}{30 \times 24}$. Notice that under assumption A, this is true for all hours, even 3 AM on a Sunday.

Question 4

What was the average number of scandals per second at Ooober over the last 2 years?


In [8]:
average_scandals_per_second = average_scandals_per_month / (30*24*60*60) #SOLUTION
average_scandals_per_second

In [9]:
_ = ok.grade('q4')
_ = ok.backup()

It seems okay to neglect the chance that there are 2 scandals in one second! In that case, whether a scandal happens in a particular second is a Bernoulli($p$) random variable: 1 if there is a scandal and 0 otherwise.

Question 5

In order for the average number of scandals per second to match your answer to the previous question, what should $p$ be?


In [10]:
p = average_scandals_per_second #SOLUTION
p

In [11]:
_ = ok.grade('q5')
_ = ok.backup()

Question 6

Draw from a Bernoulli($p$) distribution for each second in two years. Count each 1 you find as a scandal at the start of that second. Set q6_simulated_scandals to a list of those seconds, converted to months since the start of the first year.

Hint: np.random.random(size=n) will generate an array of n numbers uniformly distributed between 0 and 1. Therefore, each element of the array np.random.random(size=n) <= p has an independent chance p to be True and chance 1-p to be False.

Hint 2: You may find the function np.where useful. np.where(array_of_booleans)[0] is an array of indices where array_of_booleans is True. Note the [0].


In [12]:
seconds_in_a_month = 30*24*60*60

q6_simulated_scandals = np.where(np.random.random(size=num_months*seconds_in_a_month) <= p)[0] / seconds_in_a_month #SOLUTION

display_points(q6_simulated_scandals, [0, num_months], "Simulated scandals in 2 years")
q6_simulated_scandals

Question 7

Encapsulate your work in a function called draw_approximate_poisson_process, as described below.


In [13]:
def draw_approximate_poisson_process(rate, length, steps_per_unit_length):
    """Draws from an approximate Poisson process on [0, length] with the given rate.
    
    This function simulates the number of times a thing happens, if the
    thing happens or doesn't happen randomly according to a Bernoulli
    distribution at each time step.  The number of time steps simulated
    is:
    
        length * steps_per_unit_length.
        
    The average number of things that happen is:
    
        rate * length.
    
    Args:
      rate (float): The average number of times the thing happens per
                    unit length.
      length (float): The length of time to be simulated.
      steps_per_unit_length (float): The number of Bernoulli draws per unit length.
                                     length*steps_per_unit_length must be an integer.
                                     (That guarantee isn't strictly necessary but
                                     simplifies the implementation of this function.)
     
    Returns:
      ndarray: A NumPy array containing the times when the thing happened.
               (If a event happened during a time step, this function can
               choose any time for the event between the time step and the
               next one.)
    """
    # We suggest computing the number of steps like this:
    num_steps = int(np.round(length * steps_per_unit_length))
    p = rate / steps_per_unit_length #SOLUTION
    return np.where(np.random.random(size=num_steps) <= p)[0] / steps_per_unit_length #SOLUTION

draw_approximate_poisson_process(average_scandals_per_month, num_months, seconds_in_a_month)

Question 8

You might notice that your function takes some time to run. To speed things up, let's switch the granularity of our simulations from seconds to minutes. This will give us a slightly worse approximation to the Poisson process, but it still seems plausible that 2 scandals never happen in the same minute.

Now, by calling your function 100-1000 times, estimate the chance that (over the next 24 months, assuming the rate of scandals remains the same) Ooober experiences 3 or more scandals in at least one month. This is called a Monte Carlo simulation.

Note: This may take up to a minute. Try using fewer simulations when you first run it.


In [14]:
num_simulations = 1000
minutes_in_a_month = 30*24*60
three_or_more_scandals_count = np.count_nonzero([
        max(np.bincount(draw_approximate_poisson_process(average_scandals_per_month, num_months, minutes_in_a_month).astype(int))) >= 3
        for _ in range(num_simulations)])
three_or_more_scandals_chance = three_or_more_scandals_count / num_simulations
three_or_more_scandals_chance

In [16]:
_ = ok.grade('q8')
_ = ok.backup()

The Poisson distribution

Often, we don't care about exactly when events happen, only how many of them happen in a certain interval. A Poisson($\lambda$) random variable (not a process) is the number of events that happen in some Poisson process that has rate*length equal to $\lambda$.

Since draw_approximate_poisson_process has approximately a Poisson process distribution, an expression like:

len(draw_approximate_poisson_process(2, 3, 1000))

...has approximately a Poisson($2 \times 3$) distribution.

However, we can characterize the distribution of your approximation exactly. Your approximation draws from many Bernoulli random variables. The number of events that happen is a sum of those Bernoulli random variables. Therefore, the number of events follows exactly a Binomial distribution!

Recall that the Binomial distribution has two parameters: $n$, the number of Bernoulli random variables that are summed, and $p$, the probability that any one of them is 1. It's written Binomial($n$, $p$). The expected value is $n \times p$.

Question 9

What is the exact distribution of the results of this call?

len(draw_approximate_poisson_process(r, l, d))

Give the Binomial parameters $n$ and $p$.


In [17]:
# Example values for r, l, and d so that this cell can run:
r = .2
l = 10
d = 3

n = l*d #SOLUTION
p = r/d #SOLUTION

In [18]:
_ = ok.grade('q9')
_ = ok.backup()

As d increases, the time steps get narrower, and the Binomial(l*d, r/d) distribution becomes the same as the Poisson(l*r) distribution. In homework 5, you'll see another way to simulate data from the Poisson process that takes advantage of this fact. The cell below uses your draw_approximate_poisson_process function to show by simulation that:

$$\text{Binomial}(ld, \frac{r}{d}) \to \text{Poisson}(lr) \text{ as }d \to \infty.$$

In [72]:
def plot_poisson_approximation(r, l, num_step_sizes, num_simulations):
    import math
    max_output = int(r*l + 6*(r*l)**0.5)
    true_poisson_pmf = [math.exp(-r*l) * (r*l)**k / math.factorial(k) for k in range(max_output)]
    min_steps_per_length = r*2
    steps_per_length = min_steps_per_length * 2**np.arange(num_step_sizes)
    def approximate_pmf(s):
        draws = [len(draw_approximate_poisson_process(r, l, s)) for _ in range(num_simulations)]
        return np.bincount(draws, minlength=max_output) / num_simulations
    
    approximate_pmfs = [approximate_pmf(s) for s in steps_per_length]
    approximate_pmf_names = ["Approximation to Poisson PMF (Binomial({:d}, {:.4f}))".format(int(np.round(l*s)), r/s)
                             for s in steps_per_length]
    from matplotlib import cm
    colors = [cm.jet(x) for x in np.linspace(0, 1, len(steps_per_length)+1)]
    
    for pmf, name, color in zip(approximate_pmfs + [true_poisson_pmf], approximate_pmf_names + ["True Poisson({:.4f}) PMF".format(r*l)], colors):
        plt.scatter(range(len(pmf)), pmf, c=color, label=name, s=40)
        plt.plot(range(len(pmf)), pmf, c=color, linestyle='dashed')
        plt.legend()
        plt.title("Approximations to the Poisson distribution\n(dashed lines are for contrast purposes only)")
        plt.xlabel("Count")
        plt.ylabel("Probability")

plot_poisson_approximation(.2, 10, 5, 40000)

Least squares

Now let's turn to a new topic entirely: least-squares linear regression. To save time, we won't look at a real-world or even a fictitious problem. Our goal is to get some concrete experience solving least-squares problems with a computer.

In last week's lab, you saw that you could maximize a function on a computer by plotting its values over a range of inputs. A computer can automate that process and improve on it in various ways. This is called numerical optimization. The main numerical optimization tool we'll use is the function sc.optimize.minimize from the SciPy package. There are many, many more.

The problem in least-squares linear regression is:

For $n$ entities $i$ = 0 through $n-1$, find a prediction function that predicts their outcomes $y_0$, $y_1$, ..., $y_{n-1}$ well. To help us make predictions, we have available for each entity a vector of $p$ numerical features, $x_i \in \mathcal{R}^p$. We limit ourselves to linear prediction functions: functions that look like $f(x) = \theta \boldsymbol{\cdot} x$ for some parameter vector $\theta$. We would like to find the $f$ that produces the least squared prediction error, averaged over the $n$ entities.

In more succinct mathematical terms (in case you prefer such language), the problem is to find:

$$\text{least squares prediction function} = \arg\min_{f \in \text{linear functions}} L(f; (x_i)_{i=0}^{n-1}, (y_i)_{i=0}^{n-1})$$

where the loss function is:

$$L(f; (x_i)_{i=0}^{n-1}, (y_i)_{i=0}^{n-1}) = \frac{1}{n} \sum_{i=0}^{n-1} (f(x_i) - y_i)^2.$$

It helps to make the dependence of $f$ on parameters $\theta$ more explicit. So we define $f_\theta$ as the linear function $f_\theta(x) = \theta \boldsymbol{\cdot} x$. Then the problem becomes:

$$\theta^{\text{least squares}} = \arg\min_{\theta \in \mathcal{R}^p} \frac{1}{n} \sum_{i=0}^{n-1} (f_\theta(x) - y_i)^2.$$

To help us visualize the data and our prediction function, we will take $p = 2$ and $n = 100$, but nothing would change except computation time and accuracy if we increased $p$ or $n$.

Running the three cells below will create an interactive visualization of the dataset.


In [19]:
data = pd.read_csv("least_squares_data.csv")
data.head()

In [20]:
def display_linear_prediction(data, theta=None, orient_x = 45, orient_y = 45):
    import math
    from matplotlib import patches, cm
    from matplotlib.ticker import LinearLocator, FormatStrFormatter
    from mpl_toolkits.mplot3d import Axes3D
    plot_predictions = theta is not None
    fig = plt.figure()
    ax = fig.gca(projection='3d')
    if plot_predictions:
        actual_xs = data[['x0', 'x1']].as_matrix()
        predicted_ys = np.dot(actual_xs, theta)
        
        def plot_points_and_residuals(point_indicators, ax, occluded):
            to_plot = data.loc[point_indicators]
            ax.hold(True)
            ax.scatter(to_plot["x0"], to_plot["x1"], to_plot["y"], 
                       c='black',
                       zorder=-1 if occluded else 100)
            
            for i in np.where(point_indicators)[0]:
                x0, x1 = actual_xs[i,:]
                y = data['y'][i]
                predicted_y = predicted_ys[i]
                ax.hold(True)
                ax.plot([x0, x0], [x1, x1], [y, predicted_y],
                        c='red',
                        linestyle='dotted' if occluded else 'solid',
                        lw=1 if occluded else 2)
        
        # Figuring out which points are in front of the surface:
        orient_x_rad = orient_x*math.pi/180
        orient_y_rad = orient_y*math.pi/180
        viewpoint_coords = [
            np.sin(orient_x_rad)*np.cos(orient_y_rad),
            np.cos(orient_x_rad)*np.cos(orient_y_rad),
            np.sin(orient_y_rad)]
        prediction_surface_normal = [theta[0], theta[1], -1]
        viewpoint_above_surface = np.dot(viewpoint_coords, prediction_surface_normal) >= 0
        point_in_front_of_surface = ((predicted_ys - data['y']) >= 0) == viewpoint_above_surface
        
        plot_points_and_residuals(~point_in_front_of_surface, ax, True)
        
        # Plotting the surface:
        xs = np.array(np.meshgrid(
            np.linspace(min(data['x0']), max(data['x0']), 5),
            np.linspace(min(data['x1']), max(data['x1']), 5)))
        ys = np.tensordot(theta, xs, axes=1)
        ax.hold(True)
        prediction_plane = ax.plot_surface(xs[0], xs[1], ys,
                                           cmap=cm.coolwarm)
        
        plot_points_and_residuals(point_in_front_of_surface, ax, False)
        
        squared_loss = np.mean((predicted_ys - data['y'])**2)
        plt.title("data, predictions, and residuals\n(current average squared prediction error = {:.2f})".format(squared_loss))
    else:
        prediction_plane = ax.scatter(data["x0"], data["x1"], data["y"], 
                                      cmap=cm.coolwarm)
        plt.title("raw data")
        
    ax.zaxis.set_major_locator(LinearLocator(5))
    ax.zaxis.set_major_formatter(FormatStrFormatter('%.1f'))
    ax.view_init(orient_y, orient_x)
    ax.set_xlabel("x0 (feature 0)")
    ax.set_ylabel("x1 (feature 1)")
    ax.set_zlabel("y (the thing we're predicting)")
    plt.show()

In [21]:
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

def plot_predictor(theta0, theta1, orient_x, orient_y, plot_predictions):
    theta = [theta0, theta1]
    display_linear_prediction(data, theta if plot_predictions else None, orient_x, orient_y)

theta0 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
theta1 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
orient_x = widgets.FloatSlider(min=0, max=90, step=1, value=45, description="x rotation")
orient_y = widgets.FloatSlider(min=0, max=90, step=1, value=45, description="y rotation")
plot_predictions = widgets.Checkbox(value=True, description="Plot predictions on top of data?")

interact(plot_predictor, theta0=theta0, theta1=theta1, orient_x=orient_x, orient_y=orient_y, plot_predictions=plot_predictions);

Question 10

Play around with the graph above to find a good predictor $\theta = (\theta_0, \theta_1)$. It should have squared prediction error around 1 or less. (If the graph draws slowly on your computer, just move on to the next question.)

Question 11

Write the objective function least_squares_loss as described in its docstring.

Big hint: The function np.dot and the DataFrame method as_matrix will be useful. Despite the name, np.dot can perform matrix-vector multiplication. And as_matrix converts a DataFrame to a 2d NumPy array. For example,

np.dot(data[['x0', 'x1']].as_matrix(), [-5, 3])

...is an array with one element for each entity. For each entity, the value is -5 times the x0 feature for the entity plus 3 times the x1 feature.


In [22]:
def least_squares_loss(theta):
    """The average squared prediction error when the function
        f: x => theta . x
    is used to predict y for our dataset.
    
    The dataset is the DataFrame named data.
    
    Args:
      theta (ndarray): A vector of p numbers.  The prediction function
                       is f: x => theta . x, the dot product with theta.
    
    Returns:
      float: The average (over the DataFrame named data) of the
             squared prediction error.
    """
    # Our solution defined an array called predictions; you
    # don't have to.
    predictions = np.dot(data[['x0', 'x1']].as_matrix(), theta) #SOLUTION
    return np.mean((data['y'] - predictions)**2) #SOLUTION

In [23]:
_ = ok.grade('q11')
_ = ok.backup()

Question 12

Import the scipy.optimize module as scopt and call scopt.minimize on least_squares_loss. Name the result optimization_result and study it.

Hint: You'll need to provide one additional argument: an initial guess. It's best to start with the origin (np.zeros((2))) or a random location near the origin (np.random.normal(0, 1, size=2)).


In [24]:
import scipy.optimize as scopt
optimization_result = scopt.minimize(least_squares_loss, x0=np.random.normal(0, 1, size=2)) #SOLUTION
optimization_result

In [25]:
_ = ok.grade('q12')
_ = ok.backup()

Question 13

Interpret the output of scopt.minimize. Which field do you think is $\theta^{\text{least squares}}$? Which tells you the average squared prediction error for that value of $\theta$?

SOLUTION: x is the optimal $\theta$, and fun is the prediction error.

Question 14

The three cells below create two interactive graphs; we've added a heat map displaying your least_squares_loss function. Plug in the value you found for $\theta^{\text{least squares}}$ (or as close as you can get) in both. Rotate the data graph to see how well the prediction function fits.

Note: Unfortunately, the heat map is a little slow to display and update.


In [299]:
windowsize = 5
maxloss = least_squares_loss([windowsize, -windowsize])
losses_flattened = pd.DataFrame.from_records(
    [(t0, t1)
     for t0 in np.linspace(-windowsize, windowsize, 30)
     for t1 in np.linspace(-windowsize, windowsize, 30)
     for _ in range(int(math.ceil((maxloss-least_squares_loss([t0, t1]))/(maxloss/10))))],
    columns=["theta0", "theta1"])

def display_loss_heatmap(theta):
    sns.kdeplot(losses_flattened['theta0'], losses_flattened['theta1'], shade=True, n_levels=100, color='red')
    plt.scatter(theta[0], theta[1], color='orange', s=100, label="proposed theta")
    plt.xlim([-windowsize, windowsize])
    plt.ylim([-windowsize, windowsize])
    plt.gca().set_aspect('equal', adjustable='box')
    plt.xlabel("theta0")
    plt.ylabel("theta1")
    plt.title("average squared prediction error by theta\n(darker is lower)")
    plt.legend()

def plot_loss(theta0, theta1):
    theta = [theta0, theta1]
    display_loss_heatmap(theta)

In [300]:
theta0 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
theta1 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
orient_x = widgets.FloatSlider(min=0, max=90, step=1, value=45, description="x rotation")
orient_y = widgets.FloatSlider(min=0, max=90, step=1, value=45, description="y rotation")
plot_predictions = widgets.Checkbox(value=True, description="Plot predictions on top of data?")

interact(plot_predictor, theta0=theta0, theta1=theta1, orient_x=orient_x, orient_y=orient_y, plot_predictions=plot_predictions);

In [301]:
theta0 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
theta1 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)

interact(plot_loss, theta0=theta0, theta1=theta1);

Submitting your assignment

If you made a good-faith effort to complete the lab, change i_finished_the_lab to True in the cell below. In any case, run the cells below to submit the lab.


In [ ]:
i_finished_the_lab = False

In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()

In [ ]:
_ = ok.submit()

Now, run this code in your terminal to make a git commit that saves a snapshot of your changes in git. The last line of the cell runs git push, which will send your work to your personal Github repo.

# Tell git to commit your changes to this notebook
git add sp17/lab/lab09/lab09.ipynb

# Tell git to make the commit
git commit -m "lab09 finished"

# Send your updates to your personal private repo
git push origin master