In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab09.ok')
Today's lab has two topics:
Homework 5 uses a Poisson process model in the context of website updates: A website could be updated at any time, and you can't really predict when an update will happen based on previous update times.
Here are a few more situations where events could be modeled by a Poisson process:
We use the term "random variable" for numbers or single quantities that are random, but of course other things can be random, too. A Poisson process is a random set of numbers. For example:
It arises any time the set has just two properties:
A. All numbers that are possible have equal chances to appear.
B. Knowing that certain numbers appear in the set doesn't tell you anything about whether other numbers appear in the set. In other words, numbers appear independently in the set.
The second assumption is often problematic. For example, it's violated to some degree in the following situations:
SOLUTION: Situation 1: knowing that one car has arrived, you could predict that many other cars are about to arrive. Situation 2: knowing that an unusually-high number of customers have visited in the last minute, you could predict that many other customers are about to visit in the next minute.
Suppose you are working in the Public Relations office at the ridesharing company Ooober. Your company is somewhat prone to scandal. Normally, one or two scandals are okay, but it is very bad when more than 3 scandals happen in one month, because then the media starts to write "pile-on" stories about how scandal-prone Ooober is.
You would like to model scandals and estimate how likely it is for more than 3 scandals to happen in a month.
Run the next cell to load the scandals
table. It corresponds to the following schema:
scandals(primary key (scandal_id), time)
The times are given in fractions of a month from January 2014 to December 2015. For simplicity, in this lab, just assume every month has 30 days and every day has 24 hours.
In [43]:
scandals = pd.read_csv('scandals.csv')
scandals.set_index('scandal_id', inplace=True)
scandals
The following cell defines a function for viewing timelines of events.
In [3]:
def display_points(points, xlim, title):
"""Displays a timeline with points on it.
Args:
points (ndarray): A list of floats in the range [xlim[0], xlim[1]],
each one a point to display in the timeline.
xlim (list-like): A list/tuple/array with 2 elements giving the
start and end of the timeline.
title (str): The name to display on the plot.
"""
fig, ax = plt.subplots(1)
fig.set_size_inches(8, 1)
ax.scatter(points, np.repeat(0, len(points)))
ax.axhline(0, color="grey", zorder=-1, lw=.5)
ax.yaxis.set_visible(False)
ax.xaxis.set_visible(True)
ax.set_xlim(xlim)
ax.set_title("{} ({:d} total points)".format(title, len(points)))
plt.show()
In [4]:
num_months = 24
display_points(scandals['time'], [0, num_months], 'Scandals at Ooober') #SOLUTION
In [ ]:
average_scandals_per_month = len(scandals) / num_months #SOLUTION
average_scandals_per_month
In [6]:
_ = ok.grade('q3')
_ = ok.backup()
Our goal is to simulate possible future scandal patterns. To do that, we need a model for the scandals, and we need to learn the parameters of that model from our data.
The scandals look somewhat spread out. Suppose we are willing to assume that they follow a Poisson process. Remember, that means:
A. Scandals are equally likely to happen at any time.
B. Scandals happen independently. That means a scandal is no more or less likely to happen tomorrow if one happened today.
How would we simulate future scandals under this model?
If the average number of scandals per month is $r$, then the average number of scandals per day must be $\frac{r}{30}$ (remember, we're assuming there are 30 days in each month), and the average number of scandals per hour must be $\frac{r}{30 \times 24}$. Notice that under assumption A, this is true for all hours, even 3 AM on a Sunday.
In [8]:
average_scandals_per_second = average_scandals_per_month / (30*24*60*60) #SOLUTION
average_scandals_per_second
In [9]:
_ = ok.grade('q4')
_ = ok.backup()
It seems okay to neglect the chance that there are 2 scandals in one second! In that case, whether a scandal happens in a particular second is a Bernoulli($p$) random variable: 1 if there is a scandal and 0 otherwise.
In order for the average number of scandals per second to match your answer to the previous question, what should $p$ be?
In [10]:
p = average_scandals_per_second #SOLUTION
p
In [11]:
_ = ok.grade('q5')
_ = ok.backup()
Draw from a Bernoulli($p$) distribution for each second in two years. Count each 1 you find as a scandal at the start of that second. Set q6_simulated_scandals
to a list of those seconds, converted to months since the start of the first year.
Hint: np.random.random(size=n)
will generate an array of n
numbers uniformly distributed between 0 and 1. Therefore, each element of the array np.random.random(size=n) <= p
has an independent chance p
to be True
and chance 1-p
to be False
.
Hint 2: You may find the function np.where
useful. np.where(array_of_booleans)[0]
is an array of indices where array_of_booleans
is True
. Note the [0]
.
In [12]:
seconds_in_a_month = 30*24*60*60
q6_simulated_scandals = np.where(np.random.random(size=num_months*seconds_in_a_month) <= p)[0] / seconds_in_a_month #SOLUTION
display_points(q6_simulated_scandals, [0, num_months], "Simulated scandals in 2 years")
q6_simulated_scandals
In [13]:
def draw_approximate_poisson_process(rate, length, steps_per_unit_length):
"""Draws from an approximate Poisson process on [0, length] with the given rate.
This function simulates the number of times a thing happens, if the
thing happens or doesn't happen randomly according to a Bernoulli
distribution at each time step. The number of time steps simulated
is:
length * steps_per_unit_length.
The average number of things that happen is:
rate * length.
Args:
rate (float): The average number of times the thing happens per
unit length.
length (float): The length of time to be simulated.
steps_per_unit_length (float): The number of Bernoulli draws per unit length.
length*steps_per_unit_length must be an integer.
(That guarantee isn't strictly necessary but
simplifies the implementation of this function.)
Returns:
ndarray: A NumPy array containing the times when the thing happened.
(If a event happened during a time step, this function can
choose any time for the event between the time step and the
next one.)
"""
# We suggest computing the number of steps like this:
num_steps = int(np.round(length * steps_per_unit_length))
p = rate / steps_per_unit_length #SOLUTION
return np.where(np.random.random(size=num_steps) <= p)[0] / steps_per_unit_length #SOLUTION
draw_approximate_poisson_process(average_scandals_per_month, num_months, seconds_in_a_month)
You might notice that your function takes some time to run. To speed things up, let's switch the granularity of our simulations from seconds to minutes. This will give us a slightly worse approximation to the Poisson process, but it still seems plausible that 2 scandals never happen in the same minute.
Now, by calling your function 100-1000 times, estimate the chance that (over the next 24 months, assuming the rate of scandals remains the same) Ooober experiences 3 or more scandals in at least one month. This is called a Monte Carlo simulation.
Note: This may take up to a minute. Try using fewer simulations when you first run it.
In [14]:
num_simulations = 1000
minutes_in_a_month = 30*24*60
three_or_more_scandals_count = np.count_nonzero([
max(np.bincount(draw_approximate_poisson_process(average_scandals_per_month, num_months, minutes_in_a_month).astype(int))) >= 3
for _ in range(num_simulations)])
three_or_more_scandals_chance = three_or_more_scandals_count / num_simulations
three_or_more_scandals_chance
In [16]:
_ = ok.grade('q8')
_ = ok.backup()
Often, we don't care about exactly when events happen, only how many of them happen in a certain interval. A Poisson($\lambda$) random variable (not a process) is the number of events that happen in some Poisson process that has rate*length
equal to $\lambda$.
Since draw_approximate_poisson_process
has approximately a Poisson process distribution, an expression like:
len(draw_approximate_poisson_process(2, 3, 1000))
...has approximately a Poisson($2 \times 3$) distribution.
However, we can characterize the distribution of your approximation exactly. Your approximation draws from many Bernoulli random variables. The number of events that happen is a sum of those Bernoulli random variables. Therefore, the number of events follows exactly a Binomial distribution!
Recall that the Binomial distribution has two parameters: $n$, the number of Bernoulli random variables that are summed, and $p$, the probability that any one of them is 1. It's written Binomial($n$, $p$). The expected value is $n \times p$.
In [17]:
# Example values for r, l, and d so that this cell can run:
r = .2
l = 10
d = 3
n = l*d #SOLUTION
p = r/d #SOLUTION
In [18]:
_ = ok.grade('q9')
_ = ok.backup()
As d
increases, the time steps get narrower, and the Binomial(l*d
, r/d
) distribution becomes the same as the Poisson(l*r
) distribution. In homework 5, you'll see another way to simulate data from the Poisson process that takes advantage of this fact. The cell below uses your draw_approximate_poisson_process
function to show by simulation that:
In [72]:
def plot_poisson_approximation(r, l, num_step_sizes, num_simulations):
import math
max_output = int(r*l + 6*(r*l)**0.5)
true_poisson_pmf = [math.exp(-r*l) * (r*l)**k / math.factorial(k) for k in range(max_output)]
min_steps_per_length = r*2
steps_per_length = min_steps_per_length * 2**np.arange(num_step_sizes)
def approximate_pmf(s):
draws = [len(draw_approximate_poisson_process(r, l, s)) for _ in range(num_simulations)]
return np.bincount(draws, minlength=max_output) / num_simulations
approximate_pmfs = [approximate_pmf(s) for s in steps_per_length]
approximate_pmf_names = ["Approximation to Poisson PMF (Binomial({:d}, {:.4f}))".format(int(np.round(l*s)), r/s)
for s in steps_per_length]
from matplotlib import cm
colors = [cm.jet(x) for x in np.linspace(0, 1, len(steps_per_length)+1)]
for pmf, name, color in zip(approximate_pmfs + [true_poisson_pmf], approximate_pmf_names + ["True Poisson({:.4f}) PMF".format(r*l)], colors):
plt.scatter(range(len(pmf)), pmf, c=color, label=name, s=40)
plt.plot(range(len(pmf)), pmf, c=color, linestyle='dashed')
plt.legend()
plt.title("Approximations to the Poisson distribution\n(dashed lines are for contrast purposes only)")
plt.xlabel("Count")
plt.ylabel("Probability")
plot_poisson_approximation(.2, 10, 5, 40000)
Now let's turn to a new topic entirely: least-squares linear regression. To save time, we won't look at a real-world or even a fictitious problem. Our goal is to get some concrete experience solving least-squares problems with a computer.
In last week's lab, you saw that you could maximize a function on a computer by plotting its values over a range of inputs. A computer can automate that process and improve on it in various ways. This is called numerical optimization. The main numerical optimization tool we'll use is the function sc.optimize.minimize
from the SciPy package. There are many, many more.
The problem in least-squares linear regression is:
For $n$ entities $i$ = 0 through $n-1$, find a prediction function that predicts their outcomes $y_0$, $y_1$, ..., $y_{n-1}$ well. To help us make predictions, we have available for each entity a vector of $p$ numerical features, $x_i \in \mathcal{R}^p$. We limit ourselves to linear prediction functions: functions that look like $f(x) = \theta \boldsymbol{\cdot} x$ for some parameter vector $\theta$. We would like to find the $f$ that produces the least squared prediction error, averaged over the $n$ entities.
In more succinct mathematical terms (in case you prefer such language), the problem is to find:
$$\text{least squares prediction function} = \arg\min_{f \in \text{linear functions}} L(f; (x_i)_{i=0}^{n-1}, (y_i)_{i=0}^{n-1})$$where the loss function is:
$$L(f; (x_i)_{i=0}^{n-1}, (y_i)_{i=0}^{n-1}) = \frac{1}{n} \sum_{i=0}^{n-1} (f(x_i) - y_i)^2.$$It helps to make the dependence of $f$ on parameters $\theta$ more explicit. So we define $f_\theta$ as the linear function $f_\theta(x) = \theta \boldsymbol{\cdot} x$. Then the problem becomes:
$$\theta^{\text{least squares}} = \arg\min_{\theta \in \mathcal{R}^p} \frac{1}{n} \sum_{i=0}^{n-1} (f_\theta(x) - y_i)^2.$$To help us visualize the data and our prediction function, we will take $p = 2$ and $n = 100$, but nothing would change except computation time and accuracy if we increased $p$ or $n$.
Running the three cells below will create an interactive visualization of the dataset.
In [19]:
data = pd.read_csv("least_squares_data.csv")
data.head()
In [20]:
def display_linear_prediction(data, theta=None, orient_x = 45, orient_y = 45):
import math
from matplotlib import patches, cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from mpl_toolkits.mplot3d import Axes3D
plot_predictions = theta is not None
fig = plt.figure()
ax = fig.gca(projection='3d')
if plot_predictions:
actual_xs = data[['x0', 'x1']].as_matrix()
predicted_ys = np.dot(actual_xs, theta)
def plot_points_and_residuals(point_indicators, ax, occluded):
to_plot = data.loc[point_indicators]
ax.hold(True)
ax.scatter(to_plot["x0"], to_plot["x1"], to_plot["y"],
c='black',
zorder=-1 if occluded else 100)
for i in np.where(point_indicators)[0]:
x0, x1 = actual_xs[i,:]
y = data['y'][i]
predicted_y = predicted_ys[i]
ax.hold(True)
ax.plot([x0, x0], [x1, x1], [y, predicted_y],
c='red',
linestyle='dotted' if occluded else 'solid',
lw=1 if occluded else 2)
# Figuring out which points are in front of the surface:
orient_x_rad = orient_x*math.pi/180
orient_y_rad = orient_y*math.pi/180
viewpoint_coords = [
np.sin(orient_x_rad)*np.cos(orient_y_rad),
np.cos(orient_x_rad)*np.cos(orient_y_rad),
np.sin(orient_y_rad)]
prediction_surface_normal = [theta[0], theta[1], -1]
viewpoint_above_surface = np.dot(viewpoint_coords, prediction_surface_normal) >= 0
point_in_front_of_surface = ((predicted_ys - data['y']) >= 0) == viewpoint_above_surface
plot_points_and_residuals(~point_in_front_of_surface, ax, True)
# Plotting the surface:
xs = np.array(np.meshgrid(
np.linspace(min(data['x0']), max(data['x0']), 5),
np.linspace(min(data['x1']), max(data['x1']), 5)))
ys = np.tensordot(theta, xs, axes=1)
ax.hold(True)
prediction_plane = ax.plot_surface(xs[0], xs[1], ys,
cmap=cm.coolwarm)
plot_points_and_residuals(point_in_front_of_surface, ax, False)
squared_loss = np.mean((predicted_ys - data['y'])**2)
plt.title("data, predictions, and residuals\n(current average squared prediction error = {:.2f})".format(squared_loss))
else:
prediction_plane = ax.scatter(data["x0"], data["x1"], data["y"],
cmap=cm.coolwarm)
plt.title("raw data")
ax.zaxis.set_major_locator(LinearLocator(5))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.1f'))
ax.view_init(orient_y, orient_x)
ax.set_xlabel("x0 (feature 0)")
ax.set_ylabel("x1 (feature 1)")
ax.set_zlabel("y (the thing we're predicting)")
plt.show()
In [21]:
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
def plot_predictor(theta0, theta1, orient_x, orient_y, plot_predictions):
theta = [theta0, theta1]
display_linear_prediction(data, theta if plot_predictions else None, orient_x, orient_y)
theta0 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
theta1 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
orient_x = widgets.FloatSlider(min=0, max=90, step=1, value=45, description="x rotation")
orient_y = widgets.FloatSlider(min=0, max=90, step=1, value=45, description="y rotation")
plot_predictions = widgets.Checkbox(value=True, description="Plot predictions on top of data?")
interact(plot_predictor, theta0=theta0, theta1=theta1, orient_x=orient_x, orient_y=orient_y, plot_predictions=plot_predictions);
Write the objective function least_squares_loss
as described in its docstring.
Big hint: The function np.dot
and the DataFrame method as_matrix
will be useful. Despite the name, np.dot
can perform matrix-vector multiplication. And as_matrix
converts a DataFrame to a 2d NumPy array. For example,
np.dot(data[['x0', 'x1']].as_matrix(), [-5, 3])
...is an array with one element for each entity. For each entity, the value is -5 times the x0
feature for the entity plus 3 times the x1
feature.
In [22]:
def least_squares_loss(theta):
"""The average squared prediction error when the function
f: x => theta . x
is used to predict y for our dataset.
The dataset is the DataFrame named data.
Args:
theta (ndarray): A vector of p numbers. The prediction function
is f: x => theta . x, the dot product with theta.
Returns:
float: The average (over the DataFrame named data) of the
squared prediction error.
"""
# Our solution defined an array called predictions; you
# don't have to.
predictions = np.dot(data[['x0', 'x1']].as_matrix(), theta) #SOLUTION
return np.mean((data['y'] - predictions)**2) #SOLUTION
In [23]:
_ = ok.grade('q11')
_ = ok.backup()
Import the scipy.optimize
module as scopt
and call scopt.minimize
on least_squares_loss
. Name the result optimization_result
and study it.
Hint: You'll need to provide one additional argument: an initial guess. It's best to start with the origin (np.zeros((2))
) or a random location near the origin (np.random.normal(0, 1, size=2)
).
In [24]:
import scipy.optimize as scopt
optimization_result = scopt.minimize(least_squares_loss, x0=np.random.normal(0, 1, size=2)) #SOLUTION
optimization_result
In [25]:
_ = ok.grade('q12')
_ = ok.backup()
SOLUTION: x
is the optimal $\theta$, and fun
is the prediction error.
The three cells below create two interactive graphs; we've added a heat map displaying your least_squares_loss
function. Plug in the value you found for $\theta^{\text{least squares}}$ (or as close as you can get) in both. Rotate the data graph to see how well the prediction function fits.
Note: Unfortunately, the heat map is a little slow to display and update.
In [299]:
windowsize = 5
maxloss = least_squares_loss([windowsize, -windowsize])
losses_flattened = pd.DataFrame.from_records(
[(t0, t1)
for t0 in np.linspace(-windowsize, windowsize, 30)
for t1 in np.linspace(-windowsize, windowsize, 30)
for _ in range(int(math.ceil((maxloss-least_squares_loss([t0, t1]))/(maxloss/10))))],
columns=["theta0", "theta1"])
def display_loss_heatmap(theta):
sns.kdeplot(losses_flattened['theta0'], losses_flattened['theta1'], shade=True, n_levels=100, color='red')
plt.scatter(theta[0], theta[1], color='orange', s=100, label="proposed theta")
plt.xlim([-windowsize, windowsize])
plt.ylim([-windowsize, windowsize])
plt.gca().set_aspect('equal', adjustable='box')
plt.xlabel("theta0")
plt.ylabel("theta1")
plt.title("average squared prediction error by theta\n(darker is lower)")
plt.legend()
def plot_loss(theta0, theta1):
theta = [theta0, theta1]
display_loss_heatmap(theta)
In [300]:
theta0 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
theta1 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
orient_x = widgets.FloatSlider(min=0, max=90, step=1, value=45, description="x rotation")
orient_y = widgets.FloatSlider(min=0, max=90, step=1, value=45, description="y rotation")
plot_predictions = widgets.Checkbox(value=True, description="Plot predictions on top of data?")
interact(plot_predictor, theta0=theta0, theta1=theta1, orient_x=orient_x, orient_y=orient_y, plot_predictions=plot_predictions);
In [301]:
theta0 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
theta1 = widgets.FloatSlider(min=-3, max=3, step=.1, value=0)
interact(plot_loss, theta0=theta0, theta1=theta1);
In [ ]:
i_finished_the_lab = False
In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()
In [ ]:
_ = ok.submit()
Now, run this code in your terminal to make a
git commit
that saves a snapshot of your changes in git
. The last line of the cell
runs git push, which will send your work to your personal Github repo.
# Tell git to commit your changes to this notebook
git add sp17/lab/lab09/lab09.ipynb
# Tell git to make the commit
git commit -m "lab09 finished"
# Send your updates to your personal private repo
git push origin master