In [3]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Technically... officially...
$ P(hypothesis\ |\ data) = \frac{P(data\ |\ hypothesis)\ *\ P(hypothesis)}{P(data)}\ $
But in practice we use it like this:
$ P(hypothesis\ |\ data) \propto P(data\ |\ hypothesis) * P(hypothesis) $
Why? At the end of the day P(data) is just used to normalize to percentages.
The entire output of a factory is produced on three machines. The three machines account for different amounts of the factory output, namely 20%, 30%, and 50%. The fraction of defective items produced is this: for the first machine, 5%; for the second machine, 3%; for the third machine, 1%. If an item is chosen at random from the total output and is found to be defective, what is the probability that it was produced by the third machine?
But $P(D)$ isn't necessary... what if we did this instead?
In [2]:
p_m1 = (0.05*0.2)#/(.2*0.05+0.3*0.03+0.5*0.01)
p_m2 = (0.03*0.3)
p_m3 = (0.01*0.5)
p_total = p_m1 + p_m2 + p_m3
p_m1 / p_total, p_m2 / p_total, p_m3 / p_total
Out[2]:
Actually the same thing... notice that P(D) is literally the sum of the likelihoods and we divide by it to normalize.
How many people in this room are from NYC?
Assuming there are 1,000 people at the conference, how many people do we think are locals?
First our assumptions:
In [4]:
conference_population = 1000
room_population = 50
locals_in_room_population = 15
hypothesis_count = conference_population + 1
Let's also assume that for our room it is a completely random assignment of locals and non-locals. IID.
Naively, I'm going to just include a hypothesis for every possible value even if it's logically impossible. We'll handle those cases with our likelihood function. Let's create a list of all of our hypotheses:
In [5]:
hypothesis_values = np.linspace(0,1000, num=hypothesis_count)
Let's count the number of hypotheses we have
In [6]:
hypothesis_count = len(hypothesis_values)
And let's weight each hypothesis according to our prior belief of how likely each is. For now, we'll just weight every hypothesis equally (known as an uninformative prior).
In [7]:
hypothesis_likelihoods = np.ones(hypothesis_count) / hypothesis_count
We normalize it by the hypothesis count to come up with a percentage. This is a visualization of our prior distribution:
In [8]:
plt.ylim(0,0.0011)
plt.plot(hypothesis_likelihoods);
Next, we need to come up with a likelihood function we can use to evaluate each hypothesis. With some thought we can rule out certain hypotheses:
This accounts for line 2 of the below likelihood function.
Now let's say we have 15 locals out of 50 in our room at the conference with 1,000 attendees... How likely do we think it is that we have exactly 15 locals at the entire conference?
Well, for each attendee in our room we would calculate the probability of them being in the room. How so?
For each person in the room we calculate the probability of choosing each person at random out of the 1,000. So the probability you're a local and she's a local and he's a local and he's not and she's not and... Mathematically that looks like this:
In [39]:
ss.hypergeom.pmf(n=15, M=conference_population, k=locals_in_room_population, N=room_population)
Out[39]:
What about if there were 16 locals at the conference? We'd expect that to increase our probability right?
In [42]:
ss.hypergeom.pmf(n=16, M=conference_population, k=locals_in_room_population, N=room_population)
Out[42]:
A little bigger! Actually almost twice as likely. So we could keep going but it would take a long time. Instead we just code this into a likelihood function and write a loop.
The below code should look a lot like the above code.
Next we loop through our hypotheses and evaluate each one.
In [35]:
hypotheses = np.linspace(0,1000, num=1001)
hypothesis_likelihoods = ss.hypergeom.pmf(n=hypotheses, M=conference_population, k=locals_in_room_population, N=room_population)
probabilities = likelihoods/likelihoods.sum()
Lastly we normalize our probabilities and plot the answer.
In [36]:
hypothesis_probabilities = hypothesis_likelihoods / hypothesis_likelihoods.sum()
plt.figure(figsize=(8,8))
plt.title('Local Attendee Probability Distribution', size=18)
plt.xlabel('Hypothesized number of local attendees', size=16)
plt.ylabel('Probability of each hypothesis', size=16)
plt.plot(hypothesis_probabilities);
Now we can estimate how many locals we think are here!
In [37]:
lower_bound = np.argmax(hypothesis_probabilities.cumsum() > 0.005)
upper_bound = np.argmax(hypothesis_probabilities.cumsum() > 0.995)
mode_index = np.argmax(hypothesis_probabilities)
hypothesis_values[lower_bound], hypothesis_values[mode_index], hypothesis_values[upper_bound]
Out[37]:
So given our assumptions there's a 99% chance there's between 164 and 477 locals at PyData! Our most likely estimate would be 300 though if we had to guess but as the result below shows, there's only a 0.6% chance of it being exactly that.
In [38]:
max(hypothesis_probabilities)
Out[38]:
Some common ones:
Want to learn more and/OR interested in the math for these?
Mathematical Statistics with Applications (7th edition)
In the previous example we had too many hypotheses to calculate by hand but definitely a finite number of them due to the fact that we were restricted to integer solutions. What happens when we care about solutions at arbitrary granularity though?
We'll start by approaching the problem the same way and note the challenges.
A common business problem is to find the min/max of some function so that you can optimize revenue or costs. For this reason, my favorite toy example is modeling a parabola to some data.
Imagine we are tasked with optimizing our revenue for creating Super Widgets. We want to know what the optimum number of hours are for our workers to work in a day in order to maximize our revenue. We have a fixed staff size and are subject to overtime laws which is where the optimization comes into play.
A common business problem is to find the min/max of some function so that you can optimize revenue or costs. For this reason, my favorite toy example is modeling a parabola to some data.
Imagine we are tasked with optimizing our revenue for creating Super Widgets. We want to know what the optimum number of hours are for our workers to work in a day in order to maximize our revenue. We have a fixed staff size and are subject to overtime laws which is where the optimization comes into play.
Now! We've ran an experiment and we varied how many hours we have the average worker work for. Using this data, let's see if we can recommend an optimum.
Let's make up some hypothetical data so we can visualize the problem better.
In [18]:
import numpy as np, scipy as sp, seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline
true_a = -3
true_h = 9.25
true_k = 25000
true_sigma = 10
x_data = np.array([4, 5, 6, 7, 8, 9, 10, 11, 12])
observed_y = true_a * (x_data - true_h)**2 + true_k + np.random.normal(loc=0, scale=true_sigma, size=len(x_data))
Then visualize it
In [19]:
observed_y
Out[19]:
In [20]:
plt.figure(figsize=[8,8])
plt.xlabel('Hours Worked', size=14)
plt.ylabel('Net Revenue', size=14)
plt.title('Hours worked vs net revenue', size=16)
plt.scatter(x_data, observed_y);
With the naked eye we might say that 8 hour days are perfect... or maybe 11 hours? #eek Let's see what happens when we model this.
Since we're optimizing price maybe we could assume that $$ y = a(x - h)^2 + k \\ $$
On top of that... we can see the data looks parabolic and we want to find the maximum point. The nice thing about modeling this with a parabola is that we can write the equation such that the exact information we want is just a parameter.
In [21]:
import itertools
# Come up with a hypothesis for each parameter
a_hypotheses = np.array([-4, -3, -2, -1])
h_hypotheses = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
k_hypotheses = np.array([18000, 20000, 23000, 25000, 26000, 28000, 30000])
sigma_hypotheses = np.array([5, 10, 15, 20, 25])
Any parameter we aren't explicitly providing we are implicitly saying is impossible.
That's fine for now
How do we use what we've gathered so far to evaluate probabilities? We use all of our hypotheses above to generate EVERY unique hypothesis in our hypothesis space.
In [22]:
# Create list of every possible combination of them.
hypotheses = list(itertools.product(a_hypotheses, h_hypotheses, k_hypotheses, sigma_hypotheses))
Now that we have all of our hypotheses, we can continue the way we did in the previous example. We just need two things:
We can start with a simple non-informative prior.
In [23]:
prior = np.ones(len(hypotheses))
prior
Out[23]:
Next we need a way to assign a probability to each hypothesis using a likelihood function. Next we can set up a probabilistic model around the error term. Our mental model could be that if we fit our model perfectly, we'd expect the errors to be normally distributed.
The following example illustrates this but the code is not executable outside of a loop
In [24]:
for i, hypothesis in zip(range(len(hypotheses)), hypotheses):
if hypothesis == (-3, 9, 25000, 5):
print(i)
break
In [25]:
a_hypothesis, h_hypothesis, k_hypothesis, sigma_hypothesis = hypotheses[890]
x_datum, y_datum = list(zip(x_data, observed_y))[0]
predicted_y = a_hypothesis * (x_datum - h_hypothesis)**2 + k_hypothesis
prediction_error = y_datum - predicted_y
# On average, the error should be centered around zero
# just like in a standard regression (aka normal residuals).
y_probability = ss.norm.pdf(0, loc=prediction_error, scale=sigma_hypothesis)
y_probability
Out[25]:
This is NOT a probability! Needs to be normalized... before it can be normalized we need to compute likelihoods for all of our hypotheses.
This we know is the best possible hypothesis because we generated the data. For each hypothesis we will find the likelihood for all of the data with that hypothesis... but watch what happens even at the most likely hypothesis:
In [26]:
y_probability ** len(observed_y)
Out[26]:
And if we had 100x as many observations...
In [27]:
y_probability ** (100*len(observed_y))
Out[27]:
Completely zero. In probability zero and one are certainty and no amount of data will ever change them. These are bad values we want to avoid. We'll come back to this.
In [28]:
# Start with an uninformative prior
# Every hypothesis gets an equal footing
hypothesis_likelihoods = np.array(prior)
for i, hypotheses_tuple in enumerate(hypotheses):
a_hypothesis, h_hypothesis, k_hypothesis, sigma_hypothesis = hypotheses_tuple
for x_datum, y_datum in zip(x_data, observed_y):
# For the given hypothetical value of each parameter
# let's compute a prediction and the error
predicted_y = a_hypothesis * (x_datum - h_hypothesis)**2 + k_hypothesis
prediction_error = y_datum - predicted_y
# On average, the error should be centered around zero
# just like in a standard regression (aka normal residuals).
y_probability = ss.norm.pdf(0, loc=prediction_error, scale=sigma_hypothesis)
# Multiply probability density
hypothesis_likelihoods[i] *= y_probability
In [29]:
evaluated_hypotheses = list(zip(hypotheses, hypothesis_likelihoods))
In [30]:
max(evaluated_hypotheses, key=lambda x: x[-1])
Out[30]:
So this means that the most likely estimate for the parameter values are
a = -4
h = 9
k = 25000
sigma = 5
These aren't perfect! But that's fine... the only value we care about (h) is pretty darn close. Also the last value in that tuple is supposed to represent our likelihood of this solution but there's no way that's a probability! We never normalized.
In [31]:
normalized_hypothesis_likelihoods = hypothesis_likelihoods / hypothesis_likelihoods.sum()
evaluated_normalized_hypotheses = list(zip(hypotheses, normalized_hypothesis_likelihoods))
max(evaluated_normalized_hypotheses, key=lambda x: x[-1])
Out[31]:
Same solution but we just have a probability amongst all of our hypotheses now.
In [32]:
np.exp([-5000, -5001, -5002])
Out[32]:
In [33]:
def log_normalize(log_p):
shifted_p = np.exp(log_p - np.max(log_p))
normalized_log_p = shifted_p/shifted_p.sum()
return normalized_log_p
In [34]:
log_normalize([-5000, -5001, -5002])
Out[34]:
In [35]:
def fit_data(prior, hypotheses, x_data, observed_y):
# Log
hypothesis_likelihoods = np.log(prior)
for i, hypotheses_tuple in enumerate(hypotheses):
a_hypothesis, h_hypothesis, k_hypothesis, sigma_hypothesis = hypotheses_tuple
for x_datum, y_datum in zip(x_data, observed_y):
predicted_y = a_hypothesis * (x_datum - h_hypothesis)**2 + k_hypothesis
prediction_error = y_datum - predicted_y
y_probability = ss.norm.pdf(0, loc=prediction_error, scale=sigma_hypothesis)
# Log
hypothesis_likelihoods[i] += np.log(y_probability)
# Log
normalized_hypothesis_likelihoods = log_normalize(hypothesis_likelihoods)
evaluated_normalized_hypotheses = list(zip(hypotheses, normalized_hypothesis_likelihoods))
return evaluated_normalized_hypotheses
fixed_evaluated_normalized_hypotheses = fit_data(np.ones(len(hypotheses)), hypotheses, x_data, observed_y)
max(fixed_evaluated_normalized_hypotheses, key=lambda x: x[-1])
Out[35]:
Let's look at the different hypotheses weighted by their probability
In [36]:
X = np.linspace(4, 12, num=9)
for hypothesis, probability in fixed_evaluated_normalized_hypotheses:
if probability == 0:
continue
a, h, k, sigma = hypothesis
Y = a * (X - h)**2 + k
plt.plot(X, Y, alpha=probability);
Challenges of the above:
Let's use MCMC to solve a very simple problem: Inferring the mean of data we assume to be gaussian.
For simplicity, we'll assume we start at our priors.
When choosing a next proposal for a variable we use a random distribution and some sort of limit to its width. In this case we'll use a normal distribution with a stdev of 0.5. Technically this is unbounded and serves to allow us to explore the space but is largely focused on reasonable values.
We want to find $P_accept = \frac{P(\mu_{proposed} | data)}{P(\mu_{current} | data)}$. If $p_{accept} >= 1$ we accept the sample and add it to our trace.
Find the likelihood of our data for both our current location for each variable as well as the proposed location. Let's also evaluate how likely our current value and proposed values are compared to our prior. NOTE: What would happen if you chose a prior that was wildly different from the true value?
Here we need to find the $P(\mu_{current} | data) = \frac{P(data | \mu_{current}) * P(\mu_{current})}{P(data)}$. When we divide $P(\mu_{proposed} | data)$ by $P(\mu_{current} | data)$ the $P(data)$ cancels out and we can avoid computing it!
In [37]:
prior_m = 0
prior_m_distribution = ss.norm(loc=prior_m, scale=1.0)
m_current = prior_m
proposal_width = 0.5
When choosing a next proposal for a variable we use a random distribution and some sort of limit to its width. In this case we'll use a normal distribution with a stdev of 0.5. Technically this is unbounded and serves to allow us to explore the space but is largely focused on reasonable values.
In [38]:
m_proposal = ss.norm(loc=m_current, scale=proposal_width).rvs()
We want to find $P_accept = \frac{P(\mu_{proposed} | data)}{P(\mu_{current} | data)}$. If $p_{accept} >= 1$ we accept the sample and add it to our trace.
Find the likelihood of our data for both our current location for each variable as well as the proposed location. Let's also evaluate how likely our current value and proposed values are compared to our prior. NOTE: What would happen if you chose a prior that was wildly different from the true value?
Here we need to find the $P(\mu_{current} | data) = \frac{P(data | \mu_{current}) * P(\mu_{current})}{P(data)}$. When we divide $P(\mu_{proposed} | data)$ by $P(\mu_{current} | data)$ the $P(data)$ cancels out and we can avoid computing it!
In [39]:
prior_distribution = ss.norm(loc=prior_m, scale=1.0)
# Make proposal
m_proposal = ss.norm(loc=m_current, scale=proposal_width).rvs()
# Find likelihood of data given the proposal for m
m_proposal_likelihood = ss.norm(loc=m_proposal, scale=1).logpdf(data).sum()
# Find likelihood of data given the current m
m_current_likelihood = ss.norm(loc=m_current, scale=1).logpdf(data).sum()
# Find likelihood of m_current given the prior
m_current_prior_likelihood = prior_distribution.logpdf(m_current)
m_proposal_prior_likelihood = prior_distribution.logpdf(m_proposal)
# Compute proportional probability
m_current_probability = m_current_likelihood + m_current_prior_likelihood
m_proposal_probability = m_proposal_likelihood + m_proposal_prior_likelihood
m_accept_probability = m_proposal_probability - m_current_probability
In [40]:
import numpy as np
import scipy as sp
import pandas as pd
import scipy.stats as ss
def sampler(data, samples=1000, mu_prior_mu=2, mu_prior_sd=1.):
mu_current = mu_prior_mu #mu_init set to our prior
posterior = [mu_current]
prior_distribution = ss.norm(loc=mu_prior_mu, scale=mu_prior_sd)
for i in range(samples):
# suggest new position
# How we come up with a proposal can be uniform random,
# normal (like it is here) or something else.
# Doesn't matter as far as making a functional algo.
proposal_width = 1
mu_proposal = ss.norm(loc=mu_current, scale=proposal_width).rvs()
# What is the likelihood of the observed data given our mu_current?
# Compute likelihood by multiplying probabilities of each data point
likelihood_current = ss.norm(loc=mu_current, scale=1).logpdf(data).sum()
# What is the likelihood of the observed data given our mu_proposal?
likelihood_proposal = ss.norm(loc=mu_proposal, scale=1).logpdf(data).sum()
# Compute prior probability of current and proposed mu
# Compute the probability of the mu_current using our prior
# Compute the probability of the mu_proposal using our same prior
prior_current = prior_distribution.logpdf(mu_current)
prior_proposal = prior_distribution.logpdf(mu_proposal)
# Here we're finding P(data | mu) * P(mu)
# for both the mu_current and mu_proposal
p_current = likelihood_current + prior_current
p_proposal = likelihood_proposal + prior_proposal
# Accept proposal?
# By dividing these two the P(data) normalizing term cancels out.
p_accept = np.exp(p_proposal - p_current)
# Usually would include prior probability, which we neglect here for simplicity
# Choose a real between 0 and 1 and accept our mu_proposal
# if our ratio above is greater than this random number.
accept = np.random.rand() < p_accept
if accept:
# Update position
mu_current = mu_proposal
# This posterior is actually our sampler trace.
posterior.append(mu_current)
return posterior
In [41]:
data = np.random.normal(loc=5.0, scale=1.0, size=20)
samples = sampler(data, samples=10000)
In [42]:
plt.hist(samples[200:], bins=50);
In [43]:
plt.plot(samples[200:]);
This is a VERY simple implementation of the Metropolis algorithm. As you work on more and more complex models you will find the simpler algorithms run into problems. In this case, metropolis assumes symmetrical distributions. Metropolis-Hastings accounts for assymetric distributions.
Now with some tooling, let's return to our revenue optimization problem from above. Here's how we can model it in PyMC3.
In [44]:
import pymc3 as pm
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=10)
h = pm.Normal('h', mu=8, sd=10)
k = pm.Normal('k', mu=24000, sd=1000)
sigma = pm.Lognormal('sigma', mu=2, sd=5)
y_pred = a * (x_data - h)**2 + k
error = observed_y - y_pred
pm.Normal('error', mu=0, sd=sigma, observed=error)
trace = pm.sample(draws=15000, n_init=25000)
In [45]:
pm.traceplot(trace[500:]);
In [46]:
pm.summary(trace[500:])
Out[46]:
In [47]:
plt.figure(figsize=(8,8))
plt.title('Observed vs Predicted', size=16)
sampled_a = trace[14000:]['a']
sampled_h = trace[14000:]['h']
sampled_k = trace[14000:]['k']
x_eval = np.linspace(4, 12, num=50)
for a_sample, h_sample, k_sample in zip(sampled_a, sampled_h, sampled_k):
y_eval = a_sample * (x_eval - h_sample)**2 + k_sample
plt.plot(x_eval, y_eval, alpha=0.0065, c='k', linewidth=3.0);
plt.scatter(x_data, observed_y);
We get pretty close to what we know the optimal setting to be! It should be 9.25 and this gives us 9.75. We also get a concept of uncertainty.
Imagine you have a logistic regression model that you can use to better predict which delivery drivers will be best to deliver a product to your customer. You create the model but don't know the threshold you should set for the probabilities even through cross-validation because human behavior is complex.
Instead we design an experiment:
Try many thresholds randomly and infer the optimum using a mathematical model.
Think through the effect of the model. When the value was closer to one, the possible drivers would be the same as production. In other words a sort of control. As the threshold was lowered though, drivers would be filtered out... but lower it too much and all drivers would be filtered out and delivery performance would degrade.
We can develop a simple mathematical model to encapsulate this:
In [48]:
X = np.linspace(0.13, 1, num=101)
Y = 5*((X - 0.36)**2) / (X - 0.12) + 10
plt.figure(figsize=(8,8))
plt.title('Hypothetical Response', size=16)
plt.ylim(0, 20)
plt.xlim(0,1)
plt.xlabel('Threshold value', size=14)
plt.ylabel('Delivery performance', size=14)
plt.plot(X, Y);
In [49]:
X_ = np.linspace(0.13, 1, num=30)
Y_ = 3*((X_ - 0.36)**2) / (X_ - 0.12) + 10 + np.random.normal(loc=0, scale=2, size=len(X_))
In [50]:
plt.figure(figsize=(8,8))
plt.xlim(0,1)
plt.ylim(0,30)
plt.scatter(X_, Y_, s=10.0);
In [51]:
import pymc3 as pm
with pm.Model() as m:
a = pm.Uniform('a', lower=0.0, upper=20.0)
h = pm.Uniform('h', lower=0.0, upper=1.0)# pm.Normal('h', mu=0.36, sd=0.05)
k = pm.Uniform('k', lower=0.0, upper=1.0)#pm.Normal('k', mu=0.5, sd=0.05)
C = pm.Uniform('C', lower=1.0, upper=30.) #pm.Normal('C', mu=8, sd=1)
sigma = pm.Lognormal('sigma', mu=2, sd=2)
y_pred = a * ((X_ - h)**2)/(X_ - k) + C
error = Y_ - y_pred
pm.Normal('error', mu=0, sd=sigma, observed=error)
trace = pm.sample(draws=15000, tune=1000)
In [52]:
pm.traceplot(trace);
In [53]:
pm.summary(trace)
Out[53]:
In [54]:
X_ = np.linspace(0.13, 1, num=30)
Y_ = 3*((X_ - 0.36)**2) / (X_ - 0.12) + 10 + np.random.normal(loc=0, scale=2, size=len(X_))
In [55]:
plt.scatter(X_, Y_);
In [56]:
import numpy as np
import pymc3 as pm
with pm.Model() as m:
a = pm.Uniform('a', lower=0.0, upper=10.0)
h = pm.Normal('h', mu=0.5, sd=0.05)
k = pm.Normal('k', mu=0.2, sd=0.05)
C = pm.Normal('C', mu=8, sd=1)
sigma = pm.Uniform('sigma', lower=1, upper=30)
y_pred = a * ((X_ - h)**2)/(X_ - k) + C
error = Y_ - y_pred
pm.Normal('error', mu=0, sd=sigma, observed=error)
trace = pm.sample(draws=15000, tune=1500)
In [57]:
pm.traceplot(trace);
In [58]:
pm.summary(trace)
Out[58]:
In [59]:
plt.figure(figsize=(8,8))
plt.title('Observed vs Predicted', size=16)
sampled_a = trace[14000:]['a']
sampled_h = trace[14000:]['h']
sampled_k = trace[14000:]['k']
sampled_C = trace[14000:]['C']
x_eval = np.linspace(0.13, 1.0, num=50)
for a_sample, h_sample, k_sample, C_sample in list(zip(sampled_a, sampled_h, sampled_k, sampled_C))[:1000]:
y_eval = a_sample * ((x_eval - h_sample)**2)/(x_eval - k_sample) + C_sample
plt.plot(x_eval, y_eval, alpha=0.02, c='k', linewidth=1.0);
plt.scatter(X_, Y_);
In [ ]: