Bayesian Statistics From Scratch

Building up to MCMC

Justin Bozonier

Lead Data Scientist, GrubHub

@databozo

justin@bozonier.com

http://www.databozo.com

GETTING STARTED



In [3]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Probability and Bayes Theroem
Infinite Hypotheses- Estimating number of locals attending PyData NYC
Introduction to motivating problem- Finding an optimum price
Manual evaluation of hypotheses
Dealing with underflow: Living in Log Land
Scaling- Speed and hypothesis evaluation
MCMC as a scaling solution
Programming our first MCMC algorithm
Where does our solution break where others succeed?

Probability and Bayes Theorem

Technically... officially...

$ P(hypothesis\ |\ data) = \frac{P(data\ |\ hypothesis)\ *\ P(hypothesis)}{P(data)}\ $

But in practice we use it like this:

$ P(hypothesis\ |\ data) \propto P(data\ |\ hypothesis) * P(hypothesis) $

Why? At the end of the day P(data) is just used to normalize to percentages.

Wild Example of Bayesian Statistics

The entire output of a factory is produced on three machines. The three machines account for different amounts of the factory output, namely 20%, 30%, and 50%. The fraction of defective items produced is this: for the first machine, 5%; for the second machine, 3%; for the third machine, 1%. If an item is chosen at random from the total output and is found to be defective, what is the probability that it was produced by the third machine?

$$ P(M_3 | D) = \frac{P(D | M_3) * P(M_3)}{P(D)}\ $$

$$ P(M_1 | D) = \frac{ 0.05 * 0.2}{.2*0.05+0.3*0.03+0.5*0.01} = 41.7\%\ $$

$$ P(M_2 | D) = \frac{ 0.03 * 0.3}{.2*0.05+0.3*0.03+0.5*0.01} = 37.5\%\ $$

$$ P(M_3 | D) = \frac{ 0.01 * 0.5}{.2*0.05+0.3*0.03+0.5*0.01} = 20.8\%\ $$

But $P(D)$ isn't necessary... what if we did this instead?



In [2]:

    
p_m1 = (0.05*0.2)#/(.2*0.05+0.3*0.03+0.5*0.01)
p_m2 = (0.03*0.3)
p_m3 = (0.01*0.5)

p_total = p_m1 + p_m2 + p_m3

p_m1 / p_total, p_m2 / p_total, p_m3 / p_total









    Out[2]:





(0.4166666666666667, 0.3749999999999999, 0.20833333333333331)

Actually the same thing... notice that P(D) is literally the sum of the likelihoods and we divide by it to normalize.

Many Hypotheses

Estimating number of locals attending PyData NYC

How many people in this room are from NYC?

Assuming there are 1,000 people at the conference, how many people do we think are locals?

First our assumptions:



In [4]:

    
conference_population = 1000
room_population = 50
locals_in_room_population = 15
hypothesis_count = conference_population + 1

Let's also assume that for our room it is a completely random assignment of locals and non-locals. IID.

Naively, I'm going to just include a hypothesis for every possible value even if it's logically impossible. We'll handle those cases with our likelihood function. Let's create a list of all of our hypotheses:



In [5]:

    
hypothesis_values = np.linspace(0,1000, num=hypothesis_count)

Let's count the number of hypotheses we have



In [6]:

    
hypothesis_count = len(hypothesis_values)

And let's weight each hypothesis according to our prior belief of how likely each is. For now, we'll just weight every hypothesis equally (known as an uninformative prior).



In [7]:

    
hypothesis_likelihoods = np.ones(hypothesis_count) / hypothesis_count

We normalize it by the hypothesis count to come up with a percentage. This is a visualization of our prior distribution:



In [8]:

    
plt.ylim(0,0.0011)
plt.plot(hypothesis_likelihoods);

Next, we need to come up with a likelihood function we can use to evaluate each hypothesis. With some thought we can rule out certain hypotheses:

We can't have more locals in our room than at the conference as a whole.
We can't have more non-locals in our room than at the conference as a whole.
Of course, we can't have more locals at the conference than we have people at the conference either.

This accounts for line 2 of the below likelihood function.

Now let's say we have 15 locals out of 50 in our room at the conference with 1,000 attendees... How likely do we think it is that we have exactly 15 locals at the entire conference?

Well, for each attendee in our room we would calculate the probability of them being in the room. How so?

For each person in the room we calculate the probability of choosing each person at random out of the 1,000. So the probability you're a local and she's a local and he's a local and he's not and she's not and... Mathematically that looks like this:



In [39]:

    
ss.hypergeom.pmf(n=15, M=conference_population, k=locals_in_room_population, N=room_population)









    Out[39]:





3.2708841265688138e-21

What about if there were 16 locals at the conference? We'd expect that to increase our probability right?



In [42]:

    
ss.hypergeom.pmf(n=16, M=conference_population, k=locals_in_room_population, N=room_population)









    Out[42]:





5.0474557079996835e-20

A little bigger! Actually almost twice as likely. So we could keep going but it would take a long time. Instead we just code this into a likelihood function and write a loop.

The below code should look a lot like the above code.

Next we loop through our hypotheses and evaluate each one.



In [35]:

    
hypotheses = np.linspace(0,1000, num=1001)
hypothesis_likelihoods = ss.hypergeom.pmf(n=hypotheses, M=conference_population, k=locals_in_room_population, N=room_population)
probabilities = likelihoods/likelihoods.sum()

Lastly we normalize our probabilities and plot the answer.



In [36]:

    
hypothesis_probabilities = hypothesis_likelihoods / hypothesis_likelihoods.sum()

plt.figure(figsize=(8,8))
plt.title('Local Attendee Probability Distribution', size=18)
plt.xlabel('Hypothesized number of local attendees', size=16)
plt.ylabel('Probability of each hypothesis', size=16)
plt.plot(hypothesis_probabilities);

Now we can estimate how many locals we think are here!



In [37]:

    
lower_bound = np.argmax(hypothesis_probabilities.cumsum() > 0.005)
upper_bound = np.argmax(hypothesis_probabilities.cumsum() > 0.995)

mode_index = np.argmax(hypothesis_probabilities)

hypothesis_values[lower_bound], hypothesis_values[mode_index], hypothesis_values[upper_bound]









    Out[37]:





(164.0, 300.0, 477.0)

So given our assumptions there's a 99% chance there's between 164 and 477 locals at PyData! Our most likely estimate would be 300 though if we had to guess but as the result below shows, there's only a 0.6% chance of it being exactly that.



In [38]:

    
max(hypothesis_probabilities)









    Out[38]:





0.0063952859043963379

Probability Distributions are Your BFFs

Some common ones:

Beta- Likelihood of a given True proportion given the sampled counts. Think Conversions vs. Non-Conversions.
Binomial- Tells likelihood of a sampled proportion given a known probability of success and a fixed number of trials. Useful when we want to understand the number of successes we might observe.
Poisson- Given an average occurence rate it tells you the likelihood of an event occuring N times in the given time period. Tied to binomial and approximates Normal as the rate parameter increases.
Geometric- Tells likelihood of first success occuring on trial N.
Negative Binomial- Generalizes the geometric distribution by allowing us to find the likelihood of the Mth success occuring on trial N.
Normal (aka Gaussian)- Important because of the Central Limit Theorem and because it is used to approximate the sum of many random variables taken together.
Exponential- How much time between events?
Hypergeometric- Our problem above!

Want to learn more and/OR interested in the math for these?

Mathematical Statistics with Applications (7th edition)

Approaching Infinite Hypotheses

In the previous example we had too many hypotheses to calculate by hand but definitely a finite number of them due to the fact that we were restricted to integer solutions. What happens when we care about solutions at arbitrary granularity though?

We'll start by approaching the problem the same way and note the challenges.

Finding an Optimum Price

A common business problem is to find the min/max of some function so that you can optimize revenue or costs. For this reason, my favorite toy example is modeling a parabola to some data.

Imagine we are tasked with optimizing our revenue for creating Super Widgets. We want to know what the optimum number of hours are for our workers to work in a day in order to maximize our revenue. We have a fixed staff size and are subject to overtime laws which is where the optimization comes into play.

A common business problem is to find the min/max of some function so that you can optimize revenue or costs. For this reason, my favorite toy example is modeling a parabola to some data.

Now! We've ran an experiment and we varied how many hours we have the average worker work for. Using this data, let's see if we can recommend an optimum.

Let's make up some hypothetical data so we can visualize the problem better.



In [18]:

    
import numpy as np, scipy as sp, seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

true_a = -3
true_h = 9.25
true_k = 25000
true_sigma = 10

x_data = np.array([4, 5, 6, 7, 8, 9, 10, 11, 12]) 
observed_y = true_a * (x_data - true_h)**2 + true_k + np.random.normal(loc=0, scale=true_sigma, size=len(x_data))

Then visualize it



In [19]:

    
observed_y









    Out[19]:





array([ 24924.33465083,  24949.08652908,  24970.68188298,  24968.12499771,
        24983.55203499,  24990.77441232,  25000.45794035,  24965.22327695,
        24972.49164896])



In [20]:

    
plt.figure(figsize=[8,8])
plt.xlabel('Hours Worked', size=14)
plt.ylabel('Net Revenue', size=14)
plt.title('Hours worked vs net revenue', size=16)
plt.scatter(x_data, observed_y);

With the naked eye we might say that 8 hour days are perfect... or maybe 11 hours? #eek Let's see what happens when we model this.

Manual Evaluation of Hypotheses

Since we're optimizing price maybe we could assume that $$ y = a(x - h)^2 + k \\ $$

On top of that... we can see the data looks parabolic and we want to find the maximum point. The nice thing about modeling this with a parabola is that we can write the equation such that the exact information we want is just a parameter.

What the parameters mean

Dependent variable

h- hours worked

Nuisance parameters (AKA DGAF)

k- amount of money made at our optimum
a- width of parabola



In [21]:

    
import itertools

# Come up with a hypothesis for each parameter
a_hypotheses = np.array([-4, -3, -2, -1])
h_hypotheses = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
k_hypotheses = np.array([18000, 20000, 23000, 25000, 26000, 28000, 30000])
sigma_hypotheses = np.array([5, 10, 15, 20, 25])

Any parameter we aren't explicitly providing we are implicitly saying is impossible.

IMPOSSIBLE

That's fine for now

How do we use what we've gathered so far to evaluate probabilities? We use all of our hypotheses above to generate EVERY unique hypothesis in our hypothesis space.



In [22]:

    
# Create list of every possible combination of them.
hypotheses = list(itertools.product(a_hypotheses, h_hypotheses, k_hypotheses, sigma_hypotheses))

Now that we have all of our hypotheses, we can continue the way we did in the previous example. We just need two things:

Our prior
Our likelihood function

We can start with a simple non-informative prior.



In [23]:

    
prior = np.ones(len(hypotheses))
prior









    Out[23]:





array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

Next we need a way to assign a probability to each hypothesis using a likelihood function. Next we can set up a probabilistic model around the error term. Our mental model could be that if we fit our model perfectly, we'd expect the errors to be normally distributed.

The following example illustrates this but the code is not executable outside of a loop



In [24]:

    
for i, hypothesis in zip(range(len(hypotheses)), hypotheses):
    if hypothesis == (-3, 9, 25000, 5):
        print(i)
        break



In [25]:

    
a_hypothesis, h_hypothesis, k_hypothesis, sigma_hypothesis = hypotheses[890]
x_datum, y_datum = list(zip(x_data, observed_y))[0]

predicted_y = a_hypothesis * (x_datum - h_hypothesis)**2 + k_hypothesis
prediction_error = y_datum - predicted_y
# On average, the error should be centered around zero
# just like in a standard regression (aka normal residuals).
y_probability = ss.norm.pdf(0, loc=prediction_error, scale=sigma_hypothesis)
y_probability









    Out[25]:





0.079085143893051602

This is NOT a probability! Needs to be normalized... before it can be normalized we need to compute likelihoods for all of our hypotheses.

This we know is the best possible hypothesis because we generated the data. For each hypothesis we will find the likelihood for all of the data with that hypothesis... but watch what happens even at the most likely hypothesis:



In [26]:

    
y_probability ** len(observed_y)









    Out[26]:





1.2101917342165885e-10

And if we had 100x as many observations...



In [27]:

    
y_probability ** (100*len(observed_y))









    Out[27]:





0.0

Completely zero. In probability zero and one are certainty and no amount of data will ever change them. These are bad values we want to avoid. We'll come back to this.



In [28]:

    
# Start with an uninformative prior
# Every hypothesis gets an equal footing
hypothesis_likelihoods = np.array(prior)

for i, hypotheses_tuple in enumerate(hypotheses):
  a_hypothesis, h_hypothesis, k_hypothesis, sigma_hypothesis = hypotheses_tuple
  for x_datum, y_datum in zip(x_data, observed_y):
    # For the given hypothetical value of each parameter
    # let's compute a prediction and the error
    predicted_y = a_hypothesis * (x_datum - h_hypothesis)**2 + k_hypothesis
    prediction_error = y_datum - predicted_y
    
    # On average, the error should be centered around zero
    # just like in a standard regression (aka normal residuals).
    y_probability = ss.norm.pdf(0, loc=prediction_error, scale=sigma_hypothesis)
    
    # Multiply probability density
    hypothesis_likelihoods[i] *= y_probability



In [29]:

    
evaluated_hypotheses = list(zip(hypotheses, hypothesis_likelihoods))



In [30]:

    
max(evaluated_hypotheses, key=lambda x: x[-1])









    Out[30]:





((-3, 9, 25000, 10), 6.1489193282095118e-16)

So this means that the most likely estimate for the parameter values are

a = -4
h = 9
k = 25000
sigma = 5

These aren't perfect! But that's fine... the only value we care about (h) is pretty darn close. Also the last value in that tuple is supposed to represent our likelihood of this solution but there's no way that's a probability! We never normalized.



In [31]:

    
normalized_hypothesis_likelihoods = hypothesis_likelihoods / hypothesis_likelihoods.sum()
evaluated_normalized_hypotheses = list(zip(hypotheses, normalized_hypothesis_likelihoods))
max(evaluated_normalized_hypotheses, key=lambda x: x[-1])









    Out[31]:





((-3, 9, 25000, 10), 0.30943259524380373)

Same solution but we just have a probability amongst all of our hypotheses now.

Dealing with Underflow: Living in Log Land



In [32]:

    
np.exp([-5000, -5001, -5002])









    Out[32]:





array([ 0.,  0.,  0.])



In [33]:

    
def log_normalize(log_p):
  shifted_p = np.exp(log_p - np.max(log_p))
  normalized_log_p = shifted_p/shifted_p.sum()
  return normalized_log_p



In [34]:

    
log_normalize([-5000, -5001, -5002])









    Out[34]:





array([ 0.66524096,  0.24472847,  0.09003057])



In [35]:

    
def fit_data(prior, hypotheses, x_data, observed_y):
    # Log
    hypothesis_likelihoods = np.log(prior)

    for i, hypotheses_tuple in enumerate(hypotheses):
      a_hypothesis, h_hypothesis, k_hypothesis, sigma_hypothesis = hypotheses_tuple
      for x_datum, y_datum in zip(x_data, observed_y):
        predicted_y = a_hypothesis * (x_datum - h_hypothesis)**2 + k_hypothesis
        prediction_error = y_datum - predicted_y

        y_probability = ss.norm.pdf(0, loc=prediction_error, scale=sigma_hypothesis)

        # Log
        hypothesis_likelihoods[i] += np.log(y_probability)
    # Log
    normalized_hypothesis_likelihoods = log_normalize(hypothesis_likelihoods)
    evaluated_normalized_hypotheses = list(zip(hypotheses, normalized_hypothesis_likelihoods))
    return evaluated_normalized_hypotheses

fixed_evaluated_normalized_hypotheses = fit_data(np.ones(len(hypotheses)), hypotheses, x_data, observed_y)
max(fixed_evaluated_normalized_hypotheses, key=lambda x: x[-1])









    



//anaconda/lib/python3.4/site-packages/ipykernel/__main__.py:14: RuntimeWarning: divide by zero encountered in log






    Out[35]:





((-3, 9, 25000, 10), 0.30943259524380334)

Solution Evaluation

Let's look at the different hypotheses weighted by their probability



In [36]:

    
X = np.linspace(4, 12, num=9)

for hypothesis, probability in fixed_evaluated_normalized_hypotheses:
    if probability == 0:
        continue
    a, h, k, sigma = hypothesis
    Y = a * (X - h)**2 + k
    plt.plot(X, Y, alpha=probability);

Performance Challenges

MCMC

Shout Outs

Huge shout out to Thomas Wiecki!

@twiecki

Motivations behind MCMC

Challenges of the above:

We can only evaluate discrete hypotheses. There are an infinite number of hypotheses between each integer though!
Evaluating more hypotheses will take exponentially longer as we add more.
We can solve problems even if they don't have tractable solutions like our first problem.

Understanding MCMC Foundations

Coding Metropolis

Let's use MCMC to solve a very simple problem: Inferring the mean of data we assume to be gaussian.

Choose where we're at to start

For simplicity, we'll assume we start at our priors.

Choose a proposed next value for m and b.

When choosing a next proposal for a variable we use a random distribution and some sort of limit to its width. In this case we'll use a normal distribution with a stdev of 0.5. Technically this is unbounded and serves to allow us to explore the space but is largely focused on reasonable values.

Evaluate Where We're At vs. Proposal

We want to find $P_accept = \frac{P(\mu_{proposed} | data)}{P(\mu_{current} | data)}$. If $p_{accept} >= 1$ we accept the sample and add it to our trace.

Find the likelihood of our data for both our current location for each variable as well as the proposed location. Let's also evaluate how likely our current value and proposed values are compared to our prior. NOTE: What would happen if you chose a prior that was wildly different from the true value?

Here we need to find the $P(\mu_{current} | data) = \frac{P(data | \mu_{current}) * P(\mu_{current})}{P(data)}$. When we divide $P(\mu_{proposed} | data)$ by $P(\mu_{current} | data)$ the $P(data)$ cancels out and we can avoid computing it!

http://twiecki.github.io/blog/2014/01/02/visualizing-mcmc/

Choose where we're at to start

For simplicity, we'll assume we start at our priors.



In [37]:

    
prior_m = 0
prior_m_distribution = ss.norm(loc=prior_m, scale=1.0)
m_current = prior_m
proposal_width = 0.5

Choose a proposed next value for m.



In [38]:

    
m_proposal = ss.norm(loc=m_current, scale=proposal_width).rvs()

Evaluate Where We're At vs. Proposal

We want to find $P_accept = \frac{P(\mu_{proposed} | data)}{P(\mu_{current} | data)}$. If $p_{accept} >= 1$ we accept the sample and add it to our trace.



In [39]:

    
prior_distribution = ss.norm(loc=prior_m, scale=1.0)

# Make proposal
m_proposal = ss.norm(loc=m_current, scale=proposal_width).rvs()

# Find likelihood of data given the proposal for m
m_proposal_likelihood = ss.norm(loc=m_proposal, scale=1).logpdf(data).sum()

# Find likelihood of data given the current m
m_current_likelihood = ss.norm(loc=m_current, scale=1).logpdf(data).sum()

# Find likelihood of m_current given the prior
m_current_prior_likelihood = prior_distribution.logpdf(m_current)
m_proposal_prior_likelihood = prior_distribution.logpdf(m_proposal)

# Compute proportional probability
m_current_probability = m_current_likelihood + m_current_prior_likelihood
m_proposal_probability = m_proposal_likelihood + m_proposal_prior_likelihood

m_accept_probability = m_proposal_probability - m_current_probability









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-39-b1ed1721825a> in <module>()
      5 
      6 # Find likelihood of data given the proposal for m
----> 7 m_proposal_likelihood = ss.norm(loc=m_proposal, scale=1).logpdf(data).sum()
      8 
      9 # Find likelihood of data given the current m

NameError: name 'data' is not defined

All Together Now



In [40]:

    
import numpy as np
import scipy as sp
import pandas as pd
import scipy.stats as ss

def sampler(data, samples=1000, mu_prior_mu=2, mu_prior_sd=1.):
    mu_current = mu_prior_mu #mu_init set to our prior
    posterior = [mu_current]
       
    prior_distribution = ss.norm(loc=mu_prior_mu, scale=mu_prior_sd)
        
    for i in range(samples):
        # suggest new position
        # How we come up with a proposal can be uniform random,
        # normal (like it is here) or something else.
        # Doesn't matter as far as making a functional algo.
        proposal_width = 1
        mu_proposal = ss.norm(loc=mu_current, scale=proposal_width).rvs()

        # What is the likelihood of the observed data given our mu_current?
        # Compute likelihood by multiplying probabilities of each data point
        likelihood_current = ss.norm(loc=mu_current, scale=1).logpdf(data).sum()
        # What is the likelihood of the observed data given our mu_proposal?
        likelihood_proposal = ss.norm(loc=mu_proposal, scale=1).logpdf(data).sum()
        
        # Compute prior probability of current and proposed mu
        # Compute the probability of the mu_current using our prior
        # Compute the probability of the mu_proposal using our same prior
        prior_current = prior_distribution.logpdf(mu_current)
        prior_proposal = prior_distribution.logpdf(mu_proposal)
        
        # Here we're finding P(data | mu) * P(mu)
        # for both the mu_current and mu_proposal
        p_current = likelihood_current + prior_current
        p_proposal = likelihood_proposal + prior_proposal
        
        # Accept proposal?
        # By dividing these two the P(data) normalizing term cancels out.
        p_accept = np.exp(p_proposal - p_current)

        # Usually would include prior probability, which we neglect here for simplicity
        # Choose a real between 0 and 1 and accept our mu_proposal
        # if our ratio above is greater than this random number.
        accept = np.random.rand() < p_accept

        if accept:
            # Update position
            mu_current = mu_proposal
        
        # This posterior is actually our sampler trace.
        posterior.append(mu_current)
        
    return posterior



In [41]:

    
data = np.random.normal(loc=5.0, scale=1.0, size=20)
samples = sampler(data, samples=10000)



In [42]:

    
plt.hist(samples[200:], bins=50);



In [43]:

    
plt.plot(samples[200:]);

Alternative MCMC Strategies and Use Cases

This is a VERY simple implementation of the Metropolis algorithm. As you work on more and more complex models you will find the simpler algorithms run into problems. In this case, metropolis assumes symmetrical distributions. Metropolis-Hastings accounts for assymetric distributions.

Metropolis- What we did! Simplest of MCMC samplers.
Metropolis-Hastings- Almost as simple as our metropolis algo above but also handles assymetric distributions
Gibbs - Best for models with conditional distributions, a special case of Metropolis-Hastings
Hamiltonian (HMC) - Much more efficient for models with many parameters (compared to Metropolis/Gibbs)
NUTS (No U-Turn Sampler)- HMC has issues with tuning the correct step size. NUTS attempts to automate and tune this choice.

HUGE thank you to Alex Companioni (@achompas) for helping me summarize these!

The Use of Tooling

Now with some tooling, let's return to our revenue optimization problem from above. Here's how we can model it in PyMC3.



In [44]:

    
import pymc3 as pm

with pm.Model() as m:
    a = pm.Normal('a', mu=0, sd=10)
    h = pm.Normal('h', mu=8, sd=10)
    k = pm.Normal('k', mu=24000, sd=1000)
    sigma = pm.Lognormal('sigma', mu=2, sd=5)
    
    
    y_pred = a * (x_data - h)**2 + k
    error = observed_y - y_pred
    
    pm.Normal('error', mu=0, sd=sigma, observed=error)
    
    trace = pm.sample(draws=15000, n_init=25000)









    



 83%|████████▎ | 12874/15500 [00:34<00:06, 383.19it/s]//anaconda/lib/python3.4/site-packages/pymc3/step_methods/hmc/nuts.py:468: UserWarning: Chain 1 contains 29 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|█████████▉| 15485/15500 [00:40<00:00, 483.92it/s]//anaconda/lib/python3.4/site-packages/pymc3/step_methods/hmc/nuts.py:452: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.918487953966, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
//anaconda/lib/python3.4/site-packages/pymc3/step_methods/hmc/nuts.py:468: UserWarning: Chain 0 contains 48 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 15500/15500 [00:40<00:00, 386.17it/s]



In [45]:

    
pm.traceplot(trace[500:]);



In [46]:

    
pm.summary(trace[500:])









    Out[46]:







  
    
      
      mean
      sd
      mc_error
      hpd_2.5
      hpd_97.5
      n_eff
      Rhat
    
  
  
    
      a
      -2.183743
      0.719970
      0.012985
      -3.467728
      -0.565113
      2886.0
      1.000343
    
    
      h
      9.407637
      1.017161
      0.031734
      8.216660
      11.290476
      953.0
      1.000853
    
    
      k
      24988.107710
      6.182845
      0.113028
      24976.687176
      25000.917247
      2914.0
      1.000007
    
    
      sigma
      11.345957
      4.173282
      0.064911
      5.349892
      19.271929
      3686.0
      0.999972

Visualizing the Possibilities



In [47]:

    
plt.figure(figsize=(8,8))
plt.title('Observed vs Predicted', size=16)

sampled_a = trace[14000:]['a']
sampled_h = trace[14000:]['h']
sampled_k = trace[14000:]['k']

x_eval = np.linspace(4, 12, num=50)

for a_sample, h_sample, k_sample in zip(sampled_a, sampled_h, sampled_k):
    y_eval = a_sample * (x_eval - h_sample)**2 + k_sample
    plt.plot(x_eval, y_eval, alpha=0.0065, c='k', linewidth=3.0);
    
plt.scatter(x_data, observed_y);

We get pretty close to what we know the optimal setting to be! It should be 9.25 and this gives us 9.75. We also get a concept of uncertainty.

When Things Go Wrong

Imagine you have a logistic regression model that you can use to better predict which delivery drivers will be best to deliver a product to your customer. You create the model but don't know the threshold you should set for the probabilities even through cross-validation because human behavior is complex.

Instead we design an experiment:

Try many thresholds randomly and infer the optimum using a mathematical model.

First step

Think through the effect of the model. When the value was closer to one, the possible drivers would be the same as production. In other words a sort of control. As the threshold was lowered though, drivers would be filtered out... but lower it too much and all drivers would be filtered out and delivery performance would degrade.

We can develop a simple mathematical model to encapsulate this:

$$ Y = a * \frac{(x - h)^2}{x - k} + C\ $$



In [48]:

    
X = np.linspace(0.13, 1, num=101)
Y = 5*((X - 0.36)**2) / (X - 0.12) + 10

plt.figure(figsize=(8,8))
plt.title('Hypothetical Response', size=16)
plt.ylim(0, 20)
plt.xlim(0,1)
plt.xlabel('Threshold value', size=14)
plt.ylabel('Delivery performance', size=14)
plt.plot(X, Y);



In [49]:

    
X_ = np.linspace(0.13, 1, num=30)
Y_ = 3*((X_ - 0.36)**2) / (X_ - 0.12) + 10 + np.random.normal(loc=0, scale=2, size=len(X_))



In [50]:

    
plt.figure(figsize=(8,8))
plt.xlim(0,1)
plt.ylim(0,30)
plt.scatter(X_, Y_, s=10.0);



In [51]:

    
import pymc3 as pm

with pm.Model() as m:
    a = pm.Uniform('a', lower=0.0, upper=20.0)
    h = pm.Uniform('h', lower=0.0, upper=1.0)# pm.Normal('h', mu=0.36, sd=0.05)
    k = pm.Uniform('k', lower=0.0, upper=1.0)#pm.Normal('k', mu=0.5, sd=0.05)
    C = pm.Uniform('C', lower=1.0, upper=30.) #pm.Normal('C', mu=8, sd=1)
    sigma = pm.Lognormal('sigma', mu=2, sd=2)
    
    
    y_pred = a * ((X_ - h)**2)/(X_ - k) + C
    error = Y_ - y_pred
    
    pm.Normal('error', mu=0, sd=sigma, observed=error)
    
    trace = pm.sample(draws=15000, tune=1000)









    



 80%|███████▉  | 12744/16000 [01:41<00:26, 124.36it/s]//anaconda/lib/python3.4/site-packages/pymc3/step_methods/hmc/nuts.py:468: UserWarning: Chain 1 contains 816 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|█████████▉| 15995/16000 [02:03<00:00, 117.14it/s]//anaconda/lib/python3.4/site-packages/pymc3/step_methods/hmc/nuts.py:452: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.685322981844, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
//anaconda/lib/python3.4/site-packages/pymc3/step_methods/hmc/nuts.py:468: UserWarning: Chain 0 contains 1808 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 16000/16000 [02:03<00:00, 129.39it/s]



In [52]:

    
pm.traceplot(trace);



In [53]:

    
pm.summary(trace)

But When They Work!!



In [54]:

    
X_ = np.linspace(0.13, 1, num=30)
Y_ = 3*((X_ - 0.36)**2) / (X_ - 0.12) + 10 + np.random.normal(loc=0, scale=2, size=len(X_))



In [55]:

    
plt.scatter(X_, Y_);



In [56]:

    
import numpy as np
import pymc3 as pm 

with pm.Model() as m:
    a = pm.Uniform('a', lower=0.0, upper=10.0)
    h = pm.Normal('h', mu=0.5, sd=0.05)
    k = pm.Normal('k', mu=0.2, sd=0.05)
    C = pm.Normal('C', mu=8, sd=1)
    sigma = pm.Uniform('sigma', lower=1, upper=30)
    
    
    y_pred = a * ((X_ - h)**2)/(X_ - k) + C
    error = Y_ - y_pred
    
    pm.Normal('error', mu=0, sd=sigma, observed=error)
    
    trace = pm.sample(draws=15000, tune=1500)









    



 95%|█████████▍| 15639/16500 [01:13<00:03, 220.15it/s]//anaconda/lib/python3.4/site-packages/pymc3/step_methods/hmc/nuts.py:468: UserWarning: Chain 1 contains 1844 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|█████████▉| 16499/16500 [01:17<00:00, 242.86it/s]//anaconda/lib/python3.4/site-packages/pymc3/step_methods/hmc/nuts.py:468: UserWarning: Chain 0 contains 1548 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 16500/16500 [01:17<00:00, 213.07it/s]



In [57]:

    
pm.traceplot(trace);



In [58]:

    
pm.summary(trace)

Special big thanks to Austin Rochford @AustinRochford!



In [59]:

    
plt.figure(figsize=(8,8))
plt.title('Observed vs Predicted', size=16)

sampled_a = trace[14000:]['a']
sampled_h = trace[14000:]['h']
sampled_k = trace[14000:]['k']
sampled_C = trace[14000:]['C']

x_eval = np.linspace(0.13, 1.0, num=50)

for a_sample, h_sample, k_sample, C_sample in list(zip(sampled_a, sampled_h, sampled_k, sampled_C))[:1000]:
    y_eval = a_sample * ((x_eval - h_sample)**2)/(x_eval - k_sample) + C_sample
    plt.plot(x_eval, y_eval, alpha=0.02, c='k', linewidth=1.0);
    
plt.scatter(X_, Y_);



In [ ]:

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
a	4.416728	2.639221	0.090846	0.048761	9.204302	316.0	1.006870
h	0.392786	0.101406	0.004056	0.204221	0.600584	144.0	1.005379
k	0.116121	0.031133	0.002632	0.077128	0.129429	7.0	1.022851
C	9.629068	0.663142	0.022383	8.290903	10.825430	66.0	1.005250
sigma	2.294955	0.440636	0.028088	1.616050	3.083408	12.0	1.015125

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
h	0.458770	0.053379	0.000652	0.353310	0.559564	7557.0	0.999971
k	0.098601	0.020222	0.000302	0.059458	0.128916	3815.0	0.999980
C	9.749317	0.590578	0.007139	8.547323	10.854736	6397.0	0.999967
a	3.088058	2.198490	0.032807	0.107337	7.567388	4226.0	0.999967
sigma	3.058598	0.445687	0.005274	2.269013	3.945667	7019.0	0.999983

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
a	-2.183743	0.719970	0.012985	-3.467728	-0.565113	2886.0	1.000343
h	9.407637	1.017161	0.031734	8.216660	11.290476	953.0	1.000853
k	24988.107710	6.182845	0.113028	24976.687176	25000.917247	2914.0	1.000007
sigma	11.345957	4.173282	0.064911	5.349892	19.271929	3686.0	0.999972