Think Bayes

Second Edition

Copyright 2020 Allen B. Downey

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)


In [1]:
# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist

In [2]:
# Get utils.py and create directories

import os

if not os.path.exists('utils.py'):
    !wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py
        
if not os.path.exists('figs'):
    !mkdir figs
        
if not os.path.exists('tables'):
    !mkdir tables

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from empiricaldist import Pmf
from utils import decorate, savefig

The Price is Right problem

On November 1, 2007, contestants named Letia and Nathaniel appeared on The Price is Right, an American game show. They competed in a game called "The Showcase", where the objective is to guess the price of a collection of prizes. The contestant who comes closest to the actual price, without going over, wins the prizes.

Nathaniel went first. His showcase included a dishwasher, a wine cabinet, a laptop computer, and a car. He bid $26,000.

Letia’s showcase included a pinball machine, a video arcade game, a pool table, and a cruise of the Bahamas. She bid $21,500.

The actual price of Nathaniel’s showcase was $25,347. His bid was too high, so he lost.

The actual price of Letia’s showcase was $21,578.

She was only off by $78, so she won her showcase and, because her bid was off by less than 250, she also won Nathaniel’s showcase.

For a Bayesian thinker, this scenario suggests several questions:

  1. Before seeing the prizes, what prior beliefs should the contestant have about the price of the showcase?

  2. After seeing the prizes, how should the contestant update those beliefs?

  3. Based on the posterior distribution, what should the contestant bid?

The third question demonstrates a common use of Bayesian methods: decision analysis.

This problem is inspired by this example in Cameron Davidson-Pilon’s book, Probablistic Programming and Bayesian Methods for Hackers.

The prior

To choose a prior distribution of prices, we can take advantage of data from previous episodes. Fortunately, fans of the show keep detailed records.

For this example, I downloaded files containing the price of each showcase from the 2011 and 2012 seasons and the bids offered by the contestants.

The following cells load the data files.


In [4]:
# Load the data files
import os

if not os.path.exists('showcases.2011.csv'):
    !wget https://github.com/AllenDowney/ThinkBayes2/raw/master/showcases.2011.csv

if not os.path.exists('showcases.2012.csv'):
    !wget http://github.com/AllenDowney/ThinkBayes2/raw/master/showcases.2012.csv

The following function reads the data and cleans it up a little.


In [5]:
def read_data(filename):
    """Read the showcase price data.
    
    filename: string
    
    returns: DataFrame
    """
    df = pd.read_csv(filename, index_col=0, skiprows=[1])
    return df.dropna().transpose()

I'll read both files and concatenate them.


In [6]:
df2011 = read_data('showcases.2011.csv')
df2011.shape


Out[6]:
(191, 6)

In [7]:
df2012 = read_data('showcases.2012.csv')
df2012.shape


Out[7]:
(122, 6)

In [8]:
df = pd.concat([df2011, df2012], ignore_index=True)
df.shape


Out[8]:
(313, 6)

Here's what the dataset looks like:


In [9]:
df.head()


Out[9]:
Showcase 1 Showcase 2 Bid 1 Bid 2 Difference 1 Difference 2
0 50969.0 45429.0 42000.0 34000.0 8969.0 11429.0
1 21901.0 34061.0 14000.0 59900.0 7901.0 -25839.0
2 32815.0 53186.0 32000.0 45000.0 815.0 8186.0
3 44432.0 31428.0 27000.0 38000.0 17432.0 -6572.0
4 24273.0 22320.0 18750.0 23000.0 5523.0 -680.0

Kernel density estimation

This dataset contains the prices for 313 previous showcases, which we can think of as a sample from the population of possible prices.

We can use this sample to estimate the prior distribution of showcase prices. One way to do that is kernel density estimation (KDE), which uses the sample to estimate a smooth distribution.

SciPy provides gaussian_kde, which takes a sample and returns an object that represents the estimated distribution.

The following function takes a sample, makes a KDE, evaluates it at a given sequence of quantities, and returns the result as a normalized PMF.


In [10]:
from scipy.stats import gaussian_kde

def make_kde(qs, sample):
    """Make a PMF based on KDE:
    
    qs: quantities where we should evaluate the KDE
    sample: sequence of values
    
    returns: Series that represents a normalized PMF
    """
    kde = gaussian_kde(sample)
    ps = kde(qs)
    pmf = Pmf(ps, qs)
    pmf.normalize()
    return pmf

We can use it to estimate the distribution of values for Showcase 1:


In [11]:
qs = np.linspace(0, 80000, 81)
prior1 = make_kde(qs, df['Showcase 1'])

Here's what it looks like:


In [12]:
def decorate_value(title=''):
    decorate(xlabel='Showcase value ($)',
        ylabel='PMF',
        title=title)

In [13]:
prior1.plot(label='Prior 1')
decorate_value('Prior distribution of showcase value')
savefig('fig08-01')


Exercise: Use this function to make a Pmf that represents the prior distribution for Showcase 2, and plot it.


In [14]:
# Solution

qs = np.linspace(0, 80000, 81)
prior2 = make_kde(qs, df['Showcase 2'])

In [15]:
# Solution

prior1.plot(label='Prior 1')
prior2.plot(label='Prior 2')

decorate_value('Prior distributions of showcase value')


Distribution of error

To update these priors, we have to answer these questions:

  • What data should we consider and how should we quantify it?

  • Can we compute a likelihood function; that is, for each hypothetical price, can we compute the conditional likelihood of the data?

To answer these questions, I will model the contestant as a price-guessing instrument with known error characteristics. In other words, when the contestant sees the prizes, they guess the price of each prize --- ideally without taking into consideration the fact that the prize is part of a showcase --- and add up the prices. Let’s call this total guess.

Under this model, the question we have to answer is, “If the actual price is price, what is the likelihood that the contestant’s estimate would be guess?”

Equivalently, if we define error = guess - price, we can ask, “What is the likelihood that the contestant’s estimate is off by error?”

To answer this question, I'll use the historical data again. For each showcase in the dataset, let's look at the difference between the contestant's bid and the actual price:


In [16]:
sample_diff1 = df['Bid 1'] - df['Showcase 1']
sample_diff2 = df['Bid 2'] - df['Showcase 2']

To visualize the distribution of these differences, we can use KDE again.


In [17]:
qs = np.linspace(-40000, 20000, 61)
kde_diff1 = make_kde(qs, sample_diff1)
kde_diff2 = make_kde(qs, sample_diff2)

In [18]:
kde_diff1.plot(label='Diff 1')
kde_diff2.plot(label='Diff 2')

decorate(xlabel='Difference in value ($)',
        ylabel='PMF',
        title='Difference between bid and actual value')

savefig('fig08-02')


It looks like the bids are too low more often than too high, which makes sense. Remember that under the rules of the game, you lose if you overbid, so contestants probably underbid to some degree deliberately.

Here is the mean and standard deviation of Diff for Player 1.


In [19]:
mean_diff1 = sample_diff1.mean()
std_diff1 = sample_diff1.std()

mean_diff1, std_diff1


Out[19]:
(-4116.3961661341855, 6899.909806377117)

We can use the observed distribution of differences to model the contestant's distribution of errors.

This step is a little tricky because we don’t actually know the contestant’s guesses; we only know what they bid.

So we have to make some assumptions:

  • I'll assume that contestants underbid because they are being strategic, and that on average their guesses are accurate. In other words, the mean of their errors is 0.

  • But I'll assume that the spread of the differences reflects the actual spread of their errors. So, I'll use the standard deviation of the differences as the standard deviation of their errors.

Based on these assumptions, I'll make a normal distribution with parameters 0 and std_diff1.


In [20]:
from scipy.stats import norm

error_dist1 = norm(0, std_diff1)
error_dist1


Out[20]:
<scipy.stats._distn_infrastructure.rv_frozen at 0x7f4e927407f0>

We'll use this distribution to do the update.

Update

Suppose you are Player 1. You see the prizes in your showcase and your estimate of the total price is $23,000.

For each hypothetical price in the prior distribution, I'll subtract away your guess. The result is your error under each hypothesis.


In [21]:
guess1 = 23000

qs = prior1.index
error1 = guess1 - qs

Now suppose you know based on past performance that your estimation error is well modeled by error_dist1.

Under that assumption we can compute the likelihood of your estimate under each hypothesis.


In [22]:
likelihood1 = error_dist1.pdf(error1)

And we can use that likelihood to update the prior.


In [23]:
posterior1 = prior1 * likelihood1
posterior1.normalize()


Out[23]:
3.3889812097254624e-05

Here's what the posterior distribution looks like:


In [24]:
prior1.plot(color='gray', label='Prior 1')
posterior1.plot(color='C0', label='Posterior 1')

decorate_value('Prior and posterior distribution of showcase value')
savefig('fig08-03')


Because your estimate is in the lower end of the range, the posterior distribution has shifted to the left. We can use the posterior mean to see by how much.


In [25]:
prior1.mean(), posterior1.mean()


Out[25]:
(30299.488817891375, 26192.024002392536)

Before you saw the prizes, you expected to see a showcase with a value close to $30,000.

After making an estimate of $23,000, you updated the prior distribution.

Based on the combination of the prior and your estimate, you now expect the actual price to be about $26,000.

Exercise: Now suppose you are Player 2. When you see your showcase, you estimte that the total price is $38,000.

Use diff2 to construct a normal distribution that represents the distribution of your estimation errors.

Compute the likelihood of your estimate for each actual price and use it to update prior2.

Plot the posterior distribution and compute the posterior mean. Based on your estimate, what do you expect the actual price of the showcase to be?


In [26]:
# Solution

mean_diff2 = sample_diff2.mean()
std_diff2 = sample_diff2.std()

mean_diff2, std_diff2


Out[26]:
(-3675.891373801917, 6886.260711323408)

In [27]:
# Solution

error_dist2 = norm(0, std_diff2)

In [28]:
# Solution

guess2 = 38000
qs = prior2.index
error2 = guess2 - qs

likelihood2 = error_dist2.pdf(error2)

In [29]:
# Solution

posterior2 = prior2 * likelihood2
posterior2.normalize()


Out[29]:
2.6978123219107028e-05

In [30]:
# Solution

prior2.plot(color='gray', label='Prior 2')
posterior2.plot(color='C1', label='Posterior 2')

decorate_value('Prior and posterior distribution of showcase value')



In [31]:
# Solution

prior2.mean(), posterior2.mean()


Out[31]:
(31047.623719122515, 34305.20161642469)

Probability of winning

Now that we have a posterior distribution for each player, let's think about strategy.

First, from the point of view of Player 1, let's compute the probability that Player 2 overbids. To keep it simple, I'll use only the performance of past players, ignoring the estimated value of the showcase.

The following function takes a sequence of past bids and returns the fraction that overbid.


In [32]:
def prob_overbid(sample_diff):
    """Returns the probability this player overbids.

    sample_diff: sequence of differences
    """
    return np.mean(sample_diff > 0)

Here's an estimate for the probability that Player 2 overbids.


In [33]:
prob_overbid(sample_diff2)


Out[33]:
0.29073482428115016

Now suppose Player 1 underbids by $5000. What is the probability that Player 2 underbids by more?

The following function uses past performance to estimate the probability that a player underbids by more than a given amount, diff:


In [34]:
def prob_worse_than(diff, sample_diff):
    """Probability the opponents's diff is worse than the given diff.

    diff: how much the oppenent is off by (always negative)
    sample_diff: sequence of differences for the opponent
    """
    return np.mean(sample_diff < diff)

Here's the probability that Player 2 underbids by more than $5000.


In [35]:
prob_worse_than(-5000, sample_diff2)


Out[35]:
0.38338658146964855

And here's the probability they are off by more than $10,000.


In [36]:
prob_worse_than(-10000, sample_diff2)


Out[36]:
0.14376996805111822

We can combine these functions to compute the probability that Player 1 wins, given the difference between their bid and the actual price:


In [37]:
def compute_prob_win(diff, sample_diff):
    """Computes the probability of winning for a given diff.

    diff: how much your bid was off by
    sample_diff: sequence of differences for the opponent
    """
    # if you overbid you lose
    if diff > 0:
        return 0
    
    # if the opponent over bids, you win
    p1 = prob_overbid(sample_diff)
    
    # or of their bid is worse than yours, you win
    p2 = prob_worse_than(diff, sample_diff)
    return p1 + p2

Here's the probability that you win, given that you underbid by $5000.


In [38]:
compute_prob_win(-5000, sample_diff2)


Out[38]:
0.6741214057507987

Now let's look at the probability of winning for a range of possible differences.


In [39]:
xs = np.linspace(-30000, 5000, 121)
ys = [compute_prob_win(x, sample_diff2) for x in xs]
plt.plot(xs, ys)

decorate(xlabel='Difference between bid and actual price ($)',
    ylabel='Probability of winning',
    title='Player 1')

savefig('fig08-04')


If you underbid by $30,000, the chance of winning is about 30%, which is mostly the chance your opponent overbids.

As your bids gets closer to the actual price, your chance of winning approaches 1.

And, of course, if you overbid, you lose (even if your opponent also overbids).

Exercise: Run the same analysis from the point of view of Player 2. Using the sample of differences from Player 1, compute:

  1. The probability that Player 1 overbids.

  2. The probability that Player 1 underbids by more than $5000.

  3. The probability that Player 2 wins, given that they underbid by $5000.

Then plot the probability that Player 2 wins for a range of possible differences between their bid and the actual price.


In [40]:
prob_overbid(sample_diff1)


Out[40]:
0.24600638977635783

In [41]:
prob_worse_than(-5000, sample_diff1)


Out[41]:
0.3993610223642173

In [42]:
compute_prob_win(-5000, sample_diff1)


Out[42]:
0.6453674121405751

In [43]:
xs = np.linspace(-30000, 5000, 121)
ys = [compute_prob_win(x, sample_diff1) for x in xs]
plt.plot(xs, ys)

decorate(xlabel='Difference between bid and actual price ($)',
    ylabel='Probability of winning',
    title='Player 2')


Decision analysis

In the previous section we computed the probability of winning, given that we have underbid by a particular amount.

In reality the contestants don't know how much they have underbid by, because they don't know the actual price.

But they do have a posterior distribution that represents their beliefs about the actual price, and they can use that to estimate their probability of winning with a given bid.

The following function take a possible bid, a posterior distribution of actual prices, and a sample of differences for the opponent.

It loops through the hypothetical prices in the posterior distribution and for each price:

  1. Computes the difference between the bid and the hypothetical price.

  2. Computes the probability that the player wins, given that difference.

  3. Adds up the weighted sum of the probabilities, where the weights are the probabilities in the posterior distribution.


In [44]:
def total_prob_win(bid, posterior, sample_diff):
    """Computes the total probability of winning with a given bid.

    bid: your bid
    posterior: Pmf of showcase value
    sample_diff: sequence of differences for the opponent
    
    returns: probability of winning
    """
    total = 0
    for price, prob in posterior.items():
        diff = bid - price
        total += prob * compute_prob_win(diff, sample_diff)
    return total

This loop implements the law of total probability:

$P(win) = \sum_{price} P(price) ~ P(win ~|~ price)$


In [45]:
total_prob_win(25000, posterior1, sample_diff2)


Out[45]:
0.48422109454398116

In [46]:
bids = posterior1.index

probs = [total_prob_win(bid, posterior1, sample_diff2) 
         for bid in bids]

prob_win_series = pd.Series(probs, index=bids)

In [47]:
prob_win_series.plot(color='C1')

decorate(xlabel='Bid ($)',
    ylabel='Probability of winning',
    title='Player 1')

savefig('fig08-05')


And here's the bid that maximizes your chance of winning.


In [48]:
prob_win_series.idxmax()


Out[48]:
21000.0

In [49]:
prob_win_series.max()


Out[49]:
0.6136807192359474

Recall that your estimate was $23,000.

After using your estimate to compute the posterior distribution, the posterior mean is about $26,000.

But the bid that maximizes your chance of winning is $21,000.

Exercise: Do the same analysis for Player 2.


In [50]:
# Solution

bids = posterior2.index

probs = [total_prob_win(bid, posterior2, sample_diff1) 
         for bid in bids]

prob_win_series = pd.Series(probs, index=bids)

In [51]:
# Solution

prob_win_series.plot(color='C1')

decorate(xlabel='Bid ($)',
    ylabel='Probability of winning',
    title='Player 2')



In [52]:
# Solution

prob_win_series.idxmax()


Out[52]:
29000.0

Maximizing expected gain

In the previous section we computed the bid that maximizes your chance of winning. And if that's your goal, the bid we computed is optimal.

But winning isn't everything. Remember that if your bid is off by $250 or less, you win both showcases. So it might be a good idea to increase your bid a little: it increases the chance you overbid and lose, but it also increases the chance of winning both showcases.

Let's see how that works out. The following function computes how much you will win, on average, given your bid, the actual price, and a sample of errors for your opponent.


In [53]:
def compute_gain(bid, price, sample_diff):
    """Computes expected gain given a bid and actual price.

    bid: number
    price: actual price
    sample_diff: sequence of differences for the opponent
    """
    diff = bid - price
    prob = compute_prob_win(diff, sample_diff)

    # if you are within 250 dollars, you win both showcases
    if -250 <= diff <= 0:
        return 2 * price * prob
    else:
        return price * prob

For example, if the actual price is $35000

and you bid $30000,

you will win about $23,600 worth of prizes on average.


In [54]:
compute_gain(30000, 35000, sample_diff2)


Out[54]:
23594.249201277955

In reality we don't know the actual price, but we have a posterior distribution that represents what we know about it. By averaging over the prices and probabilities in the posterior distribution, we can compute the "expected gain" for a particular bid.


In [55]:
def expected_gain(bid, posterior, sample_diff):
    """Computes the expected return of a given bid.

    bid: your bid
    posterior: distribution of showcase values
    sample_diff: distribution of differences for the opponent
    """
    total = 0
    for price, prob in posterior.items():
        total += prob * compute_gain(bid, price, sample_diff)
    return total

For the posterior we computed earlier, based on an estimate of $23,000,

the expected gain for a bid of $21,000

is about $16,900.


In [56]:
expected_gain(21000, posterior1, sample_diff2)


Out[56]:
16923.59933856512

But can we do any better?

To find out, we can loop through a range of bids and find the one that maximizes expected gain.


In [57]:
bids = posterior1.index

gains = [expected_gain(bid, posterior1, sample_diff2) for bid in bids]

expected_gain_series = pd.Series(gains, index=bids)

Here are the results.


In [58]:
expected_gain_series.plot(color='C2')

decorate(xlabel='Bid ($)',
    ylabel='Expected gain ($)',
    title='Player 1')

savefig('fig08-06')


And here is the optimal bid.


In [59]:
expected_gain_series.idxmax()


Out[59]:
22000.0

With that bid, the expected gain is about $17,400.


In [60]:
expected_gain_series.max()


Out[60]:
17384.899584430794

Recall that the estimated value of the prizes was $23,000.

The bid that maximizes the chance of winning is $21,000.

And the bid that maximizes your expected gain is $22,000.

Exercise: Do the same analysis for Player 2.


In [61]:
bids = posterior2.index

gains = [expected_gain(bid, posterior2, sample_diff1) for bid in bids]

expected_gain_series = pd.Series(gains, index=bids)

Here are the results.


In [62]:
expected_gain_series.plot(color='C2')

decorate(xlabel='Bid ($)',
    ylabel='Expected gain ($)',
    title='Player 2')


And here is the optimal bid.


In [63]:
expected_gain_series.idxmax()


Out[63]:
30000.0

Exercises

Exercise: This exercise is inspired by a true story. In 2001 I created Green Tea Press to publish my books, starting with Think Python. I ordered 100 copies from a short run printer and made the book available for through a distributor. After the first week, the distributor reported that 12 copies were sold. Based that report, I thought I would run out of copies in about 8 weeks, so I got ready to order more. My printer offered me a discount if I ordered more than 1000 copies, so I went a little crazy and ordered 2000 copies. A few days later, my mother called to tell me that her copies of the book had arrived. Surprised, I asked how many copies. She said ten.

It turned out I had sold only two copies to non-relatives. And it took a lot longer than I expected to sell 2000 copies.

The details of this story are unique, but the general problem is something almost every retailer has to figure out. Based on past sales, how do you predict future sales? And based on those predictions, how do you decide how much to order and when?

Often the cost of a bad decision is complicated. If you place a lot of small orders rather than one big one, your costs are likely to be higher. If you run out of inventory, you might lose customers. And if you order too much, you have to pay the various costs of holding inventory.

So, let's solve a version of the problem I faced. Suppose you start selling books online. During the first week you sell 12 copies (and let's assume that none of the customers are your mother). During the second week you sell 8 copies.

Assuming that the arrival of orders is a Poisson process, we can think of the weekly orders as samples from a Poisson distribution with an unknown rate. Choose a prior you think is appropriate and use the data to compute the posterior distribution of the order rate. Then generate a posterior predictive distribution for the number of copies you expect during the next 8 weeks.

  • Suppose the cost of printing the book is $5 per copy,

  • But if you order 100 or more, it's $4.50 per copy.

  • For every book you sell, you get $10.

  • But if you run out of books before the end of 8 weeks, you lose $50 in future sales for every week you are out of stock.

  • If you have books left over at the end of 8 weeks, you lose $2 in inventory costs per extra book.

For example, suppose you get orders for 10 books per week, every week.

If you order 60 books,

  • The total cost is $300.

  • You sell all 30 books, so you make $600.

  • But the book is out of stock for two weeks, so you lose $100 in future sales.

In total, your profit is $200.

If you order 100 books,

  • The total cost is $450.

  • You sell 80 books, again, so you make $800.

  • But you have 20 books left over at the end, so you lose $40.

In total, your profit is $310.

Combining these costs with your predictive distribution, how many books should you order to maximize your expected profit?

To get you started, the following functions compute profits and costs according to the speficiation of the problem:


In [64]:
def print_cost(printed):
    if printed < 100:
        return printed * 5
    else:
        return printed * 4.5

In [65]:
def total_income(printed, orders):
    sold = min(printed, np.sum(orders))
    return sold * 10

In [66]:
def inventory_cost(printed, orders):
    excess = printed - np.sum(orders)
    if excess > 0:
        return excess * 2
    else:
        return 0

In [67]:
def out_of_stock_cost(printed, orders):
    weeks = len(orders)
    total_orders = np.cumsum(orders)
    for i, total in enumerate(total_orders):
        if total > printed:
            return (weeks-i) * 50
    return 0

In [68]:
def compute_profit(printed, orders):
    return (total_income(printed, orders) -
            print_cost(printed)-
            out_of_stock_cost(printed, orders) -
            inventory_cost(printed, orders))

To test these function, suppose we get exactly 10 orders per week for eight weeks:


In [69]:
always_10 = [10] * 8
always_10


Out[69]:
[10, 10, 10, 10, 10, 10, 10, 10]

If you print 60 books, your net profit is $200, as in the example.


In [70]:
compute_profit(60, always_10)


Out[70]:
200

If you print 100 books, your net profit is $310.


In [71]:
compute_profit(100, always_10)


Out[71]:
310.0

Of course, in the context of the problem you don't know how many books will be ordered in any given week. You don't even know the average rate of orders. However, given the data and some assumptions about the prior, you can compute the distribution of the rate of orders.

You'll have a chance to do that, but to demonstrate the decision analysis part of the problem, I'll start with the arbitrary assumption that order rates comes from a gamma distribution with mean 10.

Here's a Pmf that represents this distribution.


In [72]:
from scipy.stats import gamma

α = 10
qs = np.linspace(0, 25, 101)
ps = gamma.pdf(qs, α)
pmf = Pmf(ps, qs)
pmf.normalize()
pmf.mean()


Out[72]:
9.996616662114423

In [73]:
pmf.plot(color='C1')
decorate(xlabel='Book ordering rate (λ)',
        ylabel='PMF')


Now, we could generate a predictive distribution for the number of books ordered in a given week, but in this example we have to deal with a complicated cost function. In particular, out_of_stock_cost depends on a sequence of orders.

So, rather than generate a predictive distribution, I suggest we run simulations. I'll demonstrate the steps.

First, from our hypothetical distribution of rates, we can draw a random sample of 1000 values.


In [74]:
rates = pmf.choice(1000)
np.mean(rates)


Out[74]:
9.987

For each possible rate, we can generate a sequence of 8 orders.


In [75]:
rng = np.random.default_rng()
order_array = rng.poisson(rates, size=(8, 1000)).transpose()
order_array[:5, :]


Out[75]:
array([[ 4,  7,  7,  4,  6,  8,  8,  7],
       [ 9, 10,  8, 12, 13, 11,  9, 10],
       [19, 13, 10, 16, 13, 13, 15, 10],
       [14,  9, 10,  7,  3,  4, 16,  8],
       [ 8,  9,  9, 11,  8,  6, 14,  7]])

Each row of this array is a hypothetical sequence of orders based on a different hypothetical order rate.

Now, if you tell me how many books you printed, I can compute your expected profits, averaged over these 1000 possible sequences.


In [76]:
def compute_expected_profits(printed, order_array):
    """
    """
    profits = [compute_profit(printed, orders)
               for orders in order_array]
    return np.mean(profits)

For example, here are the expected profits if you order 70, 80, or 90 books.


In [77]:
compute_expected_profits(70, order_array)


Out[77]:
203.794

In [78]:
compute_expected_profits(80, order_array)


Out[78]:
225.108

In [79]:
compute_expected_profits(90, order_array)


Out[79]:
219.948

Now, let's sweep through a range of values and compute expected profits as a function of the number of books you print.


In [80]:
printed_array = np.arange(70, 110)
t = [compute_expected_profits(printed, order_array)
                    for printed in printed_array]
expected_profits = pd.Series(t, printed_array)

In [81]:
expected_profits.plot(label='')

decorate(xlabel='Number of books printed',
         ylabel='Expected profit ($)')


The peak is at 80 books, where your expected profit is about $250.


In [82]:
expected_profits.idxmax(), expected_profits.max()


Out[82]:
(100, 242.934)

In this example, it turns out to be optimal to order the number of books you expect to sell on averagewhich is not too surprising. But notice that you do almost as well if you order 100 books. So it's not obvious that ordering the mean is always optimal.

Now it's your turn. Choose a prior that you think is reasonable, update it with the data you are given, and then use the posterior distribution to do the analysis I just demonstrated.


In [83]:
# Solution

# For a prior I chose a log-uniform distribution; 
# that is, a distribution that is uniform in log-space
# from 1 to 100 books per week.

qs = np.logspace(0, 2, 101)
prior = Pmf(1, qs)
prior.normalize()


Out[83]:
101

In [84]:
# Solution

prior.make_cdf().plot(color='C1')
decorate(xlabel='Book ordering rate (λ)',
        ylabel='CDF')



In [85]:
# Solution

from scipy.stats import poisson

def update_book(pmf, data):
    """Update book ordering rate.
    
    pmf: Pmf of book ordering rates
    data: observed number of orders in one week
    """
    k = data
    lams = pmf.index
    likelihood = poisson.pmf(k, lams)
    pmf *= likelihood
    pmf.normalize()

In [86]:
# Solution

posterior1 = prior.copy()
update_book(posterior1, 12)

In [87]:
# Solution

posterior2 = posterior1.copy()
update_book(posterior2, 8)

In [88]:
# Solution

posterior1.plot()
posterior2.plot()



In [89]:
# Solution

prior.mean(), posterior1.mean(), posterior2.mean()


Out[89]:
(21.78849107458653, 12.000000007076508, 10.000000000000375)

In [90]:
# Solution

rates = posterior2.choice(1000)
np.mean(rates)


Out[90]:
10.057732560331226

In [91]:
# Solution

order_array = rng.poisson(rates, size=(8, 1000)).transpose()
order_array[:5, :]


Out[91]:
array([[10, 14, 10,  8, 16,  9, 15,  8],
       [12,  4,  6,  8,  6, 10,  3,  5],
       [ 5, 11,  8, 11,  8,  3, 10, 10],
       [10, 10, 22,  9,  9, 11, 14, 12],
       [11, 13, 13, 11,  8, 11, 13, 13]])

In [92]:
# Solution

printed_array = np.arange(70, 110)
t = [compute_expected_profits(printed, order_array)
                    for printed in printed_array]
expected_profits = pd.Series(t, printed_array)

In [93]:
# Solution

expected_profits.plot(label='')

decorate(xlabel='Number of books printed',
         ylabel='Expected profit ($)')



In [ ]: