In 1814, Pierre-Simon Laplace wrote:
Probability theory is nothing but common sense reduced to calculation. ... [Probability] is thus simply a fraction whose numerator is the number of favorable cases and whose denominator is the number of all the cases possible ... when nothing leads us to expect that any one of these cases should occur more than any other.
Laplace nailed it. To untangle a probability problem, all you have to do is define exactly what the cases are, and careful count the favorable and total cases. Let's be clear on our vocabulary words:
4
.{1, 2, 3, 4, 5, 6}
.{2, 4, 6}
. This notebook will explore these concepts in a concrete way using Python code. The code is meant to be succint and explicit, and fast enough to handle sample spaces with millions of outcomes. If you need to handle trillions, you'll want a more efficient implementation. I also have another notebook that covers paradoxes in Probability Theory.
In [1]:
from fractions import Fraction
def P(event, space):
"The probability of an event, given a sample space."
return Fraction(cases(favorable(event, space)),
cases(space))
favorable = set.intersection # Outcomes that are in the event and in the sample space
cases = len # The number of cases is the length, or size, of a set
What's the probability of rolling an even number with a single six-sided fair die? Mathematicians traditionally use a single capital letter to denote a sample space; I'll use D
for the die:
In [2]:
D = {1, 2, 3, 4, 5, 6} # a sample space
even = { 2, 4, 6} # an event
P(even, D)
Out[2]:
Good to confirm what we already knew. We can explore some other events:
In [3]:
prime = {2, 3, 5, 7, 11, 13}
odd = {1, 3, 5, 7, 9, 11, 13}
In [4]:
P(odd, D)
Out[4]:
In [5]:
P((even | prime), D) # The probability of an even or prime die roll
Out[5]:
In [6]:
P((odd & prime), D) # The probability of an odd prime die roll
Out[6]:
In [7]:
suits = u'♥♠♦♣'
ranks = u'AKQJT98765432'
deck = [r + s for r in ranks for s in suits]
len(deck)
Out[7]:
Now I want to define Hands
as the sample space of all 5-card combinations from deck
. The function itertools.combinations
does most of the work; we than concatenate each combination into a space-separated string:
In [8]:
import itertools
def combos(items, n):
"All combinations of n items; each combo as a space-separated str."
return set(map(' '.join, itertools.combinations(items, n)))
Hands = combos(deck, 5)
len(Hands)
Out[8]:
There are too many hands to look at them all, but we can sample:
In [9]:
import random
random.sample(Hands, 7)
Out[9]:
In [10]:
random.sample(deck, 7)
Out[10]:
Now we can answer questions like the probability of being dealt a flush (5 cards of the same suit):
In [11]:
flush = {hand for hand in Hands if any(hand.count(suit) == 5 for suit in suits)}
P(flush, Hands)
Out[11]:
Or the probability of four of a kind:
In [12]:
four_kind = {hand for hand in Hands if any(hand.count(rank) == 4 for rank in ranks)}
P(four_kind, Hands)
Out[12]:
Around 1700, Jacob Bernoulli wrote about removing colored balls from an urn in his landmark treatise Ars Conjectandi, and ever since then, explanations of probability have relied on urn problems. (You'd think the urns would be empty by now.)
For example, here is a three-part problem adapted from mathforum.org:
An urn contains 6 blue, 9 red, and 8 white balls. We select six balls at random. What is the probability of each of these outcomes:
- All balls are red.
- 3 are blue, and 1 is red, and 2 are white, .
- Exactly 4 balls are white.
We'll start by defining the contents of the urn. A set
can't contain multiple objects that are equal to each other, so I'll call the blue balls 'B1'
through 'B6'
, rather than trying to have 6 balls all called 'B'
:
In [13]:
def balls(color, n):
"A set of n numbered balls of the given color."
return {color + str(i)
for i in range(1, n + 1)}
urn = balls('B', 6) | balls('R', 9) | balls('W', 8)
urn
Out[13]:
Now we can define the sample space, U6
, as the set of all 6-ball combinations:
In [14]:
U6 = combos(urn, 6)
random.sample(U6, 5)
Out[14]:
Define select
such that select('R', 6)
is the event of picking 6 red balls from the urn:
In [15]:
def select(color, n, space=U6):
"The subset of the sample space with exactly `n` balls of given `color`."
return {s for s in space if s.count(color) == n}
Now I can answer the three questions:
In [16]:
P(select('R', 6), U6)
Out[16]:
In [17]:
P(select('B', 3) & select('R', 1) & select('W', 2), U6)
Out[17]:
In [18]:
P(select('W', 4), U6)
Out[18]:
Let's verify these calculations using basic arithmetic, rather than exhaustive counting. First, how many ways can I choose 6 out of 9 red balls? It could be any of the 9 for the first ball, any of 8 remaining for the second, and so on down to any of the remaining 4 for the sixth and final ball. But we don't care about the order of the six balls, so divide that product by the number of permutations of 6 things, which is 6!, giving us 9 × 8 × 7 × 6 × 5 × 4 / 6! = 84. In general, the number of ways of choosing c out of n items is (n choose c) = n! / ((n - c)! × c!). We can translate that to code:
In [19]:
from math import factorial
def choose(n, c):
"Number of ways to choose c items from a list of n items."
return factorial(n) // (factorial(n - c) * factorial(c))
In [20]:
choose(9, 6)
Out[20]:
Now we can verify the answers to the three problems. (Since P
computes a ratio and choose
computes a count,
I multiply the left-hand-side by N
, the length of the sample space, to make both sides be counts.)
In [21]:
N = len(U6)
N * P(select('R', 6), U6) == choose(9, 6)
Out[21]:
In [22]:
N * P(select('B', 3) & select('W', 2) & select('R', 1), U6) == choose(6, 3) * choose(8, 2) * choose(9, 1)
Out[22]:
In [23]:
N * P(select('W', 4), U6) == choose(8, 4) * choose(6 + 9, 2) # (6 + 9 non-white balls)
Out[23]:
We can solve all these problems just by counting; all you ever needed to know about probability problems you learned from Sesame Street:
So far, we have accepted Laplace's assumption that nothing leads us to expect that any one of these cases should occur more than any other. In real life, we often get outcomes that are not equiprobable--for example, a loaded die favors one side over the others. We will introduce three more vocabulary items:
Frequency: a non-negative number describing how often an outcome occurs. Can be a count like 5, or a ratio like 1/6.
Distribution: A mapping from outcome to frequency of that outcome. We will allow sample spaces to be distributions.
Probability Distribution: A probability distribution is a distribution whose frequencies sum to 1.
I could implement distributions with Dist = dict
, but instead I'll make Dist
a subclass collections.Counter
:
In [24]:
from collections import Counter
class Dist(Counter):
"A Distribution of {outcome: frequency} pairs."
Because a Dist
is a Counter
, we can initialize it in any of the following ways:
In [25]:
# A set of equiprobable outcomes:
Dist({1, 2, 3, 4, 5, 6})
Out[25]:
In [26]:
# A collection of outcomes, with repetition indicating frequency:
Dist('THHHTTHHT')
Out[26]:
In [27]:
# A mapping of {outcome: frequency} pairs:
Dist({'H': 5, 'T': 4})
Out[27]:
In [28]:
# Keyword arguments:
Dist(H=5, T=4) == Dist({'H': 5}, T=4) == Dist('TTTT', H=5)
Out[28]:
Now I will modify the code to handle distributions. Here's my plan:
set
or a Dist
.Dist(H=50, T=50)
; the results
will be the same as if the sample space had been a true probability distribution like Dist(H=1/2, T=1/2)
.cases
now sums the frequencies in a distribution (it previously counted the length).favorable
now returns a Dist
of favorable outcomes and their frequencies (not a set
).Fraction
to use "/"
, not fractions.Fraction
, because frequencies might be floats.P
is unchanged.
In [29]:
def cases(outcomes):
"The total frequency of all the outcomes."
return sum(Dist(outcomes).values())
def favorable(event, space):
"A distribution of outcomes from the sample space that are in the event."
space = Dist(space)
return Dist({x: space[x]
for x in space if x in event})
def Fraction(n, d): return n / d
For example, here's the probability of rolling an even number with a crooked die that is loaded to prefer 6:
In [30]:
Crooked = Dist({1: 0.1, 2: 0.1, 3: 0.1, 4: 0.1, 5: 0.1, 6: 0.5})
P(even, Crooked)
Out[30]:
As another example, an article gives the following counts for two-child families in Denmark, where GB
means a family where the first child is a girl and the second a boy (I'm aware that not all births can be classified as the binary "boy" or "girl," but the data was reported that way):
GG: 121801 GB: 126840
BG: 127123 BB: 135138
In [31]:
DK = Dist(GG=121801, GB=126840,
BG=127123, BB=135138)
In [32]:
first_girl = {'GG', 'GB'}
P(first_girl, DK)
Out[32]:
In [33]:
second_girl = {'GG', 'BG'}
P(second_girl, DK)
Out[33]:
This says that the probability of a girl is somewhere between 48% and 49%. The probability of a girl is very slightly higher for the second child.
Given the first child, are you more likely to have a second child of the same sex?
In [34]:
same = {'GG', 'BB'}
P(same, DK)
Out[34]:
Yes, but only by about 0.3%.
To calculate the probability of an even die roll, I originally said
even = {2, 4, 6}
But that's inelegant—I had to explicitly enumerate all the even numbers from one to six. If I ever wanted to deal with a twelve or twenty-sided die, I would have to go back and redefine even
. I would prefer to define even
once and for all like this:
In [35]:
def even(n): return n % 2 == 0
Now in order to make P(even, D)
work, I'll allow an Event
to be either a collection of outcomes or a callable
predicate (that is, a function that returns true for outcomes that are part of the event). I don't need to modify P
, but favorable
will have to convert a callable event
to a set
:
In [36]:
def favorable(event, space):
"A distribution of outcomes from the sample space that are in the event."
if callable(event):
event = {x for x in space if event(x)}
space = Dist(space)
return Dist({x: space[x]
for x in space if x in event})
In [37]:
favorable(even, D)
Out[37]:
In [38]:
P(even, D)
Out[38]:
I'll define die
to make a sample space for an n-sided die:
In [39]:
def die(n): return set(range(1, n + 1))
In [40]:
favorable(even, die(12))
Out[40]:
In [41]:
P(even, die(12))
Out[41]:
In [42]:
P(even, die(2000))
Out[42]:
In [43]:
P(even, die(2001))
Out[43]:
We can define more interesting events using predicates; for example we can determine the probability that the sum of rolling d 6-sided dice is prime:
In [44]:
def sum_dice(d): return Dist(sum(dice) for dice in itertools.product(D, repeat=d))
def is_prime(n): return (n > 1 and not any(n % i == 0 for i in range(2, n)))
for d in range(1, 9):
p = P(is_prime, sum_dice(d))
print("P(is_prime, sum_dice({})) = {}".format(d, round(p, 3)))
1654 1654 |
Consider a gambling game consisting of tossing a coin repeatedly. Player H wins the game as soon as a total of 10 heads come up, and T wins if a total of 10 tails come up before H wins. If the game is interrupted when H has 8 heads and T has 7 tails, how should the pot of money (which happens to be 100 Francs) be split? Here are some proposals, and arguments against them:
In 1654, Blaise Pascal and Pierre de Fermat corresponded on this problem, with Fermat writing:
Dearest Blaise,
As to the problem of how to divide the 100 Francs, I think I have found a solution that you will find to be fair. Seeing as I needed only two points to win the game, and you needed 3, I think we can establish that after four more tosses of the coin, the game would have been over. For, in those four tosses, if you did not get the necessary 3 points for your victory, this would imply that I had in fact gained the necessary 2 points for my victory. In a similar manner, if I had not achieved the necessary 2 points for my victory, this would imply that you had in fact achieved at least 3 points and had therefore won the game. Thus, I believe the following list of possible endings to the game is exhaustive. I have denoted 'heads' by an 'h', and tails by a 't.' I have starred the outcomes that indicate a win for myself.
h h h h * h h h t * h h t h * h h t t * h t h h * h t h t * h t t h * h t t t t h h h * t h h t * t h t h * t h t t t t h h * t t h t t t t h t t t t
I think you will agree that all of these outcomes are equally likely. Thus I believe that we should divide the stakes by the ration 11:5 in my favor, that is, I should receive (11/16)×100 = 68.75 Francs, while you should receive 31.25 Francs.
I hope all is well in Paris,
Your friend and colleague,
Pierre
Pascal agreed with this solution, and replied with a generalization that made use of his previous invention, Pascal's Triangle. There's even a book about it.
We can solve the problem with the tools we have:
In [45]:
def win_unfinished_game(h, t):
"The probability that H will win the unfinished game, given the number of points needed by H and T to win."
return P(at_least(h, 'h'), finishes(h, t))
def at_least(n, item):
"The event of getting at least n instances of item in an outcome."
return lambda outcome: outcome.count(item) >= n
def finishes(h, t):
"All finishes of a game where player H needs h points to win and T needs t."
tosses = ['ht'] * (h + t - 1)
return set(itertools.product(*tosses))
We can generate the 16 equiprobable finished that Pierre wrote about:
In [46]:
finishes(2, 3)
Out[46]:
And we can find the 11 of them that are favorable to player H
:
In [47]:
favorable(at_least(2, 'h'), finishes(2, 3))
Out[47]:
Finally, we can answer the question:
In [48]:
100 * win_unfinished_game(2, 3)
Out[48]:
We agree with Pascal and Fermat; we're in good company!
1693 1693 |
Let's jump ahead from 1654 all the way to 1693, when Samuel Pepys wrote to Isaac Newton posing the problem:
Which of the following three propositions has the greatest chance of success?
- Six fair dice are tossed independently and at least one “6” appears.
- Twelve fair dice are tossed independently and at least two “6”s appear.
- Eighteen fair dice are tossed independently and at least three “6”s appear.
Newton was able to answer the question correctly (although his reasoning was not quite right); let's see how we can do. Since we're only interested in whether a die comes up as "6" or not, we can define a single die like this:
In [49]:
die6 = Dist({6: 1/6, '-': 5/6})
Next we can define the joint distribution formed by combining two independent distribution like this:
In [50]:
def joint(A, B, combine='{}{}'.format):
"""The joint distribution of two independent distributions.
Result is all entries of the form {'ab': frequency(a) * frequency(b)}"""
return Dist({combine(a, b): A[a] * B[b]
for a in A for b in B})
joint(die6, die6)
Out[50]:
And the joint distribution from rolling n dice:
In [51]:
def dice(n, die):
"Joint probability distribution from rolling `n` dice."
if n == 1:
return die
else:
return joint(die, dice(n - 1, die))
dice(4, die6)
Out[51]:
Now we are ready to determine which proposition is more likely to have the required number of sixes:
In [52]:
P(at_least(1, '6'), dice(6, die6))
Out[52]:
In [53]:
P(at_least(2, '6'), dice(12, die6))
Out[53]:
In [54]:
P(at_least(3, '6'), dice(18, die6))
Out[54]:
We reach the same conclusion Newton did, that the best chance is rolling six dice.
Here's another urn problem (actually a "bag" problem) from prolific Python/Probability pundit Allen Downey :
The blue M&M was introduced in 1995. Before then, the color mix in a bag of plain M&Ms was (30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan). Afterward it was (24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown). A friend of mine has two bags of M&Ms, and he tells me that one is from 1994 and one from 1996. He won't tell me which is which, but he gives me one M&M from each bag. One is yellow and one is green. What is the probability that the yellow M&M came from the 1994 bag?
To solve this problem, we'll first create distributions for each bag: bag94
and bag96
:
In [55]:
bag94 = Dist(brown=30, yellow=20, red=20, green=10, orange=10, tan=10)
bag96 = Dist(blue=24, green=20, orange=16, yellow=14, red=13, brown=13)
Next, define MM
as the joint distribution—the sample space for picking one M&M from each bag. The outcome '94:yellow 96:green'
means that a yellow M&M was selected from the 1994 bag and a green one from the 1996 bag. In this problem we don't get to see the actual outcome; we just see some evidence about the outcome, that it contains a yellow and a green.
In [56]:
MM = joint(bag94, bag96, '94:{} 96:{}'.format)
MM
Out[56]:
We observe that "One is yellow and one is green":
In [57]:
def yellow_and_green(outcome): return 'yellow' in outcome and 'green' in outcome
favorable(yellow_and_green, MM)
Out[57]:
Given this observation, we want to know "What is the probability that the yellow M&M came from the 1994 bag?"
In [58]:
def yellow94(outcome): return '94:yellow' in outcome
P(yellow94, favorable(yellow_and_green, MM))
Out[58]:
So there is a 74% chance that the yellow comes from the 1994 bag.
Answering this question was straightforward: just like all the other probability problems, we simply create a sample space, and use P
to pick out the probability of the event in question, given what we know about the outcome.
But in a sense it is curious that we were able to solve this problem with the same methodology as the others: this problem comes from a section titled My favorite Bayes's Theorem Problems, so one would expect that we'd need to invoke Bayes Theorem to solve it. The computation above shows that that is not necessary.
Of course, we could solve it using Bayes Theorem. Why is Bayes Theorem recommended? Because we are asked about the probability of an outcome given the evidence—the probability the yellow came from the 94 bag, given that there is a yellow and a green. But the problem statement doesn't directly tell us the probability of that outcome given the evidence; it just tells us the probability of the evidence given the outcome.
Before we see the colors of the M&Ms, there are two hypotheses, A
and B
, both with equal probability:
A: first M&M from 94 bag, second from 96 bag
B: first M&M from 96 bag, second from 94 bag
P(A) = P(B) = 0.5
Then we get some evidence:
E: first M&M yellow, second green
We want to know the probability of hypothesis A
, given the evidence:
P(A | E)
That's not easy to calculate (except by enumerating the sample space, which our P
function does). But Bayes Theorem says:
P(A | E) = P(E | A) * P(A) / P(E)
The quantities on the right-hand-side are easier to calculate:
P(E | A) = 0.20 * 0.20 = 0.04
P(E | B) = 0.10 * 0.14 = 0.014
P(A) = 0.5
P(B) = 0.5
P(E) = P(E | A) * P(A) + P(E | B) * P(B)
= 0.04 * 0.5 + 0.014 * 0.5 = 0.027
And we can get a final answer:
P(A | E) = P(E | A) * P(A) / P(E)
= 0.04 * 0.5 / 0.027
= 0.7407407407
You have a choice: Bayes Theorem allows you to do less calculation at the cost of more algebra; that is a great trade-off if you are working with pencil and paper. Enumerating the sample space allows you to do less algebra at the cost of more calculation; usually a good trade-off if you have a computer. But regardless of the approach you use, it is important to understand Bayes theorem and how it works.
There is one important question that Allen Downey does not address: would you eat twenty-year-old M&Ms? 😨
Sometimes it is inconvenient, difficult, or even impossible to explicitly enumerate a sample space. Perhaps the sample space is infinite, or perhaps it is just very large and complicated (perhaps with a bunch of low-probability outcomes that don't seem very important). In that case, we might feel more confident in writing a program to simulate a random outcome. Random sampling from such a simulation can give an accurate estimate of probability.
1940—
Consider problem 84 from the excellent Project Euler, which asks for the probability that a player in the game Monopoly ends a roll on each of the squares on the board. To answer this we need to take into account die rolls, chance and community chest cards, and going to jail (from the "go to jail" space, from a card, or from rolling doubles three times in a row). We do not need to take into account anything about acquiring properties or exchanging money or winning or losing the game, because these events don't change a player's location.
A game of Monopoly can go on forever, so the sample space is infinite. Even if we limit the sample space to say, 1000 rolls, there are $21^{1000}$ such sequences of rolls, and even more possibilities when we consider drawing cards. So it is infeasible to explicitly represent the sample space. There are techniques for representing the problem as a Markov decision problem (MDP) and solving it, but the math is complex (a paper on the subject runs 15 pages).
The simplest approach is to implement a simulation and run it for, say, a million rolls. Here is the code for a simulation:
In [59]:
from collections import deque as Deck # a Deck of community chest or chance cards
# The Monopoly board, as specified by https://projecteuler.net/problem=84
(GO, A1, CC1, A2, T1, R1, B1, CH1, B2, B3,
JAIL, C1, U1, C2, C3, R2, D1, CC2, D2, D3,
FP, E1, CH2, E2, E3, R3, F1, F2, U2, F3,
G2J, G1, G2, CC3, G3, R4, CH3, H1, T2, H2) = board = range(40)
# A card is either a square, a set of squares meaning advance to the nearest,
# a -3 to go back 3 spaces, or None meaning no change to location.
CC_deck = Deck([GO, JAIL] + 14 * [None])
CH_deck = Deck([GO, JAIL, C1, E3, H2, R1, -3, {U1, U2}]
+ 2 * [{R1, R2, R3, R4}] + 6 * [None])
def monopoly(rolls):
"""Simulate given number of dice rolls of a Monopoly game,
and return the counts of how often each square is visited."""
counts = [0] * len(board)
doubles = 0 # Number of consecutive doubles rolled
random.shuffle(CC_deck)
random.shuffle(CH_deck)
goto(GO)
for _ in range(rolls):
d1, d2 = random.randint(1, 6), random.randint(1, 6)
doubles = (doubles + 1 if d1 == d2 else 0)
goto(here + d1 + d2)
if here == G2J or doubles == 3:
goto(JAIL)
doubles = 0
elif here in (CC1, CC2, CC3):
do_card(CC_deck)
elif here in (CH1, CH2, CH3):
do_card(CH_deck)
counts[here] += 1
return counts
def goto(square):
"Update 'here' to be this square (and handle passing GO)."
global here
here = square % len(board)
def do_card(deck):
"Take the top card from deck and do what it says."
card = deck.popleft() # The top card
deck.append(card) # Move top card to bottom of deck
if card == None: # Don't move
pass
elif card == -3: # Go back 3 spaces
goto(here - 3)
elif isinstance(card, set): # Advance to next railroad or utility
next1 = min({place for place in card if place > here} or card)
goto(next1)
else: # Go to destination named on card
goto(card)
Let's run the simulation for a million dice rolls:
In [60]:
counts = monopoly(10**6)
And print a table of square names and their percentages:
In [61]:
property_names = """
GO, A1, CC1, A2, T1, R1, B1, CH1, B2, B3,
JAIL, C1, U1, C2, C3, R2, D1, CC2, D2, D3,
FP, E1, CH2, E2, E3, R3, F1, F2, U2, F3,
G2J, G1, G2, CC3, G3, R4, CH3, H1, T2, H2""".replace(',', ' ').split()
for (c, n) in sorted(zip(counts, property_names), reverse=True):
print('{:4} {:.2%}'.format(n, c / sum(counts)))
There is one square far above average: JAIL
, at a little over 6%. There are four squares far below average: the three chance squares, CH1
, CH2
, and CH3
, at around 1% (because 10 of the 16 chance cards send the player away from the square), and the "Go to Jail" square, which has a frequency of 0 because you can't end a turn there. The other squares are around 2% to 3% each, which you would expect, because 100% / 40 = 2.5%.
We have covered the concept of distributions of outcomes. You may have heard of the normal distribution, the bell-shaped curve. In Python it is called random.normalvariate
(also random.gauss
). We can plot it with the help of the repeated_hist
function defined below, which samples a distribution n
times and displays a histogram of the results. (Note: in this section I am using "distribution" to mean a function that, each time it is called, returns a random sample from a distribution. I am not using it to mean a mapping of type Dist
.)
In [62]:
%matplotlib inline
import matplotlib.pyplot as plt
from statistics import mean
from random import normalvariate, triangular, choice, vonmisesvariate, uniform
def normal(mu=0, sigma=1): return random.normalvariate(mu, sigma)
def repeated_hist(dist, n=10**6, bins=100):
"Sample the distribution n times and make a histogram of the results."
samples = [dist() for _ in range(n)]
plt.hist(samples, bins=bins, normed=True)
plt.title('{} (μ = {:.1f})'.format(dist.__name__, mean(samples)))
plt.grid(axis='x')
plt.yticks([], '')
plt.show()
In [63]:
# Normal distribution
repeated_hist(normal)
Why is this distribution called normal? The Central Limit Theorem says that it is the ultimate limit of other distributions, as follows (informally):
Here's a simple example: summing ten independent die rolls:
In [64]:
def sum10dice(): return sum(random.randint(1, 6) for _ in range(10))
repeated_hist(sum10dice, bins=range(10, 61))
As another example, let's take just k = 5 component distributions representing the per-game scores of 5 basketball players, and then sum them together to form the new distribution, the team score. I'll be creative in defining the distributions for each player, but historically accurate in the mean for each distribution.
In [65]:
def SC(): return max(0, normal(12.1, 3) + 3 * triangular(1, 13, 4)) # 30.1
def KT(): return max(0, triangular(8, 22, 15.3) + choice((0, 3 * triangular(1, 9, 4)))) # 22.1
def DG(): return max(0, vonmisesvariate(30, 2) * 3.08) # 14.0
def HB(): return max(0, choice((normal(6.7, 1.5), normal(16.7, 2.5)))) # 11.7
def BE(): return max(0, normal(17, 3) + uniform(0, 40)) # 37.0
team = (SC, KT, DG, HB, BE)
def Team(team=team): return sum(player() for player in team)
In [66]:
for player in team:
repeated_hist(player, bins=range(70))
We can see that none of the players have a distribution that looks like a normal distribution: SC
is skewed to one side (the mean is 5 points to the right of the peak); the three next players have bimodal distributions; and BE
is too flat on top.
Now we define the team score to be the sum of the k = 5 players, and display this new distribution:
In [67]:
repeated_hist(Team, bins=range(50, 180))
Sure enough, this looks very much like a normal distribution. The Central Limit Theorem appears to hold in this case. But I have to say: "Central Limit" is not a very evocative name, so I propose we re-name this as the Strength in Numbers Theorem, to indicate the fact that if you have a lot of numbers, you tend to get the expected result.
We've had an interesting tour and met some giants of the field: Laplace, Bernoulli, Fermat, Pascal, Bayes, Newton, ... even Mr. Monopoly and The Count.
The conclusion is: be methodical in defining the sample space and the event(s) of interest, and be careful in counting the number of outcomes in the numerator and denominator. and you can't go wrong. Easy as 1-2-3.
Everything up to here has been about discrete, finite sample spaces, where we can enumerate all the possible outcomes.
But a reader asked about continuous sample spaces, such as the space of real numbers. The principles are the same: probability is still the ratio of the favorable cases to all the cases, but now instead of counting cases, we have to (in general) compute integrals to compare the sizes of cases. Here we will cover a simple example, which we first solve approximately by simulation, and then exactly by calculation.
Oliver Roeder posed this problem in the 538 Riddler blog:
Two players go on a hot new game show called Higher Number Wins. The two go into separate booths, and each presses a button, and a random number between zero and one appears on a screen. (At this point, neither knows the other’s number, but they do know the numbers are chosen from a standard uniform distribution.) They can choose to keep that first number, or to press the button again to discard the first number and get a second random number, which they must keep. Then, they come out of their booths and see the final number for each player on the wall. The lavish grand prize — a case full of gold bullion — is awarded to the player who kept the higher number. Which number is the optimal cutoff for players to discard their first number and choose another? Put another way, within which range should they choose to keep the first number, and within which range should they reject it and try their luck with a second number?
We'll use this notation:
For example, if player A chooses a cutoff of A = 0.6, that means that A would accept any first number greater than 0.6, and reject any number below that cutoff. The question is: What cutoff, A, should player A choose to maximize the chance of winning, that is, maximize P(a > b)?
First, simulate the number that a player with a given cutoff gets (note that random.random()
returns a float sampled uniformly from the interval [0..1]):
In [68]:
number= random.random
def strategy(cutoff):
"Play the game with given cutoff, returning the first or second random number."
first = number()
return first if first > cutoff else number()
In [69]:
strategy(.5)
Out[69]:
Now compare the numbers returned with a cutoff of A versus a cutoff of B, and repeat for a large number of trials; this gives us an estimate of the probability that cutoff A is better than cutoff B:
In [70]:
def Pwin(A, B, trials=20000):
"The probability that cutoff A wins against cutoff B."
return mean(strategy(A) > strategy(B)
for _ in range(trials))
In [71]:
Pwin(0.6, 0.9)
Out[71]:
Now define a function, top
, that considers a collection of possible cutoffs, estimate the probability for each cutoff playing against each other cutoff, and returns a list with the N
top cutoffs (the ones that defeated the most number of opponent cutoffs), and the number of opponents they defeat:
In [72]:
def top(N, cutoffs):
"Return the N best cutoffs and the number of opponent cutoffs they beat."
winners = Counter(A if Pwin(A, B) > 0.5 else B
for (A, B) in itertools.combinations(cutoffs, 2))
return winners.most_common(N)
In [73]:
from numpy import arange
top(10, arange(0.5, 1.0, 0.01))
Out[73]:
We get a good idea of the top cutoffs, but they are close to each other, so we can't quite be sure which is best, only that the best is somewhere around 0.60. We could get a better estimate by increasing the number of trials, but that would consume more time.
More promising is the possibility of making Pwin(A, B)
an exact calculation. But before we get to Pwin(A, B)
, let's solve a simpler problem: assume that both players A and B have chosen a cutoff, and have each received a number above the cutoff. What is the probability that A gets the higher number? We'll call this Phigher(A, B)
. We can think of this as a two-dimensional sample space of points in the (a, b) plane, where a ranges from the cutoff A to 1 and b ranges from the cutoff B to 1. Here is a diagram of that two-dimensional sample space, with the cutoffs A=0.5 and B=0.6:
The total area of the sample space is 0.5 × 0.4 = 0.20, and in general it is (1 - A) · (1 - B). What about the favorable cases, where A beats B? That corresponds to the shaded triangle below:
The area of a triangle is 1/2 the base times the height, or in this case, 0.42 / 2 = 0.08, and in general, (1 - B)2 / 2. So in general we have:
Phigher(A, B) = favorable / total
favorable = ((1 - B) ** 2) / 2
total = (1 - A) * (1 - B)
Phigher(A, B) = (((1 - B) ** 2) / 2) / ((1 - A) * (1 - B))
Phigher(A, B) = (1 - B) / (2 * (1 - A))
And in this specific case we have:
A = 0.5; B = 0.6
favorable = 0.4 ** 2 / 2 = 0.08
total = 0.5 * 0.4 = 0.20
Phigher(0.5, 0.6) = 0.08 / 0.20 = 0.4
But note that this only works when the cutoff A ≤ B; when A > B, we need to reverse things. That gives us the code:
In [74]:
def Phigher(A, B):
"Probability that a sample from [A..1] is higher than one from [B..1]."
if A <= B:
return (1 - B) / (2 * (1 - A))
else:
return 1 - Phigher(B, A)
In [75]:
Phigher(0.5, 0.6)
Out[75]:
We're now ready to tackle the full game. There are four cases to consider, depending on whether A and B gets a first number that is above or below their cutoff choices:
first a | first b | P(a, b) | P(A wins &vert a, b) | Comment |
---|---|---|---|---|
a > A | b > B | (1 - A) · (1 - B) | Phigher(A, B) | Both above cutoff; both keep first numbers |
a < A | b < B | A · B | Phigher(0, 0) | Both below cutoff, both get new numbers from [0..1] |
a > A | b < B | (1 - A) · B | Phigher(A, 0) | A keeps number; B gets new number from [0..1] |
a < A | b > B | A · (1 - B) | Phigher(0, B) | A gets new number from [0..1]; B keeps number |
For example, the first row of this table says that the event of both first numbers being above their respective cutoffs has probability (1 - A) · (1 - B), and if this does occur, then the probability of A winning is Phigher(A, B).
We're ready to replace the old simulation-based Pwin
with a new calculation-based version:
In [76]:
def Pwin(A, B):
"With what probability does cutoff A win against cutoff B?"
return ((1-A) * (1-B) * Phigher(A, B) # both above cutoff
+ A * B * Phigher(0, 0) # both below cutoff
+ (1-A) * B * Phigher(A, 0) # A above, B below
+ A * (1-B) * Phigher(0, B)) # A below, B above
In [77]:
Pwin(0.5, 0.6)
Out[77]:
Pwin
relies on a lot of algebra. Let's define a few tests to check for obvious errors:
In [78]:
def test():
assert Phigher(0.5, 0.5) == Phigher(0.75, 0.75) == Phigher(0, 0) == 0.5
assert Pwin(0.5, 0.5) == Pwin(0.75, 0.75) == 0.5
assert Phigher(.6, .5) == 0.6
assert Phigher(.5, .6) == 0.4
return 'ok'
test()
Out[78]:
Let's repeat the calculation with our new, exact Pwin
:
In [79]:
top(10, arange(0.5, 1.0, 0.01))
Out[79]:
It is good to see that the simulation and the exact calculation are in rough agreement; that gives me more confidence in both of them. We see here that 0.62 defeats all the other cutoffs, and 0.61 defeats all cutoffs except 0.62. The great thing about the exact calculation code is that it runs fast, regardless of how much accuracy we want. We can zero in on the range around 0.6:
In [80]:
top(10, arange(0.5, 0.7, 0.001))
Out[80]:
This says 0.618 is best, better than 0.620. We can get even more accuracy:
In [81]:
top(10, arange(0.617, 0.619, 0.000001))
Out[81]:
So 0.618034 is best. Does that number look familiar? Can we prove that it is what I think it is?
To understand the strategic possibilities, it is helpful to draw a 3D plot of Pwin(A, B)
for values of A and B between 0 and 1:
In [82]:
import numpy as np
from mpl_toolkits.mplot3d.axes3d import Axes3D
def map2(fn, A, B):
"Map fn to corresponding elements of 2D arrays A and B."
return [list(map(fn, Arow, Brow))
for (Arow, Brow) in zip(A, B)]
cutoffs = arange(0.00, 1.00, 0.02)
A, B = np.meshgrid(cutoffs, cutoffs)
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.set_xlabel('A')
ax.set_ylabel('B')
ax.set_zlabel('Pwin(A, B)')
ax.plot_surface(A, B, map2(Pwin, A, B));
What does this Pringle of Probability show us? The highest win percentage for A, the peak of the surface, occurs when A is around 0.5 and B is 0 or 1. We can confirm that, finding the maximum Pwin(A, B)
for many different cutoff values of A
and B
:
In [83]:
cutoffs = (set(arange(0.00, 1.00, 0.01)) |
set(arange(0.500, 0.700, 0.001)) |
set(arange(0.61803, 0.61804, 0.000001)))
def Pwin_summary(A, B): return [Pwin(A, B), 'A:', A, 'B:', B]
In [84]:
max(Pwin_summary(A, B) for A in cutoffs for B in cutoffs)
Out[84]:
So A could win 62.5% of the time if only B would chose a cutoff of 0. But, unfortunately for A, a rational player B is not going to do that. We can ask what happens if the game is changed so that player A has to declare a cutoff first, and then player B gets to respond with a cutoff, with full knowledge of A's choice. In other words, what cutoff should A choose to maximize Pwin(A, B)
, given that B is going to take that knowledge and pick a cutoff that minimizes Pwin(A, B)
?
In [85]:
max(min(Pwin_summary(A, B) for B in cutoffs)
for A in cutoffs)
Out[85]:
And what if we run it the other way around, where B chooses a cutoff first, and then A responds?
In [86]:
min(max(Pwin_summary(A, B) for A in cutoffs)
for B in cutoffs)
Out[86]:
In both cases, the rational choice for both players in a cutoff of 0.618034, which corresponds to the "saddle point" in the middle of the plot. This is a stable equilibrium; consider fixing B = 0.618034, and notice that if A changes to any other value, we slip off the saddle to the right or left, resulting in a worse win probability for A. Similarly, if we fix A = 0.618034, then if B changes to another value, we ride up the saddle to a higher win percentage for A, which is worse for B. So neither player will want to move from the saddle point.
The moral for continuous spaces is the same as for discrete spaces: be careful about defining your sample space; measure carefully, and let your code take care of the rest.