License: Attribution 4.0 International (CC BY 4.0)
In [1]:
from thinkbayes2 import Pmf, Suite
import thinkplot
import math
% matplotlib inline
Suppose we are asked the question: Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin?
In order to make our problem easier, we will clarify a few facts about identical twins. Identical twins are known as monozygotic twins, meaning that they both devolop from a single zygote. As a result, monozygotic twins are the same gender, either male-male or female-female. So, we rephrase our question: What percentage of male-male twins are monozygotic.
In addition, here is an important fact: .08% of twins are monozygotic.
We use a tree to visualize the problem.
Assuming we have 100 twins, lets calculate the number of male-male dizygotic, male-male monozygotic, and total number of male-male twins.
In [2]:
# calculate number of male-male dizygotic twins using the percentage of dizygotic and percentage of male-male
DiMM = 100 * .92 * .25
# calculate number of male-male monozygotic twins using the percentage of monozygotic and percentage of male-male
MoMM = 100 * .08 * .5
# calculate total number of male-male twins
TotalMM = DiMM + MoMM
print("Number of male-male dizygotic twins: {}".format(DiMM))
print("Number of male-male monozygotic twins: {}".format(MoMM))
print("Total number of male-male twins: {}".format(TotalMM))
In [3]:
# next we can calculate the fraction of male-male twins that are monozygotic
fractionMoMM = MoMM / TotalMM
percentMoMM = fractionMoMM * 100
print("Percentage of male-male monozygotic twins: {0:.1f}%".format(percentMoMM))
So, we can conclude that Elvis had a 14.8% chance to identical twins with his brother.
However, rather than using a huge tree, we can use Bayes' theorem to make a much more eligant solution. First, assuming we are only dealing with twins, we find must find P(male-male|monozygotic), P(male-male), and P(monozygotic). Then we can calculate P(monozygotic|male-male).
In [4]:
twins = dict()
# first calculate the total percentage of male-male twins. We can do this by adding the percentage of male-male
# monozygotic and the percentage of male-male dizygotic
twins['male-male'] = (.08*.50 + .92*.25)
twins['male-male|monozygotic'] = (.50)
twins['monozygotic'] = (.08)
print(twins['male-male'])
print(twins['male-male|monozygotic'])
print(twins['monozygotic'])
In [5]:
# now using bayes theorem
temp = twins['male-male|monozygotic'] * twins['monozygotic'] / twins['male-male']
print("P(monozygotic|male-male): {0:.3f}".format(temp))
We are given dice 4, 6, 8, 12, and 20 sides. If we roll a a die many times at random, what is the probability that we roll each die.
First we must define the Likelihood function for the dice. In this case, if we roll a number greater than that dice (roll 5 for 4 sided dice), the probability of that being the chosen dice goes to zero. Else, the probability is multiplied by 1 over the number of sides.
In [6]:
class Dice(Suite):
def Likelihood(self, data, hypo):
if hypo < data:
return 0
else:
return 1 / hypo
Next create a dice object with dice of 4, 6, 8, 12 and 20 sides
In [7]:
suite = Dice([4, 6, 8, 12, 20])
Roll a 6 and see the probabilities of being each dice
In [8]:
suite.Update(6)
suite.Print()
Now roll a series of numbers
In [9]:
for roll in [6, 8, 7, 7, 5, 4]:
suite.Update(roll)
suite.Print()
For these roles, we see that the 8 sided dice is most probable. It is still possible for the 20 sided dice, but only with a .1% chance.
Railroads number trains from 1 to N. One day you see a train numbered 60. How many trains does the railroad have?
First define the Train suite. The likelihood is the same as the above dice problem. We can think of it like this: each number correspond to a number of trains. If we see train N, then all hypothesis less than N are 0. Else, they are 1 / N.
In [10]:
class Train(Suite):
# hypo is the number of trains
# data is an observed serial number
def Likelihood(self, data, hypo):
if data > hypo:
return 0
else:
return 1 / hypo
Create train object and update with train number 60
In [11]:
hypos = range(1, 1001)
train = Train(hypos)
train.Update(60)
Out[11]:
Plot current probabilities of numbers of trains
In [12]:
thinkplot.Pdf(train)
Because 60 is not actually a good guess, we will compute the mean of the posterior distribution
In [13]:
def Mean(suite):
total = 0
for hypo, prob in suite.Items():
total += hypo * prob
return total
print(Mean(train))
The mean of the posterior distribution is the value that minimizes error. In simpler terms, we get the smallest number (error) when we subtract the actual number of trains from the mean of posterior distribution.
Next, update the train with two more sightings, 50 and 90
In [14]:
for data in [50, 90]:
train.Update(data)
print(Mean(train))
thinkplot.Pdf(train)
After the two updates, the error minimizing value has gone down to 164.
At the start of the problem, we assumed that there was an equal chance to any number of trains. However, most rail companies don't have thousands of trains. To better represent this fact, we can give each hypotheses greater for smaller numbers of trains.
In [15]:
class Train2(Dice):
def __init__(self, hypos, alpha=1.0):
Pmf.__init__(self)
for hypo in hypos:
self.Set(hypo, hypo**(-alpha))
self.Normalize()
In [16]:
hypos2 = range(1, 1001)
train2 = Train2(hypos2)
In [17]:
thinkplot.Pmf(train2)
In [18]:
for data in [50, 60, 90]:
train2.Update(data)
thinkplot.Pmf(train2)
We initally thought that givin lower number of trains higher probabilities would give us a more accurate result. However, over just a few data points, we get a nearly identical graph to the one with linearly represented hypotheses.
Suppose you are a student who goes to various classes. Every morning you wake up and put on one of two watches. The first watch is on time. The second watch is 5 minutes slow. If you arrive to class 3 minutes late, what is the probability you wore the slow watch. Assume that arrival times follow the Gaussian function where b is an offset in minutes: $$f(x) = e^{-\frac{(x-b)^2}{32}}$$
First we want to make sure that our gaussian function is a reasonable approximation of arrival time. Below is a plot of the function from 15 minutes late to 15 minutes early. With some quick looks at the graph, you can see that you arrive to class +- 2 minutes around 45% of the time which is reasonable most students.
Next we define our Watch Suite. Our hypotheses will be the watches described above: 'watch 1' is that you used the on time watch 'watch 2' is that you used the 5 minute slow watch
In [19]:
class Watch(Suite):
"""
Maps watch hypotheses to probabilities
"""
def f(x, b):
"""
f is a function that returns a Gaussian Function.
Args:
x (int): the primary variable
b (int): a constant offset used to make fast or slow clocks
"""
return math.exp((-1 * (x-b)**2) / (32))
watch1_probs = dict()
for i in range(-15,15):
watch1_probs[i] = f(i, 0)
watch2_probs = dict()
for i in range(-15,15):
watch2_probs[i] = f(i, -5)
hypotheses = {
'watch 1':watch1_probs,
'watch 2':watch2_probs
}
def __init__(self, hypos):
Pmf.__init__(self)
for hypo in hypos:
self.Set(hypo, 1)
self.Normalize()
def Likelihood(self, data, hypo):
time = self.hypotheses[hypo]
like = time[data]
return like
Next create the two hypotheses. As expected, before we see any class arival data, both watches have equal chances of being worn.
In [20]:
watches = Watch(['watch 1', 'watch 2'])
watches.Print()
As a sanity check, suppose we arrive to class exactly on time, and the next 5 minutes late
In [21]:
for arrival_time in [0,-5]:
watches.Update(arrival_time)
watches.Print()
Our model says that both hypotheses have still have the same probabilility, this makes sense because one hypotheses is centered at 0 and the other at 5.
Now, lets see what happens if we visit 5 more classes
In [22]:
for arrival_time in [0,-2,-2,-3,-5]:
watches.Update(arrival_time)
watches.Print()
After this series of updates, we have a slightly increased chance of using the on time watch. This makes sense because on average, the times have been slightly closer to 0 than -5.
Even though our model performs reasonable close data, it falls apart if you arrive either really late or early. Suppose you arrive to class 11 minutes late
In [23]:
watches.Update(-11)
watches.Print()
Now, the slow watch, is significantly more probable than the on time watch. If we didn't want extreme values to have as big of an impact in updates, we would simply modify the denominator of our Gaussian function