In [1]:
# format the book
%matplotlib inline
import sys
from __future__ import division, print_function
import sys
sys.path.insert(0,'../code')
import book_format
book_format.load_style('../code')
Out[1]:
Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen science project with the goal of identifying bacterial species that can be found in human navels (http://bbdata.yourwildlife.org). The project might seem whimsical, but it is part of an increasing interest in the human microbiome, the set of microorganisms that live on human skin and parts of the body.
In their pilot study, BBB2 researchers collected swabs from the navels of 60 volunteers, used multiplex pyrosequencing to extract and sequence fragments of 16S rDNA, then identified the species or genus the fragments came from. Each identified fragment is called a “read.”
We can use these data to answer several related questions:
Based on the number of species observed, can we estimate the total number of species in the environment?
Can we estimate the prevalence of each species; that is, the fraction of the total population belonging to each species?
If we are planning to collect additional samples, can we predict how many new species we are likely to discover?
How many additional reads are needed to increase the fraction of observed species to a given threshold?
These questions make up what is called the Unseen Species problem.
I’ll start with a simplified version of the problem where we know that there are exactly three species. Let’s call them lions, tigers and bears. Suppose we visit a wild animal preserve and see 3 lions, 2 tigers and one bear.
If we have an equal chance of observing any animal in the preserve, the
number of each species we see is governed by the multinomial
distribution. If the prevalence of lions and tigers and bears is
p_lion
and p_tiger
and p_bear
, the likelihood of seeing 3 lions, 2
tigers and one bear is
p_lion**3 * p_tiger**2 * p_bear**1
An approach that is tempting, but not correct, is to use beta
distributions, as in Section [beta], to describe the prevalence of each
species separately. For example, we saw 3 lions and 3 non-lions; if we
think of that as 3 “heads” and 3 “tails,” then the posterior
distribution of p_lion
is:
In [3]:
import thinkbayes
beta = thinkbayes.Beta()
beta.Update((3, 3))
print(beta.MaximumLikelihood())
The maximum likelihood estimate for p_lion
is the observed rate, 50%.
Similarly the MLEs for p_tiger
and p_bear
are 33% and 17%.
But there are two problems:
We have implicitly used a prior for each species that is uniform from 0 to 1, but since we know that there are three species, that prior is not correct. The right prior should have a mean of 1/3, and there should be zero likelihood that any species has a prevalence of 100%.
The distributions for each species are not independent, because the prevalences have to add up to 1. To capture this dependence, we need a joint distribution for the three prevalences.
We can use a Dirichlet distribution to solve both of these problems (see
http://en.wikipedia.org/wiki/Dirichlet_distribution). In the same way
we used the beta distribution to describe the distribution of bias for a
coin, we can use a Dirichlet distribution to describe the joint
distribution of p_lion
, p_tiger
and p_bear
.
The Dirichlet distribution is the multi-dimensional generalization of the beta distribution. Instead of two possible outcomes, like heads and tails, the Dirichlet distribution handles any number of outcomes: in this example, three species.
If there are n
outcomes, the Dirichlet distribution is
described by n
parameters, written $\alpha_1$ through
$\alpha_n$.
Here’s the definition, from thinkbayes.py
, of a class that
represents a Dirichlet distribution:
class Dirichlet(object):
def __init__(self, n):
self.n = n
self.params = numpy.ones(n, dtype=numpy.int)
n
is the number of dimensions; initially the parameters are
all 1. I use a numpy
array to store the parameters so I can
take advantage of array operations.
Given a Dirichlet distribution, the marginal distribution for each prevalence is a beta distribution, which we can compute like this:
In [4]:
def MarginalBeta(self, i):
alpha0 = self.params.sum()
alpha = self.params[i]
return Beta(alpha, alpha0-alpha)
i
is the index of the marginal distribution we want.
alpha0
is the sum of the parameters; alpha
is
the parameter for the given species.
In the example, the prior marginal distribution for each species is
Beta(1, 2)
. We can compute the prior means like this:
In [6]:
dirichlet = thinkbayes.Dirichlet(3)
for i in range(3):
beta = dirichlet.MarginalBeta(i)
print(beta.Mean())
As expected, the prior mean prevalence for each species is 1/3.
To update the Dirichlet distribution, we add the observations to the parameters like this:
def Update(self, data):
m = len(data)
self.params[:m] += data
data
is a sequence of counts in the same order as
params
, so in this example, it should be the number of
lions, tigers and bears.
data
can be shorter than params
; in that case
there are some species that have not been observed.
Here’s code that updates dirichlet
with the observed data
and computes the posterior marginal distributions.
In [8]:
data = [3, 2, 1]
dirichlet.Update(data)
for i in range(3):
beta = dirichlet.MarginalBeta(i)
pmf = beta.MakePmf()
print(i, pmf.Mean())
[fig.species1]
Figure [fig.species1] shows the results. The posterior mean prevalences are 44%, 33%, and 22%.
We have solved a simplified version of the problem: if we know how many species there are, we can estimate the prevalence of each.
Now let’s get back to the original problem, estimating the total number of species. To solve this problem I’ll define a meta-Suite, which is a Suite that contains other Suites as hypotheses. In this case, the top-level Suite contains hypotheses about the number of species; the bottom level contains hypotheses about prevalences.
Here’s the class definition:
class Species(thinkbayes.Suite):
def __init__(self, ns):
hypos = [thinkbayes.Dirichlet(n) for n in ns]
thinkbayes.Suite.__init__(self, hypos)
__init__
takes a list of possible values for n
and makes
a list of Dirichlet objects.
Here’s the code that creates the top-level suite:
In [19]:
from species import Species
ns = range(3, 30)
suite = Species(ns)
ns
is the list of possible values for n
. We
have seen 3 species, so there have to be at least that many. I chose an
upper bound that seems reasonable, but we will check later that the
probability of exceeding this bound is low. And at least initially we
assume that any value in this range is equally likely.
To update a hierarchical model, you have to update all levels. Usually you have to update the bottom level first and work up, but in this case we can update the top level first:
#class Species
def Update(self, data):
thinkbayes.Suite.Update(self, data)
for hypo in self.Values():
hypo.Update(data)
Species.Update
invokes Update
in the parent
class, then loops through the sub-hypotheses and updates them.
Now all we need is a likelihood function:
# class Species
def Likelihood(self, data, hypo):
dirichlet = hypo
like = 0
for i in range(1000):
like += dirichlet.Likelihood(data)
return like
data
is a sequence of observed counts; hypo
is
a Dirichlet object. Species.Likelihood
calls
Dirichlet.Likelihood
1000 times and returns the total.
Why call it 1000 times? Because Dirichlet.Likelihood
doesn’t actually compute the likelihood of the data under the whole
Dirichlet distribution. Instead, it draws one sample from the
hypothetical distribution and computes the likelihood of the data under
the sampled set of prevalences.
Here’s what it looks like:
# class Dirichlet
def Likelihood(self, data):
m = len(data)
if self.n < m:
return 0
x = data
p = self.Random()
q = p[:m]**x
return q.prod()
The length of data
is the number of species observed. If we
see more species than we thought existed, the likelihood is 0.
Otherwise we select a random set of prevalences, p
, and
compute the multinomial PMF, which is
$p_i$ is the prevalence of the $i$th species, and $x_i$ is the observed number. The first term, $c_x$, is the multinomial coefficient; I leave it out of the computation because it is a multiplicative factor that depends only on the data, not the hypothesis, so it gets normalized away (see http://en.wikipedia.org/wiki/Multinomial_distribution).
m
is the number of observed species. We only need the first
m
elements of p
; for the others, $x_i$ is 0,
so $p_i^{x_i}$ is 1, and we can leave them out of the product.
There are two ways to generate a random sample from a Dirichlet distribution. One is to use the marginal beta distributions, but in that case you have to select one at a time and scale the rest so they add up to 1 (see http://en.wikipedia.org/wiki/Dirichlet_distribution#Random_number_generation).
A less obvious, but faster, way is to select values from n
gamma distributions, then normalize by dividing through by the total.
Here’s the code:
# class Dirichlet
def Random(self):
p = numpy.random.gamma(self.params)
return p / p.sum()
Now we’re ready to look at some results. Here is the code that extracts
the posterior distribution of n
:
def DistOfN(self):
pmf = thinkbayes.Pmf()
for hypo, prob in self.Items():
pmf.Set(hypo.n, prob)
return pmf
DistOfN
iterates through the top-level hypotheses and
accumulates the probability of each n
.
[fig.species2]
Figure [fig.species2] shows the result. The most likely value is 4. Values from 3 to 7 are reasonably likely; after that the probabilities drop off quickly. The probability that there are 29 species is low enough to be negligible; if we chose a higher bound, we would get nearly the same result.
Remember that this result is based on a uniform prior for
n
. If we have background information about the number of
species in the environment, we might choose a different prior.
I have to admit that I am proud of this example. The Unseen Species problem is not easy, and I think this solution is simple and clear, and takes surprisingly few lines of code (about 50 so far).
The only problem is that it is slow. It’s good enough for the example with only 3 observed species, but not good enough for the belly button data, with more than 100 species in some samples.
The next few sections present a series of optimizations we need to make this solution scale. Before we get into the details, here’s a road map.
The first step is to recognize that if we update the Dirichlet
distributions with the same data, the first m
parameters are the same for all of them. The only difference is the
number of hypothetical unseen species. So we don’t really need
n
Dirichlet objects; we can store the parameters in the
top level of the hierarchy. Species2
implements this
optimization.
Species2
also uses the same set of random values for
all of the hypotheses. This saves time generating random values, but
it has a second benefit that turns out to be more important: by
giving all hypotheses the same selection from the sample space, we
make the comparison between the hypotheses more fair, so it takes
fewer iterations to converge.
Even with these changes there is a major performance problem. As the number of observed species increases, the array of random prevalences gets bigger, and the chance of choosing one that is approximately right becomes small. So the vast majority of iterations yield small likelihoods that don’t contribute much to the total, and don’t discriminate between hypotheses.
The solution is to do the updates one species at a time.
Species4
is a simple implementation of this strategy
using Dirichlet objects to represent the sub-hypotheses.
Finally, Species5
combines the sub-hypotheses into the
top level and uses numpy
array operations to speed
things up.
If you are not interested in the details, feel free to skip to Section [belly] where we look at results from the belly button data.
All of the bottom-level Dirichlet distributions are updated with the
same data, so the first m
parameters are the same for all
of them. We can eliminate them and merge the parameters into the
top-level suite. Species2
implements this optimization:
class Species2(object):
def __init__(self, ns):
self.ns = ns
self.probs = numpy.ones(len(ns), dtype=numpy.double)
self.params = numpy.ones(self.high, dtype=numpy.int)
ns
is the list of hypothetical values for n
;
probs
is the list of corresponding probabilities. And
params
is the sequence of Dirichlet parameters, initially
all 1.
Species2.Update
updates both levels of the hierarchy: first
the probability for each value of n
, then the Dirichlet
parameters:
# class Species2
def Update(self, data):
like = numpy.zeros(len(self.ns), dtype=numpy.double)
for i in range(1000):
like += self.SampleLikelihood(data)
self.probs *= like
self.probs /= self.probs.sum()
m = len(data)
self.params[:m] += data
SampleLikelihood
returns an array of likelihoods, one for
each value of n
. like
accumulates the total
likelihood for 1000 samples. self.probs
is multiplied by
the total likelihood, then normalized. The last two lines, which update
the parameters, are the same as in Dirichlet.Update
.
Now let’s look at SampleLikelihood
. There are two
opportunities for optimization here:
When the hypothetical number of species, n
, exceeds the
observed number, m
, we only need the first
m
terms of the multinomial PMF; the rest are 1.
If the number of species is large, the likelihood of the data might be too small for floating-point (see [underflow]). So it is safer to compute log-likelihoods.
Again, the multinomial PMF is
$$c_x p_1^{x_1} \cdots p_n^{x_n}$$So the log-likelihood is
$$\log c_x + x_1 \log p_1 + \cdots + x_n \log p_n$$which is fast and easy to compute. Again, $c_x$ it is the same for all hypotheses, so we can drop it. Here’s the code:
# class Species2
def SampleLikelihood(self, data):
gammas = numpy.random.gamma(self.params)
m = len(data)
row = gammas[:m]
col = numpy.cumsum(gammas)
log_likes = []
for n in self.ns:
ps = row / col[n-1]
terms = data * numpy.log(ps)
log_like = terms.sum()
log_likes.append(log_like)
log_likes -= numpy.max(log_likes)
likes = numpy.exp(log_likes)
coefs = [thinkbayes.BinomialCoef(n, m) for n in self.ns]
likes *= coefs
return likes
gammas
is an array of values from a gamma distribution; its
length is the largest hypothetical value of n
.
row
is just the first m
elements of
gammas
; since these are the only elements that depend on
the data, they are the only ones we need.
For each value of n
we need to divide row
by
the total of the first n
values from gamma
.
cumsum
computes these cumulative sums and stores them in
col
.
The loop iterates through the values of n
and accumulates a
list of log-likelihoods.
Inside the loop, ps
contains the row of probabilities,
normalized with the appropriate cumulative sum. terms
contains the terms of the summation, $x_i \log p_i$, and log_like
contains their sum.
After the loop, we want to convert the log-likelihoods to linear likelihoods, but first it’s a good idea to shift them so the largest log-likelihood is 0; that way the linear likelihoods are not too small (see [underflow]).
Finally, before we return the likelihood, we have to apply a correction
factor, which is the number of ways we could have observed these
m
species, if the total number of species is
n
. BinomialCoefficient
computes “n choose m”,
which is written $\binom{n}{m}$.
As often happens, the optimized version is less readable and more error-prone than the original. But that’s one reason I think it is a good idea to start with the simple version; we can use it for regression testing. I plotted results from both versions and confirmed that they are approximately equal, and that they converge as the number of iterations increases.
There’s more we could do to optimize this code, but there’s another problem we need to fix first. As the number of observed species increases, this version gets noisier and takes more iterations to converge on a good answer.
The problem is that if the prevalences we choose from the Dirichlet
distribution, the ps
, are not at least approximately right,
the likelihood of the observed data is close to zero and almost equally
bad for all values of n
. So most iterations don’t provide
any useful contribution to the total likelihood. And as the number of
observed species, m
, gets large, the probability of
choosing ps
with non-negligible likelihood gets small.
Really small.
Fortunately, there is a solution. Remember that if you observe a set of data, you can update the prior distribution with the entire dataset, or you can break it up into a series of updates with subsets of the data, and the result is the same either way.
For this example, the key is to perform the updates one species at a
time. That way when we generate a random set of ps
, only
one of them affects the computed likelihood, so the chance of choosing a
good one is much better.
Here’s a new version that updates one species at a time:
class Species4(Species):
def Update(self, data):
m = len(data)
for i in range(m):
one = numpy.zeros(i+1)
one[i] = data[i]
Species.Update(self, one)
This version inherits __init__
from Species
, so it
represents the hypotheses as a list of Dirichlet objects (unlike
Species2
).
Update
loops through the observed species and makes an
array, one
, with all zeros and one species count. Then it
calls Update
in the parent class, which computes the
likelihoods and updates the sub-hypotheses.
So in the running example, we do three updates. The first is something like “I have seen three lions.” The second is “I have seen two tigers and no additional lions.” And the third is “I have seen one bear and no more lions and tigers.”
Here’s the new version of Likelihood
:
# class Species4
def Likelihood(self, data, hypo):
dirichlet = hypo
like = 0
for i in range(self.iterations):
like += dirichlet.Likelihood(data)
# correct for the number of unseen species the new one
# could have been
m = len(data)
num_unseen = dirichlet.n - m + 1
like *= num_unseen
return like
This is almost the same as Species.Likelihood
. The
difference is the factor, num_unseen
. This correction is necessary
because each time we see a species for the first time, we have to
consider that there were some number of other unseen species that we
might have seen. For larger values of n
there are more
unseen species that we could have seen, which increases the likelihood
of the data.
This is a subtle point and I have to admit that I did not get it right the first time. But again I was able to validate this version by comparing it to the previous versions.
Performing the updates one species at a time solves one problem, but it creates another. Each update takes time proportional to $k m$, where $k$ is the number of hypotheses and $m$ is the number of observed species. So if we do $m$ updates, the total run time is proportional to $k m^2$.
But we can speed things up using the same trick we used in
Section [collapsing]: we’ll get rid of the Dirichlet objects and
collapse the two levels of the hierarchy into a single object. So here’s
yet another version of Species
:
class Species5(Species2):
def Update(self, data):
m = len(data)
for i in range(m):
self.UpdateOne(i+1, data[i])
self.params[i] += data[i]
This version inherits __init__
from Species2
, so it uses
ns
and probs
to represent the distribution of
n
, and params
to represent the parameters of
the Dirichlet distribution.
Update
is similar to what we saw in the previous section.
It loops through the observed species and calls UpdateOne
:
# class Species5
def UpdateOne(self, i, count):
likes = numpy.zeros(len(self.ns), dtype=numpy.double)
for i in range(self.iterations):
likes += self.SampleLikelihood(i, count)
unseen_species = [n-i+1 for n in self.ns]
likes *= unseen_species
self.probs *= likes
self.probs /= self.probs.sum()
This function is similar to Species2.Update
, with two
changes:
The interface is different. Instead of the whole dataset, we get
i
, the index of the observed species, and
count
, how many of that species we’ve seen.
We have to apply a correction factor for the number of unseen
species, as in Species4.Likelihood
. The difference here
is that we update all of the likelihoods at once with array
multiplication.
Finally, here’s SampleLikelihood
:
# class Species5
def SampleLikelihood(self, i, count):
gammas = numpy.random.gamma(self.params)
sums = numpy.cumsum(gammas)[self.ns[0]-1:]
ps = gammas[i-1] / sums
log_likes = numpy.log(ps) * count
log_likes -= numpy.max(log_likes)
likes = numpy.exp(log_likes)
return likes
This is similar to Species2.SampleLikelihood
; the
difference is that each update only includes a single species, so we
don’t need a loop.
The runtime of this function is proportional to the number of hypotheses, $k$. It runs $m$ times, so the run time of the update is proportional to $k m$. And the number of iterations we need to get an accurate result is usually small.
That’s enough about lions and tigers and bears. Let’s get back to belly buttons. To get a sense of what the data look like, consider subject B1242, whose sample of 400 reads yielded 61 species with the following counts:
92, 53, 47, 38, 15, 14, 12, 10, 8, 7, 7, 5, 5,
4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
There are a few dominant species that make up a large fraction of the whole, but many species that yielded only a single read. The number of these “singletons” suggests that there are likely to be at least a few unseen species.
In the example with lions and tigers, we assume that each animal in the preserve is equally likely to be observed. Similarly, for the belly button data, we assume that each bacterium is equally likely to yield a read.
In reality, each step in the data-collection process might introduce biases. Some species might be more likely to be picked up by a swab, or to yield identifiable amplicons. So when we talk about the prevalence of each species, we should remember this source of error.
I should also acknowledge that I am using the term “species” loosely. First, bacterial species are not well defined. Second, some reads identify a particular species, others only identify a genus. To be more precise, I should say “operational taxonomic unit”, or OTU.
Now let’s process some of the belly button data. I define a class called
Subject
to represent information about each subject in the
study:
class Subject(object):
def __init__(self, code):
self.code = code
self.species = []
Each subject has a string code, like “B1242”, and a list of (count,
species name) pairs, sorted in increasing order by count.
Subject
provides several methods to make it easy to access
these counts and species names. You can see the details in
http://thinkbayes.com/species.py. For more information see
Section [download].
[species-ndist]
Subject
provides a method named Process
that
creates and updates a Species5
suite, which represents the
distributions of n
and the prevalences.
And Suite2
provides DistOfN
, which returns the
posterior distribution of n
.
# class Suite2
def DistN(self):
items = zip(self.ns, self.probs)
pmf = thinkbayes.MakePmfFromItems(items)
return pmf
Figure [species-ndist] shows the distribution of n
for
subject B1242. The probability that there are exactly 61 species, and no
unseen species, is nearly zero. The most likely value is 72, with 90%
credible interval 66 to 79. At the high end, it is unlikely that there
are as many as 87 species.
Next we compute the posterior distribution of prevalence for each
species. Species2
provides DistOfPrevalence
:
# class Species2
def DistOfPrevalence(self, index):
metapmf = thinkbayes.Pmf()
for n, prob in zip(self.ns, self.probs):
beta = self.MarginalBeta(n, index)
pmf = beta.MakePmf()
metapmf.Set(pmf, prob)
mix = thinkbayes.MakeMixture(metapmf)
return metapmf, mix
index
indicates which species we want. For each
n
, we have a different posterior distribution of
prevalence.
[species-prev]
The loop iterates through the possible values of n
and
their probabilities. For each value of n
it gets a Beta
object representing the marginal distribution for the indicated species.
Remember that Beta objects contain the parameters alpha
and
beta
; they don’t have values and probabilities like a Pmf,
but they provide MakePmf
, which generates a discrete
approximation to the continuous beta distribution.
metapmf
is a meta-Pmf that contains the distributions of
prevalence, conditioned on n
. MakeMixture
combines the meta-Pmf into mix
, which combines the
conditional distributions into a single distribution of prevalence.
Figure [species-prev] shows results for the five species with the most reads. The most prevalent species accounts for 23% of the 400 reads, but since there are almost certainly unseen species, the most likely estimate for its prevalence is 20%, with 90% credible interval between 17% and 23%.
[species-rare]
I introduced the hidden species problem in the form of four related
questions. We have answered the first two by computing the posterior
distribution for n
and the prevalence of each species.
The other two questions are:
If we are planning to collect additional reads, can we predict how many new species we are likely to discover?
How many additional reads are needed to increase the fraction of observed species to a given threshold?
To answer predictive questions like this we can use the posterior distributions to simulate possible future events and compute predictive distributions for the number of species, and fraction of the total, we are likely to see.
The kernel of these simulations looks like this:
Choose n
from its posterior distribution.
Choose a prevalence for each species, including possible unseen species, using the Dirichlet distribution.
Generate a random sequence of future observations.
Compute the number of new species, num_new
, as a function of the
number of additional reads, k
.
Repeat the previous steps and accumulate the joint distribution of
num_new
and k
.
And here’s the code. RunSimulation
runs a single
simulation:
# class Subject
def RunSimulation(self, num_reads):
m, seen = self.GetSeenSpecies()
n, observations = self.GenerateObservations(num_reads)
curve = []
for k, obs in enumerate(observations):
seen.add(obs)
num_new = len(seen) - m
curve.append((k+1, num_new))
return curve
num_reads
is the number of additional reads to simulate.
m
is the number of seen species, and seen
is a
set of strings with a unique name for each species. n
is a
random value from the posterior distribution, and
observations
is a random sequence of species names.
Each time through the loop, we add the new observation to
seen
and record the number of reads and the number of new
species so far.
The result of RunSimulation
is a rarefaction
curve, represented as a list of pairs with the number of reads
and the number of new species.
Before we see the results, let’s look at GetSeenSpecies
and
GenerateObservations
.
#class Subject
def GetSeenSpecies(self):
names = self.GetNames()
m = len(names)
seen = set(SpeciesGenerator(names, m))
return m, seen
GetNames
returns the list of species names that appear in
the data files, but for many subjects these names are not unique. So I
use SpeciesGenerator
to extend each name with a serial
number:
In [20]:
def SpeciesGenerator(names, num):
i = 0
for name in names:
yield '%s-%d' % (name, i)
i += 1
while i < num:
yield 'unseen-%d' % i
i += 1
Given a name like Corynebacterium
,
SpeciesGenerator
yields Corynebacterium-1
.
When the list of names is exhausted, it yields names like
unseen-62
.
Here is GenerateObservations
:
Given a name like Corynebacterium
,
SpeciesGenerator
yields Corynebacterium-1
.
When the list of names is exhausted, it yields names like
unseen-62
.
Here is GenerateObservations
:
# class Subject
def GenerateObservations(self, num_reads):
n, prevalences = self.suite.SamplePosterior()
names = self.GetNames()
name_iter = SpeciesGenerator(names, n)
d = dict(zip(name_iter, prevalences))
cdf = thinkbayes.MakeCdfFromDict(d)
observations = cdf.Sample(num_reads)
return n, observations
Again, num_reads
is the number of additional reads to generate.
n
and prevalences
are samples from the
posterior distribution.
cdf
is a Cdf object that maps species names, including the
unseen, to cumulative probabilities. Using a Cdf makes it efficient to
generate a random sequence of species names.
Finally, here is Species2.SamplePosterior
:
def SamplePosterior(self):
pmf = self.DistOfN()
n = pmf.Random()
prevalences = self.SamplePrevalences(n)
return n, prevalences
And SamplePrevalences
, which generates a sample of
prevalences conditioned on n
:
# class Species2
def SamplePrevalences(self, n):
params = self.params[:n]
gammas = numpy.random.gamma(params)
gammas /= gammas.sum()
return gammas
We saw this algorithm for generating random values from a Dirichlet distribution in Section [randomdir].
Figure [species-rare] shows 100 simulated rarefaction curves for subject B1242. The curves are “jittered;” that is, I shifted each curve by a random offset so they would not all overlap. By inspection we can estimate that after 400 more reads we are likely to find 2–6 new species.
In [21]:
def MakeJointPredictive(curves):
joint = thinkbayes.Joint()
for curve in curves:
for k, num_new in curve:
joint.Incr((k, num_new))
joint.Normalize()
return joint
MakeJointPredictive
makes a Joint object, which is a Pmf
whose values are tuples.
curves
is a list of rarefaction curves created by
RunSimulation
. Each curve contains a list of pairs of
k
and num_new
.
The resulting joint distribution is a map from each pair to its
probability of occurring. Given the joint distribution, we can use
Joint.Conditional
get the distribution of num_new
conditioned on k
(see Section [conditional]).
Subject.MakeConditionals
takes a list of ks
and computes the conditional distribution of num_new
for each
k
. The result is a list of Cdf objects.
In [22]:
def MakeConditionals(curves, ks):
joint = MakeJointPredictive(curves)
cdfs = []
for k in ks:
pmf = joint.Conditional(1, 0, k)
pmf.name = 'k=%d' % k
cdf = pmf.MakeCdf()
cdfs.append(cdf)
return cdfs
Figure [species-cond] shows the results. After 100 reads, the median predicted number of new species is 2; the 90% credible interval is 0 to
[species-frac]
The last question we want to answer is, “How many additional reads are needed to increase the fraction of observed species to a given threshold?”
To answer this question, we need a version of RunSimulation
that computes the fraction of observed species rather than the number of
new species.
# class Subject
def RunSimulation(self, num_reads):
m, seen = self.GetSeenSpecies()
n, observations = self.GenerateObservations(num_reads)
curve = []
for k, obs in enumerate(observations):
seen.add(obs)
frac_seen = len(seen) / float(n)
curve.append((k+1, frac_seen))
return curve
Next we loop through each curve and make a dictionary, d
,
that maps from the number of additional reads, k
, to a list
of fracs
; that is, a list of values for the coverage
achieved after k
reads.
def MakeFracCdfs(self, curves):
d = {}
for curve in curves:
for k, frac in curve:
d.setdefault(k, []).append(frac)
cdfs = {}
for k, fracs in d.iteritems():
cdf = thinkbayes.MakeCdfFromList(fracs)
cdfs[k] = cdf
return cdfs
Then for each value of k
we make a Cdf of
fracs
; this Cdf represents the distribution of coverage
after k
reads.
Remember that the CDF tells you the probability of falling below a given
threshold, so the complementary CDF tells you the
probability of exceeding it. Figure [species-frac] shows complementary
CDFs for a range of values of k
.
To read this figure, select the level of coverage you want to achieve along the $x$-axis. As an example, choose 90%.
Now you can read up the chart to find the probability of achieving 90%
coverage after k
reads. For example, with 200 reads, you
have about a 40% chance of getting 90% coverage. With 1000 reads, you
have a 90% chance of getting 90% coverage.
With that, we have answered the four questions that make up the unseen species problem. To validate the algorithms in this chapter with real data, I had to deal with a few more details. But this chapter is already too long, so I won’t discuss them here.
You can read about the problems, and how I addressed them, at http://allendowney.blogspot.com/2013/05/belly-button-biodiversity-end-game.html.
You can download the code in this chapter from http://thinkbayes.com/species.py. For more information see Section [download].
The Unseen Species problem is an area of active research, and I believe the algorithm in this chapter is a novel contribution. So in fewer than 200 pages we have made it from the basics of probability to the research frontier. I’m very happy about that.
My goal for this book is to present three related ideas:
Bayesian thinking: The foundation of Bayesian analysis is the idea of using probability distributions to represent uncertain beliefs, using data to update those distributions, and using the results to make predictions and inform decisions.
A computational approach: The premise of this book is that it is easier to understand Bayesian analysis using computation rather than math, and easier to implement Bayesian methods with reusable building blocks that can be rearranged to solve real-world problems quickly.
Iterative modeling: Most real-world problems involve modeling decisions and trade-offs between realism and complexity. It is often impossible to know ahead of time what factors should be included in the model and which can be abstracted away. The best approach is to iterate, starting with simple models and adding complexity gradually, using each model to validate the others.
These ideas are versatile and powerful; they are applicable to problems in every area of science and engineering, from simple examples to topics of current research.
If you made it this far, you should be prepared to apply these tools to new problems relevant to your work. I hope you find them useful; let me know how it goes!