In [1]:
import pandas as pd
import numpy as np
from collections import Counter
The formula for calculating the probability that a held-out sequence will be assigned to a given environment is reported incorrectly in Knights et al. 2011. The corrected formula is:
$$P( z_{i} = v \mid z^{ \neg i}, x) \propto P(x_{i} \mid v) \times P(v \mid x^{ \neg i}) = \begin{cases} \frac{m_{x_{i}v} + \alpha_{1}}{m_{v} + \tau \alpha_{1}} \times \frac{n_{v}^{ \neg i} + \beta}{n - 1 + \beta V} & v < V \\ \\ \frac{m_{x_{i}v} + n \alpha_{2}}{m_{v} + \tau n \alpha_{2}} \times \frac{n_{v}^{ \neg i} + \beta}{n - 1 + \beta V} & v = V \end{cases}$$This updated formula is what is truly being calculated by both the R sourcetracking algorithm and this repository (personal communication with Dan Knights).
The Gibb's sampler works in four basic steps:
1. Randomly assign the sequences in a sink to source environments. These random assignments represent which source a given sink sequence came from.
2. Select one of the sequences from step 1, calculate the *actual* probabilities of that sequence having come from any of the source environments, and update the assigned source environment of the sequence based on a random draw with the calculated probabilities. Repeat many times.
3. At intervals in the repeats of step 2, take the source environment assingments of all the sequences in a sink and record them.
4. After doing step 3 a certain number of times (i.e. recording the full assignments of source environments for each sink sequence), terminate the iteration and move to the next sink.
In [20]:
table = pd.DataFrame(data=[[0, 10, 100], [3, 6, 9]],
columns=['F1', 'F2', 'F3'],
index=['Source1', 'Source2'])
table
Out[20]:
Imagine that the above table represents a pair of urns (Source1
and Source2
) and that each urn has 3 colors of billiard in it (colors F1
through F
). Now, lets say you close your eyes, and randomly select an urn, say Source2
. You reach in to that urn and withdraw a random billiard. What is the probability that you have selected a billiard of color F1
?
We can make this conditional probability calculation very simply by dividing each Feature count within a source by the total sequences seen in that Source. This would also be the relative abundance for each Source.
In [26]:
# sum each OTU
table = table.apply(lambda row: row / row.sum(), axis=1)
table
Out[26]:
Now we can calculate the $P(t \mid v)$. Let's imagine that we saw F1
in our Sink. What is the probability that F1
came from either Source1
or Source2
? Well, from the table above we can see that F1
was not seen in Source1
. Therefore, we update the table by dividing by each column sum:
In [28]:
# Calculate the probability (p_tv) for each OTU
table_p_tv = table.loc[['Source1', 'Source2']].div(table.sum(axis=0), axis=1)
table_p_tv
Out[28]:
This table shows $P(t \mid v)$ for each $t, v$ (i.e each OTU and Source).
To check our understanding, we can think about the values in this matrix. If we see F1
in the sink sample, it could only have come from Source2
, and therefore the p_tv (the probability of seeing F1
) for Source2
is 100%. If F3
was seen in our sink, there would be a 91.7% chance of it coming from Source1
and a 8.3% chance of it coming from Source2
.
The calculation for p_tv
is really about this simple. We add a few tuning parameters to this calculation to ensure that more of the probability space is explored. In the biologial context, our idea is that because we sequence only a small fraction of the source environment, we are likely to miss some sources of some taxa. The parameters described below help us correct for this.
Alpha1
You may note above that OTU1's abundance in Source1 is 0, and therefore no probability can be provided for that OTU to that Source. SourceTracker2 therefore allows the user to specify an alpha1
value on the command line, which is the prior counts to be added to all of the sample counts. This would therefore provide a small amount of probability to OTU1 in Source1, and the table would look like (assuming an alpha1 of 0.01):
In [44]:
table = pd.DataFrame(data=[[0.1, 10.1, 100.1], [3.1, 6.1, 9.1]],
columns=['F1', 'F2', 'F3'],
index=['Source1', 'Source2'])
table
Out[44]:
Unknown and Alpha2
A key feature of SourceTracker2 is to create an Unknown source environment. The basic premise here is to identify sequences that don't have a high probability of being derived from the known and specified Source Environments. An Unknown community is propogated with sequences with the alpha2
parameter that specifies what percent of the total sink sequence count to be added for each OTU in the Unknown. For example, a sink with 100 sequences and an alpha of 0.001 would add 100 * 0.001 = 0.1
counts to each OTU, and the table would look like this:
In [45]:
table = pd.DataFrame(data=[[0.1, 10.1, 100.1], [3.1, 6.1, 9.1], [0.1, 0.1, 0.1]],
columns=['F1', 'F2', 'F3'],
index=['Source1', 'Source2', 'Unknown'])
table
Out[45]:
In [46]:
table = pd.DataFrame(data=[[0.1, 10.1, 100.1], [3.1, 6.1, 9.1], [0.1, 0.1, 0.1]],
columns=['OTU1', 'OTU2', 'OTU3'],
index=['Source1', 'Source2', 'Unknown'])
table_p_tv = table.loc[['Source1', 'Source2']].div(table.sum(axis=0), axis=1)
table_p_tv
Out[46]:
These are the precalculated p_tv values we can use.
The second probability that the Gibb's sampler must calculate is $P(v)$, the probability that a sequence came from a given source. In the code, we refer to this value as p_v
. In the interal loops of the Gibb's function, we randomly assign sequences from a sink to having come from a given source. Iteratively, we remove one of each of these sequences (one at a time) and reassign the origin of that sequence to an environment. p_v
is the probability that one of these held out sequences came from any one of the environments.
We'll walk through an example below.Each sink sample can be thought of a collection of sequences that identify a taxa:
In [47]:
sink_otus = [1, 1, 1, 2, 2, 2, 2, 2, 2, 3]
Our sink sample contains 10 sequences that each are assigned to one of the Fs in our table. Let's forget for a moment that each Feature has a certain probability of being found in a given source environment (the p_tv
), and instead focus on the number of possible Source environments we have. If our sink sample was completely unidentified with Featre information, it would look like:
In [48]:
sink = ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X']
And if we only knew that we had 3 source environments that each sink sequence must have come from, then we would say that each sink sequence had a 1/3
probability of being assigned to each source.
Let's randomly assign each sink sequence to a source environment:
In [49]:
sink_source = [3, 1, 2, 3, 2, 3, 3, 1, 2, 1]
# where 1 = Source1, 2 = Source2, and 3 = Unknown
To calculate our p_v
, we count the number of sequences assigned to each environment and divide by the total.
In [50]:
p_v = np.bincount(sink_source, minlength=3)[1:]/float(len(sink_source))
print(p_v)
During the internal loops of the Gibb's function, we'll end up withdrawing a sequence from the sink_source
vector, and then re-calculating the p_v
. For example, lets say we are on the 5th iteration, so we withdraw the 5th sequence:
In [51]:
sink_source_copy = sink_source[:]
env_of_fifth_sequence = sink_source_copy.pop(4)
# recalculate p_v
p_v = np.bincount(sink_source_copy, minlength=3)[1:]/float(len(sink_source))
print(p_v)
We are calculating $P(v \mid x^{ \neg i})$ in the overall SourceTracker2 formula (seen in the first cell). In the actual code, p_v
is multiplied by p_tv
to create a new probability vector for identifying where the sequence that was withdrawn actually came from.
As with p_tv
there are tuning parameters that we add to make the model explore a larger portion of probability space.
Beta
Beta is added to the count of each environment, to prevent an environment from having zero probability mass. In many ways it is like alpha1 which we add to the source matrix data.
Restarts
When we restart the gibbs sampler, we reshuffle the assignments of each sink sequence to a Source Environment, and begin our calculations again, essentially redoing all the steps above. This is done to allow the proportions to converge in an independent Markov Chain. Since there is high correlation between adjacent states in the Markov chain, the restarts
and delay
parameters of SourceTracker2 are used to avoid correlated answers.
In [52]:
sink_otus = [1, 1, 1, 2, 2, 2, 2, 2, 2, 3]
sink_otus_positions_shuffled = [4, 1, 8, 2, 7, 9, 0, 3, 5, 6]
The next thing we do is create a shuffled order for which to walk through our sink sequences. The above says that the first sequence we want to predict the source environment contributor for is the 4
index, which relates to F2
.
In [53]:
# For OTU2, taken from the table above
# for Source1, Source2, Unknown
p_tv = np.array([0.619632, 0.374233, 0.006135])
p_v = np.array([0.3, 0.3, 0.4])
combined_probability = p_tv * p_v
combined_probability / combined_probability.sum()
Out[53]:
We have now calculated the probability that this sink sequence at index 4 should belong to each Source Environment and the Unknown. We can then randomly assign that sequence to one of the environments given those probability distributions. We also update our p_v by adding 1 to the selected source community.
IF the sequence gets assigned to the Unknown community, then the Unknown community gets an extra count in the table. In this way, the Unknown community gets propogated through repeated iterations. No changes are made to the table if the sequence gets assigned to an identified source (Source1 or Source2).
Let's assume the unlikely case that our sample did get assigned to the Unknown.
In [54]:
# Update the table with the OTU2 count
table = pd.DataFrame(data=[[0.1, 10.1, 100.1], [3.1, 6.1, 9.1], [0.1, 1.1, 0.1]],
columns=['F1', 'F2', 'F3'],
index=['Source1', 'Source2', 'Unknown'])
table
Out[54]:
In [55]:
# and recalculate the p_tv values
table_p_tv = table.loc[['Source1', 'Source2']].div(table.sum(axis=0), axis=1)
table_p_tv
Out[55]:
In [56]:
# Update the p_v values
# change the 4th index to community 3
sink_source = [3, 1, 2, 3, 3, 3, 3, 1, 2, 1]
Counter(sink_source)
Out[56]:
And now:
In [57]:
sink_otus = [1, 1, 1, 2, 2, 2, 2, 2, 2, 3]
sink_otus_positions_shuffled = [4, 1, 8, 2, 7, 9, 0, 3, 5, 6]
We'd move on to index position 1, and repeat the entire process
After so many iterations, or burn ins, we can start to record the p_v source environment counts as the percent contribution of each Source to our sink sample. The delay determines how many iterations to skip between draws after the burnin period. The draws_per_restart indicate how many draws to save from each restart. All draws are then averaged together to provide a final mixing proportion.