The program TreeMix by Pickrell & Pritchard (2012) is used to infer population splits and admixture from allele frequency data. From the TreeMix documentation: "In the underlying model, the modern-day populations in a species are related to a common ancestor via a graph of ancestral populations. We use the allele frequencies in the modern populations to infer the structure of this graph."
In [18]:
# conda install treemix ipyrad ipcoal -c conda-forge -c bioconda
In [19]:
import ipyrad.analysis as ipa
import toytree
import toyplot
import ipcoal
In [20]:
print('ipyrad', ipa.__version__)
print('toytree', toytree.__version__)
! treemix --version | grep 'TreeMix v. '
In [76]:
# network model
tree = toytree.rtree.unittree(7, treeheight=4e6, seed=123)
tree.draw(ts='o', admixture_edges=(3, 2));
In [78]:
# simulation model
model = ipcoal.Model(tree, Ne=1e4, nsamples=4, admixture_edges=(3, 2, 0.5, 0.2))
model.sim_snps(1000)
model.write_snps_to_hdf5(name="test-treemix", outdir="/tmp", diploid=True)
In [110]:
# the path to your HDF5 formatted snps file
SNPS = "/tmp/test-treemix.snps.hdf5"
In [111]:
IMAP = {
"r0": ["r0-0", "r0-1"],
"r1": ["r1-0", "r1-1"],
"r2": ["r2-0", "r2-1"],
"r3": ["r3-0", "r3-1"],
"r4": ["r4-0", "r4-1"],
"r5": ["r5-0", "r5-1"],
"r6": ["r6-0", "r6-1"],
}
In [139]:
tmx = ipa.treemix(SNPS, imap=IMAP, workdir="/tmp")
In [140]:
tmx.params.root = "r4,r5,r6"
tmx.params.m = 1
tmx.params.global_ = 1
tmx.params
Out[140]:
In [141]:
# the command that will be run
tmx.command
Out[141]:
In [142]:
# execute command
tmx.run()
In [143]:
tmx.results
Out[143]:
In [144]:
canvas1, axes1 = tmx.draw_tree();
canvas2, axes2 = tmx.draw_cov();
In [145]:
# save your plots
import toyplot.svg
toyplot.svg.render(canvas1, "/tmp/treemix-m1.svg")
m
As with structure plots there is no True best value, but you can use model selection methods to decide whether one is a statistically better fit to your data than another. Adding additional admixture edges will always improve the likelihood score, but with diminishing returns as you add additional edges that explain little variation in the data. You can look at the log likelihood score of each model fit by running a for-loop like below. You may want to run this within another for-loop that iterates over different subsampled SNPs.
In [120]:
tests = {}
nadmix = [0, 1, 2, 3, 4, 5]
# iterate over n admixture edges and store results in a dictionary
for adm in nadmix:
tmx.params.m = adm
tmx.run()
tests[adm] = tmx.results.llik
In [122]:
# plot the likelihood for different values of m
toyplot.plot(
nadmix,
[tests[i] for i in nadmix],
width=350,
height=275,
stroke_width=3,
xlabel="n admixture edges",
ylabel="ln(likelihood)",
);