ipyrad-analysis toolkit: treemix

The program TreeMix by Pickrell & Pritchard (2012) is used to infer population splits and admixture from allele frequency data. From the TreeMix documentation: "In the underlying model, the modern-day populations in a species are related to a common ancestor via a graph of ancestral populations. We use the allele frequencies in the modern populations to infer the structure of this graph."

Required software


In [18]:
# conda install treemix ipyrad ipcoal -c conda-forge -c bioconda

In [19]:
import ipyrad.analysis as ipa
import toytree
import toyplot
import ipcoal

In [20]:
print('ipyrad', ipa.__version__)
print('toytree', toytree.__version__)
! treemix --version | grep 'TreeMix v. '


ipyrad 0.9.54
toytree 2.0.1
TreeMix v. 1.12

Simulate example data


In [76]:
# network model
tree = toytree.rtree.unittree(7, treeheight=4e6, seed=123)
tree.draw(ts='o', admixture_edges=(3, 2));


r0r1r2r3r4r5r6

In [78]:
# simulation model
model = ipcoal.Model(tree, Ne=1e4, nsamples=4, admixture_edges=(3, 2, 0.5, 0.2))
model.sim_snps(1000)
model.write_snps_to_hdf5(name="test-treemix", outdir="/tmp", diploid=True)


wrote 1000 SNPs to /tmp/test-treemix.snps.hdf5

Input data file


In [110]:
# the path to your HDF5 formatted snps file
SNPS = "/tmp/test-treemix.snps.hdf5"

Population assignments


In [111]:
IMAP = {
    "r0": ["r0-0", "r0-1"],
    "r1": ["r1-0", "r1-1"],
    "r2": ["r2-0", "r2-1"],
    "r3": ["r3-0", "r3-1"],
    "r4": ["r4-0", "r4-1"],
    "r5": ["r5-0", "r5-1"],
    "r6": ["r6-0", "r6-1"],
}

Load tool and filter missing data


In [139]:
tmx = ipa.treemix(SNPS, imap=IMAP, workdir="/tmp")


Samples: 14
Sites before filtering: 1000
Filtered (indels): 0
Filtered (bi-allel): 58
Filtered (mincov): 0
Filtered (minmap): 0
Filtered (subsample invariant): 0
Filtered (combined): 58
Sites after filtering: 942
Sites containing missing values: 0 (0.00%)
Missing values in SNP matrix: 0 (0.00%)
subsampled 942 unlinked SNPs

Set parameters


In [140]:
tmx.params.root = "r4,r5,r6"
tmx.params.m = 1
tmx.params.global_ = 1
tmx.params


Out[140]:
bootstrap   0                   
climb       0                   
cormig      0                   
g           (None, None)        
global_     1                   
k           0                   
m           1                   
noss        0                   
root        r4,r5,r6            
se          0                   
seed        381302990           

Run analysis


In [141]:
# the command that will be run
tmx.command


Out[141]:
'/home/deren/miniconda3/envs/py36/bin/treemix -i /tmp/test.treemix.in.gz -o /tmp/test -m 1 -seed 381302990 -root r4,r5,r6 -global'

In [142]:
# execute command
tmx.run()

Parse results

The result here is not accurate. Perhaps it would improve with more samples per lineage or more SNPs.


In [143]:
tmx.results


Out[143]:
admixture   [(4, 0.123026, 7, 0.0657704, 0.039121)]
cov         [[ 0.193246    0.01681     0.0151671  -0.0440155  -0.0540499  -0.0633512
  -0.0638062 ]
 [ 0.01681     0.169992    0.0569728  -0.0510421  -0.0578917  -0.0671931
  -0.0676481 ]
 [ 0.0151671   0.0569728   0.178207   -0.051358   -0.0613924  -0.0685706
  -0.0690256 ]
 [-0.0440155  -0.0510421  -0.051358    0.156445    0.00681355 -0.00779569
  -0.00904682]
 [-0.0540499  -0.0578917  -0.0613924   0.00681355  0.118506    0.0246328
   0.0233816 ]
 [-0.0633512  -0.0671931  -0.0685706  -0.00779569  0.0246328   0.114058
   0.0682204 ]
 [-0.0638062  -0.0676481  -0.0690256  -0.00904682  0.0233816   0.0682204
   0.117925  ]]
llik        125.367             
tree        (r6:0.0853888,((r3:0.140711,(r2:0.0776395,(r1:0.0450637,r0:0.0504777):0.0657704):0.0430514):0.127177,(r5:0.111235,r4:0.123026):0.0474548):0.0853888);

In [144]:
canvas1, axes1 = tmx.draw_tree();
canvas2, axes2 = tmx.draw_cov();


r0r1r2r3r4r5r60.00.20.4
0.1932460.0168100.015167-0.044015-0.054050-0.063351-0.063806r60.0168100.1699920.056973-0.051042-0.057892-0.067193-0.067648r50.0151670.0569730.178207-0.051358-0.061392-0.068571-0.069026r4-0.044015-0.051042-0.0513580.1564450.006814-0.007796-0.009047r3-0.054050-0.057892-0.0613920.0068140.1185060.0246330.023382r2-0.063351-0.067193-0.068571-0.0077960.0246330.1140580.068220r1-0.063806-0.067648-0.069026-0.0090470.0233820.0682200.117925r0r6r5r4r3r2r1r0

In [145]:
# save your plots
import toyplot.svg
toyplot.svg.render(canvas1, "/tmp/treemix-m1.svg")

Finding the best value for m

As with structure plots there is no True best value, but you can use model selection methods to decide whether one is a statistically better fit to your data than another. Adding additional admixture edges will always improve the likelihood score, but with diminishing returns as you add additional edges that explain little variation in the data. You can look at the log likelihood score of each model fit by running a for-loop like below. You may want to run this within another for-loop that iterates over different subsampled SNPs.


In [120]:
tests = {}
nadmix = [0, 1, 2, 3, 4, 5]

# iterate over n admixture edges and store results in a dictionary
for adm in nadmix:
    tmx.params.m = adm
    tmx.run()
    tests[adm] = tmx.results.llik

In [122]:
# plot the likelihood for different values of m
toyplot.plot(
    nadmix,
    [tests[i] for i in nadmix],
    width=350, 
    height=275,
    stroke_width=3,
    xlabel="n admixture edges",
    ylabel="ln(likelihood)",
);


012345n admixture edges124.5125.0125.5ln(likelihood)