ipyrad-analysis toolkit: construct

The program construct is a STRUCTURE-like tool that incorporates expectations of isolation by distance. It is available as an R package. This notebook demonstrates how to convert data to the proper format for use in construct using simulated data as an example. For details on running construct see their documentation.

Required software


In [1]:
# conda install ipyrad ipcoal -c conda-forge -c bioconda

In [3]:
import ipyrad.analysis as ipa
import toytree
import ipcoal

In [22]:
print('ipyrad', ipa.__version__)
print('toytree', toytree.__version__)
print('ipcoal', ipcoal.__version__)


ipyrad 0.9.54
toytree 2.0.1
ipcoal 0.1.4

Simulate example data


In [5]:
# network model
tree = toytree.rtree.unittree(7, treeheight=3e6, seed=123)
tree.draw(ts='o', admixture_edges=(3, 2));


r0r1r2r3r4r5r6

In [15]:
# simulation model with admixture and missing data
model = ipcoal.Model(tree, Ne=1e4, nsamples=4, admixture_edges=(3, 2, 0.5, 0.1))
model.sim_snps(250)
model.write_snps_to_hdf5(name="test-construct", outdir="/tmp", diploid=True)


wrote 250 SNPs to /tmp/test-construct.snps.hdf5

Input data file


In [16]:
# the path to your HDF5 formatted snps file
SNPS = "/tmp/test-construct.snps.hdf5"

Population assignments


In [17]:
IMAP = {
    "r0": ["r0-0", "r0-1"],
    "r1": ["r1-0", "r1-1"],
    "r2": ["r2-0", "r2-1"],
    "r3": ["r3-0", "r3-1"],
    "r4": ["r4-0", "r4-1"],
    "r5": ["r5-0", "r5-1"],
    "r6": ["r6-0", "r6-1"],
}

Filter missing data and convert to genotype frequencies


In [18]:
# apply filtering to the SNPs file
tool = ipa.snps_extracter(data=SNPS, imap=IMAP, minmap={i:2 for i in IMAP})
tool.parse_genos_from_hdf5();


Samples: 14
Sites before filtering: 250
Filtered (indels): 0
Filtered (bi-allel): 9
Filtered (mincov): 0
Filtered (minmap): 0
Filtered (subsample invariant): 0
Filtered (combined): 9
Sites after filtering: 241
Sites containing missing values: 0 (0.00%)
Missing values in SNP matrix: 0 (0.00%)

In [19]:
# convert SNP data to genotype frequencies
df = tool.get_population_geno_frequency()
df.head()


Out[19]:
0 1 2 3 4 5 6 7 8 9 ... 231 232 233 234 235 236 237 238 239 240
r0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.25 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.00
r1 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.00
r2 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00
r3 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.00 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00
r4 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.00 0.0 ... 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.25

5 rows × 241 columns

Write data to file


In [20]:
# write to a file
df.to_csv("/tmp/freqs.csv")