ipyrad-analysis toolkit: STRUCTURE

Structure v.2.3.4 is a standard tool for examining population genetic structure based on allele frequencies within and among populations. Although many new implementations of the structure algorithm have been developed in recent years offering improvements to speed, the classic tool offers a number of useful options that keep it relevant to this day.

Required software


In [1]:
# conda install ipyrad -c bioconda
# conda install structure clumpp -c ipyrad
# conda install toyplot -c eaton-lab

In [2]:
import ipyrad.analysis as ipa
import toyplot

Required input data files

Your input data should be a .snps.hdf database file produced by ipyrad. If you do not have this you can generate it from any VCF file following the vcf2hdf5 tool tutorial. The database file contains the genotype calls information as well as linkage information that is used for subsampling unlinked SNPs and bootstrap resampling.


In [3]:
# the path to your .snps.hdf5 database file
data = "/home/deren/Downloads/ref_pop2.snps.hdf5"

Note: missing data in STRUCTURE analyses:

Structure infers the values of missing data while it runs the MCMC chain. No imputation is required, but it will perform more accurately if there is less missing data and when base calls are more accurate. I recommend not imputing data and simply filtering fairly stringently.

Approximate run times

This example data set should probably be run for a longer burnin and number of reps if it were to be used in a publication. For reference, this data set takes about 2.5 hours to run 12 jobs on a 4 core laptop for a data set with 27 samples and ~125K SNPs. If your data set has more samples or SNPs then it will take longer. If you have 2X as many cores then it will run 2X faster.

Input data file and population assignments

If you are using the "sample" input method then population assignments (imap dictionary) are used for for filtering, color coding plots, and for imputation. If you are using the "kmeans" imputing method then population assignments are only used for filtering and color coding plots.


In [4]:
# group individuals into populations
imap = {
    "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "sagr": ["CUVN10", "CUCA4", "CUSV6"],
    "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
    "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
}

# require that 50% of samples have data in each group
minmap = {i: 0.5 for i in imap}

Enter data file and params

The struct analysis object takes input data as the .snps.hdf5 file produced by ipyrad. All other parameters are optional. The imap dictionary groups individuals into populations and minmap can be used to filter SNPs to only include those that have data for at least some proportion of samples in every group. The mincov option works similarly, it filters SNPs that are shared across less than some proportion of all samples (in contrast to minmap this does not use imap groupings).

When you init the object it will load the data and apply filtering. The printed output tells you how many SNPs were removed by each filter and the remaining amount of missing data after filtering. These remaining missing values are the ones that will be filled with imputation.


In [6]:
# init analysis object with input data and (optional) parameter options
struct = ipa.structure(
    name="test",
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.9,
)


Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 222081
Filtered (minmap): 112898
Filtered (combined): 226418
Sites after filtering: 123496
Sites containing missing values: 96001 (77.74%)
Missing values in SNP matrix: 142017 (4.26%)

Run STRUCTURE and plot results.

The burnin and numreps parameters determine the length of the run. For analyses with many samples and with larger values of K you should use much larger values than these.


In [7]:
struct.mainparams.burnin = 5000
struct.mainparams.numreps = 10000

In [8]:
struct.run(nreps=3, kpop=[2, 3, 4, 5], auto=True)


Parallel connection | oud: 4 cores
[####################] 100% 2:26:57 | running 12 structure jobs 

Analyze results: Choosing K


In [83]:
etable = struct.get_evanno_table([2, 3, 4, 5])
etable


Out[83]:
Nreps lnPK lnPPK deltaK estLnProbMean estLnProbStdev
2 3 0.000000 0.000000 0.000000 -254535.766667 2023.420259
3 3 25229.900000 35261.166667 30.892635 -229305.866667 1141.410147
4 3 -10031.266667 1451.800000 1.675614 -239337.133333 866.428568
5 3 -8579.466667 0.000000 0.000000 -247916.600000 8537.460208

In [90]:
# get canvas object and set size
canvas = toyplot.Canvas(width=400, height=300)

# plot the mean log probability of the models in red
axes = canvas.cartesian(ylabel="estLnProbMean")
axes.plot(etable.estLnProbMean * -1, color="darkred", marker="o")
axes.y.spine.style = {"stroke": "darkred"}

# plot delta K with its own scale bar of left side and in blue
axes = axes.share("x", ylabel="deltaK", ymax=etable.deltaK.max() + etable.deltaK.max() * .25)
axes.plot(etable.deltaK, color="steelblue", marker="o");
axes.y.spine.style = {"stroke": "steelblue"}

# set x labels
axes.x.ticks.locator = toyplot.locator.Explicit(range(len(etable.index)), etable.index)
axes.x.label.text = "K (N ancestral populations)"


2345K (N ancestral populations)230000235000240000245000250000255000estLnProbMean010203040deltaK

Analyze results: Barplots


In [7]:
k = 3
table = struct.get_clumpp_table(k)


[K3] 3/3 results permuted across replicates (max_var=0).

In [8]:
# sort list by columns
table.sort_values(by=list(range(k)), inplace=True)

# or, sort by a list of names (here taken from imap)
import itertools
onames = list(itertools.chain(*imap.values()))
table = table.loc[onames]

In [9]:
# build barplot
canvas = toyplot.Canvas(width=500, height=250)
axes = canvas.cartesian(bounds=("10%", "90%", "10%", "45%"))
axes.bars(table)

# add labels to x-axis
ticklabels = [i for i in table.index.tolist()]
axes.x.ticks.locator = toyplot.locator.Explicit(labels=ticklabels)
axes.x.ticks.labels.angle = -60
axes.x.ticks.show = True
axes.x.ticks.labels.offset = 10
axes.x.ticks.labels.style = {"font-size": "12px"}


TXWV2LALC2SCCU3FLSF33FLBA140FLSF47FLMO62FLSA185FLCK216FLCK18FLSF54FLWO6FLAB109CUVN10CUCA4CUSV6CRL0030HNDA09BZBB1MXSA3017MXED8MXGT4TXGR3TXMD3BJSL25BJSB3BJVL190.00.51.0

Cookbook

Advanced: Load existing results

Results files can be loaded by providing the name and workdir combination that leads to the path where your previous results were stored.


In [12]:
rerun = ipa.structure(
    data=data, 
    name="test", 
    workdir="analysis-structure",
    imap=imap,
    load_only=True,
)


12 previous results loaded for run [test]

In [13]:
rerun.get_clumpp_table(3)


[K3] 3/3 results permuted across replicates (max_var=0).
Out[13]:
0 1 2
BJSB3 0.0000 1.0000 0.0000
BJSL25 0.0000 1.0000 0.0000
BJVL19 0.0000 1.0000 0.0000
BZBB1 0.0000 0.0000 1.0000
CRL0030 0.0000 0.0000 1.0000
CUCA4 0.3450 0.0000 0.6550
CUSV6 0.4098 0.0000 0.5902
CUVN10 0.3408 0.0000 0.6592
FLAB109 1.0000 0.0000 0.0000
FLBA140 1.0000 0.0000 0.0000
FLCK18 1.0000 0.0000 0.0000
FLCK216 1.0000 0.0000 0.0000
FLMO62 0.9987 0.0010 0.0003
FLSA185 1.0000 0.0000 0.0000
FLSF33 1.0000 0.0000 0.0000
FLSF47 1.0000 0.0000 0.0000
FLSF54 1.0000 0.0000 0.0000
FLWO6 1.0000 0.0000 0.0000
HNDA09 0.0000 0.0000 1.0000
LALC2 1.0000 0.0000 0.0000
MXED8 0.1760 0.6953 0.1287
MXGT4 0.1531 0.8093 0.0377
MXSA3017 0.0477 0.0013 0.9510
SCCU3 1.0000 0.0000 0.0000
TXGR3 0.3649 0.6267 0.0083
TXMD3 0.3987 0.6010 0.0003
TXWV2 1.0000 0.0000 0.0000

Advanced: Add replicates or additional K values

You can continue an analysis with the same name and workdir by setting additional replicates or values of K and calling .run() again. Here I will increase the number of replicates per K value from 3 to 5, and run one additional K value. Be sure to use all of the same parameter and filtering values that you used in the previous run or you might cause unexpected problems.

Here because we already finished 3 replicates for K=2,3,4,5 it will run 2 more for each of those, and it will run 5 replicates for K=6 since we do not have any finished replicates of those yet. You can see which result files exist for a named analysis object by accessing the .result_files attribute, or by looking in the working directory. To overwrite existing files instead of adding more replicates you can use force=True in the run command. You could also simply create a new object with a different name.


In [15]:
# init analysis object with same params as previously
struct = ipa.structure(
    name="test",
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.9,
)

# use the same params as before 
struct.mainparams.burnin = 5000
struct.mainparams.numreps = 10000

# call run for all K values you want to have 5 finished replicates
struct.run(nreps=5, kpop=[2, 3, 4, 5, 6], auto=True)


12 previous results loaded for run [test]
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 222081
Filtered (minmap): 112898
Filtered (combined): 226418
Sites after filtering: 123496
Sites containing missing values: 96001 (77.74%)
Missing values in SNP matrix: 142017 (4.26%)
Parallel connection | oud: 4 cores
[####################] 100% 3:39:43 | running 13 structure jobs 

In [16]:
struct.get_evanno_table([2, 3, 4, 5, 6])


Out[16]:
Nreps lnPK lnPPK deltaK estLnProbMean estLnProbStdev
2 5 0.00 0.00 0.000000 -254950.12 1675.280397
3 5 25807.62 38146.76 39.701878 -229142.50 960.830118
4 5 -12339.14 7180.72 2.018760 -241481.64 3556.995749
5 5 -5158.42 8885.00 1.413531 -246640.06 6285.676647
6 5 3726.58 0.00 0.000000 -242913.48 2164.870641