With reference mapped RAD loci you can select windows of loci located close together on scaffolds and automate extracting and filtering and concatenating the RAD data to write to phylip format (see also the window_extracter
tool.) The treeslider
tool here automates this process across many windows, distributes the tree inference jobs in parallel, and organizes the results.
Key features:
clade_weights
).
In [1]:
# conda install ipyrad -c bioconda
# conda install raxml -c bioconda
# conda install toytree -c eaton-lab
In [3]:
import ipyrad.analysis as ipa
import toytree
In [4]:
# the path to your HDF5 formatted seqs file
data = "/home/deren/Downloads/ref_pop2.seqs.hdf5"
In [5]:
# check scaffold idx (row) against scaffold names
ipa.treeslider(data).scaffold_table.head()
Out[5]:
Here I select the scaffold Qrob_Chr03 (scaffold_idx
=2), and run 2Mb windows (window_size
) non-overlapping (2Mb slide_size
) across the entire scaffold. I use the default inference method "raxml", and modify its default arguments to run 100 bootstrap replicates. More details on modifying raxml params later. I set for it to skip windows with <10 SNPs (minsnps
), and to filter sites within windows (mincov
) to only include those that have coverage across all 9 clades, with samples grouped into clades using an imap
dictionary.
In [43]:
# select a scaffold idx, start, and end positions
ts = ipa.treeslider(
name="test2",
data="/home/deren/Downloads/ref_pop2.seqs.hdf5",
workdir="analysis-treeslider",
scaffold_idxs=2,
window_size=250000,
slide_size=250000,
inference_method="raxml",
inference_args={"N": 100, "T": 4},
minsnps=10,
consensus_reduce=True,
mincov=5,
imap={
"reference": ["reference"],
"virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
"mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
"gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
"bran": ["BJSL25", "BJSB3", "BJVL19"],
"fusi-N": ["TXGR3", "TXMD3"],
"fusi-S": ["MXED8", "MXGT4"],
"sagr": ["CUVN10", "CUCA4", "CUSV6"],
"oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
},
)
In [44]:
ts.show_inference_command()
In [45]:
ts.run(auto=True, force=True)
The main result of a tree slider analysis is the tree_table
. This is a pandas dataframe that includes information about the size and informativeness of each window in addition to the inferred tree for that window. This table is also saved as a CSV file. You can later re-load this CSV to perform further analysis on the tree results. For example, see the clade_weights
tool for how to analyze the support for clades throughout the genome, or see the example tutorial for running ASTRAL species tree or SNAQ species network analyses using the list of trees inferred here.
In [46]:
ts.tree_table.head()
Out[46]:
In [79]:
# example: remove any rows where the tree is NaN
df = ts.tree_table.loc[ts.tree_table.tree.notna()]
In [93]:
mtre = toytree.mtree(df.tree)
mtre.treelist = [i.root("reference") for i in mtre.treelist]
mtre.draw_tree_grid(
nrows=3, ncols=4, start=20,
tip_labels_align=True,
tip_labels_style={"font-size": "9px"},
);
In [ ]:
In [5]:
# select a scaffold idx, start, and end positions
ts = ipa.treeslider(
name="test",
data="/home/deren/Downloads/ref_pop2.seqs.hdf5",
workdir="analysis-treeslider",
scaffold_idxs=2,
window_size=1000000,
slide_size=1000000,
inference_method="mb",
inference_args={"N": 0, "T": 4},
minsnps=10,
mincov=9,
consensus_reduce=True,
imap={
"reference": ["reference"],
"virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
"mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
"gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
"bran": ["BJSL25", "BJSB3", "BJVL19"],
"fusi-N": ["TXGR3", "TXMD3"],
"fusi-S": ["MXED8", "MXGT4"],
"sagr": ["CUVN10", "CUCA4", "CUSV6"],
"oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
},
)
In [5]:
# select a scaffold idx, start, and end positions
ts = ipa.treeslider(
name="test",
data="/home/deren/Downloads/ref_pop2.seqs.hdf5",
workdir="analysis-treeslider",
scaffold_idxs=2,
window_size=2000000,
slide_size=2000000,
inference_method="raxml",
inference_args={"N": 100, "T": 4},
minsnps=10,
mincov=9,
imap={
"reference": ["reference"],
"virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
"mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
"gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
"bran": ["BJSL25", "BJSB3", "BJVL19"],
"fusi-N": ["TXGR3", "TXMD3"],
"fusi-S": ["MXED8", "MXGT4"],
"sagr": ["CUVN10", "CUCA4", "CUSV6"],
"oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
},
)
In [6]:
# this is the tree inference command that will be used
ts.show_inference_command()
In [7]:
ts.run(auto=True, force=True)
Our goal is to fill the .tree_table
, a pandas DataFrame where rows are genomic windows and the information content of each window is recorded, and a newick string tree is inferred and filled in for each. The tree table is also saved as a CSV formatted file in the workdir. You can re-load it later using Pandas. Below I demonstrate how to plot results from the tree_able. To examine how phylogenetic relationships vary across the genome see also the clade_weights()
tool, which takes the tree_table as input.
In [8]:
# the tree table is automatically saved to disk as a CSV during .run()
ts.tree_table.head()
Out[8]:
In [9]:
# filter to only windows with >50 SNPS
trees = ts.tree_table[ts.tree_table.snps > 50].tree.tolist()
# load all trees into a multitree object
mtre = toytree.mtree(trees)
# root trees and collapse nodes with <50 bootstrap support
mtre.treelist = [
i.root("reference").collapse_nodes(min_support=50)
for i in mtre.treelist
]
# draw the first 12 trees in a grid
mtre.draw_tree_grid(
nrows=3, ncols=4, start=0,
tip_labels_align=True,
tip_labels_style={"font-size": "9px"},
);
Using toytree you can easily draw a cloud tree of overlapping gene trees to visualize discordance. These typically look much better if you root the trees, order tips by their consensus tree order, and do not use edge lengths. See below for an example, and see the toytree documentation.
In [10]:
# filter to only windows with >50 SNPS (this could have been done in run)
trees = ts.tree_table[ts.tree_table.snps > 50].tree.tolist()
# load all trees into a multitree object
mtre = toytree.mtree(trees)
# root trees
mtre.treelist = [i.root("reference") for i in mtre.treelist]
# infer a consensus tree to get best tip order
ctre = mtre.get_consensus_tree()
# draw the first 12 trees in a grid
mtre.draw_cloud_tree(
width=400,
height=400,
fixed_order=ctre.get_tip_labels(),
use_edge_lengths=False,
);
In this analysis I entered multiple scaffolds to create windows across each scaffold. I also entered a smaller slide size than window size so that windows are partially overlapping. The raxml command string was modified to perform 10 full searches with no bootstraps.
In [11]:
# select a scaffold idx, start, and end positions
ts = ipa.treeslider(
name="chr1_w500K_s100K",
data=data,
workdir="analysis-treeslider",
scaffold_idxs=[0, 1, 2],
window_size=500000,
slide_size=100000,
minsnps=10,
inference_method="raxml",
inference_args={"m": "GTRCAT", "N": 10, "f": "d", 'x': None},
)
In [12]:
# this is the tree inference command that will be used
ts.show_inference_command()