ipyrad-analysis toolkit: window_extracter

View as notebook

Extract all sequence data within a genomic window, concatenate, and write to a phylip file. Useful for inferring the phylogeny near a specific gene/region of interest. Follow up with downstream phylogenetic analysis of the region.

Key features:

Automatically concatenates ref-mapped RAD loci in sliding windows.
Filter to remove sites by missing data.
Optionally remove samples from alignments.
Optionally use consensus seqs to represent clades of multiple samples.

Required software



In [1]:

    
# conda install ipyrad -c bioconda
# conda install raxml -c bioconda
# conda install toytree -c eaton-lab



In [2]:

    
import ipyrad.analysis as ipa
import toytree

Required input data files

Your input data should be a .seqs.hdf database file produced by ipyrad. This file contains the full sequence alignment for your samples as well as associated meta-data of the genomic positions of RAD loci relative to a reference genome.



In [3]:

    
# the path to your HDF5 formatted seqs file
seqfile = "/home/deren/Downloads/ref_pop2.seqs.hdf5"

The scaffold table

The window_extracter() tool takes the .seqs.hdf5 database file from ipyrad as its input file. You select scaffolds by their index (integer) which can be found in the .scaffold_table. We can see from the table below that this genome has 12 large scaffolds (chromosome-scale linkage blocks) and many other smaller unplaced scaffolds. If you are working with a high quality reference genome then it will likely look similar to this, whereas many other reference genomes will be composed of many more scaffolds that are mostly smaller in size. Here I will focus just on the large chromosomes.



In [4]:

    
# first load the data file with no other arguments to see scaffold table
ext = ipa.window_extracter(seqfile)

# the scaffold table shows scaffold names and lens in length-order
ext.scaffold_table.head(15)









    Out[4]:







  
    
      
      scaffold_name
      scaffold_length
    
  
  
    
      0
      Qrob_Chr01
      55068941
    
    
      1
      Qrob_Chr02
      115639695
    
    
      2
      Qrob_Chr03
      57474983
    
    
      3
      Qrob_Chr04
      44977106
    
    
      4
      Qrob_Chr05
      70629082
    
    
      5
      Qrob_Chr06
      57352617
    
    
      6
      Qrob_Chr07
      51661711
    
    
      7
      Qrob_Chr08
      71345938
    
    
      8
      Qrob_Chr09
      50221317
    
    
      9
      Qrob_Chr10
      50368918
    
    
      10
      Qrob_Chr11
      52130961
    
    
      11
      Qrob_Chr12
      39860516
    
    
      12
      Qrob_H2.3_Sc0000024
      2943817
    
    
      13
      Qrob_H2.3_Sc0000026
      2906018
    
    
      14
      Qrob_H2.3_Sc0000030
      2801502

Selecting scaffolds

The scaffold_idxs designates the scaffold to extract sequence data from. This is the index (row) of the named scaffold from the scaffold table (e.g., above). The window_extracter tool will select all RAD data within this window and exclude any sites that have no data (e.g., the space between RAD markers, or the space between paired reads) to create a clean concise alignment.

The .stats attribute shows the information content of the selected window before and after filtering. The stats are returned as a dataframe, showing the size, information content, missingness, and number of samples in the alignment. You can see that the 55Mbp scaffold is reduced to a 450Kbp alignment that includes 13K snps and has 20% missing data across 30 samples (NB: this dataset already had some minimum sample filtering applied during assembly). The default filtering applied to sites only reduced the number of sites by a few thousand.



In [5]:

    
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
)

# show stats of the window
ext.stats









    Out[5]:







  
    
      
      scaffold
      start
      end
      sites
      snps
      missing
      samples
    
  
  
    
      prefilter
      Qrob_Chr01
      0
      55068941
      456515
      13133
      0.20
      30
    
    
      postfilter
      Qrob_Chr01
      0
      55068941
      449832
      13131
      0.19
      30

Subsetting scaffold windows

You can use the start and end arguments to select subsets of scaffolds as smaller window sizes to be extracted. As with the example above the selected window will be filtered to reduce missing data. If there is no data in the selected window the stats will show no sites, and a warning will be printed. An example with no data and with some data are both shown below.



In [6]:

    
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
    start=0, 
    end=10000,
)

# show stats of the window
ext.stats









    



No data in selected window.






    Out[6]:







  
    
      
      scaffold
      start
      end
      sites
      snps
      missing
      samples
    
  
  
    
      prefilter
      Qrob_Chr01
      0
      10000
      0
      0
      0
      0
    
    
      postfilter
      Qrob_Chr01
      0
      10000
      0
      0
      0
      0



In [7]:

    
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
    start=500000, 
    end=800000,
)

# show stats of the window
ext.stats









    Out[7]:







  
    
      
      scaffold
      start
      end
      sites
      snps
      missing
      samples
    
  
  
    
      prefilter
      Qrob_Chr01
      500000
      800000
      3431
      103
      0.14
      30
    
    
      postfilter
      Qrob_Chr01
      500000
      800000
      3422
      103
      0.14
      30

Filtering missing data with `mincov`

You can filter sites from the alignment by using mincov, which applies a filter to all sites in the alignment. For example, mincov=0.5 will require that 50% of samples contain a site that is not N or - for the site to be included in the alignment. This value can be a proportion like 0.5, or it can be a number, like 10.



In [8]:

    
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
    start=500000, 
    end=800000,
    mincov=0.8,
    rmincov=0.5,
)

# show stats of the window
ext.stats









    Out[8]:







  
    
      
      scaffold
      start
      end
      sites
      snps
      missing
      samples
    
  
  
    
      prefilter
      Qrob_Chr01
      500000
      800000
      3431
      103
      0.14
      30
    
    
      postfilter
      Qrob_Chr01
      500000
      800000
      2720
      94
      0.07
      29

Filtering missing data with `imap` and `minmap`

An imap dictionary can be used to group samples into populations/species, as in the example below. It takes key,value pairs where the key is the name of the group, and the value is a list of sample names. One way to use an imap is to apply a minmap filter. This acts just like the global mincov filter, but applies to each group separately. Only if a site meets the minimum coverage argument for each group will it be retained in the data set. In this case the imap sampling selected 28/30 samples and required 75% of data in each group which reduced the number of SNPs from 92 to 86.



In [10]:

    
# assign samples to groups/taxa
imap = {
    "reference": ["reference"],
    "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
    "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
    "sagr": ["CUVN10", "CUCA4", "CUSV6"],
    "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
}

# set a simple minmap requiring 1 sample from each group
minmap = {name: 0.75 for name in imap}



In [11]:

    
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
    start=500000, 
    end=800000,
    mincov=0.8,
    imap=imap,
    minmap=minmap,
)

# show stats of the window
ext.stats









    Out[11]:







  
    
      
      scaffold
      start
      end
      sites
      snps
      missing
      samples
    
  
  
    
      prefilter
      Qrob_Chr01
      500000
      800000
      3431
      92
      0.12
      28
    
    
      postfilter
      Qrob_Chr01
      500000
      800000
      2904
      86
      0.08
      28

Subsample taxa with `imap`

You can also use an imap dictionary to select which samples to include/exclude from an analysis. This is an easy way to remove rogue taxa, hybrids, or technical replicates from phylogenetic analyses. Here I select a subset ot taxa to include in the analyses and keep only sites that have 80% coverage from scaffold 2 (Qrob_Chr03).



In [12]:

    
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=2,
    mincov=0.8,
    imap={
        "include": [
            "TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140",
            "FLSF47", "FLMO62", "FLSA185", "FLCK216",
            "FLCK18", "FLSF54", "FLWO6", "FLAB109",
        ]
    },
)

# show stats of the window
ext.stats









    Out[12]:







  
    
      
      scaffold
      start
      end
      sites
      snps
      missing
      samples
    
  
  
    
      prefilter
      Qrob_Chr03
      0
      57474983
      419278
      7283
      0.22
      13
    
    
      postfilter
      Qrob_Chr03
      0
      57474983
      302300
      5441
      0.10
      13

Concatenate multiple scaffolds together

You can also concatenate multiple scaffolds together using window_extracter. This can be useful for creating genome-wide alignments, or smaller subsets of the genome. For example, you may want to combine multiple scaffolds from the same chromosome together, or, if you are working with denovo data, you could even combine a random sample of anonymous loci together as a sort of pseudo bootstrapping procedure. To select multiple scaffolds you simply provide a list or range of scaffold idxs.



In [13]:

    
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=[0, 1, 2, 3, 4, 5],
    mincov=0.5,
)

# show stats of the window
ext.stats









    Out[13]:







  
    
      
      scaffold
      start
      end
      sites
      snps
      missing
      samples
    
  
  
    
      0
      concatenated
      0
      2835162
      2835162
      90452
      0.152085
      30

Consensus reduction with `imap`

You can further reduce missing data by condensing data from multiple samples into a single "consensus" representative using the consensus_reduce=True option. This uses the imap dictionary to group samples into groups and sample the most frequent allele. This can be particularly useful for analyses in which you want dense species-level coverage with little missing data, but it is not particularly important which individual represents the sampled allele for a species at a given locus. For example, if you want to construct many gene trees with one representative per species to use as input to a two-step species tree inference program like ASTRAL.



In [22]:

    
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
    start=200000, 
    end=5000000,
    mincov=0.8,
    imap=imap,
    minmap=minmap,
    consensus_reduce=True,
)

# show stats of the window
ext.stats









    Out[22]:







  
    
      
      scaffold
      start
      end
      sites
      snps
      missing
      samples
    
  
  
    
      prefilter
      Qrob_Chr01
      200000
      5000000
      51288
      1622
      0.19
      28
    
    
      postfilter
      Qrob_Chr01
      200000
      5000000
      48927
      651
      0.01
      8

Write selected window to a file

Once you've chosen the final set of arguments to select the window of interest you can write the alignment to .phy format by calling the .run() command. If you want to write to nexus format you can simply add the argument nexus=True.



In [24]:

    
ext.run(force=True)









    



Wrote data to /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-window_extracter/scaf0-200000-5000000.phy

Accessing the output files

The output files created by the .run() command will be written to the working directory (defaults to "./analysis-window_extracter"). You can either find the full path to that file or access it easily from the extracter object itself as an attribute like below.



In [26]:

    
# path to the phylip file output
ext.outfile









    Out[26]:





'/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-window_extracter/scaf0-200000-5000000.phy'

Advanced: Infer tree from phy output

You can pass in the file path that was created above to a number of inference programs. The ipyrad tools for raxml and mrbayes both accept phylip format (ipyrad converts it to nexus under the hood for mrbayes).



In [19]:

    
# run raxml on the phylip file 
rax = ipa.raxml(data=ext.outfile, name="test", N=50, T=4)

# show the raxml command
print(rax.command)









    



/home/deren/miniconda3/envs/py36/bin/raxmlHPC-PTHREADS-AVX2 -f a -T 4 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-raxml -s /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-window_extracter/scaf0-500000-800000.phy -p 54321 -N 50 -x 12345



In [20]:

    
# run job and wait to finish
rax.run(force=True)









    



job test finished successfully



In [21]:

    
# plot the tree for this genome window
print(rax.trees.bipartitions)
tre = toytree.tree(rax.trees.bipartitions)
rtre = tre.root("reference").collapse_nodes(min_support=50)
rtre.draw(node_labels="support");









    



/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-raxml/RAxML_bipartitions.test

Advanced: Infer trees from many windows

Now that you've seen how easy it is to extract a single window from the genome, and to apply a number of filters to it and then infer a tree, you can imagine how easy it is to apply this framework to many hundreds or thousands of windows along the genome. Head over to the tree_slider tutorial next to see this in action.

	scaffold_name	scaffold_length
0	Qrob_Chr01	55068941
1	Qrob_Chr02	115639695
2	Qrob_Chr03	57474983
3	Qrob_Chr04	44977106
4	Qrob_Chr05	70629082
5	Qrob_Chr06	57352617
6	Qrob_Chr07	51661711
7	Qrob_Chr08	71345938
8	Qrob_Chr09	50221317
9	Qrob_Chr10	50368918
10	Qrob_Chr11	52130961
11	Qrob_Chr12	39860516
12	Qrob_H2.3_Sc0000024	2943817
13	Qrob_H2.3_Sc0000026	2906018
14	Qrob_H2.3_Sc0000030	2801502

	scaffold	start	end	sites	snps	missing	samples
prefilter	Qrob_Chr01	0	55068941	456515	13133	0.20	30
postfilter	Qrob_Chr01	0	55068941	449832	13131	0.19	30

	scaffold	start	end	sites	snps	missing	samples
prefilter	Qrob_Chr01	500000	800000	3431	103	0.14	30
postfilter	Qrob_Chr01	500000	800000	3422	103	0.14	30

	scaffold	start	end	sites	snps	missing	samples
prefilter	Qrob_Chr01	200000	5000000	51288	1622	0.19	28
postfilter	Qrob_Chr01	200000	5000000	48927	651	0.01	8