After you mapped your reads against the reference library during the reference-based assembly step, you are ready to phase your reads into the two different allele sequences (in case of diploid organisms). This step is simple to execute, since the function only requires the path to the reference-based assembly output and the user-set minimal read depth for generating the consensus sequence:
In [1]:
%%bash
source activate secapr_env
secapr phase_alleles -h
We can run the command simply like this:
secapr phase_alleles --input ../../data/processed/remapped_reads/ --output ../../data/processed/allele_sequences --min_coverage 3
We can also choose to phase only the selected loci that were produced with the secapr locus_selection
function:
secapr phase_alleles --input ../../data/processed/selected_loci --output ../../data/processed/allele_sequences_selected_loci --min_coverage 3
Now all we need to do is to run the secapr align_sequences
function in order to align the extracted allele sequences of all samples for each locus. We can run the command like this:
secapr align_sequences --sequences ../../data/processed/allele_sequences/joined_allele_fastas.fasta --output ../../data/processed/alignments/allele_alignments --aligner mafft --output-format fasta --no-trim --ambiguous
Or like this if we want to instead build allele alignments from only the selected loci:
secapr align_sequences --sequences ../../data/processed/allele_sequences_selected_loci/joined_allele_fastas.fasta --output ../../data/processed/alignments/selected_loci_allele_alignments --aligner mafft --output-format fasta --no-trim --ambiguous
Before using these alignments for phylogenetic analyses it usually is a good idea to make sure that all taxa contain the same number of sequences. As of right now, some alignments may be missing one of the two allele sequences for some samples, because not enough reads were present that were supporting both haplotypes (controlled by the --min_coverage
flag int he phase_alleles
command. In order to add missing sequences as dummy sequences containing n's we can use the secapr add_missing_sequences
function:
secapr add_missing_sequences --input ../../data/processed/alignments/selected_loci_allele_alignments/ --output ../../data/processed/alignments/selected_loci_allele_alignments_complete
We provide a tutorial of how to use the generated allele sequence alignments for phylogeny estimation under the Multispecies Coalescent (MSC) model.
We provide a tool for extracting SNPs from phased allele alignments. The tool can be applied to any type of allignments, but is usually the most useful for fully phased allele alignments, which contain the heterozygous information. See here for more information.