Align contigs

We can use SECAPR to produce Multiple Sequence Alignments (MSAs) from the contig data. The alignment function align_sequences looks as follows:

In [3]:
source activate secapr_env
secapr align_sequences -h

usage: secapr align_sequences [-h] --sequences SEQUENCES --output OUTPUT
                              [--aligner {muscle,mafft}]
                              [--output-format {fasta,nexus,phylip,clustal,emboss,stockholm}]
                              [--no-trim] [--window WINDOW]
                              [--proportion PROPORTION]
                              [--threshold THRESHOLD]
                              [--max-divergence MAX_DIVERGENCE]
                              [--min-length MIN_LENGTH] [--exclude_ambiguous]
                              [--cores CORES]

Align sequences and produce separate alignment file for each locus, containing
the seqeunces of all taxa.

optional arguments:
  -h, --help            show this help message and exit
  --sequences SEQUENCES
                        The fasta file containing individual sequences for
                        several samples and loci
  --output OUTPUT       The directory in which to store the resulting
  --aligner {muscle,mafft}
                        The alignment engine to use.
  --output-format {fasta,nexus,phylip,clustal,emboss,stockholm}
                        The output alignment format.
  --no-trim             Align, but DO NOT trim alignments.
  --window WINDOW       Sliding window size for trimming.
  --proportion PROPORTION
                        The proportion of taxa required to have sequence at
                        alignment ends.
  --threshold THRESHOLD
                        The proportion of residues required across the window
                        in proportion of taxa.
  --max-divergence MAX_DIVERGENCE
                        The max proportion of sequence divergence allowed
                        between any row of the alignment and the alignment
  --min-length MIN_LENGTH
                        The minimum length of alignments to keep.
  --exclude_ambiguous   Don't allow reads in alignments containing N-bases.
  --cores CORES         Process alignments in parallel using --cores for
                        alignment. This is the number of PHYSICAL CPUs.

Let's now create alignment from our contig data. In the example command below we added the flag --no-trim, which avoids the algorithm to cut the alignment at the ends (= full contig sequence length is being preserved) and the flag --ambiguous, which allows sequences with ambiguous bases ('N') to be included into the alignments. You can decide to not use the --no-trim flag if you want all sequences in the alignments to be of the same length. In that case there are a bunch of additional flags (see above) that you can use to adjust the trimming process.

secapr align_sequences --sequences ../../data/processed/target_contigs/extracted_target_contigs_all_samples.fasta --output ../../data/processed/alignments/contig_alignments/ --aligner mafft --output-format fasta --no-trim

The align_sequences function by default creates multiple sequence alignments (MSAs) for all loci that are present in at least 3 samples. This leads to alignments for most of the targeted exons. We can use the SECAPR plotting function from the previous step again to create an overview showing which loci we have MSAs for. Therefore we need to provide the function the path to the target contigs as well as the path to the alignment folder. We can provide the same output folder as before, since the function will automatically name this plot differently than the previous one, so that it's not being overwritten.

secapr plot_sequence_yield --extracted_contigs ../../data/processed/target_contigs --alignments ../../data/processed/alignments/contig_alignments/ --output ../../data/processed/plots

In [2]:
from IPython.display import Image, display
img1 = Image("../../data/processed/plots/contig_alignment_yield_overview.png",width=1000)


Filling in missing sequences in alignments

Some applications (such as e.g. BEAST) require the same samples/taxa being present in every alignment. If you review your alignments you may find that many-most loci could not be assembled for all of your samples. Even though this is not the optimal turn-out it is quite normal and you can still proceed using your multilocus dataset. It is okay to have missing sequence data for some samples in the alignments, as long as it is correctly coded. Secapr has a function that adds dummy sequences consisting of ?'s for the missing taxa to your alignments, so that all alignments have the same set of taxa. All you have to do is to provide the path to the folder containign all alignments you want to sync and provide the link to the output folder.

secapr add_missing_sequences --input ../../data/processed/alignments/contig_alignments/ --output ../../data/processed/alignments/contig_alignments_no_missing

Congratulations! You now hopefully have a whole folder full with alignments, which you can use for your downstream analyses. However, there is a lot more you can get out of your sequence capture data if you stick with us! Keep in mind that contig sequences (which your alignments consist of) constitute consensus sequences of the reads that were merged during assembly. Even though the algorithms behind assembly softwares (such as ABySS and Trinity) are well developed, they still may produce chimeric sequences. This means that the resulting sequence may be a mixture between the sequences of the two possible alleles at the respective locus (for diploid organisms) or even worse a mixture between paralogous sequences from different sites. The rest of this tutorial will take you through the steps of reference-based assembly, phasing and compiling of allele sequences. Further we provide a workflow for SNP extraction.

The next step is to generate a new reference library from our contig sequences and to remap the reads to this library (reference-based assembly).