If you applied the --keep-paralogs
flag in the SECAPR find_target_contigs
function, the function will print a text file with paralogous information into the subfolder of each sample. This file contains the information about the locus id (first column) in the reference file, followed by a list of contig headers that were matched to this locus:
In [8]:
%%bash
head -n 10 ../../data/processed/target_contigs_paralogs/1061/info_paralogous_loci.txt
We can use the SECAPR paralogs_to_ref
function to extract the sequences of all these contigs and align them with their respective reference sequence. This will give an idea of where the contigs map on the reference and can help decide if these are truly potential paralogous or more likely the result of non-optimal contig assembly (clusters of homologous reads are broken into separate contigs) or if contigs from other parts of the genome map to the reference due to the presence of e.g. repetitive regions or other common sequence patterns.
We need to provide the function the following input items:
find_target_contigs
function)The command looks as follows:
secapr paralogs_to_ref --contigs ../../data/processed/contigs --reference ../../data/raw/palm_reference_sequences.fasta --target_contigs ../../data/processed/target_contigs_paralogs_info --output ../../data/processed/paralogs_to_reference
Depending on how many paralogous loci were identified in your samples this can take several minutes. The script will store the final alignments of the contigs and the reference sequences for each sample in the paralog_alignments
folder in the provided output path. Let's look at one exemplarly alignment. You can view alignments using alignment viewers such as e.g. AliView:
In [10]:
from IPython.display import Image, display
img1 = Image("../../images/paralog_contig_alignment.png",width=1000)
display(img1)
In this example you see the reference sequence in the first line starts with a repetative sequence (CTCTCTCTC....). Many contigs that only consist of this repetitive sequence appear to map to it. Only one of the contigs, however, apears to be a true match, as it exceeds the repetative region and has a lot of overlap with the reference sequence. A locus like this doesn't need to be excluded, since we see that even though it's flagged as paralogous, there are no true signs of paralogy.