In order to extract the contigs representing your target sequences (the sequences that were being captured during the sequence capture process), you need to provide a fasta file containing the reference sequences for all loci of interest. Ususally all sequences of interest should be present in the file that was used to design the RNA baits for sequence capture. If you are using some standarad RNA bait library that was not specifically designed for your organism group/project (e.g. Ultraconserved Elements - UCEs), you can usually find the reference library on the webpage of the developer or in the respective publication. If all else fails, you can try to extract sequences of the same loci that you captured, for organisms that are closely related to your taxa, e.g. from NCBI GenBank.
The reference library should be in simple fasta format, containign one sequence per locus of interest. Here is an example of a reference library that was used in our example dataset of palms, extracted from Heyduk et al., 2016:
Citation: Heyduk, K., Trapnell, D. W., Barrett, C. F., & Leebens-Mack, J. (2016). Phylogenomic analyses of species relationships in the genus Sabal (Arecaceae) using targeted sequence capture. Biological Journal of the Linnean Society, 117, 106–120.
In [2]:
%%bash
head ../../data/raw/palm_reference_sequences.fasta
Once you got your reference fasta files ready you are good to start with extracting the contigs of interest. For this purpose we want to create an overview over which contigs represent which reference locus in each sample. At the same time we also have to be somewhat selective and discard potential duplicates that match several loci. Let's check the function that helps you do this:
In [4]:
%%bash
source activate secapr_env
secapr find_target_contigs -h
Note that in this step SECAPR will index all your locus names stored in the reference file, so in all downstream steps your loci will carry a numerical index. The translation table holding the information which index corresponds to which locus name is stored in a text file in the output folder (reference_fasta_header_info.txt
).
The sensitivity of the blast algorithm (LASTZ) can be altered with the flags --min-coverage
and --min-identity
. High values mean conservative matching requirements, while low values will return more matches but also possibly non-orthologous sequences.
However, being to lenient with the --min-coverage
and --min-identity
flags can also lead to an increase of loci that are identified as paralogous, which happens when more than 1 contig matches the reference sequence.
You can choose to add the --keep-duplicates
flag, in order to also keep contigs which span across multiple loci. These will be extracted independently for each contig thhey match and may hence be present in several copies in the FASTA file containing your extracted contigs. If this flag is used a txt file with the duplicate information for each sample is being printed into the samples output folder (reference_fasta_header_info.txt
). This file contains the contig ID in the first column, followed by the list of reference loci indeces that the contig matched to.
The --keep-paralogs
flag should be used with caution. This will lead to SECAPR keeping loci for which indications of paralogy were found (multiple matching contigs). These loci should generally not be used for phylogenetic inferences (unless you are certain for other reasons that you are not concerned about paralogous loci in your dataset), but they may be of interest in some cases. If this flag is used, a txt file with the paralog information for each sample is being printed into the samples output folder (info_paralogous_loci.txt
). This file contains the reference locus index in the first column, followed by a list of contig names that matched to the locus.
Sometimes it turns out that despite being flagged, loci are not truly paralogous, but that multiple non-homologous/non-paralogous contigs were matched to the reference for other reasons (repetitive regions or other repeated sequence patterns). SECAPR provides an extra function to inspect where the flagged contigs end up on the reference (see documentation of paralogs_to_ref
function). In those cases the paralogous loci can be included in further steps, i.e. apply the --keep-paralogs
flag, which will extract only the longest (and usually best) contig for these loci.
Now let's run the script.
secapr find_target_contigs --contigs ../../data/processed/contigs/ --reference ../../data/raw/palm_reference_sequences.fasta --output ../../data/processed/target_contigs --keep-duplicates
To get a first idea of the resulting matches, you can have a look at the file match_table.txt
in the output folder.
In [4]:
import pandas as pd
table = pd.read_csv('../../data/processed/target_contigs/match_table.txt', delimiter = '\t',index_col=0)
table.head()
Out[4]:
Those fields containing a '1' indicate that a unique match was extracted from the contig sequences for the respective exon and sample. If the output reveals a very low harvest of target sequences, you can try to reduce the values for the flags --min-coverage
and --min-identity
in order to be more generous in the matching step. If on the other hand your output turns out to capture a lot of non-homologous sequences between the different samples (can be identified after the alignment step), you may want to turn up the values for these flags in order to be more conservative in your search.
The script also prints out summary stats in a textfile in the output folder:
In [5]:
%%bash
cat ../../data/processed/target_contigs/summary_stats.txt
In [9]:
%%bash
source activate secapr_env
secapr plot_sequence_yield -h
For now we only want to print the contig data, so we can execute the function like this:
secapr plot_sequence_yield --extracted_contigs ../../data/processed/target_contigs --output ../../data/processed/plots
The resulting plot looks like this and shows which contigs could be extracted (blue) and which were not present (white) for each sample and locus:
In [11]:
from IPython.display import Image, display
img1 = Image("../../data/processed/plots/contig_yield_overview.png",width=1000)
display(img1)
Blue means "presence" and white "absence" of the respective sequence in the final contig file. SECAPR also prints a separate text file with an overview which exon index coressponds to which locus. It appears that most loci were recovered for most samples. Further there are gaps in some places for all samples, which could indicate that the bait sequences for that locus are not very suitable for our sample data of the genus Geonoma.
If you are satisfied with your contig yield you are ready to continue to the alignment step.