extract_contigs-checkpoint


Extracting target contigs

In order to extract the contigs representing your target sequences (the sequences that were being captured during the sequence capture process), you need to provide a fasta file containing the reference sequences for all loci of interest. Ususally all sequences of interest should be present in the file that was used to design the RNA baits for sequence capture. If you are using some standarad RNA bait library that was not specifically designed for your organism group/project (e.g. Ultraconserved Elements - UCEs), you can usually find the reference library on the webpage of the developer or in the respective publication. If all else fails, you can try to extract sequences of the same loci that you captured, for organisms that are closely related to your taxa, e.g. from NCBI GenBank.

The reference library should be in simple fasta format, containign one sequence per locus of interest. Here is an example of a reference library that was used in our example dataset of palms, extracted from Heyduk et al., 2016:

Citation: Heyduk, K., Trapnell, D. W., Barrett, C. F., & Leebens-Mack, J. (2016). Phylogenomic analyses of species relationships in the genus Sabal (Arecaceae) using targeted sequence capture. Biological Journal of the Linnean Society, 117, 106–120.


In [2]:
%%bash
head ../../data/raw/palm_reference_sequences.fasta


>Elaeis_1007_0
TGGGAGTCGCCGGGCATTTCTGGGATCTCCTCAAGCCCTACGCCCGGAACGAGGGCGTCGACTTCCTCCGGAACAAGCGCGTCGCCATCGACCTCTCCTTCTGGATCGTCCAGCACGGGGCTGCCATCCGCAACAAAACCTCTCGCCTCCGCAATCCCCACATCCGCACCACCTTTTTCCGTACTGTCGCCTTGTTT
>Elaeis_1007_1
TCGAAGATGGGGGCGTTCCCGGTGTTCGTCGTTGACGGCGAGCCATCGCCGTTGAAGACGCAGGCAAGGATGGAGCGCTTCTTCTGCAACTCTGGTGTTGATCCTTCGACGCTGTCGAAGGCGGAGGAAGGGGAGGAGTCTCCTGTGAAACAGAGGAATCAGGCATTCACCAGATGCGTTCGGGAGTGCGTG
>Elaeis_1007_2
GAACTCCTCGAAATCCTAGGGATGCCAGTTCTAAGAGCATGTGGTGAGGCTGAAGCCCTGTGTGCACAGTTAAATAGTGAAGGCCATGTTGATGCTTGCATCACTGCCGACAGTGATGCTTTCCTGTTTGGGGCGAAATGTGTCATAAAGTGCCTTCGCTCCAATTGCAAG
>Elaeis_1007_3
GACCCATTTGAGTGCTACAACATATCAGATGTTGAAGCTGGTCTTGGTTTGAAGAGAAAACAAATGGTAGCCATTGCTCTTCTGGTCGGTAATGACCATGATTTGCATGGGGTATCTGGGTTTGGGGGTTGATACGGCTGTCCGATTTGTAAAGATGTTTAGTGAGGATGAAATTTTGGCTA
>Elaeis_1007_4
GGTTATGTGAGGTTGGTAAAGGGGTTTTCCCTTTTTCAGAGGGAAGCATCAGTTTGGCCATGGATCCCCACATGCCTATTTCAAATGAGGTTTCACCGAGTGCAAGATCTCCACACTGCTCACATTGTGGTCACCCAGGCAGCAAGAAAGCTCATCAGAAGACTGCTTGTGAATACTGTATTCTTAATGGTTCTGAAAATTGCATGGAAAAGCCAACAGGGTTCAAATGCTTGTGCTCCACCTGCGA

Find and extract all target contigs

Once you got your reference fasta files ready you are good to start with extracting the contigs of interest. For this purpose we want to create an overview over which contigs represent which reference locus in each sample. At the same time we also have to be somewhat selective and discard potential duplicates that match several loci. Let's check the function that helps you do this:


In [4]:
%%bash
source activate secapr_env
secapr find_target_contigs -h


usage: secapr find_target_contigs [-h] --contigs CONTIGS --reference REFERENCE
                                  --output OUTPUT
                                  [--min-coverage MIN_COVERAGE]
                                  [--min-identity MIN_IDENTITY]
                                  [--keep-duplicates] [--keep-paralogs]
                                  [--disable_stats]

Extract the contigs that match the reference database

optional arguments:
  -h, --help            show this help message and exit
  --contigs CONTIGS     The directory containing the assembled contigs in
                        fasta format.
  --reference REFERENCE
                        The fasta-file containign the reference sequences
                        (probe-order-file).
  --output OUTPUT       The directory in which to store the extracted target
                        contigs and lastz results.
  --min-coverage MIN_COVERAGE
                        The minimum percent coverage required for a match
                        [default=80].
  --min-identity MIN_IDENTITY
                        The minimum percent identity required for a match
                        [default=80].
  --keep-duplicates     Use this flag in case you want to keep those contigs
                        that span across multiple exons.
  --keep-paralogs       Use this flag in case you want to keep loci with signs
                        of paralogy (multiple contig matches). The longest
                        contig will be selected for these loci.
  --disable_stats       Use this flag if you want to disable generation of
                        stats (can be necessary because previous stats files
                        can't be found if files are used that were not
                        previously processed with SECAPR)

Note that in this step SECAPR will index all your locus names stored in the reference file, so in all downstream steps your loci will carry a numerical index. The translation table holding the information which index corresponds to which locus name is stored in a text file in the output folder (reference_fasta_header_info.txt).

The sensitivity of the blast algorithm (LASTZ) can be altered with the flags --min-coverage and --min-identity. High values mean conservative matching requirements, while low values will return more matches but also possibly non-orthologous sequences.

However, being to lenient with the --min-coverage and --min-identity flags can also lead to an increase of loci that are identified as paralogous, which happens when more than 1 contig matches the reference sequence.

You can choose to add the --keep-duplicates flag, in order to also keep contigs which span across multiple loci. These will be extracted independently for each contig thhey match and may hence be present in several copies in the FASTA file containing your extracted contigs. If this flag is used a txt file with the duplicate information for each sample is being printed into the samples output folder (reference_fasta_header_info.txt). This file contains the contig ID in the first column, followed by the list of reference loci indeces that the contig matched to.

The --keep-paralogs flag should be used with caution. This will lead to SECAPR keeping loci for which indications of paralogy were found (multiple matching contigs). These loci should generally not be used for phylogenetic inferences (unless you are certain for other reasons that you are not concerned about paralogous loci in your dataset), but they may be of interest in some cases. If this flag is used, a txt file with the paralog information for each sample is being printed into the samples output folder (info_paralogous_loci.txt). This file contains the reference locus index in the first column, followed by a list of contig names that matched to the locus.

Sometimes it turns out that despite being flagged, loci are not truly paralogous, but that multiple non-homologous/non-paralogous contigs were matched to the reference for other reasons (repetitive regions or other repeated sequence patterns). SECAPR provides an extra function to inspect where the flagged contigs end up on the reference (see documentation of paralogs_to_ref function). In those cases the paralogous loci can be included in further steps, i.e. apply the --keep-paralogs flag, which will extract only the longest (and usually best) contig for these loci.

Now let's run the script.

secapr find_target_contigs --contigs ../../data/processed/contigs/ --reference ../../data/raw/palm_reference_sequences.fasta --output ../../data/processed/target_contigs --keep-duplicates

To get a first idea of the resulting matches, you can have a look at the file match_table.txt in the output folder.


In [4]:
import pandas as pd
table = pd.read_csv('../../data/processed/target_contigs/match_table.txt', delimiter = '\t',index_col=0)
table.head()


Out[4]:
1082 1061 1086 1140 1065 1087 1074 1064 1083 1070 1166 1073 1080 1063 1085 1079 1068
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1

Those fields containing a '1' indicate that a unique match was extracted from the contig sequences for the respective exon and sample. If the output reveals a very low harvest of target sequences, you can try to reduce the values for the flags --min-coverage and --min-identity in order to be more generous in the matching step. If on the other hand your output turns out to capture a lot of non-homologous sequences between the different samples (can be identified after the alignment step), you may want to turn up the values for these flags in order to be more conservative in your search.

The script also prints out summary stats in a textfile in the output folder:


In [5]:
%%bash
cat ../../data/processed/target_contigs/summary_stats.txt


Total number of samples: 17
Total number of targeted exons: 837

120 exons are shared by all samples.

1082: 556 extracted contigs
1061: 545 extracted contigs
1086: 516 extracted contigs
1140: 469 extracted contigs
1065: 544 extracted contigs
1087: 562 extracted contigs
1074: 531 extracted contigs
1064: 543 extracted contigs
1083: 534 extracted contigs
1070: 539 extracted contigs
1166: 544 extracted contigs
1073: 529 extracted contigs
1080: 531 extracted contigs
1063: 525 extracted contigs
1085: 512 extracted contigs
1079: 550 extracted contigs
1068: 563 extracted contigs
mean: 534.882353 stdev: 21.605779

Plotting with SECAPR

SECAPR provides a plotting function that can be used to get a visual overview of the extracted target contigs. The function can be called as follows:


In [9]:
%%bash
source activate secapr_env
secapr plot_sequence_yield -h


usage: secapr plot_sequence_yield [-h] --extracted_contigs EXTRACTED_CONTIGS
                                  [--alignments ALIGNMENTS]
                                  [--read_cov READ_COV]
                                  [--coverage_norm COVERAGE_NORM] --output
                                  OUTPUT

Plot overview of extracted sequences

optional arguments:
  -h, --help            show this help message and exit
  --extracted_contigs EXTRACTED_CONTIGS
                        The directory containing the extracted target contigs
                        (output from find_target_contigs function).
  --alignments ALIGNMENTS
                        The directory containing the contig alignments.
                        Provide this path if you want to add a line to the
                        plot showing for which loci alignments could be
                        created.
  --read_cov READ_COV   The directory containing the reference assembly
                        results. Provide this path if you want to display the
                        read coverage for each locus and sample.
  --coverage_norm COVERAGE_NORM
                        Here you can adjust the color scale of the read-
                        coverage plot. This value will define the maximum of
                        the color scale, e.g. if set to '10' read-coverage of
                        10 and above will be colored black, while everything
                        below (0-10) will be stretched across the color
                        spectrum from yellow, red to black.
  --output OUTPUT       The directory in which to store the plots.

For now we only want to print the contig data, so we can execute the function like this:

secapr plot_sequence_yield --extracted_contigs ../../data/processed/target_contigs --output ../../data/processed/plots

The resulting plot looks like this and shows which contigs could be extracted (blue) and which were not present (white) for each sample and locus:


In [11]:
from IPython.display import Image, display
img1 = Image("../../data/processed/plots/contig_yield_overview.png",width=1000)

display(img1)


Blue means "presence" and white "absence" of the respective sequence in the final contig file. SECAPR also prints a separate text file with an overview which exon index coressponds to which locus. It appears that most loci were recovered for most samples. Further there are gaps in some places for all samples, which could indicate that the bait sequences for that locus are not very suitable for our sample data of the genus Geonoma.

If you are satisfied with your contig yield you are ready to continue to the alignment step.