After cleaning and trimming the reads in the previous step we are now ready to use the fastq-reads for de-novo contig assembly. In this step the overlap between fastq reads is being used to build long, uninterrupted sequences, among which we hopefully find the target regions that were selected for during sequence capture. In a way a contig can be seen as a consensus of several reads:
However, the underlying algorithms for building contigs are much more complex than in the simplified image above. In our pipeline you can choose between the assembly programs ABySS and Trinity. We tested both assemblers and they both appear to produce very similar results. If you are working with sequence capture data of DNA regions (exons, introns, UCEs, mitochondrial markers, etc.) we recommend you to use ABySS since this assembler was built for assembling DNA sequences. Trinity on the other hand was built for assembling transcriptome (RNA) sequences.
In [1]:
%%bash
secapr assemble_reads -h
Now we run the assembly with the default options, just like this:
secapr assemble_reads --input ../../data/processed/cleaned_trimmed_reads/ --output ../../data/processed/contigs
The assembly step is very time intensive and may take several hours or even days, depending on the number of samples and the size of the files. For our example dataset the assembly took approximately 45 min per sample. The assembly step produces a fasta file for each sample, containing all assembled contig sequences. There are commonly 1000s or even 100,000s of sequences in the contig fasta file, many of which represent random short sequences that were present during sequencing. You may also find some very long sequences in the sample file which may represent the mitochondrial genome or in some cases big parts of the chloroplast genome (in plants). In the case of sequence capture datasets, we are mostly interested in the contigs that represent our enriched target sequences. We will show you in the next step how these can be easily extracted from the contig file by using a reference fasta file containing templates for the sequences of interest (often the file used to design the RNA baits). Go to the manual for extracting target contigs.