Before starting to process your fastq files with the secapr
pipeline unzip all of your fastq files and make sure they are named properly (the file names coming from the sequencing faciilities are often very long and difficult to work with). A simple numerical ID is enough followed by _R1
for the forward reads and _R2
for the backward reads. All fastq files should be in the same directory as in this example:
In [2]:
%%bash
# example of the input file structure and naming: a plain folder with unzipped backward and forward fastq files
ls ../../data/raw/fastq/ | head -n 20
To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. To generate a quality overview over all samples you can use the secapr quality_check
function:
secapr quality_check --input ../../data/raw/fastq/ --output ../../data/processed/fastqc_results/raw
In [11]:
from IPython.display import Image, display
img1 = Image("../../data/processed/fastqc_results/raw/quality_summary_all_samples_1.png",height=400,width=200)
img2 = Image("../../data/processed/fastqc_results/raw/quality_summary_all_samples_2.png",height=100,width=400)
print("Fastqc results of uncleaned fastq-files:")
display(img1)
display(img2)
The two plots produced by the R-script show summary statistics for each individual test (tests shown on x-axis). The test names carry 3-letter acronyms, and the corresponding full test-name can be found by opening one of the html files. The first plot shows how many occurrences of each test-result (fail,pass,warn) were found for each test among all samples (per-test basis). The second plot shows for each sample (y-axis) which test had which result (per-sample basis). Eventually we want to get rid of all the red in these plots (see below).
secapr
(default settings)As we see above, the raw reads that you recevied from the sequencing facility are usually not fit for further processing, since all files fail several quality tests (red fields). One of the main issues with raw read files is that they contain low quality reads. Another issue is that most if not all reads will probably contain parts of the Illumina adapter sequences, which are attached on both ends of the read. In order to prepare the reads for further steps, we need to make sure that we properly filter out low-quality reads and that we clip off all remaining adapter contaminations. We can use the secapr clean_reads
function to do exactly that for all of our files. This function applies the cleaning and trimming software Trimmomatic (Bolger et al. 2014).
Here is an overview of all available settings for secapr clean_reads
:
In [12]:
%%bash
source activate secapr_env
secapr clean_reads -h
The script requires a config file which should contain the adapter and barcode information. You see an example of the config file below. The file consists of 3 sections:
Here you provide the adapter sequences that were used during library preparation in the lab. Check which Illumina kit was used and find the corresponding sequences either in the Illumina manual or on their webpage. Add a * in the spot where the sample specific barcode will be inserted.
Here you provide a unique string for each sample, as it occurs in the filename (this is how the program recognizes which files to process). You usually have two file for each sample (forward and backward reads). Make sure that this unique identifier occurs in both of those files. Add ':_' after each identifier (don't ask why).
Here you specify the sample specific barcode for each sample. First you state which adapter (i7 or i5) the barcode will be inserted in, followed by a dash (-) and the name of the sample, as stated in the section [names].
.
Note: If you are working with double indexed adapters (both adapters containing barcodes), don't forget to add the * in the position where the barcode is inserted, also for the i5 adapter. In that case make sure in the [barcodes]-section to assign both barcodes for each sample to the correct adapter, e.g.:
i7-sampleID:AAACCC
i5-sampleID:AATTCC
In [16]:
%%bash
cat ../../data/raw/adapter_info.txt
secapr clean_reads
functionAfter preparing your config file in the previous step, you are ready to start with cleaning and trimming of your fastq files. For now let's just run the script with default settings. We will show below how to customize the settings of the script in order to achive the best cleaning results for your dataset. Let's run the script as in this example command:
secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads_default --index single
secapr clean_reads
produces a subfolder for each sample in the output directory, containing the cleaned reads for the respective sample.
After cleaning the reads with secapr clean_reads
with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads.
secapr quality_check --input ../../data/processed/cleaned_trimmed_reads_default --output ../../data/processed/fastqc_results/cleaned_default_settings
In [16]:
from IPython.display import Image, display
img1 = Image("../../data/processed/fastqc_results/cleaned_default_settings/quality_summary_all_samples_1.png",height=400,width=200)
img2 = Image("../../data/processed/fastqc_results/cleaned_default_settings/quality_summary_all_samples_2.png",height=100,width=400)
print("Fastqc results of fastq-files cleaned with default settings:")
display(img1,img2)
We ran secapr clean_reads
with default settings and we see a clear improve in comparison to the quality test results of the raw reads (see plots further up in this document). However, there are still quite a few failed tests and I'm convinced we can do better than that. Check the secapr clean_reads
documentation (by adding -h
to the command) in order to see the available options and try some different settings in order to see if and how the results improve. It helps to check out in one of the html files what the different tests mean and try to find a settings in secapr clean_reads
that could be taking care of the specific problem. Preferably all samples should pass all tests (there may still be some warnings) before you continue with further processing of the reads. Below we show an example of how the results can be further improved:
As we see above, running the script with default settings improved the file quality but there is a lot of room for improvement. After reviewing the intial quality reports and after trying a bunch of different flags and values, I ended up with this command for the example data. See the script documentation for more information about the different flags (secapr clean_reads -h
).
secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10
Let's check the final quality of the data:
secapr quality_check --input ../../data/processed/cleaned_trimmed_reads --output ../../data/processed/fastqc_results/custom_settings
In [17]:
from IPython.display import Image, display
img1 = Image("../../data/processed/fastqc_results/custom_settings/quality_summary_all_samples_1.png",height=400,width=200)
img2 = Image("../../data/processed/fastqc_results/custom_settings/quality_summary_all_samples_2.png",height=100,width=400)
print("Fastqc results of fastq-files cleaned with default settings:")
display(img1,img2)
You can see how the overall quality of the data improves (compare R-plots to the ones resulting from the deafult settings). Once all samples are properly cleaned, move on to the next step, the contig assembly