In an additional test of PyNAST alignment of V2 sequences with the core set versus the PyNAST 85% OTU alignment added in biocore/qiime-default-reference#15, I find that only slightly more sequences fail to align with the PyNAST 85% OTU alignment than with the old core set. Details of my test follow. Since we don't have an inherent way to say which is better (i.e., it's possible that we want those additional sequences to fail), and we have complete understanding of how the PyNAST 85% OTU alignment was generated (which is not true for the core set), we'll move forward with the plan to use the PyNAST 85% OTU alignment as the default reference template alignment.
My first test was performed with the V2 sequences deposited under study id 449
in the QIIME database (the Costello Whole Body) dataset. I ran the following command to align representative sequences against the PyNAST 85% OTU alignment:
In [1]:
## Uncomment to remove output from previous run.
!rm -r output
import pandas as pd
from os.path import join, abspath
from qiime_default_reference import get_reference_sequences
from bfillings.formatdb import build_blast_db_from_fasta_path
gg_97_reference = get_reference_sequences()
db_name, db_files = build_blast_db_from_fasta_path(gg_97_reference, output_dir='./')
!wget http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/core_set_aligned.fasta.imputed
core_set_template_fp = abspath("core_set_aligned.fasta.imputed")
rep_set_449_fp = abspath("data/rep_set.449.fna")
rep_set_449_out_dir = abspath("output/449_out")
rep_set_550_fp = abspath("data/rep_set.550.1.fna")
rep_set_550_out_dir = abspath("output/550_out")
In [3]:
pyn_out_qdr = join(rep_set_449_out_dir, "qdr012")
!parallel_align_seqs_pynast.py -i $rep_set_449_fp -o $pyn_out_qdr --jobs_to_start 8
fasta_fps = join(pyn_out_qdr, "*fasta")
!count_seqs.py -i "$fasta_fps"
This tells us that 204 sequences failed to align with PyNAST out of 10404.
Next, we perform the same alignment against the core set.
In [4]:
pyn_out_core = join(rep_set_449_out_dir, "core-set")
!parallel_align_seqs_pynast.py -i $rep_set_449_fp -o $pyn_out_core --jobs_to_start 8 -t $core_set_template_fp
fasta_fps = join(pyn_out_core, "*fasta")
!count_seqs.py -i "$fasta_fps"
This tells us that 171 sequences failed to align with PyNAST out of 10404, so slightly less when aligning against the core set.
However, it's important to note that different sequences failed to align with the two data sets (i.e., the sequences that fail to align against the core set are not a subset of the sequences that fail to align against the 85% PyNAST-aligned OTUs). We therefore don't immediately know which is better.
In [5]:
# find the sequences that didn't align against the 85% PyNAST-aligned OTUs
# but did align against the core set
!filter_fasta.py -f $pyn_out_qdr/rep_set.449_failures.fasta -o $pyn_out_qdr/failures-that-aligned-v-core-set.fasta -a $pyn_out_core/rep_set.449_aligned.fasta
!count_seqs.py -i $pyn_out_qdr/failures-that-aligned-v-core-set.fasta
In [6]:
# find the sequences that didn't align against the core set
# but did align against the 85% PyNAST-aligned OTUs
!filter_fasta.py -f $pyn_out_core/rep_set.449_failures.fasta -o $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta -a $pyn_out_qdr/rep_set.449_aligned.fasta
!count_seqs.py -i $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta
So what do the BLAST alignments look like if we query these sequences against the Greengenes 13_8 97% OTUs?
In [7]:
blast_out_qdr = join(pyn_out_qdr, "failures-that-aligned-v-core-set-blast-results")
!parallel_blast.py -i $pyn_out_qdr/failures-that-aligned-v-core-set.fasta -b $db_name -o $blast_out_qdr -O 8
In [8]:
blast_headers = !head $blast_out_qdr/failures-that-aligned-v-core-set_blast_out.txt
blast_headers = blast_headers[3].strip('# Fields: ').split(', ')
blast_out = !grep -v '^#' $blast_out_qdr/failures-that-aligned-v-core-set_blast_out.txt
blast_out = [l.split('\t') for l in blast_out]
blast_results_qdr = pd.DataFrame(blast_out, columns=blast_headers).convert_objects(convert_numeric=True)
In [9]:
blast_results_qdr
Out[9]:
In [10]:
blast_results_qdr.mean()
Out[10]:
Here we see that the mean alignment length is $163$, but the mean/standard deviation of the sequences provided as input above was $256 +/- 16$ ($163/256 \approx 63\%$ of sequence length is aligned). It therefore looks like we're getting partial alignments, which may be indicative of chimeras or sequence quality drop-off. This suggests that these sequences may be aligning to sequences that were in the core set, but which were dropped from in Greengenes 13_8 (or some earlier version of the database).
In [11]:
blast_out_core = join(pyn_out_core, "failures-that-aligned-v-pyn-85-otus-blast-results")
!parallel_blast.py -i $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta -b $db_name -o $blast_out_core -O 8
In [12]:
blast_headers = !head $blast_out_core/failures-that-aligned-v-pyn-85-otus_blast_out.txt
blast_headers = blast_headers[3].strip('# Fields: ').split(', ')
blast_out = !grep -v '^#' $blast_out_core/failures-that-aligned-v-pyn-85-otus_blast_out.txt
blast_out = [l.split('\t') for l in blast_out]
blast_results_core = pd.DataFrame(blast_out, columns=blast_headers).convert_objects(convert_numeric=True)
In [13]:
blast_results_core
Out[13]:
In [14]:
blast_results_core.mean()
Out[14]:
Here it also looks like these are partial alignments, but average a little longer ($181/236 \approx 77\% $ of sequence length aligned). These are within the threshold for PyNAST to align them (default is 75% of the median input sequence length), which is why we see them align to the 85% OTU alignment.
My next test was performed with the V4 sequences deposited under study id 550
in the QIIME database (the Moving Pictures) dataset. These tests are less extensive than the ones above, just because I've run out of time for now.
In [15]:
pyn_out_qdr = join(rep_set_550_out_dir, "qdr012")
!parallel_align_seqs_pynast.py -i $rep_set_550_fp -o $pyn_out_qdr --jobs_to_start 8
fasta_fps = join(pyn_out_qdr, "*fasta")
!count_seqs.py -i "$fasta_fps"
This is 9.8% failing to align against the 85% PyNAST-aligned OTUs.
In [16]:
pyn_out_core = join(rep_set_550_out_dir, "core-set")
!parallel_align_seqs_pynast.py -i $rep_set_550_fp -o $pyn_out_core --jobs_to_start 8 -t $core_set_template_fp
fasta_fps = join(pyn_out_core, "*fasta")
!count_seqs.py -i "$fasta_fps"
This is 10.6% failing to align against the core set.
So, unlike what we saw with V2 above, we're now having fewer sequences failing to align to the Greengenes 85% OTUs than failed to align to the core set.
The analyses presented here were originally described in biocore/qiime-default-reference#14. These were ported to a separate repository for reproducibility.