In an additional test of PyNAST alignment of V2 sequences with the core set versus the PyNAST 85% OTU alignment added in biocore/qiime-default-reference#15, I find that only slightly more sequences fail to align with the PyNAST 85% OTU alignment than with the old core set. Details of my test follow. Since we don't have an inherent way to say which is better (i.e., it's possible that we want those additional sequences to fail), and we have complete understanding of how the PyNAST 85% OTU alignment was generated (which is not true for the core set), we'll move forward with the plan to use the PyNAST 85% OTU alignment as the default reference template alignment.

V2 sequence collection

My first test was performed with the V2 sequences deposited under study id 449 in the QIIME database (the Costello Whole Body) dataset. I ran the following command to align representative sequences against the PyNAST 85% OTU alignment:


In [1]:
## Uncomment to remove output from previous run.
!rm -r output
import pandas as pd
from os.path import join, abspath
from qiime_default_reference import get_reference_sequences
from bfillings.formatdb import build_blast_db_from_fasta_path  

gg_97_reference = get_reference_sequences()
db_name, db_files = build_blast_db_from_fasta_path(gg_97_reference, output_dir='./')

!wget http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/core_set_aligned.fasta.imputed
core_set_template_fp = abspath("core_set_aligned.fasta.imputed")
rep_set_449_fp = abspath("data/rep_set.449.fna")
rep_set_449_out_dir = abspath("output/449_out")
rep_set_550_fp = abspath("data/rep_set.550.1.fna")
rep_set_550_out_dir = abspath("output/550_out")


rm: output: No such file or directory
--2015-04-16 09:15:50--  http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/core_set_aligned.fasta.imputed
Resolving greengenes.lbl.gov... 128.32.248.7
Connecting to greengenes.lbl.gov|128.32.248.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37975992 (36M) [text/plain]
Saving to: 'core_set_aligned.fasta.imputed'

100%[======================================>] 37,975,992  1.47MB/s   in 14s    

2015-04-16 09:16:07 (2.62 MB/s) - 'core_set_aligned.fasta.imputed' saved [37975992/37975992]


In [3]:
pyn_out_qdr = join(rep_set_449_out_dir, "qdr012")
!parallel_align_seqs_pynast.py -i $rep_set_449_fp -o $pyn_out_qdr --jobs_to_start 8
fasta_fps = join(pyn_out_qdr, "*fasta")
!count_seqs.py -i "$fasta_fps"


204  : /home/ubuntu/qdr-issue14/output/449_out/qdr012/rep_set.449_failures.fasta (Sequence lengths (mean +/- std): 181.9706 +/- 56.2609)
10200  : /home/ubuntu/qdr-issue14/output/449_out/qdr012/rep_set.449_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
10404  : Total

This tells us that 204 sequences failed to align with PyNAST out of 10404.

Next, we perform the same alignment against the core set.


In [4]:
pyn_out_core = join(rep_set_449_out_dir, "core-set")
!parallel_align_seqs_pynast.py -i $rep_set_449_fp -o $pyn_out_core --jobs_to_start 8 -t $core_set_template_fp
fasta_fps = join(pyn_out_core, "*fasta")
!count_seqs.py -i "$fasta_fps"


171  : /home/ubuntu/qdr-issue14/output/449_out/core-set/rep_set.449_failures.fasta (Sequence lengths (mean +/- std): 166.0292 +/- 46.7550)
10233  : /home/ubuntu/qdr-issue14/output/449_out/core-set/rep_set.449_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
10404  : Total

This tells us that 171 sequences failed to align with PyNAST out of 10404, so slightly less when aligning against the core set.

However, it's important to note that different sequences failed to align with the two data sets (i.e., the sequences that fail to align against the core set are not a subset of the sequences that fail to align against the 85% PyNAST-aligned OTUs). We therefore don't immediately know which is better.


In [5]:
# find the sequences that didn't align against the 85% PyNAST-aligned OTUs
# but did align against the core set
!filter_fasta.py -f $pyn_out_qdr/rep_set.449_failures.fasta -o $pyn_out_qdr/failures-that-aligned-v-core-set.fasta -a $pyn_out_core/rep_set.449_aligned.fasta
!count_seqs.py -i $pyn_out_qdr/failures-that-aligned-v-core-set.fasta


47  : /home/ubuntu/qdr-issue14/output/449_out/qdr012/failures-that-aligned-v-core-set.fasta (Sequence lengths (mean +/- std): 256.3404 +/- 16.3086)
47  : Total

In [6]:
# find the sequences that didn't align against the core set
# but did align against the 85% PyNAST-aligned OTUs
!filter_fasta.py -f $pyn_out_core/rep_set.449_failures.fasta -o $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta -a $pyn_out_qdr/rep_set.449_aligned.fasta
!count_seqs.py -i $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta


14  : /home/ubuntu/qdr-issue14/output/449_out/core-set/failures-that-aligned-v-pyn-85-otus.fasta (Sequence lengths (mean +/- std): 236.9286 +/- 11.0677)
14  : Total

So what do the BLAST alignments look like if we query these sequences against the Greengenes 13_8 97% OTUs?


In [7]:
blast_out_qdr = join(pyn_out_qdr, "failures-that-aligned-v-core-set-blast-results")

!parallel_blast.py -i $pyn_out_qdr/failures-that-aligned-v-core-set.fasta -b $db_name -o $blast_out_qdr -O 8

In [8]:
blast_headers = !head $blast_out_qdr/failures-that-aligned-v-core-set_blast_out.txt
blast_headers = blast_headers[3].strip('# Fields: ').split(', ')
blast_out = !grep -v '^#' $blast_out_qdr/failures-that-aligned-v-core-set_blast_out.txt
blast_out = [l.split('\t') for l in blast_out]
blast_results_qdr = pd.DataFrame(blast_out, columns=blast_headers).convert_objects(convert_numeric=True)

In [9]:
blast_results_qdr


Out[9]:
Query id Subject id % identity alignment length mismatches gap openings q. start q. end s. start s. end e-value bit scor
0 New.CleanUp.ReferenceOTU10449 853867 100.00 149 0 0 1 149 317 169 0 295
1 New.CleanUp.ReferenceOTU10796 960131 100.00 131 0 0 1 131 317 187 0 260
2 New.CleanUp.ReferenceOTU11004 960131 99.32 148 1 0 1 148 317 170 0 285
3 New.CleanUp.ReferenceOTU11548 509773 100.00 132 0 0 1 132 336 205 0 262
4 New.CleanUp.ReferenceOTU12392 853867 100.00 156 0 0 1 156 317 162 0 309
5 New.CleanUp.ReferenceOTU12511 853867 98.72 156 1 1 1 155 317 162 0 285
6 New.CleanUp.ReferenceOTU12532 1000986 98.84 173 2 0 1 173 304 132 0 327
7 New.CleanUp.ReferenceOTU13882 129798 97.04 169 1 3 1 169 368 204 0 274
8 New.CleanUp.ReferenceOTU14202 960131 98.18 165 1 1 1 165 317 155 0 297
9 New.CleanUp.ReferenceOTU15614 4366889 96.49 171 1 3 1 171 303 138 0 272
10 New.CleanUp.ReferenceOTU16729 4467774 99.45 181 1 0 1 181 308 128 0 351
11 New.CleanUp.ReferenceOTU17453 870751 100.00 137 0 0 1 137 317 181 0 272
12 New.CleanUp.ReferenceOTU18041 4422519 99.04 209 0 2 1 209 349 143 0 383
13 New.CleanUp.ReferenceOTU18339 4467774 99.43 176 0 1 1 176 308 134 0 333
14 New.CleanUp.ReferenceOTU18724 965048 98.89 180 0 2 1 178 313 134 0 325
15 New.CleanUp.ReferenceOTU1891 4308647 94.62 186 5 3 1 186 346 166 0 270
16 New.CleanUp.ReferenceOTU2098 979261 100.00 157 0 0 1 157 317 161 0 311
17 New.CleanUp.ReferenceOTU21175 4347865 100.00 141 0 0 1 141 334 194 0 280
18 New.CleanUp.ReferenceOTU21766 4467774 100.00 178 0 0 1 178 308 131 0 353
19 New.CleanUp.ReferenceOTU21849 853867 99.36 156 0 1 1 155 317 162 0 293
20 New.CleanUp.ReferenceOTU22455 4417848 97.92 192 1 2 2 193 306 118 0 335
21 New.CleanUp.ReferenceOTU22842 379925 98.63 146 0 2 1 146 298 155 0 258
22 New.CleanUp.ReferenceOTU22953 4368216 100.00 180 0 0 1 180 315 136 0 357
23 New.CleanUp.ReferenceOTU2306 1064036 97.59 166 2 2 1 166 326 163 0 281
24 New.CleanUp.ReferenceOTU23446 979261 100.00 144 0 0 1 144 317 174 0 285
25 New.CleanUp.ReferenceOTU24551 867184 93.95 215 4 7 1 215 329 124 0 272
26 New.CleanUp.ReferenceOTU24622 1061772 100.00 149 0 0 1 149 317 169 0 295
27 New.CleanUp.ReferenceOTU24747 960131 99.32 148 1 0 1 148 317 170 0 285
28 New.CleanUp.ReferenceOTU24753 129798 96.36 165 0 5 1 165 368 210 0 242
29 New.CleanUp.ReferenceOTU303 1136387 92.97 185 13 0 1 185 310 126 0 264
30 New.CleanUp.ReferenceOTU3067 495451 96.94 196 3 3 1 196 311 119 0 317
31 New.CleanUp.ReferenceOTU3375 4477696 100.00 148 0 0 1 148 329 182 0 293
32 New.CleanUp.ReferenceOTU3384 945061 94.59 222 12 0 1 222 315 94 0 345
33 New.CleanUp.ReferenceOTU3739 4467774 100.00 188 0 0 1 188 308 121 0 373
34 New.CleanUp.ReferenceOTU4059 635216 98.58 141 0 2 1 139 300 160 0 248
35 New.CleanUp.ReferenceOTU4528 979261 98.61 144 2 0 1 144 317 174 0 270
36 New.CleanUp.ReferenceOTU5484 934488 100.00 144 0 0 1 144 317 174 0 285
37 New.CleanUp.ReferenceOTU6173 1110261 96.22 185 5 2 1 184 318 135 0 295
38 New.CleanUp.ReferenceOTU6680 979261 99.31 144 1 0 1 144 317 174 0 278
39 New.CleanUp.ReferenceOTU720 932077 100.00 144 0 0 1 144 313 170 0 285
40 New.CleanUp.ReferenceOTU7237 979261 100.00 144 0 0 1 144 317 174 0 285
41 New.CleanUp.ReferenceOTU7455 129798 100.00 139 0 0 1 139 368 230 0 276
42 New.CleanUp.ReferenceOTU7553 335887 99.40 168 0 1 1 168 337 171 0 317
43 New.CleanUp.ReferenceOTU7914 853867 100.00 156 0 0 1 156 317 162 0 309
44 New.CleanUp.ReferenceOTU7942 4442170 100.00 154 0 0 1 154 310 157 0 305
45 New.CleanUp.ReferenceOTU8857 669486 98.26 172 3 0 1 172 330 159 0 317
46 New.CleanUp.ReferenceOTU9075 960131 99.35 153 0 1 1 152 317 165 0 287

In [10]:
blast_results_qdr.mean()


Out[10]:
Subject id          1729626.042553
% identity               98.667660
alignment length        163.468085
mismatches                1.276596
gap openings              0.936170
q. start                  1.021277
q. end                  163.319149
s. start                320.723404
s. end                  159.234043
e-value                   0.000000
bit scor                297.787234
dtype: float64

Here we see that the mean alignment length is $163$, but the mean/standard deviation of the sequences provided as input above was $256 +/- 16$ ($163/256 \approx 63\%$ of sequence length is aligned). It therefore looks like we're getting partial alignments, which may be indicative of chimeras or sequence quality drop-off. This suggests that these sequences may be aligning to sequences that were in the core set, but which were dropped from in Greengenes 13_8 (or some earlier version of the database).


In [11]:
blast_out_core = join(pyn_out_core, "failures-that-aligned-v-pyn-85-otus-blast-results")

!parallel_blast.py -i $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta -b $db_name -o $blast_out_core -O 8

In [12]:
blast_headers = !head $blast_out_core/failures-that-aligned-v-pyn-85-otus_blast_out.txt
blast_headers = blast_headers[3].strip('# Fields: ').split(', ')
blast_out = !grep -v '^#' $blast_out_core/failures-that-aligned-v-pyn-85-otus_blast_out.txt
blast_out = [l.split('\t') for l in blast_out]
blast_results_core = pd.DataFrame(blast_out, columns=blast_headers).convert_objects(convert_numeric=True)

In [13]:
blast_results_core


Out[13]:
Query id Subject id % identity alignment length mismatches gap openings q. start q. end s. start s. end e-value bit scor
0 New.CleanUp.ReferenceOTU11778 2426773 99.17 120 1 0 1 120 486 367 0 230
1 New.CleanUp.ReferenceOTU14235 1100972 98.84 172 2 0 1 172 316 145 0 325
2 New.CleanUp.ReferenceOTU14439 3339747 99.55 223 0 1 1 223 305 84 0 426
3 New.CleanUp.ReferenceOTU15749 937813 95.58 181 7 1 1 181 323 144 0 287
4 New.CleanUp.ReferenceOTU16523 4484060 96.49 228 2 4 1 228 351 130 0 361
5 New.CleanUp.ReferenceOTU20112 930873 99.37 159 0 1 1 159 323 166 0 299
6 New.CleanUp.ReferenceOTU20493 930873 98.78 164 2 0 1 164 323 160 0 309
7 New.CleanUp.ReferenceOTU2210 4466460 97.56 164 4 0 1 164 307 144 0 293
8 New.CleanUp.ReferenceOTU2265 827969 98.36 183 3 0 1 183 319 137 0 339
9 New.CleanUp.ReferenceOTU23577 1035532 100.00 179 0 0 1 179 311 133 0 355
10 New.CleanUp.ReferenceOTU24778 495095 98.35 182 3 0 1 182 323 142 0 337
11 New.CleanUp.ReferenceOTU2799 4302012 91.42 233 16 3 1 230 304 73 0 281
12 New.CleanUp.ReferenceOTU5311 1035532 100.00 179 0 0 1 179 311 133 0 355
13 New.CleanUp.ReferenceOTU8921 1035532 99.44 179 1 0 1 179 311 133 0 347

In [14]:
blast_results_core.mean()


Out[14]:
Subject id          1953517.357143
% identity               98.065000
alignment length        181.857143
mismatches                2.928571
gap openings              0.714286
q. start                  1.000000
q. end                  181.642857
s. start                329.500000
s. end                  149.357143
e-value                   0.000000
bit scor                324.571429
dtype: float64

Here it also looks like these are partial alignments, but average a little longer ($181/236 \approx 77\% $ of sequence length aligned). These are within the threshold for PyNAST to align them (default is 75% of the median input sequence length), which is why we see them align to the 85% OTU alignment.

V4 sequence collection

My next test was performed with the V4 sequences deposited under study id 550 in the QIIME database (the Moving Pictures) dataset. These tests are less extensive than the ones above, just because I've run out of time for now.


In [15]:
pyn_out_qdr = join(rep_set_550_out_dir, "qdr012")
!parallel_align_seqs_pynast.py -i $rep_set_550_fp -o $pyn_out_qdr --jobs_to_start 8
fasta_fps = join(pyn_out_qdr, "*fasta")
!count_seqs.py -i "$fasta_fps"


1475  : /home/ubuntu/qdr-issue14/output/550_out/qdr012/rep_set.550.1_failures.fasta (Sequence lengths (mean +/- std): 92.4312 +/- 18.7288)
13594  : /home/ubuntu/qdr-issue14/output/550_out/qdr012/rep_set.550.1_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
15069  : Total

This is 9.8% failing to align against the 85% PyNAST-aligned OTUs.


In [16]:
pyn_out_core = join(rep_set_550_out_dir, "core-set")
!parallel_align_seqs_pynast.py -i $rep_set_550_fp -o $pyn_out_core --jobs_to_start 8 -t $core_set_template_fp
fasta_fps = join(pyn_out_core, "*fasta")
!count_seqs.py -i "$fasta_fps"


1596  : /home/ubuntu/qdr-issue14/output/550_out/core-set/rep_set.550.1_failures.fasta (Sequence lengths (mean +/- std): 95.2481 +/- 20.8748)
13473  : /home/ubuntu/qdr-issue14/output/550_out/core-set/rep_set.550.1_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
15069  : Total

This is 10.6% failing to align against the core set.

So, unlike what we saw with V2 above, we're now having fewer sequences failing to align to the Greengenes 85% OTUs than failed to align to the core set.

The analyses presented here were originally described in biocore/qiime-default-reference#14. These were ported to a separate repository for reproducibility.