ghost-tree workflow to create fungal 18S and ITS hybrid-tree

This workflow creates a hybrid phylogenetic tree. This can be run in place with no modifications. All of the files needed to generate the tree are publicly available from their respective databases. This workflow is specific to the Silva and UNITE databases, however the commands may be adapted to create hybrid trees from other marker genes.

Dependency Versions:

ghost-tree git@5f5d5b868fa951cecc7731ecc82f8d2798359c82
SUMACLUST 1.0.01
MUSCLE 3.8.31
FastTree 2.1.8
scikit-bio 0.2.3

Download the necesary files UNITE and Silva files


In [6]:
#Silva files
!wget 'http://www.arb-silva.de/fileadmin/silva_databases/release_119/Exports/SILVA_119_SSURef_Nr99_tax_silva_full_align_trunc.fasta.gz'
!wget 'http://www.arb-silva.de/fileadmin/silva_databases/release_119/Exports/taxonomy/tax_slv_ssu_nr_119.acc_taxid'
!wget 'http://www.arb-silva.de/fileadmin/silva_databases/release_119/Exports/taxonomy/tax_slv_ssu_nr_119.txt'
!gunzip SILVA_119_SSURef_Nr99_tax_silva_full_align_trunc.fasta.gz


--13:30:03--  http://www.arb-silva.de/fileadmin/silva_databases/release_119/Exports/SILVA_119_SSURef_Nr99_tax_silva_full_align_trunc.fasta.gz
           => `SILVA_119_SSURef_Nr99_tax_silva_full_align_trunc.fasta.gz'
Resolving www.arb-silva.de... 134.102.40.6
Connecting to www.arb-silva.de[134.102.40.6]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1,113,513,256 [text/plain]

100%[==================================>] 1,113,513,256    5.72M/s    ETA 00:00

13:34:03 (4.43 MB/s) - `SILVA_119_SSURef_Nr99_tax_silva_full_align_trunc.fasta.gz' saved [1113513256/1113513256]

--13:34:03--  http://www.arb-silva.de/fileadmin/silva_databases/release_119/Exports/taxonomy/tax_slv_ssu_nr_119.acc_taxid
           => `tax_slv_ssu_nr_119.acc_taxid'
Resolving www.arb-silva.de... 134.102.40.6
Connecting to www.arb-silva.de[134.102.40.6]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11,303,218 [text/plain]

100%[====================================>] 11,303,218     2.83M/s    ETA 00:00

13:34:09 (1.93 MB/s) - `tax_slv_ssu_nr_119.acc_taxid' saved [11303218/11303218]

--13:34:09--  http://www.arb-silva.de/fileadmin/silva_databases/release_119/Exports/taxonomy/tax_slv_ssu_nr_119.txt
           => `tax_slv_ssu_nr_119.txt'
Resolving www.arb-silva.de... 134.102.40.6
Connecting to www.arb-silva.de[134.102.40.6]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1,193,845 [text/plain]

100%[====================================>] 1,193,845    557.81K/s             

13:34:12 (556.04 KB/s) - `tax_slv_ssu_nr_119.txt' saved [1193845/1193845]


In [7]:
#UNITE Files
!wget 'https://github.com/qiime/its-reference-otus/raw/master/taxonomy/97_otu_taxonomy.txt.gz'
!wget 'https://github.com/qiime/its-reference-otus/raw/master/rep_set/97_otus.fasta.gz'
!gunzip 97_otu_taxonomy.txt.gz
!gunzip 97_otus.fasta.gz


--13:37:46--  https://github.com/qiime/its-reference-otus/raw/master/taxonomy/97_otu_taxonomy.txt.gz
           => `97_otu_taxonomy.txt.gz'
Resolving github.com... 192.30.252.128
Connecting to github.com[192.30.252.128]:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/qiime/its-reference-otus/master/taxonomy/97_otu_taxonomy.txt.gz [following]
--13:37:49--  https://raw.githubusercontent.com/qiime/its-reference-otus/master/taxonomy/97_otu_taxonomy.txt.gz
           => `97_otu_taxonomy.txt.gz'
Resolving raw.githubusercontent.com... 199.27.74.133
Connecting to raw.githubusercontent.com[199.27.74.133]:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655,626 [application/octet-stream]

100%[====================================>] 655,626        2.11M/s             

13:37:50 (2.11 MB/s) - `97_otu_taxonomy.txt.gz' saved [655626/655626]

--13:37:51--  https://github.com/qiime/its-reference-otus/raw/master/rep_set/97_otus.fasta.gz
           => `97_otus.fasta.gz'
Resolving github.com... 192.30.252.128
Connecting to github.com[192.30.252.128]:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/qiime/its-reference-otus/master/rep_set/97_otus.fasta.gz [following]
--13:37:51--  https://raw.githubusercontent.com/qiime/its-reference-otus/master/rep_set/97_otus.fasta.gz
           => `97_otus.fasta.gz'
Resolving raw.githubusercontent.com... 199.27.74.133
Connecting to raw.githubusercontent.com[199.27.74.133]:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8,075,551 [application/octet-stream]

100%[====================================>] 8,075,551      5.07M/s             

13:37:54 (5.06 MB/s) - `97_otus.fasta.gz' saved [8075551/8075551]

Assign variables to each of the files to be used throughout the workflow


In [1]:
silva_aligned = 'SILVA_119_SSURef_Nr99_tax_silva_full_align_trunc.fasta'
silva_accession = 'tax_slv_ssu_nr_119.acc_taxid'
silva_taxonomy = 'tax_slv_ssu_nr_119.txt'
silva_fungi_only = 'silva_fungi_only.txt'
silva_fungi_filtered = 'silva_fungi_only_filtered.txt'
ITS_seqs = '97_otus.fasta'
ITS_otu_map_80 = 'ITS_otu_map_80.txt'
ITS_tax = '97_otu_taxonomy.txt'

Remove all non-fungal samples from Silva alignment


In [3]:
!time ghost-tree silva extract-fungi $silva_aligned $silva_accession $silva_taxonomy $silva_fungi_only


real	5m46.302s
user	5m26.687s
sys	0m14.242s

Filter entropy and gap positions from Silva fungal only alignment


In [4]:
!time ghost-tree filter-alignment-positions $silva_fungi_only 0.9 0.8 $silva_fungi_filtered


real	50m25.698s
user	50m15.813s
sys	0m6.584s

Group the extension sequences (UNITE sequences clustered at 97%) at 80% identity

Decreasing the identity creates fewer total clusters with more sequences per cluster


In [2]:
!time ghost-tree extensions group-extensions $ITS_seqs 0.8 $ITS_otu_map_80


===========================================================
 SUMACLUST version 1.0.01
 Alignment using SSE2 instructions.
===========================================================
Reading dataset...
55404 sequences
Cleaning dataset... : 79472 nucleotides substituted in 6648 sequences
Indexing dataset... : Done
Sorting sequences by count...
Maximum ratio between the counts of two sequences to connect them: 1.000000
Clustering sequences when similarity >= 0.800000
Aligning and clustering... 
Done : 100 %       18081 clusters created.                        
Printing results in OTU table format...
Done.

real	1012m4.275s
user	216m21.857s
sys	0m24.478s

Build the hybrid-tree

Steps involved in ghost-tree scaffold hybrid-tree:

  1. Build the foundation tree from 18S sequences
  2. Group ITS extension sequences by cluster
  3. Determine consensus taxonomy for each cluster
  4. Group any clusters with the same consensus taxonomy
  5. Align all sequences in each group
  6. Build a tree for each ITS group
  7. Graft extension trees to foundation tree

The ghost-tree scaffold hybrid-tree command uses FastTree which ignores non-nucleotide characters. FastTree generates warnings that these characters are being ignored, and while these warnings do not present a problem in the creation of the tree they can slow down the IPython notebok to the point where it crashes. This is not a problem when the command is run directly from the command line. In order to avoid this issue we recommend running the following command directly from the command line.
This is an open issue in a git hub; issue-#25


In [ ]:
#ghost-tree scaffold hybrid-tree ITS_otu_map_80.txt 97_otu_taxonomy.txt 97_otus.fasta silva_fungi_only_filtered.txt ghost-tree-output2.nwk