taxonomy-assignment


Taxonomy assignment of simulated communities

This notebook demonstrates how to assign taxonomy to communities simulated from natural compositions. These data are stored in the precomputed-results directory in tax-credit and this notebook does not need to be re-run unless if being used to test additional simulated communities or taxonomy assignment methods.


In [1]:
from os.path import join, expandvars 
from joblib import Parallel, delayed
from glob import glob
from os import system
from tax_credit.simulated_communities import copy_expected_composition
from tax_credit.framework_functions import (parameter_sweep,
                                            generate_per_method_biom_tables,
                                            move_results_to_repository)

First, set the location of the tax-credit repository, the reference databases, and the simulated community directory


In [2]:
# Project directory
project_dir = expandvars("$HOME/Desktop/projects/tax-credit/")
# Directory containing reference sequence databases
reference_database_dir = join(project_dir, 'data', 'ref_dbs')
# simulated communities directory
sim_dir = join(project_dir, "data", "simulated-community")

In the following cell, we define the simulated communities that we want to use for taxonomy assignment. The directory for each dataset is located in sim_dir, and contains the files simulated-seqs.fna that were previously generated in the dataset generation notebook.


In [8]:
dataset_reference_combinations = [
    # (community_name, ref_db)
    ('sake', 'gg_13_8_otus'),
    ('wine', 'unite_20.11.2016')
]

reference_dbs = {'gg_13_8_otus' : (join(reference_database_dir, 'gg_13_8_otus/99_otus_clean_515f-806r_trim250.fasta'), 
                                   join(reference_database_dir, 'gg_13_8_otus/99_otu_taxonomy_clean.tsv')),
                 'unite_20.11.2016' : (join(reference_database_dir, 'unite_20.11.2016/sh_refs_qiime_ver7_99_20.11.2016_dev_clean_BITSf-B58S3r_trim250.fasta'), 
                                       join(reference_database_dir, 'unite_20.11.2016/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.tsv'))}

Assign taxonomy to simulated community sequences

First, set the results directory, where we will put temporary results.


In [9]:
results_dir = expandvars("$HOME/Desktop/projects/simulated-community/")

Preparing the method/parameter combinations and generating commands

Now we set the methods and method-specific parameters that we want to sweep. Modify to sweep other methods. Note how method_parameters_combinations feeds method/parameter combinations to parameter_sweep() in the cell below.

Assignment Using QIIME 1 or Command-Line Classifiers

Here we provide an example of taxonomy assignment using legacy QIIME 1 classifiers executed on the command line. To accomplish this, we must first convert commands to a string, which we then pass to bash for execution. As QIIME 1 is written in python-2, we must also activate a separate environment in which QIIME 1 has been installed. If any environmental variables need to be set (in this example, the RDP_JAR_PATH), we must also source the .bashrc file.


In [10]:
method_parameters_combinations = { # probabalistic classifiers
              'rdp': {'confidence': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5,
                                     0.6, 0.7, 0.8, 0.9, 1.0]},
              
              # global alignment classifiers
              'uclust': {'min_consensus_fraction': [0.51, 0.76, 1.0], 
                         'similarity': [0.8, 0.9],
                         'uclust_max_accepts': [1, 3, 5]},
             
              # local alignment classifiers
              'sortmerna': {'sortmerna_e_value': [1.0],
                            'min_consensus_fraction': [0.51, 0.76, 1.0], 
                            'similarity': [0.8, 0.9],
                            'sortmerna_best_N_alignments ': [1, 3, 5],
                            'sortmerna_coverage' : [0.8, 0.9]},
              'blast' : {'blast_e_value' : [0.0000000001, 0.001, 1, 1000]}
             }

Now enter the template of the command to sweep, and generate a list of commands with parameter_sweep().

Fields must adhere to following format:

                  {0} = output directory
                  {1} = input data
                  {2} = output destination
                  {3} = reference taxonomy
                  {4} = method name
                  {5} = other parameters

In [11]:
command_template = "source activate qiime1; source ~/.bashrc; mkdir -p {0} ; assign_taxonomy.py -v -i {1} -o {0} -r {2} -t {3} -m {4} {5} --rdp_max_memory 16000"
        
commands = parameter_sweep(sim_dir, results_dir, reference_dbs,
                           dataset_reference_combinations,
                           method_parameters_combinations, command_template,
                           infile='simulated-seqs.fna')

A quick sanity check...


In [12]:
print(len(commands))
commands[0]


69
Out[12]:
'source activate qiime1; source ~/.bashrc; mkdir -p /Users/nbokulich/Desktop/projects/simulated-community/wine/unite_20.11.2016/sortmerna/0.51:0.8:1:0.8:1.0 ; assign_taxonomy.py -v -i /Users/nbokulich/Desktop/projects/tax-credit/data/simulated-community/wine/simulated-seqs.fna -o /Users/nbokulich/Desktop/projects/simulated-community/wine/unite_20.11.2016/sortmerna/0.51:0.8:1:0.8:1.0 -r /Users/nbokulich/Desktop/projects/tax-credit/data/ref_dbs/unite_20.11.2016/sh_refs_qiime_ver7_99_20.11.2016_dev_clean_BITSf-B58S3r_trim250.fasta -t /Users/nbokulich/Desktop/projects/tax-credit/data/ref_dbs/unite_20.11.2016/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.tsv -m sortmerna --min_consensus_fraction 0.51 --similarity 0.8 --sortmerna_best_N_alignments  1 --sortmerna_coverage 0.8 --sortmerna_e_value 1.0 --rdp_max_memory 16000'

... and finally we are ready to run.


In [13]:
Parallel(n_jobs=4)(delayed(system)(command) for command in commands)


Out[13]:
[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

Generate per-method biom tables

Modify the taxonomy_glob below to point to the taxonomy assignments that were generated above. This may be necessary if filepaths were altered in the preceding cells.


In [14]:
taxonomy_glob = join(results_dir, '*', '*', '*', '*', 'simulated-seqs_tax_assignments.txt')
generate_per_method_biom_tables(taxonomy_glob, sim_dir, biom_input_fn='simulated-composition.biom')

Move result files to repository

Add results to the tax-credit directory (e.g., to push these results to the repository or compare with other precomputed results in downstream analysis steps). The precomputed_results_dir path and methods_dirs glob below should not need to be changed unless if substantial changes were made to filepaths in the preceding cells.


In [15]:
precomputed_results_dir = join(project_dir, "data", "precomputed-results", "simulated-community")
method_dirs = glob(join(results_dir, '*', '*', '*', '*'))
move_results_to_repository(method_dirs, precomputed_results_dir)

Add expected composition bioms to repository


In [7]:
copy_expected_composition(sim_dir, dataset_reference_combinations, precomputed_results_dir)

In [ ]: