This notebook demonstrates how to assign taxonomy to communities simulated from natural compositions. These data are stored in the precomputed-results
directory in tax-credit
and this notebook does not need to be re-run unless if being used to test additional simulated communities or taxonomy assignment methods.
In [1]:
from os.path import join, expandvars
from joblib import Parallel, delayed
from glob import glob
from os import system
from tax_credit.simulated_communities import copy_expected_composition
from tax_credit.framework_functions import (parameter_sweep,
generate_per_method_biom_tables,
move_results_to_repository)
First, set the location of the tax-credit repository, the reference databases, and the simulated community directory
In [2]:
# Project directory
project_dir = expandvars("$HOME/Desktop/projects/tax-credit/")
# Directory containing reference sequence databases
reference_database_dir = join(project_dir, 'data', 'ref_dbs')
# simulated communities directory
sim_dir = join(project_dir, "data", "simulated-community")
In the following cell, we define the simulated communities that we want to use for taxonomy assignment. The directory for each dataset is located in sim_dir
, and contains the files simulated-seqs.fna
that were previously generated in the dataset generation notebook.
In [8]:
dataset_reference_combinations = [
# (community_name, ref_db)
('sake', 'gg_13_8_otus'),
('wine', 'unite_20.11.2016')
]
reference_dbs = {'gg_13_8_otus' : (join(reference_database_dir, 'gg_13_8_otus/99_otus_clean_515f-806r_trim250.fasta'),
join(reference_database_dir, 'gg_13_8_otus/99_otu_taxonomy_clean.tsv')),
'unite_20.11.2016' : (join(reference_database_dir, 'unite_20.11.2016/sh_refs_qiime_ver7_99_20.11.2016_dev_clean_BITSf-B58S3r_trim250.fasta'),
join(reference_database_dir, 'unite_20.11.2016/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.tsv'))}
In [9]:
results_dir = expandvars("$HOME/Desktop/projects/simulated-community/")
Now we set the methods and method-specific parameters that we want to sweep. Modify to sweep other methods. Note how method_parameters_combinations feeds method/parameter combinations to parameter_sweep() in the cell below.
Here we provide an example of taxonomy assignment using legacy QIIME 1
classifiers executed on the command line. To accomplish this, we must first convert commands
to a string, which we then pass to bash for execution. As QIIME 1
is written in python-2, we must also activate a separate environment in which QIIME 1 has been installed. If any environmental variables need to be set (in this example, the RDP_JAR_PATH), we must also source the .bashrc file.
In [10]:
method_parameters_combinations = { # probabalistic classifiers
'rdp': {'confidence': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.9, 1.0]},
# global alignment classifiers
'uclust': {'min_consensus_fraction': [0.51, 0.76, 1.0],
'similarity': [0.8, 0.9],
'uclust_max_accepts': [1, 3, 5]},
# local alignment classifiers
'sortmerna': {'sortmerna_e_value': [1.0],
'min_consensus_fraction': [0.51, 0.76, 1.0],
'similarity': [0.8, 0.9],
'sortmerna_best_N_alignments ': [1, 3, 5],
'sortmerna_coverage' : [0.8, 0.9]},
'blast' : {'blast_e_value' : [0.0000000001, 0.001, 1, 1000]}
}
Now enter the template of the command to sweep, and generate a list of commands with parameter_sweep()
.
Fields must adhere to following format:
{0} = output directory
{1} = input data
{2} = output destination
{3} = reference taxonomy
{4} = method name
{5} = other parameters
In [11]:
command_template = "source activate qiime1; source ~/.bashrc; mkdir -p {0} ; assign_taxonomy.py -v -i {1} -o {0} -r {2} -t {3} -m {4} {5} --rdp_max_memory 16000"
commands = parameter_sweep(sim_dir, results_dir, reference_dbs,
dataset_reference_combinations,
method_parameters_combinations, command_template,
infile='simulated-seqs.fna')
A quick sanity check...
In [12]:
print(len(commands))
commands[0]
Out[12]:
... and finally we are ready to run.
In [13]:
Parallel(n_jobs=4)(delayed(system)(command) for command in commands)
Out[13]:
In [14]:
taxonomy_glob = join(results_dir, '*', '*', '*', '*', 'simulated-seqs_tax_assignments.txt')
generate_per_method_biom_tables(taxonomy_glob, sim_dir, biom_input_fn='simulated-composition.biom')
Add results to the tax-credit directory (e.g., to push these results to the repository or compare with other precomputed results in downstream analysis steps). The precomputed_results_dir path and methods_dirs glob below should not need to be changed unless if substantial changes were made to filepaths in the preceding cells.
In [15]:
precomputed_results_dir = join(project_dir, "data", "precomputed-results", "simulated-community")
method_dirs = glob(join(results_dir, '*', '*', '*', '*'))
move_results_to_repository(method_dirs, precomputed_results_dir)
In [7]:
copy_expected_composition(sim_dir, dataset_reference_combinations, precomputed_results_dir)
In [ ]: