In this notebook, we illustrate how to use python to generate and run a list of commands. In this example, we generate a list of QIIME 1.9.0 assign_taxonomy.py
commands, though this workflow for command generation is generally very useful for performing parameter sweeps (i.e., exploration of sets of parameters for achieving a specific result for comparative purposes).
In [2]:
from os.path import join, expandvars
from joblib import Parallel, delayed
from glob import glob
from os import system
from tax_credit.framework_functions import (parameter_sweep,
generate_per_method_biom_tables,
move_results_to_repository)
In [3]:
project_dir = expandvars("$HOME/Desktop/projects/tax-credit")
analysis_name= "mock-community"
data_dir = join(project_dir, "data", analysis_name)
reference_database_dir = expandvars("$HOME/Desktop/ref_dbs/")
results_dir = expandvars("$HOME/Desktop/projects/mock-community/")
First, we're going to define the data sets that we'll sweep over. The following cell does not need to be modified unless if you wish to change the datasets or reference databases used in the sweep. Here we will use a single mock community, but two different versions of the reference database.
In [27]:
dataset_reference_combinations = [
('mock-3', 'silva_123_v4_trim250'),
('mock-3', 'silva_123_clean_full16S'),
('mock-3', 'silva_123_clean_v4_trim250'),
('mock-3', 'gg_13_8_otus_clean_trim150'),
('mock-3', 'gg_13_8_otus_clean_full16S'),
('mock-9', 'unite_20.11.2016_clean_trim100'),
('mock-9', 'unite_20.11.2016_clean_fullITS'),
]
reference_dbs = {'gg_13_8_otus_clean_trim150': (join(reference_database_dir, 'gg_13_8_otus/99_otus_clean_515f-806r_trim150.fasta'),
join(reference_database_dir, 'gg_13_8_otus/99_otu_taxonomy_clean.tsv')),
'gg_13_8_otus_clean_full16S': (join(reference_database_dir, 'gg_13_8_otus/99_otus_clean.fasta'),
join(reference_database_dir, 'gg_13_8_otus/99_otu_taxonomy_clean.tsv')),
'unite_20.11.2016_clean_trim100': (join(reference_database_dir, 'unite_20.11.2016/sh_refs_qiime_ver7_99_20.11.2016_dev_clean_ITS1Ff-ITS2r_trim100.fasta'),
join(reference_database_dir, 'unite_20.11.2016/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.tsv')),
'unite_20.11.2016_clean_fullITS': (join(reference_database_dir, 'unite_20.11.2016/sh_refs_qiime_ver7_99_20.11.2016_dev_clean.fasta'),
join(reference_database_dir, 'unite_20.11.2016/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.tsv')),
'silva_123_v4_trim250': (join(reference_database_dir, 'SILVA123_QIIME_release/rep_set/rep_set_16S_only/99/99_otus_16S/dna-sequences.fasta'),
join(reference_database_dir, 'SILVA123_QIIME_release/taxonomy/16S_only/99/majority_taxonomy_7_levels.txt')),
'silva_123_clean_full16S': (join(reference_database_dir, 'SILVA123_QIIME_release/99_otus_16S_clean.fasta'),
join(reference_database_dir, 'SILVA123_QIIME_release/majority_taxonomy_7_levels_clean.tsv')),
'silva_123_clean_v4_trim250': (join(reference_database_dir, 'SILVA123_QIIME_release/99_otus_16S_clean/dna-sequences.fasta'),
join(reference_database_dir, 'SILVA123_QIIME_release/majority_taxonomy_7_levels_clean.tsv'))
}
Here we provide an example of taxonomy assignment using legacy QIIME 1
classifiers executed on the command line. To accomplish this, we must first convert commands
to a string, which we then pass to bash for execution. As QIIME 1
is written in python-2, we must also activate a separate environment in which QIIME 1 has been installed. If any environmental variables need to be set (in this example, the RDP_JAR_PATH), we must also source the .bashrc file.
In [24]:
method_parameters_combinations = { # probabalistic classifiers
'rdp': {'confidence': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.9, 1.0]},
# global alignment classifiers
'uclust': {'min_consensus_fraction': [0.51, 0.76, 1.0],
'similarity': [0.9, 0.97, 0.99],
'uclust_max_accepts': [1, 3, 5]},
}
Now enter the template of the command to sweep, and generate a list of commands with parameter_sweep()
.
Fields must adhere to following format:
{0} = output directory
{1} = input data
{2} = reference sequences
{3} = reference taxonomy
{4} = method name
{5} = other parameters
In [28]:
command_template = "source activate qiime1; source ~/.bashrc; mkdir -p {0} ; assign_taxonomy.py -v -i {1} -o {0} -r {2} -t {3} -m {4} {5} --rdp_max_memory 7000"
commands = parameter_sweep(data_dir, results_dir, reference_dbs,
dataset_reference_combinations,
method_parameters_combinations, command_template,
infile='rep_seqs.fna',)
As a sanity check, we can look at the first command that was generated and the number of commands generated.
In [29]:
print(len(commands))
commands[0]
Out[29]:
Finally, we run our commands.
In [ ]:
Parallel(n_jobs=1)(delayed(system)(command) for command in commands)
Now let's do it all over again, but with QIIME2 classifiers (which require different input files and command templates). Note that the QIIME2 artifact files required for assignment are not included in tax-credit, but can be generated from any reference dataset using qiime tools import
.
In [11]:
new_reference_database_dir = expandvars("$HOME/Desktop/ref_dbs/")
reference_dbs = {'gg_13_8_otus_clean_trim150' : (join(new_reference_database_dir, 'gg_13_8_otus/99_otus_clean_515f-806r_trim150.qza'),
join(new_reference_database_dir, 'gg_13_8_otus/99_otu_taxonomy_clean.qza')),
'gg_13_8_otus_clean_full16S' : (join(new_reference_database_dir, 'gg_13_8_otus/99_otus_clean.qza'),
join(new_reference_database_dir, 'gg_13_8_otus/99_otu_taxonomy_clean.qza')),
'unite_20.11.2016_clean_trim100' : (join(new_reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_refs_qiime_ver7_99_20.11.2016_dev_clean_ITS1Ff-ITS2r_trim100.qza'),
join(new_reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.qza')),
'unite_20.11.2016_clean_fullITS' : (join(new_reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_refs_qiime_ver7_99_20.11.2016_dev_clean.qza'),
join(new_reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.qza')),
'silva_123_v4_trim250': (join(reference_database_dir, 'SILVA123_QIIME_release/rep_set/rep_set_16S_only/99/99_otus_16S_515f-806r_trim250.qza'),
join(reference_database_dir, 'SILVA123_QIIME_release/taxonomy/16S_only/99/majority_taxonomy_7_levels.qza')),
'silva_123_clean_full16S': (join(reference_database_dir, 'SILVA123_QIIME_release/99_otus_16S_clean.qza'),
join(reference_database_dir, 'SILVA123_QIIME_release/taxonomy/16S_only/99/majority_taxonomy_7_levels.qza')),
'silva_123_clean_v4_trim250': (join(reference_database_dir, 'SILVA123_QIIME_release/99_otus_16S_clean_515f-806r_trim250.qza'),
join(reference_database_dir, 'SILVA123_QIIME_release/taxonomy/16S_only/99/majority_taxonomy_7_levels.qza'))
}
In [13]:
method_parameters_combinations = { # probabalistic classifiers
'blast+' : {'p-evalue': [0.001],
'p-maxaccepts': [1, 10],
'p-min-id': [0.80, 0.99],
'p-min-consensus': [0.51, 0.99]}
}
command_template = "mkdir -p {0}; qiime feature-classifier blast --i-query {1} --o-classification {0}/rep_seqs_tax_assignments.qza --i-reference-reads {2} --i-reference-taxonomy {3} {5}; qiime tools export {0}/rep_seqs_tax_assignments.qza --output-dir {0}"
commands = parameter_sweep(data_dir, results_dir, reference_dbs,
dataset_reference_combinations,
method_parameters_combinations, command_template,
infile='rep_seqs.qza',)
In [14]:
Parallel(n_jobs=4)(delayed(system)(command) for command in commands)
Out[14]:
In [15]:
method_parameters_combinations = { # probabalistic classifiers
'vsearch' : {'p-maxaccepts': [1, 10],
'p-min-id': [0.97, 0.99],
'p-min-consensus': [0.51, 0.99]}
}
command_template = "mkdir -p {0}; qiime feature-classifier vsearch --i-query {1} --o-classification {0}/rep_seqs_tax_assignments.qza --i-reference-reads {2} --i-reference-taxonomy {3} {5}; qiime tools export {0}/rep_seqs_tax_assignments.qza --output-dir {0}"
commands = parameter_sweep(data_dir, results_dir, reference_dbs,
dataset_reference_combinations,
method_parameters_combinations, command_template,
infile='rep_seqs.qza',)
In [16]:
Parallel(n_jobs=4)(delayed(system)(command) for command in commands)
Out[16]:
In [18]:
new_reference_database_dir = expandvars("$HOME/Desktop/ref_dbs/")
reference_dbs = {'gg_13_8_otus_clean_trim150' : (join(new_reference_database_dir, 'gg_13_8_otus/99_otus_clean_515f-806r_trim150-classifier.qza'),
join(new_reference_database_dir, 'gg_13_8_otus/99_otu_taxonomy_clean.qza')),
'gg_13_8_otus_clean_full16S' : (join(new_reference_database_dir, 'gg_13_8_otus/99_otus_clean-classifier.qza'),
join(new_reference_database_dir, 'gg_13_8_otus/99_otu_taxonomy_clean.qza')),
'unite_20.11.2016_clean_trim100' : (join(new_reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_refs_qiime_ver7_99_20.11.2016_dev_clean_ITS1Ff-ITS2r_trim100-classifier.qza'),
join(new_reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.qza')),
'unite_20.11.2016_clean_fullITS' : (join(new_reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_refs_qiime_ver7_99_20.11.2016_dev_clean-classifier.qza'),
join(new_reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.qza')),
'silva_123_v4_trim250': (join(reference_database_dir, 'SILVA123_QIIME_release/99_otus_16S_515f-806r_trim250-classifier.qza'),
join(reference_database_dir, 'SILVA123_QIIME_release/taxonomy/16S_only/99/majority_taxonomy_7_levels.txt')),
'silva_123_clean_full16S': (join(reference_database_dir, 'SILVA123_QIIME_release/99_otus_16S_clean-classifier.qza'),
join(reference_database_dir, 'SILVA123_QIIME_release/majority_taxonomy_7_levels_clean.tsv')),
'silva_123_clean_v4_trim250': (join(reference_database_dir, 'SILVA123_QIIME_release/99_otus_16S_clean_515f-806r_trim250-classifier.qza'),
join(reference_database_dir, 'SILVA123_QIIME_release/majority_taxonomy_7_levels_clean.tsv'))
}
In [19]:
method_parameters_combinations = {
'q2-nb' : {'p-confidence': [0.0, 0.2, 0.4, 0.6, 0.8]}
}
command_template = "mkdir -p {0}; qiime feature-classifier classify --i-reads {1} --o-classification {0}/rep_seqs_tax_assignments.qza --i-classifier {2} {5}; qiime tools export {0}/rep_seqs_tax_assignments.qza --output-dir {0}"
commands = parameter_sweep(data_dir, results_dir, reference_dbs,
dataset_reference_combinations,
method_parameters_combinations, command_template,
infile='rep_seqs.qza',)
In [22]:
Parallel(n_jobs=1)(delayed(system)(command) for command in commands)
Out[22]:
In [5]:
taxonomy_glob = join(results_dir, '*', '*', '*', '*', 'rep_seqs_tax_assignments.txt')
generate_per_method_biom_tables(taxonomy_glob, data_dir)
Add results to the tax-credit directory (e.g., to push these results to the repository or compare with other precomputed results in downstream analysis steps). The precomputed_results_dir path and methods_dirs glob below should not need to be changed unless if substantial changes were made to filepaths in the preceding cells.
In [26]:
precomputed_results_dir = join(project_dir, "data", "precomputed-results", analysis_name)
for community in dataset_reference_combinations:
method_dirs = glob(join(results_dir, community[0], '*', '*', '*'))
move_results_to_repository(method_dirs, precomputed_results_dir)
Do not forget to copy the expected taxonomy files for this mock community!
In [27]:
for community in dataset_reference_combinations:
community_dir = join(precomputed_results_dir, community[0])
exp_observations = join(community_dir, '*', 'expected')
new_community_exp_dir = join(community_dir, community[1], 'expected')
!mkdir {new_community_exp_dir}; cp {exp_observations}/* {new_community_exp_dir}
In [ ]: