This notebook demonstrates taxonomy classification using q2-feature-classifier
's naive Bayes classifier.
In [14]:
from os.path import join, expandvars
from joblib import Parallel, delayed
from glob import glob
from os import system
from tax_credit.framework_functions import (parameter_sweep,
generate_per_method_biom_tables,
move_results_to_repository)
In [15]:
project_dir = expandvars("$HOME/Desktop/projects/short-read-tax-assignment")
analysis_name= "mock-community"
data_dir = join(project_dir, "data", analysis_name)
reference_database_dir = expandvars("$HOME/Desktop/ref_dbs/")
results_dir = expandvars("$HOME/Desktop/projects/mock-community/")
First, we're going to define the data sets that we'll sweep over. The following cell does not need to be modified unless if you wish to change the datasets or reference databases used in the sweep.
In [16]:
dataset_reference_combinations = [
('mock-1', 'gg_13_8_otus_clean'), # formerly S16S-1
('mock-2', 'gg_13_8_otus_clean'), # formerly S16S-2
('mock-3', 'gg_13_8_otus_clean'), # formerly Broad-1
('mock-4', 'gg_13_8_otus_clean'), # formerly Broad-2
('mock-5', 'gg_13_8_otus_clean'), # formerly Broad-3
('mock-6', 'gg_13_8_otus_clean'), # formerly Turnbaugh-1
('mock-7', 'gg_13_8_otus_clean'), # formerly Turnbaugh-2
('mock-8', 'gg_13_8_otus_clean'), # formerly Turnbaugh-3
('mock-9', 'unite_20.11.2016_clean_fullITS'), # formerly ITS1
('mock-10', 'unite_20.11.2016_clean_fullITS'), # formerly ITS2-SAG
('mock-12', 'gg_13_8_otus_clean'), # Extreme
('mock-13', 'gg_13_8_otus_full16S_clean'), # kozich-1
('mock-14', 'gg_13_8_otus_full16S_clean'), # kozich-2
('mock-15', 'gg_13_8_otus_full16S_clean'), # kozich-3
('mock-16', 'gg_13_8_otus_clean'), # schirmer-1
]
reference_dbs = {'gg_13_8_otus_clean' : (join(reference_database_dir, 'gg_13_8_otus/99_otus_clean_515f-806r-classifier.qza'),
# 'gg_13_8_otus' : (join(reference_database_dir, 'gg_13_8_otus/gg-13-8-99-515-806-nb-classifier.qza'),
join(reference_database_dir, 'gg_13_8_otus/taxonomy/99_otu_taxonomy.qza')),
'gg_13_8_otus_full16S_clean' : (join(reference_database_dir, 'gg_13_8_otus/99_otus_clean-classifier.qza'),
# 'gg_13_8_otus_full16S' : (join(reference_database_dir, 'gg_13_8_otus/gg-13-8-99-nb-classifier.qza'),
join(reference_database_dir, 'gg_13_8_otus/taxonomy/99_otu_taxonomy.qza')),
'unite_20.11.2016_clean_fullITS' : (join(reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_refs_qiime_ver7_99_20.11.2016_dev_clean-classifier.qza'),
join(reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.qza')),
'unite_20.11.2016_clean' : (join(reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_refs_qiime_ver7_99_20.11.2016_dev_clean_ITS1Ff-ITS2r-classifier.qza'),
# 'unite_20.11.2016' : (join(reference_database_dir, 'sh_qiime_release_20.11.2016/developer/99-dev-ITS1Ff-ITS2r-trim250-nb-classifier.qza'),
join(reference_database_dir, 'sh_qiime_release_20.11.2016/developer/sh_taxonomy_qiime_ver7_99_20.11.2016_dev.qza'))}
In [17]:
method_parameters_combinations = {
'q2-nb' : {'p-confidence': [0.0, 0.2, 0.4, 0.6, 0.8]}
}
Now enter the template of the command to sweep, and generate a list of commands with parameter_sweep()
.
Fields must adhere to following format:
{0} = output directory
{1} = input data
{2} = reference sequences
{3} = reference taxonomy
{4} = method name
{5} = other parameters
In [18]:
command_template = "mkdir -p {0}; qiime feature-classifier classify --i-reads {1} --o-classification {0}/rep_seqs_tax_assignments.qza --i-classifier {2} {5}; qiime tools export {0}/rep_seqs_tax_assignments.qza --output-dir {0}"
commands = parameter_sweep(data_dir, results_dir, reference_dbs,
dataset_reference_combinations,
method_parameters_combinations, command_template,
infile='rep_seqs.qza', output_name='rep_seqs_tax_assignments.qza')
As a sanity check, we can look at the first command that was generated and the number of commands generated.
In [19]:
print(len(commands))
commands[0]
Out[19]:
Finally, we run our commands.
In [20]:
Parallel(n_jobs=4)(delayed(system)(command) for command in commands)
Out[20]:
In [21]:
taxonomy_glob = join(results_dir, '*', '*', '*', '*', 'taxonomy.tsv')
generate_per_method_biom_tables(taxonomy_glob, data_dir)
Add results to the short-read-taxa-assignment directory (e.g., to push these results to the repository or compare with other precomputed results in downstream analysis steps). The precomputed_results_dir path and methods_dirs glob below should not need to be changed unless if substantial changes were made to filepaths in the preceding cells.
In [22]:
precomputed_results_dir = join(project_dir, "data", "precomputed-results", analysis_name)
method_dirs = glob(join(results_dir, '*', '*', '*', '*'))
move_results_to_repository(method_dirs, precomputed_results_dir)
In [ ]: