Data generation: using python to sweep over methods and parameters

In this notebook, we illustrate how to use python to generate and run a list of commands. In this example, we generate a list of QIIME 1.9.0 assign_taxonomy.py commands, though this workflow for command generation is generally very useful for performing parameter sweeps (i.e., exploration of sets of parameters for achieving a specific result for comparative purposes).

Environment preparation


In [1]:
from os import system
from os.path import join, expandvars 
from joblib import Parallel, delayed
from glob import glob
from tax_credit.framework_functions import (recall_novel_taxa_dirs,
                                            parameter_sweep,
                                            move_results_to_repository)

In [2]:
project_dir = expandvars("$HOME/Desktop/projects/short-read-tax-assignment")
analysis_name= "cross-validated"

results_dir = expandvars("$HOME/Desktop/projects/cross-validated/")

Preparing data set sweep

First, we're going to define the data sets that we'll sweep over. The "cross-validated" simulated reads that we use here are subsets of reference sequence databases generated duing the novel-taxa analysis. We re-use these data sets here for the purposes of having cross-validated data subsets. As the cross-validated dataset names depend on how the database generation notebook was executed, we must define the variables used to create these datasets. If you modified any variables in that notebook, set these same variables below. If you did not, then do not modify.

recall_novel_taxa_dirs() generates a list of dataset_reference_combinations and a dictionary of reference_dbs mapped to each dataset, which we feed to parameter_sweep below.


In [3]:
iterations = 3
data_dir = join(project_dir, "data", analysis_name)
# databases is a list of names given as dictionary keys in the second
# cell of the database generation notebook. Just list the names here.
databases = ['B1-REF', 'F1-REF']

# Generate a list of input directories
(dataset_reference_combinations, reference_dbs) = recall_novel_taxa_dirs(\
    data_dir, databases, iterations, ref_seqs='ref_seqs.fasta',
    ref_taxa='ref_taxa.tsv', max_level=6, min_level=5, multilevel=False)

Preparing the method/parameter combinations and generating commands

Now we set the methods and method-specific parameters that we want to sweep. Modify to sweep other methods. Note how method_parameters_combinations feeds method/parameter combinations to parameter_sweep() in the cell below.

Assignment Using QIIME 1 or Command-Line Classifiers

Here we provide an example of taxonomy assignment using legacy QIIME 1 classifiers executed on the command line. To accomplish this, we must first convert commands to a string, which we then pass to bash for execution. As QIIME 1 is written in python-2, we must also activate a separate environment in which QIIME 1 has been installed. If any environmental variables need to be set (in this example, the RDP_JAR_PATH), we must also source the .bashrc file.


In [8]:
method_parameters_combinations = { # probabalistic classifiers
              'rdp': {'confidence': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5,
                                     0.6, 0.7, 0.8, 0.9, 1.0]},
              
              # global alignment classifiers
              'uclust': {'min_consensus_fraction': [0.51, 0.76, 1.0], 
                         'similarity': [0.8, 0.9],
                         'uclust_max_accepts': [1, 3, 5]},
             
              # local alignment classifiers
              'sortmerna': {'sortmerna_e_value': [1.0],
                            'min_consensus_fraction': [0.51, 0.76, 1.0], 
                            'similarity': [0.8, 0.9],
                            'sortmerna_best_N_alignments ': [1, 3, 5],
                            'sortmerna_coverage' : [0.8, 0.9]},
              'blast' : {'blast_e_value' : [0.0000000001, 0.001, 1, 1000]}
             }

Now enter the template of the command to sweep, and generate a list of commands with parameter_sweep().

Fields must adhere to following format:

                  {0} = output directory
                  {1} = input data
                  {2} = output destination
                  {3} = reference taxonomy
                  {4} = method name
                  {5} = other parameters

In [9]:
command_template = "source activate qiime1; source ~/.bashrc; mkdir -p {0} ; assign_taxonomy.py -v -i {1} -o {0} -r {2} -t {3} -m {4} {5} --rdp_max_memory 16000"
        
commands = parameter_sweep(data_dir, results_dir, reference_dbs,
                           dataset_reference_combinations,
                           method_parameters_combinations, command_template,
                           infile='query.fasta')

As a sanity check, we can look at the first command that was generated and the number of commands generated.


In [14]:
print(len(commands))
commands[0]


96
Out[14]:
'mkdir -p /Users/nbokulich/Desktop/projects/cross-validated/B1-REF-iter0/B1-REF-iter0/vsearch/1:0.51:0.8; qiime feature-classifier vsearch --i-query /Users/nbokulich/Desktop/projects/short-read-tax-assignment/data/cross-validated/B1-REF-iter0/query.qza --o-classification /Users/nbokulich/Desktop/projects/cross-validated/B1-REF-iter0/B1-REF-iter0/vsearch/1:0.51:0.8/rep_seqs_tax_assignments.qza --i-reference-reads /Users/nbokulich/Desktop/projects/short-read-tax-assignment/data/cross-validated/B1-REF-iter0/ref_seqs.qza --i-reference-taxonomy /Users/nbokulich/Desktop/projects/short-read-tax-assignment/data/cross-validated/B1-REF-iter0/ref_taxa.qza --p-min-id 0.8 --p-maxaccepts 1 --p-min-consensus 0.51; qiime tools export /Users/nbokulich/Desktop/projects/cross-validated/B1-REF-iter0/B1-REF-iter0/vsearch/1:0.51:0.8/rep_seqs_tax_assignments.qza --output-dir /Users/nbokulich/Desktop/projects/cross-validated/B1-REF-iter0/B1-REF-iter0/vsearch/1:0.51:0.8; mv /Users/nbokulich/Desktop/projects/cross-validated/B1-REF-iter0/B1-REF-iter0/vsearch/1:0.51:0.8/taxonomy.tsv /Users/nbokulich/Desktop/projects/cross-validated/B1-REF-iter0/B1-REF-iter0/vsearch/1:0.51:0.8/query_tax_assignments.txt'

Finally, we run our commands.


In [33]:
Parallel(n_jobs=4)(delayed(system)(command) for command in commands)


Out[33]:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

BLAST+


In [4]:
method_parameters_combinations = {
              'blast+' : {'p-evalue': [0.001],
                          'p-maxaccepts': [1, 10],
                          'p-min-id': [0.80, 0.97, 0.99],
                          'p-min-consensus': [0.51, 0.75, 0.99]}
             }

command_template = ("mkdir -p {0}; "
                    "qiime feature-classifier blast --i-query {1} --o-classification "
                    "{0}/rep_seqs_tax_assignments.qza --i-reference-reads {2} --i-reference-taxonomy {3} {5}; "
                    "qiime tools export {0}/rep_seqs_tax_assignments.qza --output-dir {0}; "
                    "mv {0}/taxonomy.tsv {0}/query_tax_assignments.txt")
        
(dataset_reference_combinations, reference_dbs) = recall_novel_taxa_dirs(\
    data_dir, databases, iterations, ref_seqs='ref_seqs.qza',
    ref_taxa='ref_taxa.qza', max_level=6, min_level=5, multilevel=False)

commands = parameter_sweep(data_dir, results_dir, reference_dbs,
                           dataset_reference_combinations,
                           method_parameters_combinations, command_template,
                           infile='query.qza', output_name='rep_seqs_tax_assignments.qza')

In [5]:
Parallel(n_jobs=4)(delayed(system)(command) for command in commands)


Out[5]:
[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

VSEARCH


In [5]:
method_parameters_combinations = {
              'vsearch' : {'p-maxaccepts': [1, 10],
                           'p-min-id': [0.80, 0.90, 0.97, 0.99],
                           'p-min-consensus': [0.51, 0.99]}
             }

command_template = ("mkdir -p {0}; "
                    "qiime feature-classifier vsearch --i-query {1} --o-classification "
                    "{0}/rep_seqs_tax_assignments.qza --i-reference-reads {2} --i-reference-taxonomy {3} {5}; "
                    "qiime tools export {0}/rep_seqs_tax_assignments.qza --output-dir {0}; "
                    "mv {0}/taxonomy.tsv {0}/query_tax_assignments.txt")
        
commands = parameter_sweep(data_dir, results_dir, reference_dbs,
                           dataset_reference_combinations,
                           method_parameters_combinations, command_template,
                           infile='query.qza', output_name='rep_seqs_tax_assignments.qza')

In [6]:
Parallel(n_jobs=4)(delayed(system)(command) for command in commands)


Out[6]:
[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

Move result files to repository

Add results to the short-read-taxa-assignment directory (e.g., to push these results to the repository or compare with other precomputed results in downstream analysis steps). The precomputed_results_dir path and methods_dirs glob below should not need to be changed unless if substantial changes were made to filepaths in the preceding cells.


In [7]:
precomputed_results_dir = join(project_dir, "data", "precomputed-results", analysis_name)
method_dirs = glob(join(results_dir, '*', '*', '*', '*'))
move_results_to_repository(method_dirs, precomputed_results_dir)

In [ ]: