Run in a qiime 2.0.6 conda environment.
This notebook describes how mock community datasets were retrieved and files were generated for tax-credit comparisons. Only the feature tables, metadata maps, representative sequences, and expected taxonomies are included in tax-credit, but this notebook can regenerate intermediate files, generate these files for new mock communities, or tweaked to benchmark, e.g., quality control or OTU picking methods.
All mock communities are hosted on mockrobiota, though raw reads are deposited elsewhere. To use these mock communities, clone the mockrobiota
repository into the repo_dir
that contains the tax-credit repository.
In [1]:
from tax_credit.process_mocks import (extract_mockrobiota_dataset_metadata,
extract_mockrobiota_data,
batch_demux,
denoise_to_phylogeny,
transport_to_repo
)
from os.path import expandvars, join
Set source/destination filepaths
In [2]:
# base directory containing tax-credit and mockrobiota repositories
project_dir = expandvars("$HOME/Desktop/projects/")
# tax-credit directory
repo_dir = join(project_dir, "tax-credit")
# mockrobiota directory
mockrobiota_dir = join(project_dir, "mockrobiota")
# temp destination for mock community files
mock_data_dir = join(project_dir, "mock-community")
# destination for expected taxonomy assignments
expected_data_dir = join(repo_dir, "data", "precomputed-results", "mock-community")
First we will define which mock communities we plan to use, and necessary parameters
In [5]:
# We will just use a sequential set of mockrobiota datasets, otherwise list community names manually
communities = ['mock-{0}'.format(n) for n in range(1,27) if n != 11 and n != 17]
#communities = ['mock-{0}'.format(n) for n in range(16,27) if n != 17]
# Create dictionary of mock community dataset metadata
community_metadata = extract_mockrobiota_dataset_metadata(mockrobiota_dir, communities)
# Map marker-gene to reference database names in tax-credit and in mockrobiota
# marker-gene tax-credit-dir mockrobiota-dir version
reference_dbs = {'16S' : ('gg_13_8_otus', 'greengenes', '13-8', '99-otus'),
'ITS' : ('unite_20.11.2016', 'unite', '7-1', '99-otus')
}
Now we will generate data directories in tax-credit
for each community and begin populating these will files from mockrobiota
. This may take some time, as this involves downloading raw data fastq files.
In [6]:
extract_mockrobiota_data(communities, community_metadata, reference_dbs,
mockrobiota_dir, mock_data_dir,
expected_data_dir)
Each dataset may require different parameters. For example, some mock communities used here require different barcode orientations, while others may already be demultiplexed. These parameters may be read in as a dictionary of tuples.
In [5]:
# {community : (demultiplex, rev_comp_barcodes, rev_comp_mapping_barcodes)}
demux_params = {'mock-1' : (True, False, True),
'mock-2' : (True, False, True),
'mock-3' : (True, False, False),
'mock-4' : (True, False, True),
'mock-5' : (True, False, True),
'mock-6' : (True, False, True),
'mock-7' : (True, False, True),
'mock-8' : (True, False, True),
'mock-9' : (True, False, True),
'mock-10' : (True, False, True),
'mock-12' : (False, False, False),
'mock-13' : (False, False, False),
'mock-14' : (False, False, False),
'mock-15' : (False, False, False),
'mock-16' : (False, False, False),
'mock-18' : (False, False, False),
'mock-19' : (False, False, False),
'mock-20' : (False, False, False),
'mock-21' : (False, False, False),
'mock-22' : (False, False, False),
'mock-23' : (False, False, False),
'mock-24' : (False, False, False),
'mock-25' : (False, False, False),
'mock-26' : (True, False, False), # Note we only use samples 1-40 in mock-26
}
In [11]:
batch_demux(communities, mock_data_dir, demux_params)
To view the demux_summary.qzv
(demultiplexed sequences per sample counts) and demux_plot_qual.qzv
(fastq quality profiles) summaries that you just created, drag and drop the files into q2view
Use the fastq quality data above to decide how to proceed. As each dataset will have different quality profiles and read lengths, we will enter trimming parameters as a dictionary. We can use this dict to pass other parameters to denoise_to_phylogeny()
, including whether we want to build a phylogeny for each community.
In [8]:
# {community : (trim_left, trunc_len, build_phylogeny)}
trim_params = {'mock-1' : (0, 100, True),
'mock-2' : (0, 130, True),
'mock-3' : (0, 150, True),
'mock-4' : (0, 150, True),
'mock-5' : (0, 200, True),
'mock-6' : (0, 50, True),
'mock-7' : (0, 90, True),
'mock-8' : (0, 100, True),
'mock-9' : (0, 100, False),
'mock-10' : (0, 100, False),
'mock-12' : (0, 230, True),
'mock-13' : (0, 250, True),
'mock-14' : (0, 250, True),
'mock-15' : (0, 250, True),
'mock-16' : (19, 231, False),
'mock-18' : (19, 231, False),
'mock-19' : (19, 231, False),
'mock-20' : (0, 250, False),
'mock-21' : (0, 250, False),
'mock-22' : (19, 250, False),
'mock-23' : (19, 250, False),
'mock-24' : (0, 150, False),
'mock-25' : (0, 165, False),
'mock-26' : (0, 290, False),
}
Now we will quality filter with dada2
, and use the representative sequences to generate a phylogeny.
In [12]:
denoise_to_phylogeny(communities, mock_data_dir, trim_params)
To view the feature_table_summary.qzv
summaries you just created, drag and drop the files into q2view
In [13]:
transport_to_repo(communities, mock_data_dir, repo_dir)
In [ ]: