dataset-generation


Novel-taxa and simulated community generation

This notebook describes the generation of reference databases for both novel-taxa and simulated community analyses. Novel-taxa analysis is a form of cross-validated taxonomic classification, wherein random unique sequences are sampled from the reference database as a test set; all sequences sharing taxonomic affiliation at a given taxonomic level are removed from the reference database (training set); and taxonomy is assigned to the query sequences at the given taxonomic level. Thus, this test interrogates the behavior of a taxonomy classifier when challenged with "novel" sequences that are not represented by close matches within the reference sequence database. Such an analysis is performed to assess the degree to which "overassignment" occurs for sequences that are not represented in a reference database.

Simulated community analysis represents more conventional cross-validated classification, wherein unique sequences are randomly sampled from a reference dataset and used as a test set for taxonomic classification, using a training set that has those sequences removed, but not other sequences that share taxonomic affiliation. Instead, the training set must contain identical taxonomies to those represented by the test sequences.

Novel-taxa reference data set generation

This section describes the preparation of the data sets necessary for "novel taxa" analysis. The goals of this step are:

  1. Create a "clean" reference database that can be used for evaluation of "novel taxa" from phylum to species level.
  2. Generate simulated amplicons and randomly subsample query sequences to use as "novel taxa"
  3. Create modified sequence reference databases for taxonomic classification of "novel taxa" sequences

In this first cell, we describe data set/database characteristics as a dictionary: dataset name is the key, with values reference sequence fasta, taxonomy, database name, forward primer sequence, reverse primer sequence, forward primer name, reverse primer name.

MODIFY these values to generate novel-taxa files on a new reference database


In [1]:
from tax_credit.framework_functions import \
    generate_simulated_datasets, distance_comparison, \
    test_cross_validated_sequences, \
    test_novel_taxa_datasets

from os.path import expandvars, join
import pandas as pd

In [2]:
%matplotlib inline

In [3]:
project_dir = expandvars("../..")
data_dir = join(project_dir, "data")

# List databases as fasta/taxonomy file pairs
databases = {'B1-REF': ['../../data/ref_dbs/gg_13_8_otus/99_otus.fasta', 
             '../../data/ref_dbs/gg_13_8_otus/99_otu_taxonomy.txt',
             "gg_13_8_otus", "GTGCCAGCMGCCGCGGTAA", "GGACTACHVGGGTWTCTAAT", "515f", "806r"],
             'F1-REF': ['../../data/ref_dbs/unite_20.11.2016/sh_refs_qiime_ver7_99_20.11.2016_dev.fasta', 
             '../../data/ref_dbs/unite_20.11.2016/sh_taxonomy_qiime_ver7_99_20.11.2016_dev.txt', 
             "unite_20.11.2016", "ACCTGCGGARGGATCA", "GAGATCCRTTGYTRAAAGTT", "BITSf", "B58S3r"]
            }

Now we will import these to a dataframe and view it. You should not need to modify the following cell.


In [4]:
# Arrange data set / database info in data frame
simulated_community_definitions = pd.DataFrame.from_dict(databases, orient="index")
simulated_community_definitions.columns = ["Reference file path", "Reference tax path", "Reference id", 
                                           "Fwd primer", "Rev primer", "Fwd primer id", "Rev primer id"]
simulated_community_definitions


Out[4]:
Reference file path Reference tax path Reference id Fwd primer Rev primer Fwd primer id Rev primer id
B1-REF ../../data/ref_dbs/gg_13_8_otus/99_otus.fasta ../../data/ref_dbs/gg_13_8_otus/99_otu_taxonom... gg_13_8_otus GTGCCAGCMGCCGCGGTAA GGACTACHVGGGTWTCTAAT 515f 806r
F1-REF ../../data/ref_dbs/unite_20.11.2016/sh_refs_qi... ../../data/ref_dbs/unite_20.11.2016/sh_taxonom... unite_20.11.2016 ACCTGCGGARGGATCA GAGATCCRTTGYTRAAAGTT BITSf B58S3r

Generate "clean" reference taxonomy and sequence database by removing taxonomy strings with empty or ambiguous levels'

Set simulated community parameters, including amplicon length and the number of iterations to perform. Iterations will split our query sequence files into N chunks.

This will take a few minutes to run. Get some coffee.


In [5]:
read_length = 250
iterations = 10
min_read_length = 80
generate_simulated_datasets(simulated_community_definitions, data_dir, 
                            read_length, iterations, min_read_length=min_read_length,
                            levelrange=range(6, 1, -1), force=True)


B1-REF Sequence Counts
Raw Fasta:            203452.0
Clean Fasta:          20745.0
Simulated Reads:      20652.0
B1-REF: generating 10 folds on 493 strata
F1-REF Sequence Counts
Raw Fasta:            28856.0
Clean Fasta:          15684.0
Simulated Reads:      9384.0
F1-REF: generating 10 folds on 335 strata

Data Leakage

First check that train/train and train/test distances are similarly distributed for the cross validation data sets.


In [6]:
distance_comparison(simulated_community_definitions, data_dir, 'cross-validated')


Now check that the novel taxa distance distributions are ok.


In [7]:
distance_comparison(simulated_community_definitions, data_dir, 'novel-taxa-simulations', samples=100)


For peace of mind, we can test our novel taxa and simulated community datasets to confirm that:

1) For simulated communities, test (query) taxa IDs are not in training (ref) set, but all taxonomy strings are

2) For novel taxa, test taxa IDs and taxonomies are not in training (ref) set, but sister branch taxa are

If no errors print, all tests pass.


In [8]:
test_cross_validated_sequences(data_dir)

As a sanity check, confirm that novel taxa were generated successfully.


In [9]:
test_novel_taxa_datasets(data_dir)