dataset-generation


Generating simulated communities from natural community observations

This notebook demonstrates how to generate a new simulated community from taxonomic assignments of a natural community. We use the following test datasets, which are pared down from natural communities from the linked studies.

Sake bacterial succession (bacterial 16S rRNA))

Wine grape terroir (fungal ITS))

NOTE: the simulated communities described in this notebook are provided in the tax-credit repository and do not need to be generated again to test additional taxonomy classifiers. This notebook does not need to be re-run unless if using it to generate NEW simulated communities. If this is the case, remove the sake and wine communities from dataset_reference_combinations dictionary below before running that cell, to avoid overwriting data.


In [1]:
from os.path import join, expandvars 
from tax_credit.simulated_communities import generate_simulated_communities

In [2]:
# Project directory
project_dir = expandvars("$HOME/Desktop/projects/tax-credit/")
# Directory containing reference sequence databases
reference_database_dir = join(project_dir, 'data', 'ref_dbs')
# simulated communities directory
sim_dir = join(project_dir, "data", "simulated-community")

In the following cell, we define the natural datasets that we want to use for simulated community generation. The directory for each dataset is located in sim_dir, and contains the files expected-composition.txt, containing the taxonomic composition of each sample, and map.txt, containing sample metadata.


In [3]:
dataset_reference_combinations = [
    # (community_name, ref_db)
    ('sake', 'gg_13_8_otus'),
    ('wine', 'unite_20.11.2016')
]

reference_dbs = {'gg_13_8_otus' : (join(reference_database_dir, 'gg_13_8_otus/99_otus_clean_515f-806r_trim250.fasta'), 
                                   join(reference_database_dir, 'gg_13_8_otus/99_otu_taxonomy_clean.tsv')),
                 'unite_20.11.2016' : (join(reference_database_dir, 'unite_20.11.2016/sh_refs_qiime_ver7_99_20.11.2016_dev_clean_BITSf-B58S3r_trim250.fasta'), 
                                       join(reference_database_dir, 'unite_20.11.2016/sh_taxonomy_qiime_ver7_99_20.11.2016_dev_clean.tsv'))}

The following cell will generate:

1. Simulated compositions in tsv format.
2. biom tables for simulated-compositions and expected-composition.
3. A fasta file containing "representative sequences" for each "OTU",
    i.e., reference sequences matching the expected taxa, up to strain_max
    sequences per taxonomy.

We will set strain_max to indicate the maximum number of reference sequences we will match to each taxonomy in expected-taxonomy.txt


In [4]:
generate_simulated_communities(sim_dir, dataset_reference_combinations, reference_dbs, strain_max=5)


sake: 0 matches and 8 near matches.
wine: 2 matches and 5 near matches.

These communities have relatively few matches (exact taxonomy matches to species level), and more near matches (most likely to genus level). If we examine the expected-composition.txt files, we see that this is because most of the sequences in these natural communities were only assigned to genus level. This is a common situation and informs how we analyze the communities later on — e.g., we may want to perform certain comparisons at genus level.


In [ ]: