Filter and pick OTUs

  • License: BSD
  • Copyright: American Gut Project, 2014

Early in the American Gut Project, it was observed that some organisms bloomed likely as a result of increased shipping time and delay between when samples were collected and when they were put on ice (more detail can be found here). The purpose of this notebook is to apply the filter developed in order to bioinformatically subtract these observed bloom sequences from fecal samples. It is important to apply this filter when combining data with the American Gut as to remove a potential study-effect bias as all fecal data in the American Gut has had this filter applied. The specific steps covered are:

  • Filter demultiplexed sequence data to only fecal samples
  • Determine what sequences in the fecal samples recruit to the observed bloom sequences
  • Remove the recruited bloom sequences from the demultiplexed sequence data
  • Trim the filtered demultiplexed data (to reduce study bias when combining with short reads, such as the data in Yatsunenko et al 2012)
  • Pick OTUs against Greengenes 13_8 for the trimmed and full length filtered demultiplexed data

The notebook is designed to support bloom filtering multiple datasets.

The filtering is only intended to be applied to fecal data. As such, this notebook allows you to describe what metadata column and value to use so that only fecal samples are used.


In [ ]:
%run cluster_utils.ipy
import os
from functools import partial

# modify the following variable to set a base directory below your home
YOUR_BASE_DIRECTORY = "ICU"
current_home = os.environ['HOME']
basedir = os.path.join(current_home, YOUR_BASE_DIRECTORY)
working_dir = os.path.join(basedir, 'blooming')
prj_name = 'blooming'
trim_length = 100

# define a submit function against qsub
def submit(cmd_prefix, prj_name, queue, extra_args, cmd):
    """Submission wrapper to submit to qsub"""
    cmd = "%s; %s" % (cmd_prefix, cmd)
    return submit_qsub(cmd, job_name=prj_name, queue=queue,
                       extra_args=extra_args)

# define any environment specific details for any running compute job.
# it is very likely that these will need to be changed to fit your environment.
cmd_prefix = 'source ~/.bash_profile; workon ag'

# bootstrap the submit method
submit_8gb = partial(submit, cmd_prefix, prj_name, 'short8gb', '-l walltime=23:59:59')
submit_smr = partial(submit, cmd_prefix, prj_name, 'long8gb', '-l nodes=1:ppn=64 -l walltime=72:00:00')

# setup a working directory
if not os.path.exists(working_dir):
    os.mkdir(working_dir)
path_joiner = partial(os.path.join, working_dir)

Here is where we'll setup paths for our input datasets


In [ ]:
# setup the paths to your input datasets
paths_to_input_datasets = ["/Users/mcdonadt/scratch/ICU/r1/seqs_r1.fna",
                           "/Users/mcdonadt/scratch/ICU/r2/seqs_r2.fna"]
paths_to_input_metadata = ["/Users/mcdonadt/scratch/ICU/minor_mapping.txt",
                           "/Users/mcdonadt/scratch/ICU/minor_mapping.txt"]
fecal_category_and_value = [('BODY_SITE', 'UBERON:feces'),
                            ('BODY_SITE', 'UBERON:feces')]

Now, let's setup paths to our references


In [ ]:
# path to the bloom sequences (note, this can be obtained from https://github.com/biocore/American-Gut/tree/master/data/AG
bloom_seqs = "/Users/mcdonadt/ag/American-Gut/data/AG/BLOOM.fasta"

# path to Greengenes
reference_rep_set = '/scratch/Users/mcdonadt/greengenes/gg_13_8_otus/rep_set/97_otus.fasta' # e.g., path to 97_otus.fasta from Greengenes
reference_taxonomy = '/scratch/Users/mcdonadt/greengenes/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt' # e.g., path to 97_otu_taxonomy.txt from Grengenes

On this system, all of the nodes have 64 cores, so let's allow SortMeRNA to use all of them.


In [ ]:
# setup SortMeRNA params file for OTU picking
num_threads_otu_picking = 64
params_file = path_joiner('sortmerna_pick_params.txt')
with open(params_file, 'w') as f:
    f.write("pick_otus:otu_picking_method sortmerna\n")
    f.write("pick_otus:similarity 0.97\n")
    f.write("pick_otus:threads %d\n" % num_threads_otu_picking)

Next, we'll define the scripts were going to use for filtering.


In [ ]:
scripts = {'Filter Sequences': 'filter_fasta.py -f %(input)s -m %(otus)s -n -o %(output)s',
           'Filter Sequences to Fecal': 'filter_fasta.py -f %(input)s -o %(output)s --mapping_fp %(mapping)s --valid_states "%(states)s"',
           'Pick Closed Reference OTUs': 'pick_closed_reference_otus.py -i %(input)s -o %(output)s -r %(reference)s -t %(taxonomy)s ' + '-p ' + params_file,
}

Then, we'll filter the input datasets down to just the sequences associated with fecal samples


In [ ]:
jobs = []
just_fecal_datasets = []
for d, m, cat in zip(paths_to_input_datasets, paths_to_input_metadata, fecal_category_and_value):
    output = path_joiner(os.path.splitext(os.path.basename(d))[0] + '-fecal.fna')
    just_fecal_datasets.append(output)
    filter_args = {'input': d,
                   'output': output,
                   'mapping': m,
                   'states': ':'.join(cat)}
    
    jobs.append(submit_8gb(scripts['Filter Sequences to Fecal'] % filter_args))
wait_on(jobs)

Once we have the fecal data, we can filter for the previously observed blooms


In [ ]:
filtered_outputs = []
invalid_seqs = []
jobs = []

for d in just_fecal_datasets:
    relative_name = os.path.basename(d)
    no_ext = os.path.splitext(relative_name)[0]
    
    output = path_joiner(no_ext + '-bloomed')
    filtered_outputs.append(output)
    
    failures = os.path.join(output, 'sortmerna_picked_otus', no_ext + '_otus.txt')
    invalid_seqs.append(failures)
    
    bloom_args = {
        'input': d,
        'output': output,
        'reference': bloom_seqs,
        'taxonomy': reference_taxonomy,
    }
    
    jobs.append(submit_smr(scripts['Pick Closed Reference OTUs'] % bloom_args))
wait_on(jobs)

Since we have now identified the blooms, we can go back to the full input file and filter out the corresponding sequences


In [ ]:
jobs = []
bloom_filtered_seqs = []
for full_file, failures in zip(paths_to_input_datasets, invalid_seqs):
    output = os.path.splitext(full_file)[0] + '-bloomed.fna'
    bloom_filtered_seqs.append(output)
    
    filter_args = {
        'input': full_file,
        'output': output,
        'otus': failures}
    jobs.append(submit_8gb(scripts['Filter Sequences'] % filter_args))
wait_on(jobs)

Since within the American Gut, we also combine with HiSeq reads that are 100nt, let's go ahead and produce two datasets to pick OTUs with, one that is full length and one that is trimmed to just 100nt.


In [ ]:
def trimmer(gen):
    """Trim seqs in gen to trim_length"""
    from skbio import parse_fasta
    return ((seq_id, seq[:trim_length]) for seq_id, seq in parse_fasta(gen))

for_otu_picking = []
for bfs in bloom_filtered_seqs:
    trimmed_file = bfs + '-%dnt' % trim_length
    for_otu_picking.append(bfs)
    for_otu_picking.append(trimmed_file)
    
    with open(trimmed_file, 'w') as trimmed, open(bfs) as to_trim:
        for payload in trimmer(to_trim):
            trimmed.write(">%s\n%s\n" % payload)

And finally, let's pick OTUs that can be used for subsequent analysis


In [ ]:
jobs = []
final_otus = []

for fop in for_otu_picking:
    output = fop + '-otus'
    final_otus.append(output)
    final_args = {
        'input': fop,
        'output': output,
        'reference': reference_rep_set,
        'taxonomy': reference_taxonomy,
    }
    
    jobs.append(submit_smr(scripts['Pick Closed Reference OTUs'] % final_args))
wait_on(jobs)