This tutorial covers a full QIIME workflow using Illumina sequencing data. This tutorial is intended to be quick to run, and as such, uses only a subset of a full Illumina Genome Analyzer II (GAIIx) run. We'll make use of the 13_8
release of the Greengenes reference OTUs. You can always find a link to the latest version of the reference OTUs on the QIIME resources page.
The data used in this tutorial is derived from the Moving Pictures of the Human Microbiome study, where two human subjects collected daily samples from four body sites: the tongue, the palm of the left hand, the palm of the right hand, and the gut (via fecal samples obtained by swapping used toilet paper). This data was sequenced using the barcoded amplicon sequencing protocol described in Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. A more recent version of this protocol that can be used with the Illumina HiSeq 2000 and MiSeq can be found here.
This tutorial is presented as an IPython Notebook. For more information on using QIIME with IPython, see our recent paper. You can find more information on the IPython Notebook here.
In [ ]:
from os import chdir, mkdir
from os.path import join, exists, abspath
from IPython.display import FileLinks, FileLink
In [ ]:
!(wget https://s3.amazonaws.com/s3-qiime_tutorial_files/moving_pictures_tutorial-1.9.0.tgz || curl -O https://s3.amazonaws.com/s3-qiime_tutorial_files/moving_pictures_tutorial-1.9.0.tgz)
!tar -xzf moving_pictures_tutorial-1.9.0.tgz
If you don't have (or don't know if you have) the latest Greengenes OTUs installed, execute the next cell.
In [ ]:
!(wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz || curl -O ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz)
!tar -xzf gg_13_8_otus.tar.gz
Through-out this tutorial we make use of a reference sequence collection, tree, and taxonomy derived from the Greengenes database. For convenience, we'll define them as variables. We'll then reference the variables through-out this tutorial when they are used.
In [ ]:
BASE_GREENGENES_PATH = './'
otu_base = abspath(join(BASE_GREENGENES_PATH, "gg_13_8_otus/"))
reference_seqs = abspath(join(otu_base,"rep_set/97_otus.fasta"))
reference_tree = abspath(join(otu_base,"trees/97_otus.tree"))
reference_tax = abspath(join(otu_base,"taxonomy/97_otu_taxonomy.txt"))
aligned_core_set = abspath(join(otu_base,"rep_set_aligned/85_otus.fasta"))
if not exists(otu_base):
print "otu_base does not appear to exist. Please set BASE_GREENGENES_PATH to the parent directory of the OTUs and reexecute this cell."
else:
!echo assign_taxonomy:reference_seqs_fp $reference_seqs >> moving_pictures_tutorial-1.9.0/uc_fast_params.txt
!echo assign_taxonomy:id_to_taxonomy_fp $reference_tax >> moving_pictures_tutorial-1.9.0/uc_fast_params.txt
!echo align_seqs:template_fp $aligned_core_set >> moving_pictures_tutorial-1.9.0/uc_fast_params.txt
Start by seeing what files are in our tutorial directory. We can do this using ls
as we would on the command line, but in this case we prefix with an !
to tell IPython that we're issuing a bash
(i.e., command line) command, rather than a python command.
In [ ]:
!ls moving_pictures_tutorial-1.9.0
You can also use the FileLink
and FileLinks
features of the IPython notebook to open or download your data.
In [ ]:
FileLinks('moving_pictures_tutorial-1.9.0')
We'll change to the moving_pictures_tutorial-1.9.0/illumina
directory for the remaining steps. We also need to prepare the FileLink
and FileLinks
functions to work from this new location.
In [ ]:
from functools import partial
chdir('moving_pictures_tutorial-1.9.0/illumina')
FileLink = partial(FileLink, url_prefix='moving_pictures_tutorial-1.9.0/illumina/')
FileLinks = partial(FileLinks, url_prefix='moving_pictures_tutorial-1.9.0/illumina/')
The QIIME mapping file contains all of the per-sample metadata, including technical information such as primers and barcodes that were used for each sample, and information about the samples such as what body site they were taken from. In this data set we're looking at human microbiome samples from four sites on the bodies of two individuals at mutliple time points. The metadata in this case therefore includes a subject identifier, a timepoint, and a body site for each sample. You can review the map.tsv
file at the link in the previous cell to see an example of the data (or view the published Google Spreadsheet version, which is more nicely formatted).
In this step, we run validate_mapping_file.py
to ensure that our mapping file is compatible with QIIME.
In [ ]:
!validate_mapping_file.py -o vmf-map/ -m map.tsv
In this case there were no errors, but if there were we would review the resulting HTML summary to find out what errors are present. You could then fix those in a spreadsheet program or text editor and rerun validate_mapping_file.py
on the updated mapping file.
For the sake of illustrating what errors in a mapping file might look like, we've created a bad mapping file (map-bad.tsv
). We'll next call validate_mapping_file.py
on the file map-bad.tsv
. Review the resulting HTML report. What are the issues with this mapping file?
In [ ]:
!validate_mapping_file.py -o vmf-map-bad/ -m map-bad.tsv
In [ ]:
FileLinks('vmf-map-bad/')
We next need to demultiplex and quality filter our sequences (i.e. assigning barcoded reads to the samples they are derived from). In general, you should get separate fastq files for your sequence and barcode reads. Note that we pass these files while still gzipped. split_libraries_fastq.py
can handle gzipped or unzipped fastq files. The default strategy in QIIME for quality filtering of Illumina data is described in Bokulich et al (2013).
In [ ]:
!split_libraries_fastq.py -o slout/ -i forward_reads.fastq.gz -b barcodes.fastq.gz -m map.tsv
We often want to see the results of running a command. Here we can do that by calling our output formatter again, this time passing the output directory from the previous step.
In [ ]:
FileLinks('slout/')
We can see how many sequences we ended up with using count_seqs.py
.
In [ ]:
!count_seqs.py -i slout/seqs.fna
Now that we have demultiplexed sequences, we're ready to cluster these sequences into OTUs. There are three high-level ways to do this in QIIME. We can use de novo, closed-reference, or open-reference OTU picking. Open-reference OTU picking is currently our preferred method. Discussion of these methods can be found in Rideout et. al (2014).
Here we apply open-reference OTU picking. Note that this command takes the seqs.fna
file that was generated in the previous step, as well as the reference fasta file ($reference_seqs
here). We're also specifying some parameters to the pick_otus.py
command, which is internal to this workflow. Specifically, we set enable_rev_strand_match
to True
, which allows sequences to match the reference database if their forward or reverse orientation matches to a reference sequence. (Several other parameters are also specified in this file, but these settings were found to be much faster than, and yield the same results as, the previously used defaults that they are now the default settings as of QIIME 1.8.0-dev. Passing them here has no effect on the run. We suppress the prefilter here by passing --prefilter_percent_id 0.0
for the same reason.) These parameters are specified in the parameters file which is passed as -p
. You can find information on defining parameters files here.
This step can take about 10 minutes to complete.
In [ ]:
!pick_open_reference_otus.py -o otus/ -i slout/seqs.fna -r $reference_seqs -p ../uc_fast_params.txt --prefilter_percent_id 0.0
The primary output that we get from this command is the OTU table, or the number of times each operational taxonomic unit (OTU) is observed in each sample. QIIME uses the Genomics Standards Consortium Biological Observation Matrix standard (BIOM) format for representing OTU tables. You can find additional information on the BIOM format here, and information on converting these files to tab-separated text that can be viewed in spreadsheet programs here. Several OTU tables are generated by this command. The one we typically want to work with is otus/otu_table_mc2_w_tax_no_pynast_failures.biom
. This has singleton OTUs (or OTUs with a total count of 1) removed, as well as OTUs whose represenative (i.e., centroid) sequence couldn't be aligned with PyNAST. It also contains taxonomic assignments for each OTU as observation metadata.
The open-reference OTU picking command also produces a phylogenetic tree where the tips are the OTUs. The file containing the tree is otus/rep_set.tre
, and is the file that should be used with otus/otu_table_mc2_w_tax_no_pynast_failures.biom
in downstream phylogenetic diversity calculations. The tree is stored in the widely used newick format.
To view the output of this command, call FileLinks
on the output directory.
In [ ]:
FileLinks('otus/')
To compute some summary statistics of the OTU table we can run the following command.
In [ ]:
!biom summarize-table -i otus/otu_table_mc2_w_tax_no_pynast_failures.biom -o otus/otu_table_mc2_w_tax_no_pynast_failures_summary.txt
To view the summary, call FileLink
on the output file.
In [ ]:
FileLink('otus/otu_table_mc2_w_tax_no_pynast_failures_summary.txt')
The key piece of information you need to pull from this output is the depth of sequencing that should be used in diversity analyses. Many of the analyses that follow require that there are an equal number of sequences in each sample, so you need to review the Counts/sample detail and decide what depth you'd like. Any samples that don't have at least that many sequences will not be included in the analyses, so this is always a trade-off between the number of sequences you throw away and the number of samples you throw away. For some perspective on this, see Kuczynski 2010.
Here we're running the core_diversity_analyses.py
script which applies many of the "first-pass" diversity analyses that users are generally interested in. The main output that users will interact with is the index.html
file, which provides links into the different analysis results.
Note that in this step we're passing -e
which is the sampling depth that should be used for diversity analyses. I chose 4370 here, based on reviewing the above output from biom summarize-table
. This value will be study-specific, so don't just use this value on your own data (though it's fine to use that value for this tutorial).
The commands in this section (combined) can take about 15 minutes to complete.
In [ ]:
!core_diversity_analyses.py -o cd4370/ -i otus/otu_table_mc2_w_tax_no_pynast_failures.biom -m map.tsv -t otus/rep_set.tre -e 4370
Next open the index.html
file in the resulting directory. This will link you into the different results.
In [ ]:
FileLink('cd4370/index.html')
The results above treat all samples independently, but sometimes (for example, in the taxonomic summaries) it's useful to categorize samples by their metadata. We can do this by passing categories (i.e., headers from our mapping file) to core_diversity_analyses.py
with the -c
parameter. Because core_diversity_analyses.py
can take a long time to run, it has a --recover_from_failure
option, which can allow it to be rerun from a point where it previously failed in some cases (for example, if you accidentally turned your computer off while it was running). This option can also be used to add categorical analyses if you didn't include them in your initial run. Next we'll rerun core_diversity_analyses.py
with two sets of categorical analyses: one for the "SampleType
category, and one for the DaysSinceExperimentStart
category. Remember the --recover_from_failure
option: it can save you a lot of time.
In [ ]:
!core_diversity_analyses.py -o cd4370/ --recover_from_failure -c "SampleType,DaysSinceExperimentStart" -i otus/otu_table_mc2_w_tax_no_pynast_failures.biom -m map.tsv -t otus/rep_set.tre -e 4370
In [ ]:
FileLink('cd4370/index.html')
One thing you may notice in the PCoA plots generated by core_diversity_analyses.py
is that the samples don't cluster perfectly by SampleType
. This is unexpected, based on what we know about the human microbiome. Since this is a time series, let's explore this in a little more detail integrating a time axis into our PCoA plots. We can do this by re-running Emperor directly, replacing our previously generated PCoA plots. (Emperor is a tool for the visualization of PCoA plots with many advanced features that you can explore in the Emperor tutorial. If you use Emperor in your research you should be sure to cite it directly, as with the other tools that QIIME wraps, such as uclust and RDPClassifier.)
After this runs, you can reload the Emperor plots that you accessed from the above cd4370/index.html
links. Try making the samples taken during AntibioticUsage
invisible.
In [ ]:
!make_emperor.py -i cd4370/bdiv_even4370/weighted_unifrac_pc.txt -o cd4370/bdiv_even4370/weighted_unifrac_emperor_pcoa_plot -m map.tsv --custom_axes DaysSinceExperimentStart
!make_emperor.py -i cd4370/bdiv_even4370/unweighted_unifrac_pc.txt -o cd4370/bdiv_even4370/unweighted_unifrac_emperor_pcoa_plot -m map.tsv --custom_axes DaysSinceExperimentStart
IMPORTANT: Removing points from a PCoA plot, as is suggested above for data exploration purposes, is not the same as computing PCoA without those points. If after running this, you'd like to remove the samples taken during AntibioticUsage
from the analysis, you can do this with filter_samples_from_otus_table.py
, which is discussed here. As an exercise, try removing the samples taken during AntibioticUsage
from the OTU table and re-running core_diversity_analyses.py
. You should output the results to a different directory than you created above (e.g., cd4370_no_abx
).
This tutorial illustrated some of the basic features of QIIME, and there are a lot of places to go from here. If you're interested in seeing additional visualizations, you should check out the QIIME overview tutorial. The Procrustes analysis tutorial illustrates a really cool analysis, allowing you to continue with the same data used here, comparing against the samples sequenced on 454 (rather than Illumina, as in this analysis). If you're interested in some possibilities for statistical analyses you can try the supervised learning or distance matrix comparison tutorials, both of which can be adapted to use data generated in this tutorial.
In [ ]:
FileLinks("precomputed-output/")