Notebook preparation

This notebook illustrates the commands in the Illumina Overview Tutorial, as well as some features that are convenient for working with the IPython Notebook and QIIME.

This tutorial makes use of the 12_10 release of the Greengenes reference OTUs. You can always find a link to the latest version of the reference OTUs on the QIIME resources page.

Getting started

We'll begin by downloading data and initializing some variables to configure our IPython computing environment.


In [1]:
# need to reset this if running notebook multiple times
base_dir = 'qiime_test'
!rm -rf $base_dir
!mkdir $base_dir

In [2]:
!wget https://s3.amazonaws.com/qiime-tutorial/moving_pictures_tutorial.tgz
!wget ftp://greengenes.microbio.me/greengenes_release/gg_12_10/gg_12_10_otus.tar.gz
!tar -xzf moving_pictures_tutorial.tgz -C $base_dir
!tar -xzf gg_12_10_otus.tar.gz -C $base_dir
!rm moving_pictures_tutorial.tgz gg_12_10_otus.tar.gz
!wget -O $base_dir/core_set_aligned.fasta.imputed http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/core_set_aligned.fasta.imputed
!wget -O $base_dir/lanemask_in_1s_and_0s http://greengenes.lbl.gov/Download/Sequence_Data/lanemask_in_1s_and_0s


--2013-04-08 19:27:45--  https://s3.amazonaws.com/qiime-tutorial/moving_pictures_tutorial.tgz
Resolving s3.amazonaws.com... 72.21.214.199
Connecting to s3.amazonaws.com|72.21.214.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29124406 (28M) [application/x-compressed]
Saving to: `moving_pictures_tutorial.tgz'

100%[======================================>] 29,124,406  2.66M/s   in 11s     

2013-04-08 19:27:56 (2.60 MB/s) - `moving_pictures_tutorial.tgz' saved [29124406/29124406]

--2013-04-08 19:27:56--  ftp://greengenes.microbio.me/greengenes_release/gg_12_10/gg_12_10_otus.tar.gz
           => `gg_12_10_otus.tar.gz'
Resolving greengenes.microbio.me... 128.138.93.17
Connecting to greengenes.microbio.me|128.138.93.17|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /greengenes_release/gg_12_10 ... done.
==> SIZE gg_12_10_otus.tar.gz ... 135667617
==> PASV ... done.    ==> RETR gg_12_10_otus.tar.gz ... done.
Length: 135667617 (129M) (unauthoritative)

100%[======================================>] 135,667,617 48.9M/s   in 2.6s    

2013-04-08 19:28:00 (48.9 MB/s) - `gg_12_10_otus.tar.gz' saved [135667617]

--2013-04-08 19:28:08--  http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/core_set_aligned.fasta.imputed
Resolving greengenes.lbl.gov... 128.32.248.7
Connecting to greengenes.lbl.gov|128.32.248.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37975992 (36M) [text/plain]
Saving to: `qiime_test/core_set_aligned.fasta.imputed'

100%[======================================>] 37,975,992   634K/s   in 56s     

2013-04-08 19:29:05 (666 KB/s) - `qiime_test/core_set_aligned.fasta.imputed' saved [37975992/37975992]

--2013-04-08 19:29:05--  http://greengenes.lbl.gov/Download/Sequence_Data/lanemask_in_1s_and_0s
Resolving greengenes.lbl.gov... 128.32.248.7
Connecting to greengenes.lbl.gov|128.32.248.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7683 (7.5K) [text/plain]
Saving to: `qiime_test/lanemask_in_1s_and_0s'

100%[======================================>] 7,683       --.-K/s   in 0s      

2013-04-08 19:29:05 (454 MB/s) - `qiime_test/lanemask_in_1s_and_0s' saved [7683/7683]


In [3]:
# get qiime dependcies and set env for running in KBase VM
import os
from os import environ, chdir, mkdir
from os.path import join
nb_dir = os.getcwd()
environ['RDP_JAR_PATH'] = '/kb/runtime/rdp_classifier_2.2/rdp_classifier-2.2.jar'
environ['QIIME_CONFIG_FP'] = join(nb_dir, '.qiime_config')
# create qiime config
qconf = open('.qiime_config', 'w')
qconf.write("%s\t%s\n"%('qiime_scripts_dir', '/usr/local/bin'))
qconf.write("%s\t%s\n"%('temp_dir', join(nb_dir, 'tmp')))
qconf.write("%s\t%s\n"%('pynast_template_alignment_fp', join(nb_dir, base_dir, 'core_set_aligned.fasta.imputed')))
qconf.write("%s\t%s\n"%('template_alignment_lanemask_fp', join(nb_dir, base_dir, 'lanemask_in_1s_and_0s')))
qconf.close()

In [4]:
# these are only available in the current development branch of IPython
from IPython.display import FileLinks, FileLink
otu_base = join(base_dir, "gg_12_10_otus")
reference_seqs = join(otu_base, "rep_set/97_otus.fasta")
reference_tree = join(otu_base, "trees/97_otus.tree")
reference_tax = join(otu_base, "taxonomy/97_otu_taxonomy.txt")

Start by seeing what files are in our tutorial direcotry. We can do this using ls as we would on the command line, but in this case we prefix with an ! to tell IPython that we're issuing a bash (i.e., command line) command, rather than a python command.


In [5]:
data_dir = join(base_dir, 'moving_pictures_tutorial')
!ls $data_dir


454_map.txt		 filtered_mapping_l5.txt      subsampled_454_seqs.fna
filtered_mapping_l1.txt  filtered_mapping_l6.txt      subsampled_fastq
filtered_mapping_l2.txt  procrustes_metadata_map.txt  ucrC_fast_params.txt
filtered_mapping_l3.txt  procrustes_sid_map.txt
filtered_mapping_l4.txt  README.txt

QIIME additionally supports more convenient output formattting for the IPython notebook so you can directly interact with or download your data.


In [6]:
FileLinks(base_dir)


Out[6]:
qiime_test/
  lanemask_in_1s_and_0s
  core_set_aligned.fasta.imputed
qiime_test/gg_12_10_otus/
  note
qiime_test/gg_12_10_otus/otus/
  88_otu_map.txt
  97_clusters.uc
  85_clusters.uc
  67_clusters.uc
  82_otu_map.txt
  91_otu_map.txt
  97_otu_map.txt
  64_otu_map.txt
  99_clusters.uc
  73_clusters.uc
  79_clusters.uc
  88_clusters.uc
  61_otu_map.txt
  76_clusters.uc
  73_otu_map.txt
  94_clusters.uc
  82_clusters.uc
  70_clusters.uc
  70_otu_map.txt
  67_otu_map.txt
  64_clusters.uc
  61_clusters.uc
  76_otu_map.txt
  79_otu_map.txt
  85_otu_map.txt
  99_otu_map.txt
  91_clusters.uc
  94_otu_map.txt
qiime_test/gg_12_10_otus/rep_set/
  91_otus.fasta
  64_otus.fasta
  79_otus.fasta
  76_otus.fasta
  97_otus.fasta
  94_otus.fasta
  73_otus.fasta
  82_otus.fasta
  85_otus.fasta
  67_otus.fasta
  61_otus.fasta
  99_otus.fasta
  88_otus.fasta
  70_otus.fasta
qiime_test/gg_12_10_otus/taxonomy/
  73_otu_taxonomy.txt
  97_otu_taxonomy.txt
  70_otu_taxonomy.txt
  85_otu_taxonomy.txt
  76_otu_taxonomy.txt
  67_otu_taxonomy.txt
  94_otu_taxonomy.txt
  64_otu_taxonomy.txt
  99_otu_taxonomy.txt
  82_otu_taxonomy.txt
  79_otu_taxonomy.txt
  61_otu_taxonomy.txt
  88_otu_taxonomy.txt
  91_otu_taxonomy.txt
qiime_test/gg_12_10_otus/trees/
  88_otus.tree
  82_otus.tree
  73_otus.tree
  76_otus.tree
  67_otus.tree
  97_otus.tree
  79_otus.tree
  70_otus.tree
  85_otus.tree
  94_otus.tree
  61_otus.tree
  99_otus.tree
  64_otus.tree
  91_otus.tree
qiime_test/moving_pictures_tutorial/
  procrustes_sid_map.txt
  filtered_mapping_l4.txt
  filtered_mapping_l5.txt
  filtered_mapping_l6.txt
  filtered_mapping_l2.txt
  454_map.txt
  filtered_mapping_l3.txt
  ucrC_fast_params.txt
  README.txt
  filtered_mapping_l1.txt
  subsampled_454_seqs.fna
  procrustes_metadata_map.txt
qiime_test/moving_pictures_tutorial/subsampled_fastq/
  subsampled_s_6_sequence_barcodes.fastq
  subsampled_s_2_sequence_barcodes.fastq
  subsampled_s_3_sequence.fastq
  subsampled_s_4_sequence.fastq
  subsampled_s_1_sequence_barcodes.fastq
  subsampled_s_6_sequence.fastq
  subsampled_s_5_sequence.fastq
  subsampled_s_4_sequence_barcodes.fastq
  subsampled_s_2_sequence.fastq
  subsampled_s_1_sequence.fastq
  subsampled_s_3_sequence_barcodes.fastq
  subsampled_s_5_sequence_barcodes.fastq

First, let's check our mapping file for errors


In [7]:
!check_id_map.py -o $base_dir/cid/ -m $data_dir/filtered_mapping_l1.txt


No errors or warnings were found in mapping file.

In this case there were no errors, but if there were we would review the resulting html summary to find out what errors are present. You could then fix those in a spreadsheet program or text editor. To view that html file, call FileLinks on the output directory from the previous step and click the link to the html file.


In [8]:
FileLinks(base_dir+'/cid/')




Demultiplexing and quality filtering seqeunces


In [9]:
!split_libraries_fastq.py -o $base_dir/slout/ -i $data_dir/subsampled_fastq/subsampled_s_1_sequence.fastq,$data_dir/subsampled_fastq/subsampled_s_2_sequence.fastq,$data_dir/subsampled_fastq/subsampled_s_3_sequence.fastq,$data_dir/subsampled_fastq/subsampled_s_4_sequence.fastq,$data_dir/subsampled_fastq/subsampled_s_5_sequence.fastq,$data_dir/subsampled_fastq/subsampled_s_6_sequence.fastq -b $data_dir/subsampled_fastq/subsampled_s_1_sequence_barcodes.fastq,$data_dir/subsampled_fastq/subsampled_s_2_sequence_barcodes.fastq,$data_dir/subsampled_fastq/subsampled_s_3_sequence_barcodes.fastq,$data_dir/subsampled_fastq/subsampled_s_4_sequence_barcodes.fastq,$data_dir/subsampled_fastq/subsampled_s_5_sequence_barcodes.fastq,$data_dir/subsampled_fastq/subsampled_s_6_sequence_barcodes.fastq -m $data_dir/filtered_mapping_l1.txt,$data_dir/filtered_mapping_l2.txt,$data_dir/filtered_mapping_l3.txt,$data_dir/filtered_mapping_l4.txt,$data_dir/filtered_mapping_l5.txt,$data_dir/filtered_mapping_l6.txt

We often want to see the results of running a command. Here we can do that by calling our output formatter again, this time passing the output directory from the previous step.


In [10]:
FileLinks(base_dir+'/slout/')


Out[10]:
qiime_test/slout/
  split_library_log.txt
  seqs.fna
  histograms.txt

In [11]:
!count_seqs.py -i $base_dir/slout/seqs.fna


66189  : qiime_test/slout/seqs.fna (Sequence lengths (mean +/- std): 132.1182 +/- 9.6185)
66189  : Total

OTU picking: using an open-reference OTU picking protocol by searching reads agsint the Greengenes database.


In [12]:
!pick_subsampled_reference_otus_through_otu_table.py -o $base_dir/ucrss_fast/ -i $base_dir/slout/seqs.fna -r $reference_seqs -p $data_dir/ucrC_fast_params.txt

In [14]:
!print_biom_table_summary.py -i $base_dir/ucrss_fast/otu_table_mc2_w_tax_no_pynast_failures.biom


Num samples: 61
Num observations: 1845
Total count: 63399.0
Table density (fraction of non-zero values): 0.0860
Table md5 (unzipped): f8acb7249e8f337081d5d77ab4d61cc8

Counts/sample summary:
 Min: 122.0
 Max: 2560.0
 Median: 891.0
 Mean: 1039.32786885
 Std. dev.: 758.897511846
 Sample Metadata Categories: None provided
 Observation Metadata Categories: taxonomy

Counts/sample detail:
 L3S237: 122.0
 L3S235: 173.0
 L3S372: 186.0
 L3S373: 227.0
 L3S368: 238.0
 L3S370: 242.0
 L3S367: 243.0
 L3S236: 257.0
 L3S369: 260.0
 L3S238: 271.0
 L3S9: 271.0
 L3S8: 275.0
 L3S6: 279.0
 L3S371: 283.0
 L3S7: 301.0
 L5S234: 365.0
 L5S231: 381.0
 L5S235: 414.0
 L5S99: 426.0
 L5S98: 433.0
 L5S230: 442.0
 L5S232: 457.0
 L5S233: 459.0
 L5S229: 505.0
 L5S97: 512.0
 L5S100: 522.0
 L2S236: 584.0
 L2S234: 721.0
 L2S231: 769.0
 L2S229: 844.0
 L2S233: 891.0
 L2S99: 899.0
 L2S230: 935.0
 L2S232: 957.0
 L2S100: 1029.0
 L2S97: 1050.0
 L2S235: 1059.0
 L1S4: 1079.0
 L2S98: 1097.0
 L1S3: 1206.0
 L1S1: 1312.0
 L1S131: 1556.0
 L1S134: 1574.0
 L1S132: 1699.0
 L4S146: 1748.0
 L1S129: 1753.0
 L1S130: 1766.0
 L1S135: 1766.0
 L1S133: 1891.0
 L1S2: 1899.0
 L6S100: 1994.0
 L6S101: 2040.0
 L1S136: 2061.0
 L4S144: 2127.0
 L4S145: 2147.0
 L6S102: 2206.0
 L1S287: 2316.0
 L6S103: 2367.0
 L4S147: 2441.0
 L1S290: 2512.0
 L1S289: 2560.0

Create a single mapping file from the per-lane mapping files.


In [15]:
!merge_mapping_files.py -o $base_dir/combined_mapping_file.txt -m $data_dir/filtered_mapping_l1.txt,$data_dir/filtered_mapping_l2.txt,$data_dir/filtered_mapping_l3.txt,$data_dir/filtered_mapping_l4.txt,$data_dir/filtered_mapping_l5.txt,$data_dir/filtered_mapping_l6.txt

In [16]:
!head $base_dir/combined_mapping_file.txt


#SampleID	BarcodeSequence	LinkerPrimerSequence	days_since_epoch	SamplePlate	timestamp	SampleType	month	PrimerPlate	WellID	year	day	subject	Description
L6S100	GTCGCTGTCTTC	GTGCCAGCMGCCGCGGTAA	14355	20	20090421	Tongue	4	5	d1	2009	21	2	RK_Tong_4_21_2009
L2S229	CAACTCATCGTA	GTGCCAGCMGCCGCGGTAA	14355	7	20090421	L_palm	4	2	e5	2009	21	1	AB_L_Palm_4_21_2009
L6S102	GTTGACGACAGC	GTGCCAGCMGCCGCGGTAA	14357	20	20090423	Tongue	4	5	f1	2009	23	2	RK_Tong_4_23_2009
L6S103	TACGCGCTGAGA	GTGCCAGCMGCCGCGGTAA	14358	20	20090424	Tongue	4	5	g1	2009	24	2	RK_Tong_4_24_2009
L3S369	ATACAGAGCTCC	GTGCCAGCMGCCGCGGTAA	14357	12	20090423	R_palm	4	2	a11	2009	23	1	AB_R_Palm_4_23_2009
L3S368	CATCTGTAGCGA	GTGCCAGCMGCCGCGGTAA	14356	12	20090422	R_palm	4	2	h10	2009	22	1	AB_R_Palm_4_22_2009
L2S100	ACGCTCATGGAT	GTGCCAGCMGCCGCGGTAA	14176	6	20081024	L_palm	10	1	d1	2008	24	1	AB_L_Palm_10_24_2008
L3S373	CACATCTAACAC	GTGCCAGCMGCCGCGGTAA	14175	12	20081023	R_palm	10	2	e11	2008	23	2	RK_R_Palm_10_23_2008
L3S370	ATCCTCAGTAGT	GTGCCAGCMGCCGCGGTAA	14358	12	20090424	R_palm	4	2	b11	2009	24	1	AB_R_Palm_4_24_2009

To view a single file (rather than a directory) we use the FileLink function instead of the FileLinks function.


In [17]:
FileLink(base_dir+'/combined_mapping_file.txt')




Run diversity analyses

Here we're running the core_diversity_analyses.py script which applies many of the "first-pass" diversity analyses that users are generally interested in. The main output that users will interact with is the index.html file, which provides links into the different analysis results.


In [18]:
!core_diversity_analyses.py -o $base_dir/cd258/ -i $base_dir/ucrss_fast/otu_table_mc2_w_tax_no_pynast_failures.biom -m $base_dir/combined_mapping_file.txt -t $base_dir/ucrss_fast/rep_set.tre -e 258 -c "SampleType,days_since_epoch"


This code was only tested with Matplotlib-1.1.0

This code was only tested with Matplotlib-1.1.0

This code was only tested with Matplotlib-1.1.0

This code was only tested with Matplotlib-1.1.0

This code was only tested with Matplotlib-1.1.0

This code was only tested with Matplotlib-1.1.0

This code was only tested with Matplotlib-1.1.0

Warning, the following samples were in the category mapping file but not the OTU table and will be ignored: 
L3S368

L3S373

L3S370

L3S367

L3S372

L3S237

L3S236

L3S235


Warning, the following samples were in the category mapping file but not the OTU table and will be ignored: 
L3S368

L3S373

L3S370

L3S367

L3S372

L3S237

L3S236

L3S235



In [19]:
FileLink(base_dir+'/cd258/index.html')





In [19]: