Set up working directory


In [1]:
cd ~/Desktop/SSUsearch/


/home/gjr/Desktop/SSUsearch

In [2]:
mkdir -p ./workdir

In [3]:
#check seqfile files to process in data directory (make sure you still remember the data directory)
!ls ./data/test/data


1c.fa  1d.fa  2c.fa  2d.fa

README

This part of pipeline search for the SSU rRNA gene fragments, classify them, and extract reads aligned specific region. It is also heavy lifting part of the whole pipeline (more cpu will help).

This part works with one seqfile a time. You just need to change the "Seqfile" and maybe other parameters in the two cells bellow.

To run commands, click "Cell" then "Run All". After it finishes, you will see "*** pipeline runs successsfully :)" at bottom of this pape.

If your computer has many processors, there are two ways to make use of the resource:

  1. Set "Cpu" higher number.

  2. make more copies of this notebook (click "File" then "Make a copy" in menu bar), so you can run the step on multiple files at the same time.

(Again we assume the "Seqfile" is quality trimmed.)

Here we will process one file at a time; set the "Seqfile" variable to the seqfile name to be be processed

First part of seqfile basename (separated by ".") will be the label of this sample, so named it properly.

e.g. for "/usr/local/notebooks/data/test/data/1c.fa", "1c" will the label of this sample.


In [4]:
Seqfile='./data/test/data/2d.fa'

Other parameters to set


In [5]:
Cpu='1'   # number of maxixum threads for search and alignment
Hmm='./data/SSUsearch_db/Hmm.ssu.hmm'   # hmm model for ssu
Gene='ssu'
Script_dir='./scripts'
Gene_model_org='./data/SSUsearch_db/Gene_model_org.16s_ecoli_J01695.fasta'
Ali_template='./data/SSUsearch_db/Ali_template.silva_ssu.fasta'

Start='577'  #pick regions for de novo clustering
End='727'
Len_cutoff='100' # min length for reads picked for the region

Gene_tax='./data/SSUsearch_db/Gene_tax.silva_taxa_family.tax' # silva 108 ref
Gene_db='./data/SSUsearch_db/Gene_db.silva_108_rep_set.fasta'

Gene_tax_cc='./data/SSUsearch_db/Gene_tax_cc.greengene_97_otus.tax' # greengene 2012.10 ref for copy correction
Gene_db_cc='./data/SSUsearch_db/Gene_db_cc.greengene_97_otus.fasta'

In [6]:
# first part of file basename will the label of this sample
import os
Filename=os.path.basename(Seqfile)
Tag=Filename.split('.')[0]

In [7]:
import os

New_path = '{}:{}'.format('~/Desktop/SSUsearch/external_tools/bin/', os.environ['PATH'])
Hmm=os.path.abspath(Hmm)
Seqfile=os.path.abspath(Seqfile)
Script_dir=os.path.abspath(Script_dir)
Gene_model_org=os.path.abspath(Gene_model_org)
Ali_template=os.path.abspath(Ali_template)
Gene_tax=os.path.abspath(Gene_tax)
Gene_db=os.path.abspath(Gene_db)
Gene_tax_cc=os.path.abspath(Gene_tax_cc)
Gene_db_cc=os.path.abspath(Gene_db_cc)

os.environ.update(
    {'PATH':New_path,
     'Cpu':Cpu, 
     'Hmm':os.path.abspath(Hmm), 
     'Gene':Gene, 
     'Seqfile':os.path.abspath(Seqfile), 
     'Filename':Filename, 
     'Tag':Tag, 
     'Script_dir':os.path.abspath(Script_dir), 
     'Gene_model_org':os.path.abspath(Gene_model_org), 
     'Ali_template':os.path.abspath(Ali_template), 
     'Start':Start, 
     'End':End,
     'Len_cutoff':Len_cutoff,
     'Gene_tax':os.path.abspath(Gene_tax), 
     'Gene_db':os.path.abspath(Gene_db), 
     'Gene_tax_cc':os.path.abspath(Gene_tax_cc), 
     'Gene_db_cc':os.path.abspath(Gene_db_cc)})

In [8]:
!echo "*** make sure: parameters are right"
!echo "Seqfile: $Seqfile\nCpu: $Cpu\nFilename: $Filename\nTag: $Tag"


*** make sure: parameters are right
Seqfile: /home/gjr/Desktop/SSUsearch/data/test/data/2d.fa
Cpu: 1
Filename: 2d.fa
Tag: 2d

In [9]:
cd workdir


/home/gjr/Desktop/SSUsearch/workdir

In [10]:
mkdir -p $Tag.ssu.out

In [11]:
### start hmmsearch

In [12]:
%%bash
echo "*** hmmsearch starting"
time hmmsearch --incE 10 --incdomE 10 --cpu $Cpu \
  --domtblout $Tag.ssu.out/$Tag.qc.$Gene.hmmdomtblout \
  -o /dev/null -A $Tag.ssu.out/$Tag.qc.$Gene.sto \
  $Hmm $Seqfile
echo "*** hmmsearch finished"


*** hmmsearch starting
*** hmmsearch finished
real	0m3.323s
user	0m3.024s
sys	0m0.052s

In [13]:
!python $Script_dir/get-seq-from-hmmout.py \
    $Tag.ssu.out/$Tag.qc.$Gene.hmmdomtblout \
    $Tag.ssu.out/$Tag.qc.$Gene.sto \
    $Tag.ssu.out/$Tag.qc.$Gene


parsing hmmdotblout done..
50 of 109 seqs are kept after hmm parser

Pass hits to mothur aligner


In [14]:
%%bash
echo "*** Starting mothur align"
cat  $Gene_model_org $Tag.ssu.out/$Tag.qc.$Gene > $Tag.ssu.out/$Tag.qc.$Gene.RFadded

# mothur does not allow tab between its flags, thus no indents here
time mothur "#align.seqs(candidate=$Tag.ssu.out/$Tag.qc.$Gene.RFadded, template=$Ali_template, threshold=0.5, flip=t, processors=$Cpu)"

rm -f mothur.*.logfile


*** Starting mothur align






mothur v.1.22.2
Last updated: 11/7/2011

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > align.seqs(candidate=2d.ssu.out/2d.qc.ssu.RFadded, template=/home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Ali_template.silva_ssu.fasta, threshold=0.5, flip=t, processors=1)

Reading in the /home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Ali_template.silva_ssu.fasta template sequences...	DONE.
It took 64 to read  18491 sequences.
Aligning sequences from 2d.ssu.out/2d.qc.ssu.RFadded ...
51
It took 1 secs to align 51 sequences.


Output File Names: 
2d.ssu.out/2d.qc.ssu.align
2d.ssu.out/2d.qc.ssu.align.report


mothur > quit()
real	1m11.529s
user	1m6.840s
sys	0m2.948s

Get aligned seqs that have > 50% matched to references


In [15]:
!python $Script_dir/mothur-align-report-parser-cutoff.py \
    $Tag.ssu.out/$Tag.qc.$Gene.align.report \
    $Tag.ssu.out/$Tag.qc.$Gene.align \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter \
    0.5


0 bad seqs out of 51 total are removed from alignment

In [16]:
!python $Script_dir/remove-gap.py $Tag.ssu.out/$Tag.qc.$Gene.align.filter $Tag.ssu.out/$Tag.qc.$Gene.align.filter.fa

Search is done here (the computational intensive part). Hooray!

  • \$Tag.ssu.out/\$Tag.qc.\$Gene.align.filter:
    aligned SSU rRNA gene fragments
  • \$Tag.ssu.out/\$Tag.qc.\$Gene.align.filter.fa:
    unaligned SSU rRNA gene fragments

Extract the reads mapped 150bp region in V4 (577-727 in E.coli SSU rRNA gene position) for unsupervised clustering


In [17]:
!python $Script_dir/region-cut.py $Tag.ssu.out/$Tag.qc.$Gene.align.filter $Start $End $Len_cutoff

!mv $Tag.ssu.out/$Tag.qc.$Gene.align.filter."$Start"to"$End".cut.lenscreen $Tag.ssu.out/$Tag.forclust


26 sequences are matched to 577-727 region

Classify SSU rRNA gene seqs using SILVA


In [18]:
%%bash
rm -f $Tag.ssu.out/$Tag.qc.$Gene.align.filter.silva_taxa_family*.taxonomy
mothur "#classify.seqs(fasta=$Tag.ssu.out/$Tag.qc.$Gene.align.filter.fa, template=$Gene_db, taxonomy=$Gene_tax, cutoff=50, processors=$Cpu)"
mv $Tag.ssu.out/$Tag.qc.$Gene.align.filter.silva_taxa_family*.taxonomy \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.silva.taxonomy







mothur v.1.22.2
Last updated: 11/7/2011

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > classify.seqs(fasta=2d.ssu.out/2d.qc.ssu.align.filter.fa, template=/home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Gene_db.silva_108_rep_set.fasta, taxonomy=/home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Gene_tax.silva_taxa_family.tax, cutoff=50, processors=1)
Reading template taxonomy...     DONE.
Reading template probabilities...     DONE.
It took 36 seconds get probabilities. 
Classifying sequences from 2d.ssu.out/2d.qc.ssu.align.filter.fa ...
Processing sequence: 50

It took 2 secs to classify 50 sequences.


It took 0 secs to create the summary file for 50 sequences.


Output File Names: 
2d.ssu.out/2d.qc.ssu.align.filter.silva_taxa_family.taxonomy
2d.ssu.out/2d.qc.ssu.align.filter.silva_taxa_family.tax.summary


mothur > quit()

In [19]:
!python $Script_dir/count-taxon.py \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.silva.taxonomy \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.silva.taxonomy.count
!rm -f mothur.*.logfile

Classify SSU rRNA gene seqs with Greengene for copy correction later


In [20]:
%%bash
rm -f $Tag.ssu.out/$Tag.qc.$Gene.align.filter.greengene_97_otus*.taxonomy
mothur "#classify.seqs(fasta=$Tag.ssu.out/$Tag.qc.$Gene.align.filter.fa, template=$Gene_db_cc, taxonomy=$Gene_tax_cc, cutoff=50, processors=$Cpu)"
mv $Tag.ssu.out/$Tag.qc.$Gene.align.filter.greengene_97_otus*.taxonomy \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.gg.taxonomy







mothur v.1.22.2
Last updated: 11/7/2011

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > classify.seqs(fasta=2d.ssu.out/2d.qc.ssu.align.filter.fa, template=/home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Gene_db_cc.greengene_97_otus.fasta, taxonomy=/home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Gene_tax_cc.greengene_97_otus.tax, cutoff=50, processors=1)
Reading template taxonomy...     DONE.
Reading template probabilities...     DONE.
It took 11 seconds get probabilities. 
Classifying sequences from 2d.ssu.out/2d.qc.ssu.align.filter.fa ...
Processing sequence: 50

It took 1 secs to classify 50 sequences.


It took 0 secs to create the summary file for 50 sequences.


Output File Names: 
2d.ssu.out/2d.qc.ssu.align.filter.greengene_97_otus.taxonomy
2d.ssu.out/2d.qc.ssu.align.filter.greengene_97_otus.tax.summary


mothur > quit()

In [21]:
!python $Script_dir/count-taxon.py \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.gg.taxonomy \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.gg.taxonomy.count
!rm -f mothur.*.logfile

In [22]:
# check the output directory
!ls $Tag.ssu.out


2d.forclust
2d.qc.ssu
2d.qc.ssu.align
2d.qc.ssu.align.filter
2d.qc.ssu.align.filter.577to727.cut
2d.qc.ssu.align.filter.577to727.cut.lenscreen.fa
2d.qc.ssu.align.filter.fa
2d.qc.ssu.align.filter.greengene_97_otus.tax.summary
2d.qc.ssu.align.filter.silva_taxa_family.tax.summary
2d.qc.ssu.align.filter.wang.gg.taxonomy
2d.qc.ssu.align.filter.wang.gg.taxonomy.count
2d.qc.ssu.align.filter.wang.silva.taxonomy
2d.qc.ssu.align.filter.wang.silva.taxonomy.count
2d.qc.ssu.align.report
2d.qc.ssu.hmmdomtblout
2d.qc.ssu.hmmdomtblout.parsedToDictWithScore.pickle
2d.qc.ssu.RFadded
2d.qc.ssu.sto

This part of pipeline (working with one sequence file) finishes here. Next we will combine samples for community analysis (see unsupervised analysis).

Following are files useful for community analysis:

  • 1c.577to727: aligned fasta file of seqs mapped to target region for de novo clustering
  • 1c.qc.ssu.align.filter: aligned fasta file of all SSU rRNA gene fragments
  • 1c.qc.ssu.align.filter.wang.gg.taxonomy: Greengene taxonomy (for copy correction)
  • 1c.qc.ssu.align.filter.wang.silva.taxonomy: SILVA taxonomy

In [23]:
!echo "*** pipeline runs successsfully :)"


*** pipeline runs successsfully :)

In [ ]: