Set up working directory


In [1]:
cd ~/Desktop/SSUsearch/


/home/gjr/Desktop/SSUsearch

In [2]:
mkdir -p ./workdir

In [3]:
#check seqfile files to process in data directory (make sure you still remember the data directory)
!ls ./data/test/data


1c.fa  1d.fa  2c.fa  2d.fa

README

This part of pipeline search for the SSU rRNA gene fragments, classify them, and extract reads aligned specific region. It is also heavy lifting part of the whole pipeline (more cpu will help).

This part works with one seqfile a time. You just need to change the "Seqfile" and maybe other parameters in the two cells bellow.

To run commands, click "Cell" then "Run All". After it finishes, you will see "*** pipeline runs successsfully :)" at bottom of this pape.

If your computer has many processors, there are two ways to make use of the resource:

  1. Set "Cpu" higher number.

  2. make more copies of this notebook (click "File" then "Make a copy" in menu bar), so you can run the step on multiple files at the same time.

(Again we assume the "Seqfile" is quality trimmed.)

Here we will process one file at a time; set the "Seqfile" variable to the seqfile name to be be processed

First part of seqfile basename (separated by ".") will be the label of this sample, so named it properly.

e.g. for "/usr/local/notebooks/data/test/data/1c.fa", "1c" will the label of this sample.


In [4]:
Seqfile='./data/test/data/2d.fa'

Other parameters to set


In [5]:
Cpu='2'   # number of maxixum threads for search and alignment
Hmm='./data/SSUsearch_db/Hmm.ssu.hmm'   # hmm model for ssu
Gene='ssu'
Script_dir='./scripts'
Gene_model_org='./data/SSUsearch_db/Gene_model_org.16s_ecoli_J01695.fasta'
Ali_template='./data/SSUsearch_db/Ali_template.silva_ssu.fasta'

Start='577'  #pick regions for de novo clustering
End='727'
Len_cutoff='100' # min length for reads picked for the region

Gene_tax='./data/SSUsearch_db/Gene_tax.silva_taxa_family.tax' # silva 108 ref
Gene_db='./data/SSUsearch_db/Gene_db.silva_108_rep_set.fasta'

Gene_tax_cc='./data/SSUsearch_db/Gene_tax_cc.greengene_97_otus.tax' # greengene 2012.10 ref for copy correction
Gene_db_cc='./data/SSUsearch_db/Gene_db_cc.greengene_97_otus.fasta'

In [6]:
# first part of file basename will the label of this sample
import os
Filename=os.path.basename(Seqfile)
Tag=Filename.split('.')[0]

In [8]:
import os

New_path = '{}:{}'.format('~/Desktop/SSUsearch/external_tools/bin/', os.environ['PATH'])
Hmm=os.path.abspath(Hmm)
Seqfile=os.path.abspath(Seqfile)
Script_dir=os.path.abspath(Script_dir)
Gene_model_org=os.path.abspath(Gene_model_org)
Ali_template=os.path.abspath(Ali_template)
Gene_tax=os.path.abspath(Gene_tax)
Gene_db=os.path.abspath(Gene_db)
Gene_tax_cc=os.path.abspath(Gene_tax_cc)
Gene_db_cc=os.path.abspath(Gene_db_cc)

os.environ.update(
    {'PATH':New_path,
     'Cpu':Cpu, 
     'Hmm':os.path.abspath(Hmm), 
     'Gene':Gene, 
     'Seqfile':os.path.abspath(Seqfile), 
     'Filename':Filename, 
     'Tag':Tag, 
     'Script_dir':os.path.abspath(Script_dir), 
     'Gene_model_org':os.path.abspath(Gene_model_org), 
     'Ali_template':os.path.abspath(Ali_template), 
     'Start':Start, 
     'End':End,
     'Len_cutoff':Len_cutoff,
     'Gene_tax':os.path.abspath(Gene_tax), 
     'Gene_db':os.path.abspath(Gene_db), 
     'Gene_tax_cc':os.path.abspath(Gene_tax_cc), 
     'Gene_db_cc':os.path.abspath(Gene_db_cc)})

In [9]:
!echo "*** make sure: parameters are right"
!echo "Seqfile: $Seqfile\nCpu: $Cpu\nFilename: $Filename\nTag: $Tag"


*** make sure: parameters are right
Seqfile: /home/gjr/Desktop/SSUsearch/data/test/data/1c.fa
Cpu: 2
Filename: 1c.fa
Tag: 1c

In [15]:
cd workdir


[Errno 2] No such file or directory: 'workdir'
/home/gjr/Desktop/SSUsearch/workdir

In [11]:
mkdir -p $Tag.ssu.out

In [12]:
### start hmmsearch

In [24]:
%%bash
echo "*** hmmsearch starting"
time hmmsearch --incE 10 --incdomE 10 --cpu $Cpu \
  --domtblout $Tag.ssu.out/$Tag.qc.$Gene.hmmdomtblout \
  -o /dev/null -A $Tag.ssu.out/$Tag.qc.$Gene.sto \
  $Hmm $Seqfile
echo "*** hmmsearch finished"


*** hmmsearch starting
*** hmmsearch finished
real	0m3.223s
user	0m3.112s
sys	0m0.032s

In [25]:
!python $Script_dir/get-seq-from-hmmout.py \
    $Tag.ssu.out/$Tag.qc.$Gene.hmmdomtblout \
    $Tag.ssu.out/$Tag.qc.$Gene.sto \
    $Tag.ssu.out/$Tag.qc.$Gene


parsing hmmdotblout done..
50 of 115 seqs are kept after hmm parser

Pass hits to mothur aligner


In [26]:
%%bash
echo "*** Starting mothur align"
cat  $Gene_model_org $Tag.ssu.out/$Tag.qc.$Gene > $Tag.ssu.out/$Tag.qc.$Gene.RFadded

# mothur does not allow tab between its flags, thus no indents here
time mothur "#align.seqs(candidate=$Tag.ssu.out/$Tag.qc.$Gene.RFadded, template=$Ali_template, threshold=0.5, flip=t, processors=$Cpu)"

rm -f mothur.*.logfile


*** Starting mothur align






mothur v.1.22.2
Last updated: 11/7/2011

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > align.seqs(candidate=1c.ssu.out/1c.qc.ssu.RFadded, template=/home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Ali_template.silva_ssu.fasta, threshold=0.5, flip=t, processors=2)

Reading in the /home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Ali_template.silva_ssu.fasta template sequences...	DONE.
It took 65 to read  18491 sequences.
Aligning sequences from 1c.ssu.out/1c.qc.ssu.RFadded ...
23
28
It took 2 secs to align 51 sequences.


Output File Names: 
1c.ssu.out/1c.qc.ssu.align
1c.ssu.out/1c.qc.ssu.align.report


mothur > quit()
87.29user 4.28system 1:36.83elapsed 94%CPU (0avgtext+0avgdata 4870144maxresident)k
1461088inputs+289272outputs (40major+408782minor)pagefaults 0swaps

Get aligned seqs that have > 50% matched to references


In [27]:
!python $Script_dir/mothur-align-report-parser-cutoff.py \
    $Tag.ssu.out/$Tag.qc.$Gene.align.report \
    $Tag.ssu.out/$Tag.qc.$Gene.align \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter \
    0.5


0 bad seqs out of 51 total are removed from alignment

In [28]:
!python $Script_dir/remove-gap.py $Tag.ssu.out/$Tag.qc.$Gene.align.filter $Tag.ssu.out/$Tag.qc.$Gene.align.filter.fa

Search is done here (the computational intensive part). Hooray!

  • \$Tag.ssu.out/\$Tag.qc.\$Gene.align.filter:
    aligned SSU rRNA gene fragments
  • \$Tag.ssu.out/\$Tag.qc.\$Gene.align.filter.fa:
    unaligned SSU rRNA gene fragments

Extract the reads mapped 150bp region in V4 (577-727 in E.coli SSU rRNA gene position) for unsupervised clustering


In [29]:
!python $Script_dir/region-cut.py $Tag.ssu.out/$Tag.qc.$Gene.align.filter $Start $End $Len_cutoff

!mv $Tag.ssu.out/$Tag.qc.$Gene.align.filter."$Start"to"$End".cut.lenscreen $Tag.ssu.out/$Tag.forclust


28 sequences are matched to 577-727 region

Classify SSU rRNA gene seqs using SILVA


In [32]:
%%bash
rm -f $Tag.ssu.out/$Tag.qc.$Gene.align.filter.silva_taxa_family*.taxonomy
mothur "#classify.seqs(fasta=$Tag.ssu.out/$Tag.qc.$Gene.align.filter.fa, template=$Gene_db, taxonomy=$Gene_tax, cutoff=50, processors=$Cpu)"
mv $Tag.ssu.out/$Tag.qc.$Gene.align.filter.silva_taxa_family*.taxonomy \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.silva.taxonomy







mothur v.1.22.2
Last updated: 11/7/2011

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > classify.seqs(fasta=1c.ssu.out/1c.qc.ssu.align.filter.fa, template=/home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Gene_db.silva_108_rep_set.fasta, taxonomy=/home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Gene_tax.silva_taxa_family.tax, cutoff=50, processors=2)

Reading in the /home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Gene_tax.silva_taxa_family.tax taxonomy...	DONE.
Generating search database...    DONE.
It took 261 seconds generate search database. 
Calculating template taxonomy tree...     DONE.
Calculating template probabilities...     DONE.
It took 736 seconds get probabilities. 
Classifying sequences from 1c.ssu.out/1c.qc.ssu.align.filter.fa ...
Processing sequence: 26
Processing sequence: 24

It took 1 secs to classify 50 sequences.

Warning: cannot find taxon Soil_crenarchaeotic_group in reference taxonomy tree at level 2 for MISEQ08:58:000000000-A6BH3:1:1101:16172:8454. This may cause totals of daughter levels not to add up in summary file.

It took 0 secs to create the summary file for 50 sequences.


Output File Names: 
1c.ssu.out/1c.qc.ssu.align.filter.silva_taxa_family.taxonomy
1c.ssu.out/1c.qc.ssu.align.filter.silva_taxa_family.tax.summary


mothur > quit()
mv: cannot stat `1c.ssu.out/1c.qc.ssu.align.filter.*.wang.taxonomy': No such file or directory

In [34]:
!python $Script_dir/count-taxon.py \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.silva.taxonomy \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.silva.taxonomy.count
!rm -f mothur.*.logfile

Classify SSU rRNA gene seqs with Greengene for copy correction later


In [38]:
%%bash
rm -f $Tag.ssu.out/$Tag.qc.$Gene.align.filter.greengene_97_otus*.taxonomy
mothur "#classify.seqs(fasta=$Tag.ssu.out/$Tag.qc.$Gene.align.filter.fa, template=$Gene_db_cc, taxonomy=$Gene_tax_cc, cutoff=50, processors=$Cpu)"
mv $Tag.ssu.out/$Tag.qc.$Gene.align.filter.greengene_97_otus*.taxonomy \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.gg.taxonomy







mothur v.1.22.2
Last updated: 11/7/2011

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > classify.seqs(fasta=1c.ssu.out/1c.qc.ssu.align.filter.fa, template=/home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Gene_db_cc.greengene_97_otus.fasta, taxonomy=/home/gjr/Desktop/SSUsearch/data/SSUsearch_db/Gene_tax_cc.greengene_97_otus.tax, cutoff=50, processors=1)
Reading template taxonomy...     DONE.
Reading template probabilities...     DONE.
It took 11 seconds get probabilities. 
Classifying sequences from 1c.ssu.out/1c.qc.ssu.align.filter.fa ...
Processing sequence: 50

It took 1 secs to classify 50 sequences.


It took 0 secs to create the summary file for 50 sequences.


Output File Names: 
1c.ssu.out/1c.qc.ssu.align.filter.greengene_97_otus.taxonomy
1c.ssu.out/1c.qc.ssu.align.filter.greengene_97_otus.tax.summary


mothur > quit()

In [39]:
!python $Script_dir/count-taxon.py \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.gg.taxonomy \
    $Tag.ssu.out/$Tag.qc.$Gene.align.filter.wang.gg.taxonomy.count
!rm -f mothur.*.logfile

In [40]:
# check the output directory
!ls $Tag.ssu.out


1c.forclust
1c.qc.ssu
1c.qc.ssu.align
1c.qc.ssu.align.filter
1c.qc.ssu.align.filter.577to727.cut
1c.qc.ssu.align.filter.577to727.cut.lenscreen.fa
1c.qc.ssu.align.filter.fa
1c.qc.ssu.align.filter.greengene_97_otus.tax.summary
1c.qc.ssu.align.filter.silva_taxa_family.tax.summary
1c.qc.ssu.align.filter.wang.gg.taxonomy
1c.qc.ssu.align.filter.wang.gg.taxonomy.count
1c.qc.ssu.align.filter.wang.silva.taxonomy
1c.qc.ssu.align.filter.wang.silva.taxonomy.count
1c.qc.ssu.align.report
1c.qc.ssu.hmmdomtblout
1c.qc.ssu.hmmdomtblout.parsedToDictWithScore.pickle
1c.qc.ssu.RFadded
1c.qc.ssu.sto

This part of pipeline (working with one sequence file) finishes here. Next we will combine samples for community analysis (see unsupervised analysis).

Following are files useful for community analysis:

  • 1c.577to727: aligned fasta file of seqs mapped to target region for de novo clustering
  • 1c.qc.ssu.align.filter: aligned fasta file of all SSU rRNA gene fragments
  • 1c.qc.ssu.align.filter.wang.gg.taxonomy: Greengene taxonomy (for copy correction)
  • 1c.qc.ssu.align.filter.wang.silva.taxonomy: SILVA taxonomy

In [41]:
!echo "*** pipeline runs successsfully :)"


*** pipeline runs successsfully :)

In [ ]: