Goal

  • in silico PCR with 515F-806R primers on all NCBI compelete genomes
    • amplicons will be clustered to produce OTUs

Setting variables


In [99]:
import os
workDir = '/var/seq_data/ncbi_db/genome/Jan2016/'
genomeDir = os.path.join(workDir, 'bac_complete_rn')
primerFile = '/home/nick/notebook/SIPSim/dev/515F-806R.fna'

Init


In [100]:
%load_ext rpy2.ipython
%load_ext pushnote


The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The pushnote extension is already loaded. To reload it, use:
  %reload_ext pushnote

In [101]:
%%R
library(dplyr)
library(tidyr)
library(ggplot2)

In [102]:
if not os.path.isdir(workDir):
    os.makedirs(workDir)
    
%cd $workDir


/var/seq_data/ncbi_db/genome/Jan2016

concat all genome sequences


In [104]:
#catGenomeFile = 'bac_complete_rn.fna'
#!rm -f $catGenomeFile
#!find $genomeDir -name "*.fna" | xargs -I % bash -c "cat %; echo" >> $catGenomeFile

In [105]:
# fixing issues of no line breaks for new sequence IDs in fastas
#!perl -pi -e 's/>/\n>/ if /.+>/' $catGenomeFile

In [109]:
# checking number of sequences
#!grep -c ">" $catGenomeFile


3779

In [110]:
# checking for duplicate sequence IDs
#ret = !grep ">" $catGenomeFile | sort | uniq -c
#for x in ret:
#    x = x.lstrip().split(' ')
#    x[0] = int(x[0])
#    if x[0] > 1:
#        print x

rnammer to ID 16S genes


In [112]:
rnammerDir = os.path.join(workDir, 'rnammer')
if not os.path.isdir(rnammerDir):
    os.makedirs(rnammerDir)

In [114]:
%%bash -s "$genomeDir" "$rnammerDir"

find $1 -name "*.fna" | \
    perl -pe 's/.+\/|\.fna//g' | \
    xargs -n 1 -I % -P 30 bash -c \
    "rnammer -S bac -m ssu -gff $2/%_rrn.gff -f $2/%_rrn.fna -xml $2/%_rrn.xml < $1/%.fna"

In [115]:
!cd $rnammerDir; \
    egrep -v "^#" *.gff | \
    grep "16s_rRNA" | \
    perl -pe 's/:/\t/' > ssu_summary.txt

In [116]:
%%bash -s "$rnammerDir"
cd $1

printf "ssu gene length distribution:\n"
cut -d $'\t' -f 7 ssu_summary.txt | NY_misc_perl stats_descriptive


ssu gene length distribution:
1	min	175.90
1	Q1	1906.30
1	mean	1925.95
1	median	1950.40
1	Q3	1994.50
1	max	2090.20
1	stdev	128.81

In [121]:
# combining sequences
ssu_all_file = os.path.join(rnammerDir, 'ssu_all.fna')
!cd $rnammerDir; \
    cat *_rrn.fna > $ssu_all_file
    
!printf "Number of rRNA SSU sequences: "
!grep -c ">" $ssu_all_file


Number of rRNA SSU sequences: 15226

in-silco PCR

  • using pcr.seqs command from Mothur

Making an oligos file


In [119]:
seqs = !grep -v ">" $primerFile
oligoFile = os.path.splitext(primerFile)[0] + '.oligo'
with open(oligoFile, 'wb') as outFH:
    for i,x in enumerate(seqs):    
        if i == 0:
            primer = 'forward'
        elif i == 1:
            primer = 'reverse'
        else:
            break
        outFH.write('{} {}\n'.format(primer, x))

# checking output
!head $oligoFile


forward GTGCCAGCMGCCGCGGTAA
reverse GGACTACHVGGGTWTCTAAT

Calling mothur pcr.seqs


In [122]:
cmd = 'mothur "#pcr.seqs(fasta={}, oligos={}, pdiffs=1, processors=24)"'.format(ssu_all_file, oligoFile)
!$cmd







mothur v.1.35.1
Last updated: 03/31/2015

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > pcr.seqs(fasta=/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.fna, oligos=/home/nick/notebook/SIPSim/dev/515F-806R.oligo, pdiffs=1, processors=24)

Using 24 processors.
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 200
Processing sequence: 200
Processing sequence: 100
Processing sequence: 100
Processing sequence: 200
Processing sequence: 200
Processing sequence: 100
Processing sequence: 300
Processing sequence: 100
Processing sequence: 200
Processing sequence: 200
Processing sequence: 100
Processing sequence: 300
Processing sequence: 300
Processing sequence: 100
Processing sequence: 100
Processing sequence: 200
Processing sequence: 200
Processing sequence: 400
Processing sequence: 200
Processing sequence: 100
Processing sequence: 200
Processing sequence: 300
Processing sequence: 200
Processing sequence: 200
Processing sequence: 500
Processing sequence: 400
Processing sequence: 300
Processing sequence: 300
Processing sequence: 200
Processing sequence: 200
Processing sequence: 200
Processing sequence: 300
Processing sequence: 200
Processing sequence: 500
Processing sequence: 200
Processing sequence: 300
Processing sequence: 300
Processing sequence: 200
Processing sequence: 300
Processing sequence: 300
Processing sequence: 400
Processing sequence: 400
Processing sequence: 300
Processing sequence: 400
Processing sequence: 300
Processing sequence: 300
Processing sequence: 400
Processing sequence: 400
Processing sequence: 200
Processing sequence: 400
Processing sequence: 400
Processing sequence: 400
Processing sequence: 500
Processing sequence: 500
Processing sequence: 500
Processing sequence: 300
Processing sequence: 500
Processing sequence: 100
Processing sequence: 400
Processing sequence: 400
Processing sequence: 400
Processing sequence: 300
Processing sequence: 300
Processing sequence: 300
Processing sequence: 400
Processing sequence: 300
Processing sequence: 600
Processing sequence: 500
Processing sequence: 600
Processing sequence: 600
Processing sequence: 636
Processing sequence: 200
Processing sequence: 500
Processing sequence: 500
Processing sequence: 400
Processing sequence: 400
Processing sequence: 500
Processing sequence: 400
Processing sequence: 500
Processing sequence: 500
Processing sequence: 600
Processing sequence: 635
Processing sequence: 300
Processing sequence: 400
Processing sequence: 400
Processing sequence: 634
Processing sequence: 600
Processing sequence: 632
Processing sequence: 600
Processing sequence: 600
Processing sequence: 500
Processing sequence: 637
Processing sequence: 400
Processing sequence: 633
Processing sequence: 600
Processing sequence: 633
Processing sequence: 600
Processing sequence: 631
Processing sequence: 634
Processing sequence: 500
Processing sequence: 500
Processing sequence: 600
Processing sequence: 633
Processing sequence: 500
Processing sequence: 600
Processing sequence: 500
Processing sequence: 634
Processing sequence: 600
Processing sequence: 500
Processing sequence: 600
Processing sequence: 600
Processing sequence: 625
Processing sequence: 600
Processing sequence: 631
Processing sequence: 634
Processing sequence: 638
Processing sequence: 600
Processing sequence: 500
Processing sequence: 600
Processing sequence: 638
Processing sequence: 500
Processing sequence: 600
Processing sequence: 600
Processing sequence: 628
Processing sequence: 637
Processing sequence: 639
Processing sequence: 600
Processing sequence: 200
Processing sequence: 634
Processing sequence: 300
Processing sequence: 400
Processing sequence: 500
Processing sequence: 600
Processing sequence: 633
Processing sequence: 200
Processing sequence: 300
Processing sequence: 400
Processing sequence: 500
Processing sequence: 600
Processing sequence: 640
Processing sequence: 200
Processing sequence: 300
Processing sequence: 400
Processing sequence: 500
Processing sequence: 600
Processing sequence: 638
Processing sequence: 100
Processing sequence: 200
Processing sequence: 300
Processing sequence: 400
Processing sequence: 500
Processing sequence: 600
Processing sequence: 639

Output File Names: 
/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.pcr.fna
/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.bad.accnos
/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.scrap.pcr.fna


It took 5 secs to screen 15226 sequences.

mothur > quit()

In [127]:
ssu_pcr_file = os.path.splitext(ssu_all_file)[0] + '.pcr.fna'
cmd = 'mothur "#summary.seqs(fasta={})"'.format(ssu_pcr_file)
!$cmd







mothur v.1.35.1
Last updated: 03/31/2015

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > summary.seqs(fasta=/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.pcr.fna)

Using 1 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	193	193	0	3	1
2.5%-tile:	1	252	252	0	4	378
25%-tile:	1	253	253	0	4	3777
Median: 	1	253	253	0	5	7554
75%-tile:	1	253	253	0	6	11330
97.5%-tile:	1	254	254	0	6	14729
Maximum:	1	542	542	4	9	15106
Mean:	1	253.035	253.035	0.00205216	4.87277
# of Seqs:	15106

Output File Names: 
/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.pcr.summary

It took 0 secs to summarize 15106 sequences.

mothur > quit()

Quality filtering ssu amplicon sequences


In [137]:
cmd = 'mothur "#screen.seqs(fasta={}, maxlength=300, minlength=252, maxambig=0, maxhomop=8)"'.format(ssu_pcr_file)
!$cmd | tail -n 30


Processing sequence: 13200
Processing sequence: 13300
Processing sequence: 13400
Processing sequence: 13500
Processing sequence: 13600
Processing sequence: 13700
Processing sequence: 13800
Processing sequence: 13900
Processing sequence: 14000
Processing sequence: 14100
Processing sequence: 14200
Processing sequence: 14300
Processing sequence: 14400
Processing sequence: 14500
Processing sequence: 14600
Processing sequence: 14700
Processing sequence: 14800
Processing sequence: 14900
Processing sequence: 15000
Processing sequence: 15100
Processing sequence: 15106

Output File Names: 
/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.pcr.good.fna
/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.pcr.bad.accnos


It took 0 secs to screen 15106 sequences.

mothur > quit()

In [138]:
ssu_pcr_filt_file = os.path.splitext(ssu_pcr_file)[0] + '.good.fna'
cmd = 'mothur "#summary.seqs(fasta={})"'.format(ssu_pcr_filt_file)
!$cmd







mothur v.1.35.1
Last updated: 03/31/2015

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > summary.seqs(fasta=/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.pcr.good.fna)

Using 1 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	252	252	0	3	1
2.5%-tile:	1	252	252	0	4	377
25%-tile:	1	253	253	0	4	3762
Median: 	1	253	253	0	5	7523
75%-tile:	1	253	253	0	6	11284
97.5%-tile:	1	254	254	0	6	14668
Maximum:	1	262	262	0	8	15044
Mean:	1	252.943	252.943	0	4.8735
# of Seqs:	15044

Output File Names: 
/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.pcr.good.summary

It took 1 secs to summarize 15044 sequences.

mothur > quit()

In [139]:
!head -n 4 $ssu_pcr_filt_file


>rRNA_NC_009925_Acaryochloris_marina_MBIC11017__Acaryochloris_marina_MBIC11017_5636180-5637670_DIR+	fpdiffs=0(match) rpdiffs=0(match) 	 /molecule=16s_rRNA /score=1853.9
GACGGAGGAGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGTGGCCTATCAAGTCTGTTGTTAAAGCCCAGGGCTCAACTCTGGATCAGCAATGGAAACTGAAAGGCTAGAGTACGGTAGGGGTAGAGGGAATTCCCAGTGTAGCGGTGAAATGCGTAGATATTGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCGTAACTGACACTCATGGACGAAAGCTAGGGGAGCGAAAGGG
>rRNA_NC_009925_Acaryochloris_marina_MBIC11017__Acaryochloris_marina_MBIC11017_1409155-1410645_DIR-	fpdiffs=0(match) rpdiffs=0(match) 	 /molecule=16s_rRNA /score=1853.9
GACGGAGGAGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGTGGCCTATCAAGTCTGTTGTTAAAGCCCAGGGCTCAACTCTGGATCAGCAATGGAAACTGAAAGGCTAGAGTACGGTAGGGGTAGAGGGAATTCCCAGTGTAGCGGTGAAATGCGTAGATATTGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCGTAACTGACACTCATGGACGAAAGCTAGGGGAGCGAAAGGG

In [140]:
ssu_pcr_filt_file


Out[140]:
'/var/seq_data/ncbi_db/genome/Jan2016/rnammer/ssu_all.pcr.good.fna'

In [ ]: