Goal

in silico PCR with 515F-806R primers on all NCBI compelete genomes
- amplicons will be clustered to produce OTUs

Setting variables



In [99]:

    
import os
workDir = '/var/seq_data/ncbi_db/genome/Jan2016/'
genomeDir = os.path.join(workDir, 'bac_complete_rn')
primerFile = '/home/nick/notebook/SIPSim/dev/515F-806R.fna'

Init



In [100]:

    
%load_ext rpy2.ipython
%load_ext pushnote









    



The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The pushnote extension is already loaded. To reload it, use:
  %reload_ext pushnote



In [101]:

    
%%R
library(dplyr)
library(tidyr)
library(ggplot2)



In [102]:

    
if not os.path.isdir(workDir):
    os.makedirs(workDir)
    
%cd $workDir









    



/var/seq_data/ncbi_db/genome/Jan2016

concat all genome sequences



In [104]:

    
#catGenomeFile = 'bac_complete_rn.fna'
#!rm -f $catGenomeFile
#!find $genomeDir -name "*.fna" | xargs -I % bash -c "cat %; echo" >> $catGenomeFile



In [105]:

    
# fixing issues of no line breaks for new sequence IDs in fastas
#!perl -pi -e 's/>/\n>/ if /.+>/' $catGenomeFile



In [109]:

    
# checking number of sequences
#!grep -c ">" $catGenomeFile



In [110]:

    
# checking for duplicate sequence IDs
#ret = !grep ">" $catGenomeFile | sort | uniq -c
#for x in ret:
#    x = x.lstrip().split(' ')
#    x[0] = int(x[0])
#    if x[0] > 1:
#        print x

rnammer to ID 16S genes



In [112]:

    
rnammerDir = os.path.join(workDir, 'rnammer')
if not os.path.isdir(rnammerDir):
    os.makedirs(rnammerDir)



In [ ]:

    
%%bash -s "$genomeDir" "$rnammerDir"

find $1 -name "*.fna" | \
    perl -pe 's/.+\/|\.fna//g' | \
    xargs -n 1 -I % -P 30 bash -c \
    "rnammer -S bac -m ssu -gff $2/%_rrn.gff -f $2/%_rrn.fna -xml $2/%_rrn.xml < $1/%.fna"



In [ ]:

    
!cd $rnammerDir; \
    egrep -v "^#" *.gff | \
    grep "16s_rRNA" | \
    perl -pe 's/:/\t/' > ssu_summary.txt



In [ ]:

    
%%bash -s "$rnammerDir"
cd $1

printf "ssu gene length distribution:\n"
cut -d $'\t' -f 7 ssu_summary.txt | NY_misc_perl stats_descriptive



In [ ]:

    
# combining sequences
!cd $rnammerDir; \
    cat *_rrn.fna > ssu_all.fna
    
!printf "Number of sequences: "
!cd $rnammerDir; \
    grep -c ">" ssu_all.fna



In [ ]:

    
%pushnote rnammer complete

in-silco PCR

using pcr.seqs command from Mothur

Making an oligos file



In [10]:

    
seqs = !grep -v ">" $primerFile
oligoFile = os.path.splitext(primerFile)[0] + '.oligo'
with open(oligoFile, 'wb') as outFH:
    for i,x in enumerate(seqs):    
        if i == 0:
            primer = 'forward'
        elif i == 1:
            primer = 'reverse'
        else:
            break
        outFH.write('{} {}\n'.format(primer, x))

# checking output
!head $oligoFile









    



forward GTGCCAGCMGCCGCGGTAA
reverse GGACTACHVGGGTWTCTAAT

Calling mothur



In [ ]:

    
cmd = 'mothur "#pcr.seqs(fasta={}, oligos={}, pdiffs=1, processors=24)"'.format(catGenomeFile, oligoFile)
!$cmd









    








mothur v.1.35.1
Last updated: 03/31/2015

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > pcr.seqs(fasta=bac_complete.fna, oligos=/home/nick/notebook/SIPSim/dev/515F-806R.oligo, pdiffs=2, processors=24)

Using 24 processors.
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100
Processing sequence: 100

Filtering out large sequences



In [79]:

    
!seq_tools fasta_info --sn --sl bac_complete.pcr.fna > bac_complete.pcr_len.txt
!head -n 3 bac_complete.pcr_len.txt









    



NC_017540_Klebsiella_pneumoniae_KCTC_2242__Klebsiella_pneumoniae_KCTC_2242	fpdiffs=0(match) rpdiffs=0(match)	4857180
NZ_CP012349_Salmonella_enterica_subsp__enterica_serovar_Sloterdijk_str__ATCC_	fpdiffs=0(match) rpdiffs=0(match)	702714
NC_014622_Paenibacillus_polymyxa_SC2__Paenibacillus_polymyxa_SC2	fpdiffs=0(match) rpdiffs=0(match)	2632047



In [74]:

    
%%R -i workDir -h 300 -w 600
F = file.path(workDir, 'bac_complete.pcr_len.txt')
df = read.delim(F, sep='\t', header=FALSE)

ggplot(df, aes(V3)) +
    geom_histogram() +
    labs(x='amplicon length (bp)')



In [75]:

    
%%R -h 300 -w 600
df.f = df %>%
    filter(V3 < 500) 

ggplot(df.f, aes(V3)) +
    geom_histogram() +
    labs(x='amplicon length (bp)')



In [84]:

    
%%R
outFile = 'bac_complete.pcr.accnos'
write.table(df.f %>% select(V1, V2), outFile, sep='\t', quote=FALSE, row.names=FALSE, col.names=FALSE)



In [85]:

    
cmd = 'mothur "#get.seqs(fasta={}, accnos={})"'.format('bac_complete.pcr.fna', 'bac_complete.pcr.accnos')
!$cmd









    








mothur v.1.35.1
Last updated: 03/31/2015

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > get.seqs(fasta=bac_complete.pcr.fna, accnos=bac_complete.pcr.accnos)
Selected 902 sequences from your fasta file.

Output File Names: 
bac_complete.pcr.pick.fna


mothur > quit()

Checking final, filtered amplicon pool



In [86]:

    
cmd = 'mothur "#summary.seqs(fasta={})"'.format('bac_complete.pcr.pick.fna')
!$cmd









    








mothur v.1.35.1
Last updated: 03/31/2015

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > summary.seqs(fasta=bac_complete.pcr.pick.fna)

Using 1 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	246	246	0	3	1
2.5%-tile:	1	252	252	0	3	23
25%-tile:	1	253	253	0	4	226
Median: 	1	253	253	0	4	452
75%-tile:	1	253	253	0	5	677
97.5%-tile:	1	254	254	0	6	880
Maximum:	1	255	255	0	7	902
Mean:	1	252.919	252.919	0	4.67295
# of Seqs:	902

Output File Names: 
bac_complete.pcr.pick.summary

It took 0 secs to summarize 902 sequences.

mothur > quit()



In [87]:

    
# checking for multiple amplicons from the same genome
ret = !grep ">" bac_complete.pcr.pick.fna
cnt = {}
for x in ret:
    x = x.split('\t')[0]
    try:
        cnt[x] += 1
    except KeyError:
        cnt[x] = 1      
        
for x,y in cnt.items():
    if y > 1:
        print x
        
cnt = None



In [ ]: