Development using test dataset

grinder run

@ system76-server:~/notebook/grinderSIP/t/example

with abundance profile

grinder -reference_file genome3.fna -fr 515Fm-927Rm.fna \
    -rd 215 normal 50 -tr 1000 -length_bias 0 -unidirectional 1 \
    -af abnd_prof.txt

with abundance model

grinder -reference_file genome3.fna -fr 515Fm-927Rm.fna -am exponential \
    -rd 215 normal 50 -tr 1000 -length_bias 0 -unidirectional 1

with multiple samples

grinder -reference_file genome3.fna -fr 515Fm-927Rm.fna -am exponential \
-rd 215 normal 50 -tr 1000 -length_bias 0 -unidirectional 1 -num_libraries 2 \
-shared_perc 100

testing script with 3 genome

@ system76-server:~/notebook/grinderSIP/dev

simulating isotope incorporation (max of 100%)

$ perl ../bin/makeIncorp_phylo.pl -r grinder-ranks.txt -t genome3.nwk > genome3_inc100.txt

simulating isotope incorporation (max of 50%)

$ perl ../bin/makeIncorp_phylo.pl -r grinder-ranks.txt -t genome3.nwk -range 0-50 > genome3_inc50.txt

grinderSIP.pl

CHANGES to grindeSIP.pl: Fragments written to each fraction file, then grinder is run on each

  • fragments made by genome fragmentation:
    • genomes fragmented based on determined or defined relative abundance (RA)
    • fragmentation size defined by size distribution
    • fragment start defined by poisson? or uniform? distribution
    • total number of genome copies (TNGC) defined
    • TNGC x RA(each taxon)
    • fragments distributed into fractions based on fragment GC
    • naming issues?
      • may need to rename by taxon
  • foreach fraction:
    • wrapper to run grinder
    • need option file (options based on grinder)
$ perl ../bin/grinderSIP.pl -g genome3.fna -r grinder-reads.fa -inc genome3_inc100.txt | less



checking that all reads written

$ nseq readsByFrac/isoIncorp/*fasta | perl -ne 'BEGIN{$sum=0} /.+:(\d+)/; $sum+=$1; END{print $sum,"\n"}'
    * yes, all written

Larger dataset of 10 genomes

Making the dataset

@ system76-server:~/notebook/grinderSIP/dev/bac_genome10

Randomly selected 10 genomes

  • Selecting from ../../../deltaGC_notes/data/genome_data/prok-bac-genomes_Ngaps
    Bacillus_cereus_F837_76.fasta
    Cyanobium_gracile_PCC_6307.fasta
    Escherichia_coli_APEC_O78.fasta
    Geobacter_lovleyi_SZ.fasta
    Idiomarina_loihiensis_GSL_199.fasta
    Leuconostoc_citreum_KM20.fasta
    Micrococcus_luteus_NCTC_2665.fasta
    Nitrosomonas_europaea_ATCC_19718.fasta
    Roseobacter_litoralis_Och_149.fasta
    Xanthomonas_axonopodis_Xac29-1.fasta

Getting genomes from collection

$ printf "Bacillus_cereus_F837_76.fasta\nCyanobium_gracile_PCC_6307.fasta\nEscherichia_coli_APEC_O78.fasta\nGeobacter_lovleyi_SZ.fasta\nIdiomarina_loihiensis_GSL_199.fasta\nLeuconostoc_citreum_KM20.fasta\nMicrococcus_luteus_NCTC_2665.fasta\nNitrosomonas_europaea_ATCC_19718.fasta\nRoseobacter_litoralis_Och_149.fasta\nXanthomonas_axonopodis_Xac29-1.fasta" | xargs -n 1 -I % bash -c "grep -c '>' ~/notebook/deltaGC_notes/data/genome_data/prok-bac-genomes_Ngaps/%"
    * all have 1 genome

$ printf "Bacillus_cereus_F837_76.fasta\nCyanobium_gracile_PCC_6307.fasta\nEscherichia_coli_APEC_O78.fasta\nGeobacter_lovleyi_SZ.fasta\nIdiomarina_loihiensis_GSL_199.fasta\nLeuconostoc_citreum_KM20.fasta\nMicrococcus_luteus_NCTC_2665.fasta\nNitrosomonas_europaea_ATCC_19718.fasta\nRoseobacter_litoralis_Och_149.fasta\nXanthomonas_axonopodis_Xac29-1.fasta" | xargs -n 1 -I % bash -c "ln -s  ~/notebook/deltaGC_notes/data/genome_data/prok-bac-genomes_Ngaps/% genomes/"
    * all genomes symlinked

Combining and editing names (using file names)

$ find genomes/ -name "*fasta" | xargs -n 1 -I % bash -c "perl -p -e 'BEGIN{\$in=\$ARGV[0]; \$in=~s/.+\/|\.[^.]+\$//g} s/^>.+/>\$in/' % " > genome10.fna



Grinder simulation (10000 reads)

exponential abundance model

grinder -reference_file genome10.fna -fr ../515Fm-927Rm.fna -am exponential \
    -rd 215 normal 50 -tr 10000 -length_bias 0 -unidirectional 1 -base_name genome10

Making a tree of the genomes

16S rRNA tree

Using RNAmmer to get all 16S rRNA genes from each genome

Getting already created rnammer output files

$ mkdir rnammer

$ find genomes/ -name "*fasta" | perl -pe 's/.+\///; s/\.fasta/_rrn.fna/' | xargs -n 1 -I % bash -c "ln -s /var/seq_data/ncbi_db/genome/rnammer/bacteria_rrn/% rnammer/"

Filtering out poor hits

$ find rnammer/ -name "*_rrn.fna" | perl -pe 's/(.+)(\.fna)/$1$2\n$1\_s2k$2/g' | parallel -q -N 2 bash -c 'perl ~/dev/NY_misc/bin/rnammer_filt.py -s 2000 - <{1} > {2}'
    * score >= 2000

Making consensus 16S sequence for each

$ find rnammer/ -name "*_s2k.fna" | perl -pe 's/(.+)(\.fna)/$1$2\n$1\_con$2/g' | parallel -q -N 2 bash -c "~/dev/NY_misc_perl/bin/consensus_seq.pl -i - -t 30 -n {1} < {1} > {2}"
    * low threshold used to minimize ambiguous

Editing names & combining ('?' -> 'N')

$ find rnammer/ -name "*con.fna" | xargs -n 1 -I % bash -c "perl -pe 's/.+\//>/;s/_rrn_s2k\.fna//; s/\?/N/g' %" > genome10_rrn-con.fna

Aligning with mafft

$ mafft-linsi --thread 20 genome10_rrn-con.fna > genome10_rrn-con_aln.fna

Making ML tree with RAxML

@ system76-server:~/notebook/grinderSIP/dev/bac_genome10/raxml

$ alignIO.py ../genome10_rrn-con_aln.fna genome10_rrn-con_aln.phy

$ raxmlHPC-PTHREADS-AVX -f a -x 0318 -p 0911 -# 100 -m GTRGAMMA -s genome10_rrn-con_aln.phy -n genome10_rrn-con_ML -T 20

$ mv RAxML_bipartitions.genome10_rrn-con_ML genome10_rrn-con_ML.nwk


Making incorp file (min-max incorp: 0-100)

@ system76-server:~/notebook/grinderSIP/dev/bac_genome10

Incorp: min0, max100; weight=0.5

$ perl ../../bin/makeIncorp_phylo.pl -r genome10-ranks.txt -t raxml/genome10_rrn-con_ML.nwk > genome10_inc0-100_w0.5.txt

Incorp: min0, max100; weight=1

$ perl ../../bin/makeIncorp_phylo.pl -r genome10-ranks.txt -t raxml/genome10_rrn-con_ML.nwk -w 1 > genome10_inc0-100_w1.txt

Incorp: min0, max100; weight=0

$ perl ../../bin/makeIncorp_phylo.pl -r genome10-ranks.txt -t raxml/genome10_rrn-con_ML.nwk -w 0 > genome10_inc0-100_w0.txt

running grinderSIP.pl

$ perl ../../bin/grinderSIP.pl -g genome10.fna -r genome10-reads.fa -inc genome10_inc0-100_w1.txt -out readsByFrac_inc0-100_w1 | less

checking that all reads written

$ nseq readsByFrac_inc0-100_w1/isoIncorp/*fasta | perl -ne 'BEGIN{$sum=0} /.+:(\d+)/; $sum+=$1; END{print $sum,"\n"}'
    * yes, all written

In [50]:
# Checking on character trait evolution through ordering by tree
import os
import glob

pwd = '/home/nick/notebook/grinderSIP/dev/bac_genome10/'

## getting tree labels
cmd = 'nw_labels -I ' + pwd + 'raxml/genome10_rrn-con_ML.nwk'
ret = os.popen(cmd, 'r')
labels = [i.strip() for i in ret]

## ordering & printing tables
def order_table(csv_file, order_list):
    fh = open(csv_file, 'r')
    tbl = [i.rstrip().split('\t') for i in fh]
    tbl = dict(tbl)
    for i in labels:
        print '\t'.join([csv_file, i, tbl[i]])
        
# writing dict(tbl) in order of labels
files = glob.glob(pwd + '*_w*.txt')
for f in files:
    order_table(f, labels)


/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.txt	Escherichia_coli_APEC_O78	89.57
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.txt	Idiomarina_loihiensis_GSL_199	69.70
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.txt	Xanthomonas_axonopodis_Xac29-1	57.66
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.txt	Nitrosomonas_europaea_ATCC_19718	8.87
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.txt	Bacillus_cereus_F837_76	100.00
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.txt	Geobacter_lovleyi_SZ	70.05
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.txt	Leuconostoc_citreum_KM20	27.61
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.txt	Micrococcus_luteus_NCTC_2665	0.00
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.txt	Cyanobium_gracile_PCC_6307	31.37
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.txt	Roseobacter_litoralis_Och_149	87.14
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w1.txt	Escherichia_coli_APEC_O78	41.00
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w1.txt	Idiomarina_loihiensis_GSL_199	49.54
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w1.txt	Xanthomonas_axonopodis_Xac29-1	22.18
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w1.txt	Nitrosomonas_europaea_ATCC_19718	26.64
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w1.txt	Bacillus_cereus_F837_76	100.00
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w1.txt	Geobacter_lovleyi_SZ	62.53
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w1.txt	Leuconostoc_citreum_KM20	53.11
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w1.txt	Micrococcus_luteus_NCTC_2665	54.00
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w1.txt	Cyanobium_gracile_PCC_6307	24.64
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w1.txt	Roseobacter_litoralis_Och_149	0.00
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.5.txt	Escherichia_coli_APEC_O78	35.12
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.5.txt	Idiomarina_loihiensis_GSL_199	38.34
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.5.txt	Xanthomonas_axonopodis_Xac29-1	54.47
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.5.txt	Nitrosomonas_europaea_ATCC_19718	81.63
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.5.txt	Bacillus_cereus_F837_76	66.40
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.5.txt	Geobacter_lovleyi_SZ	3.79
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.5.txt	Leuconostoc_citreum_KM20	100.00
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.5.txt	Micrococcus_luteus_NCTC_2665	0.00
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.5.txt	Cyanobium_gracile_PCC_6307	23.78
/home/nick/notebook/grinderSIP/dev/bac_genome10/genome10_inc0-100_w0.5.txt	Roseobacter_litoralis_Och_149	50.25