grinder run
@ system76-server:~/notebook/grinderSIP/t/example
with abundance profile
grinder -reference_file genome3.fna -fr 515Fm-927Rm.fna \
-rd 215 normal 50 -tr 1000 -length_bias 0 -unidirectional 1 \
-af abnd_prof.txt
with abundance model
grinder -reference_file genome3.fna -fr 515Fm-927Rm.fna -am exponential \
-rd 215 normal 50 -tr 1000 -length_bias 0 -unidirectional 1
with multiple samples
grinder -reference_file genome3.fna -fr 515Fm-927Rm.fna -am exponential \
-rd 215 normal 50 -tr 1000 -length_bias 0 -unidirectional 1 -num_libraries 2 \
-shared_perc 100
@ system76-server:~/notebook/grinderSIP/dev
$ perl ../bin/makeIncorp_phylo.pl -r grinder-ranks.txt -t genome3.nwk > genome3_inc100.txt
$ perl ../bin/makeIncorp_phylo.pl -r grinder-ranks.txt -t genome3.nwk -range 0-50 > genome3_inc50.txt
CHANGES to grindeSIP.pl: Fragments written to each fraction file, then grinder is run on each
$ perl ../bin/grinderSIP.pl -g genome3.fna -r grinder-reads.fa -inc genome3_inc100.txt | less
checking that all reads written
$ nseq readsByFrac/isoIncorp/*fasta | perl -ne 'BEGIN{$sum=0} /.+:(\d+)/; $sum+=$1; END{print $sum,"\n"}'
* yes, all written
@ system76-server:~/notebook/grinderSIP/dev/bac_genome10
Randomly selected 10 genomes
Bacillus_cereus_F837_76.fasta
Cyanobium_gracile_PCC_6307.fasta
Escherichia_coli_APEC_O78.fasta
Geobacter_lovleyi_SZ.fasta
Idiomarina_loihiensis_GSL_199.fasta
Leuconostoc_citreum_KM20.fasta
Micrococcus_luteus_NCTC_2665.fasta
Nitrosomonas_europaea_ATCC_19718.fasta
Roseobacter_litoralis_Och_149.fasta
Xanthomonas_axonopodis_Xac29-1.fasta
Getting genomes from collection
$ printf "Bacillus_cereus_F837_76.fasta\nCyanobium_gracile_PCC_6307.fasta\nEscherichia_coli_APEC_O78.fasta\nGeobacter_lovleyi_SZ.fasta\nIdiomarina_loihiensis_GSL_199.fasta\nLeuconostoc_citreum_KM20.fasta\nMicrococcus_luteus_NCTC_2665.fasta\nNitrosomonas_europaea_ATCC_19718.fasta\nRoseobacter_litoralis_Och_149.fasta\nXanthomonas_axonopodis_Xac29-1.fasta" | xargs -n 1 -I % bash -c "grep -c '>' ~/notebook/deltaGC_notes/data/genome_data/prok-bac-genomes_Ngaps/%"
* all have 1 genome
$ printf "Bacillus_cereus_F837_76.fasta\nCyanobium_gracile_PCC_6307.fasta\nEscherichia_coli_APEC_O78.fasta\nGeobacter_lovleyi_SZ.fasta\nIdiomarina_loihiensis_GSL_199.fasta\nLeuconostoc_citreum_KM20.fasta\nMicrococcus_luteus_NCTC_2665.fasta\nNitrosomonas_europaea_ATCC_19718.fasta\nRoseobacter_litoralis_Och_149.fasta\nXanthomonas_axonopodis_Xac29-1.fasta" | xargs -n 1 -I % bash -c "ln -s ~/notebook/deltaGC_notes/data/genome_data/prok-bac-genomes_Ngaps/% genomes/"
* all genomes symlinked
Combining and editing names (using file names)
$ find genomes/ -name "*fasta" | xargs -n 1 -I % bash -c "perl -p -e 'BEGIN{\$in=\$ARGV[0]; \$in=~s/.+\/|\.[^.]+\$//g} s/^>.+/>\$in/' % " > genome10.fna
exponential abundance model
grinder -reference_file genome10.fna -fr ../515Fm-927Rm.fna -am exponential \
-rd 215 normal 50 -tr 10000 -length_bias 0 -unidirectional 1 -base_name genome10
16S rRNA tree
Getting already created rnammer output files
$ mkdir rnammer
$ find genomes/ -name "*fasta" | perl -pe 's/.+\///; s/\.fasta/_rrn.fna/' | xargs -n 1 -I % bash -c "ln -s /var/seq_data/ncbi_db/genome/rnammer/bacteria_rrn/% rnammer/"
Filtering out poor hits
$ find rnammer/ -name "*_rrn.fna" | perl -pe 's/(.+)(\.fna)/$1$2\n$1\_s2k$2/g' | parallel -q -N 2 bash -c 'perl ~/dev/NY_misc/bin/rnammer_filt.py -s 2000 - <{1} > {2}'
* score >= 2000
$ find rnammer/ -name "*_s2k.fna" | perl -pe 's/(.+)(\.fna)/$1$2\n$1\_con$2/g' | parallel -q -N 2 bash -c "~/dev/NY_misc_perl/bin/consensus_seq.pl -i - -t 30 -n {1} < {1} > {2}"
* low threshold used to minimize ambiguous
Editing names & combining ('?' -> 'N')
$ find rnammer/ -name "*con.fna" | xargs -n 1 -I % bash -c "perl -pe 's/.+\//>/;s/_rrn_s2k\.fna//; s/\?/N/g' %" > genome10_rrn-con.fna
$ mafft-linsi --thread 20 genome10_rrn-con.fna > genome10_rrn-con_aln.fna
@ system76-server:~/notebook/grinderSIP/dev/bac_genome10/raxml
$ alignIO.py ../genome10_rrn-con_aln.fna genome10_rrn-con_aln.phy
$ raxmlHPC-PTHREADS-AVX -f a -x 0318 -p 0911 -# 100 -m GTRGAMMA -s genome10_rrn-con_aln.phy -n genome10_rrn-con_ML -T 20
$ mv RAxML_bipartitions.genome10_rrn-con_ML genome10_rrn-con_ML.nwk
@ system76-server:~/notebook/grinderSIP/dev/bac_genome10
Incorp: min0, max100; weight=0.5
$ perl ../../bin/makeIncorp_phylo.pl -r genome10-ranks.txt -t raxml/genome10_rrn-con_ML.nwk > genome10_inc0-100_w0.5.txt
Incorp: min0, max100; weight=1
$ perl ../../bin/makeIncorp_phylo.pl -r genome10-ranks.txt -t raxml/genome10_rrn-con_ML.nwk -w 1 > genome10_inc0-100_w1.txt
Incorp: min0, max100; weight=0
$ perl ../../bin/makeIncorp_phylo.pl -r genome10-ranks.txt -t raxml/genome10_rrn-con_ML.nwk -w 0 > genome10_inc0-100_w0.txt
$ perl ../../bin/grinderSIP.pl -g genome10.fna -r genome10-reads.fa -inc genome10_inc0-100_w1.txt -out readsByFrac_inc0-100_w1 | less
checking that all reads written
$ nseq readsByFrac_inc0-100_w1/isoIncorp/*fasta | perl -ne 'BEGIN{$sum=0} /.+:(\d+)/; $sum+=$1; END{print $sum,"\n"}'
* yes, all written
In [50]:
# Checking on character trait evolution through ordering by tree
import os
import glob
pwd = '/home/nick/notebook/grinderSIP/dev/bac_genome10/'
## getting tree labels
cmd = 'nw_labels -I ' + pwd + 'raxml/genome10_rrn-con_ML.nwk'
ret = os.popen(cmd, 'r')
labels = [i.strip() for i in ret]
## ordering & printing tables
def order_table(csv_file, order_list):
fh = open(csv_file, 'r')
tbl = [i.rstrip().split('\t') for i in fh]
tbl = dict(tbl)
for i in labels:
print '\t'.join([csv_file, i, tbl[i]])
# writing dict(tbl) in order of labels
files = glob.glob(pwd + '*_w*.txt')
for f in files:
order_table(f, labels)