Description

  • making a sliding window GC distribution around the 16S rRNA gene

Genome dataset:

  • just 1 member of each species
  • using all genomes with amplicons generated from GRINDER run
  • using ssu genes identified by rnammer

Bacteria

  • 1091 of 1162 bacterial genomes

filtering reads

  • just need 1 amplicon from each 16S gene copy in each genome

@ system76-server:~/notebook/deltaGC_notes/data/amplicon/140331_allBac/amp454-L215-N1k/GC_dist

screen -S read_filter -L bash -c "grinder_readFilter.pl -r ../amp454-L215-N1k_reads.fa > amp454-L215-N1k_reads_filt.fa"

    * overlap window = 100bp

Number of reads pre-filtering: 1091000
Number remaining post-filtering: 4105

getting GC distribution (sliding window)

@ system76-server:~/notebook/deltaGC_notes/data/amplicon/140331_allBac/amp454-L215-N1k/GC_dist

screen -S GC_dist -L bash -c "GC_dist.pl -r amp454-L215-N1k_reads_filt.fa -g ../../prokBac-CompRep.fna -a -t 12 > prokBac-CompRep_16S_GC-byWindow.txt"
    * took ~1 min 

Making plots of GC distribution


In [5]:
%load_ext rmagic
%load_ext rpy2.ipython


The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic

In [6]:
%%R
require(ggplot2)
require(data.table)
require(gridExtra)

In [7]:
%%R
tbl = fread("~/notebook/deltaGC_notes/data/amplicon/140331_allBac/amp454-L215-N1k/GC_dist/prokBac-CompRep_16S_GC-byWindow.txt", 
            header=T)

## setting position around center
tbl[,pos_local:=(pos_local - 10000)]


                                                                                    genome
     1:                CP001579 Brucella microti CCM 4915 chromosome 2, complete sequence.
     2:                CP001579 Brucella microti CCM 4915 chromosome 2, complete sequence.
     3:                CP001579 Brucella microti CCM 4915 chromosome 2, complete sequence.
     4:                CP001579 Brucella microti CCM 4915 chromosome 2, complete sequence.
     5:                CP001579 Brucella microti CCM 4915 chromosome 2, complete sequence.
    ---                                                                                   
803463: CP002585 Pseudomonas brassicacearum subsp. brassicacearum NFM421, complete genome.
803464: CP002585 Pseudomonas brassicacearum subsp. brassicacearum NFM421, complete genome.
803465: CP002585 Pseudomonas brassicacearum subsp. brassicacearum NFM421, complete genome.
803466: CP002585 Pseudomonas brassicacearum subsp. brassicacearum NFM421, complete genome.
803467: CP002585 Pseudomonas brassicacearum subsp. brassicacearum NFM421, complete genome.
            scaf                  readID read_num pos_local pos_global   GC
     1: CP001579 CP001579__CP001579__569        1     -9750    1112151 58.6
     2: CP001579 CP001579__CP001579__569        1     -9650    1112351 59.4
     3: CP001579 CP001579__CP001579__569        1     -9550    1112551 55.0
     4: CP001579 CP001579__CP001579__569        1     -9450    1112751 53.2
     5: CP001579 CP001579__CP001579__569        1     -9350    1112951 53.8
    ---                                                                    
803463: CP002585 CP002585__CP002585__319        5      9350    5837078 65.0
803464: CP002585 CP002585__CP002585__319        5      9450    5837278 64.6
803465: CP002585 CP002585__CP002585__319        5      9550    5837478 64.2
803466: CP002585 CP002585__CP002585__319        5      9650    5837678 64.8
803467: CP002585 CP002585__CP002585__319        5      9750    5837878 63.8
        buoyant_density
     1:        1.717428
     2:        1.718212
     3:        1.713900
     4:        1.712136
     5:        1.712724
    ---                
803463:        1.723700
803464:        1.723308
803465:        1.722916
803466:        1.723504
803467:        1.722524

In [8]:
%%R
# plotting interquartile ranges of values at each relative position
## median & stdev by pos_local
tbl.median = tbl[, median(GC), by="pos_local"]
setkey(tbl.median, pos_local)
tbl.stdev = tbl[, sd(GC), by="pos_local"]
setkey(tbl.stdev, pos_local)

# merging
tbl.sum = merge(tbl.median, tbl.stdev)
colnames(tbl.sum) = c('pos_local', 'median', 'stdev')

# ymin & ymax
tbl.sum$ymin = tbl.sum$median - tbl.sum$stdev
tbl.sum$ymax = tbl.sum$median + tbl.sum$stdev

In [23]:
%%R -w 12 -h 4 -u in

p = ggplot(tbl.sum, aes(x=pos_local, y=median, ymin=ymin, ymax=ymax))+
        geom_pointrange() +
        labs(y='Median G+C content (%)', x='Nucleotide position relative to the middle of the 16S rRNA gene (bp)') +
        theme( 
            text = element_text(size=16),
            axis.title.x = element_text(vjust=0),               
            axis.title.y = element_text(vjust=1),
            axis.text.x = element_text(color='black'),
            axis.text.y = element_text(color='black')
            )

#ggsave(p, 
#       file="~/notebook/deltaGC_notes/data/amplicon/140331_allBac/amp454-L215-N1k/GC_dist/prokBac-CompRep_16S_GC-byWindow_sd.pdf",
#        width=12, height=4, units='in')
print(p)
#dev.off()
xxx=1


Example taxa

  • Clostridium ljungdahlii DSM 13528, complete genome
  • Escherichia coli APEC O78, complete genome
  • Streptomyces griseus subsp. griseus NBRC 13350 DNA, complete genome
  • Lactobacillus gasseri ATCC 33323, complete genome

In [35]:
%%R
print(apply(tbl, 2, class))


         genome            scaf          readID        read_num       pos_local 
    "character"     "character"     "character"     "character"     "character" 
     pos_global              GC buoyant_density 
    "character"     "character"     "character" 

In [70]:
%%R -w 1000 -h 400
# parsing out select taxa to show GC content
tbl$genome = gsub("\n", " ", tbl$genome)

## unique list of genomes
u.genome = unique(tbl$genome)

## selecting examples
examples = c(
"CP001666 Clostridium ljungdahlii DSM 13528, complete genome.",
"CP004009 Escherichia coli APEC O78, complete genome.",
"AP009493 Streptomyces griseus subsp. griseus NBRC 13350 DNA, complete genome.",  
"CP000413 Lactobacillus gasseri ATCC 33323, complete genome."
)
q = paste(examples, collapse="|");
tbl.p = tbl[grep(q, tbl$genome),]
tbl.p$read_num = as.character(tbl.p$read_num)


## editing names
new.names = c(
"C. ljungdahlii",
"E. coli",
"S. griseus",
"L. gasseri"
)

for(i in 1:length(examples)){
    tbl.p$genome = gsub(examples[i], new.names[i], tbl.p$genome)
    }

## ordering
tbl.p$genome = factor(tbl.p$genome, levels=new.names)

## plotting 
p.ex = ggplot(tbl.p, aes(pos_local, GC, group=readID, color=read_num)) +
        geom_line(alpha=0.5, size=0.7) +
        facet_grid(genome ~ .) +
        scale_y_continuous(limits=c(15,85)) +
        scale_color_discrete(name='Copy\nnumber') +
        labs(y='G+C content (%)', x='Nucleotide position relative to the middle of the 16S rRNA gene (bp)') +
        theme( 
            text = element_text(size=16),
            strip.text = element_text(face="italic"),
            axis.title.x = element_text(vjust=0),   
            axis.title.y = element_text(vjust=1),
            axis.text.x = element_text(color='black'),
            axis.text.y = element_text(color='black')
            )


## adding arrows to L gasseri plot
arrow.tbl = data.frame(genome='L. gasseri', readID="1", read_num="1", x=c(-6300, -5000), xend=c(-6300, -5000), y=82, yend=55)
p.ex = p.ex + geom_segment(data=arrow.tbl,
                           arrow = arrow(length = unit(0.15,"cm")),
                           aes(x=x, xend=xend, y=y, yend=yend, genome=genome)
                            )

#ggsave(p.ex, 
#       file="~/notebook/deltaGC_notes/data/amplicon/140331_allBac/amp454-L215-N1k/GC_dist/prokBac-CompRep_16S_GC-byWindow_Ex.pdf",
#        width=12, height=5, units='in')
print(p.ex)



In [74]:
%%R -w 13 -h 9 -u in
# combining the plots
blankPanel = grid.rect(gp=gpar(col="white"))

## making plot
#tiff(file="~/notebook/deltaGC_notes/data/amplicon/140331_allBac/amp454-L215-N1k/GC_dist/prokBac-CompRep_16S_GC-byWindow_comb.tiff",
#        width=13, height=9, units='in', res=300)
#svg(file="~/notebook/deltaGC_notes/data/amplicon/140331_allBac/amp454-L215-N1k/GC_dist/prokBac-CompRep_16S_GC-byWindow_comb.svg",
#        width=13, height=9)

grid.arrange(arrangeGrob(p, blankPanel, ncol=2, widths=c(8.9,1)), p.ex, ncol=1, heights=c(1.3,2))

#dev.off()

xxx=1



Calculating GC stats for ssu, lsu, tsu, and 16S-23S-ITS region

@ system76-server:~/notebook/deltaGC_notes/data/amplicon/140411_allBac/amp454-L215-N1k/140428_rrn_GCstats

Getting all rnammer ssu hits

egrep -v "^#" /var/seq_data/ncbi_db/genome/rnammer/bacteria_ssu/*gff | perl -pe 's/:/\t/' | grep -f <(grep ">" ../amp454-L215-N1k_reads.fa | perl -pe 's/.+description="|"//g;s/$/_rrn.gff/' | sort -u) > rnammer_bac-amp454.txt

Calculating stats

$ rnammer_rrn_GCstats.pl -g <(perl -pe 's/_rrn.gff//' rnammer_bac-amp454.txt) -f /var/seq_data/ncbi_db/genome/prok-bac-genomes/ --debug | less

gene    min mean    median  max stdev
5s_rRNA 41.667  57.83   57.522  71.552  5.078
23s_rRNA    22.421  52.703  52.508  66.031  3.029
ITS 9.375   44.256  44.136  73.42   8.816
16s_rRNA    35.178  54.704  54.525  65.911  2.863

Archaea

GC_dist on the NCBI archaeal genome dataset

  • using genome & read files from notebook 'deltaGC_arc_amp_sims'

Filtering reads

  • just need 1 amplicon from each 16S gene copy in each genome

@ system76-server:~/notebook/deltaGC_notes/data/amplicon/140403_allArc/amp454-L215-N1k/GC_dist

grinder_readFilter.pl -r ../amp454-L215-N1k_reads.fa > amp454-L215-N1k_reads_filt.fa
    * overlap window = 100bp

Number of reads pre-filtering: 101000
Number remaining post-filtering: 175

Getting GC distribution (sliding window)

@ system76-server:~/notebook/deltaGC_notes/data/amplicon/140403_allArc/amp454-L215-N1k/GC_dist

GC_dist.pl -r amp454-L215-N1k_reads_filt.fa -g ../../prokArc-CompRep.fna -a -t 12 > prokArc-CompRep_16S_GC-byWindow.txt

Making plots of GC distribution


In [1]:
%load_ext rmagic


/opt/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module argparse was already imported from /opt/anaconda/lib/python2.7/argparse.pyc, but /opt/anaconda/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream

In [2]:
%%R
require(ggplot2)
require(data.table)


Loading required package: ggplot2
Loading required package: data.table
data.table 1.9.2  For help type: help("data.table")

In [3]:
%%R
tbl = fread("~/notebook/deltaGC_notes/data/amplicon/140403_allArc/amp454-L215-N1k/GC_dist/prokArc-CompRep_16S_GC-byWindow.txt", 
            header=T)

## setting position around center
tbl[,pos_local:=(pos_local - 10000)]


                                                                    genome
    1: CP006885 Haloarcula hispanica N601 chromosome 2, complete sequence.
    2: CP006885 Haloarcula hispanica N601 chromosome 2, complete sequence.
    3: CP006885 Haloarcula hispanica N601 chromosome 2, complete sequence.
    4: CP006885 Haloarcula hispanica N601 chromosome 2, complete sequence.
    5: CP006885 Haloarcula hispanica N601 chromosome 2, complete sequence.
   ---                                                                    
34220:            CP003377 Natronobacterium gregoryi SP2, complete genome.
34221:            CP003377 Natronobacterium gregoryi SP2, complete genome.
34222:            CP003377 Natronobacterium gregoryi SP2, complete genome.
34223:            CP003377 Natronobacterium gregoryi SP2, complete genome.
34224:            CP003377 Natronobacterium gregoryi SP2, complete genome.
           scaf                  readID read_num pos_local pos_global   GC
    1: CP006885 CP006885__CP006885__307        1     -9750        251 47.6
    2: CP006885 CP006885__CP006885__307        1     -9650        451 47.8
    3: CP006885 CP006885__CP006885__307        1     -9550        651 48.4
    4: CP006885 CP006885__CP006885__307        1     -9450        851 49.6
    5: CP006885 CP006885__CP006885__307        1     -9350       1051 51.4
   ---                                                                    
34220: CP003377 CP003377__CP003377__880        3      9350    1362230 57.6
34221: CP003377 CP003377__CP003377__880        3      9450    1362430 57.8
34222: CP003377 CP003377__CP003377__880        3      9550    1362630 58.4
34223: CP003377 CP003377__CP003377__880        3      9650    1362830 59.8
34224: CP003377 CP003377__CP003377__880        3      9750    1363030 59.2
       buoyant_density
    1:        1.706648
    2:        1.706844
    3:        1.707432
    4:        1.708608
    5:        1.710372
   ---                
34220:        1.716448
34221:        1.716644
34222:        1.717232
34223:        1.718604
34224:        1.718016

In [6]:
%%R -w 900 -h 300
# plotting interquartile ranges of values at each relative position
## median & stdev by pos_local
tbl.median = tbl[, median(GC), by="pos_local"]
setkey(tbl.median, pos_local)
tbl.stdev = tbl[, sd(GC), by="pos_local"]
setkey(tbl.stdev, pos_local)

# merging
tbl.sum = merge(tbl.median, tbl.stdev)
colnames(tbl.sum) = c('pos_local', 'median', 'stdev')

# ymin & ymax
tbl.sum$ymin = tbl.sum$median - tbl.sum$stdev
tbl.sum$ymax = tbl.sum$median + tbl.sum$stdev

# plotting
p = ggplot(tbl.sum, aes(x=pos_local, y=median, ymin=ymin, ymax=ymax))+
        geom_pointrange() +
        labs(y='Median GC content (%)', x='Nucleotide position relative to the middle of the 16S rRNA gene') +
        theme( text = element_text(size=18))

ggsave(file="~/notebook/deltaGC_notes/data/amplicon/140403_allArc/amp454-L215-N1k/GC_dist/prokArc-CompRep_16S_GC-byWindow_sd.tiff",
        width=12, height=4, units='in', dpi=300)
print(p)
xxx=1


Calculating GC stats for ssu, lsu, tsu, and 16S-23S-ITS region

Getting all rnammer ssu hits

egrep -v "^#" /var/seq_data/ncbi_db/genome/rnammer/archaea_ssu/*gff | perl -pe 's/:/\t/' | grep -f <(grep ">" ../amp454-L215-N1k_reads.fa | perl -pe 's/.+description="|"//g;s/$/_rrn.gff/' | sort -u) > rnammer_bac-amp454.txt

Calculating GC content stats

@ system76-server:~/notebook/deltaGC_notes/data/amplicon/140403_allArc/amp454-L215-N1k/140428_rrn_GCstats

$ rnammer_rrn_GCstats.pl -g <(perl -pe 's/_rrn.gff//' rnammer_arc-amp454.txt) -f /var/seq_data/ncbi_db/genome/prok-arc-genomes/ --debug | less

In [ ]: