Goal

  • simulating amplicon fragments for genomes in non-singleton OTUs

Setting variables


In [1]:
import os

workDir = '/var/seq_data/ncbi_db/genome/Jan2016/ampFragsGC/'
ampFragFile = '/var/seq_data/ncbi_db/genome/Jan2016/ampFrags_KDE.pkl'
otuFile = '/var/seq_data/ncbi_db/genome/Jan2016/rnammer_aln/otusn_map_nonSingle.txt'

Init


In [2]:
import dill
import numpy as np
import pandas as pd
%load_ext rpy2.ipython
%load_ext pushnote

In [3]:
%%R
library(dplyr)
library(tidyr)
library(ggplot2)


/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 
Attaching package: ‘dplyr’


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following objects are masked from ‘package:stats’:

    filter, lag


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


  res = super(Function, self).__call__(*new_args, **new_kwargs)

In [4]:
if not os.path.isdir(workDir):
    os.makedirs(workDir)
    
%cd $workDir


/var/seq_data/ncbi_db/genome/Jan2016/ampFragsGC

gradient params


In [5]:
# max 13C shift
max_13C_shift_in_BD = 0.036
# min BD (that we care about)
min_GC = 13.5
min_BD = min_GC/100.0 * 0.098 + 1.66
# max BD (that we care about)
max_GC = 80
max_BD = max_GC / 100.0 * 0.098 + 1.66    # 80.0% G+C
max_BD = max_BD + max_13C_shift_in_BD
## BD range of values
BD_vals = np.arange(min_BD, max_BD, 0.001)

Get GC distribution info


In [6]:
infoFile = os.path.splitext(ampFragFile)[0] + '_info.txt'
infoFile = os.path.join(workDir, os.path.split(infoFile)[1])
!SIPSim KDE_info -s $ampFragFile > $infoFile

!wc -l $infoFile
!head -n 4 $infoFile


Loading KDEs...
6243 /var/seq_data/ncbi_db/genome/Jan2016/ampFragsGC/ampFrags_KDE_info.txt
taxon_ID	KDE_ID	min	percentile_5	percentile_25	mean	median	percentile_75	percentile_95	max	stdev
Bacillus_anthracis_str_Vollum	1	1.69393602024	1.697711297	1.70058504179	1.70313160459	1.70329429346	1.70540312694	1.70904095416	1.71340671141	0.0033717331554
Bacillus_anthracis_str_Vollum	2	373.0	4120.85	6119.0	7037.1497	7293.0	8200.0	9073.0	10425.0	1550.9421936
Staphylococcus_lugdunensis_HKU09-01	1	1.69322436392	1.69574485401	1.69910162303	1.70196230742	1.70208794754	1.70541285861	1.70705158605	1.71137644342	0.00367910493412

In [7]:
%%R -i infoFile

df.info = read.delim(infoFile, sep='\t') %>%
    mutate(genus_ID = gsub('_.+', '', taxon_ID),
           species_ID = gsub('^([^_]+_[^_]+).+', '\\1', taxon_ID))

df.info %>% head(n=3)


                             taxon_ID KDE_ID        min percentile_5
1       Bacillus_anthracis_str_Vollum      1   1.693936     1.697711
2       Bacillus_anthracis_str_Vollum      2 373.000000  4120.850000
3 Staphylococcus_lugdunensis_HKU09-01      1   1.693224     1.695745
  percentile_25        mean      median percentile_75 percentile_95
1      1.700585    1.703132    1.703294      1.705403      1.709041
2   6119.000000 7037.149700 7293.000000   8200.000000   9073.000000
3      1.699102    1.701962    1.702088      1.705413      1.707052
           max        stdev       genus_ID                 species_ID
1     1.713407 3.371733e-03       Bacillus         Bacillus_anthracis
2 10425.000000 1.550942e+03       Bacillus         Bacillus_anthracis
3     1.711376 3.679105e-03 Staphylococcus Staphylococcus_lugdunensis

Combining info table with OTU tabel


In [13]:
%%R -i otuFile

df.OTU = read.delim(otuFile, sep='\t', header=FALSE) %>%
    mutate(genome_ID = gsub('\\.fna', '', V13)) %>%
    select(genome_ID, V2) %>%
    rename('OTU_ID' = V2)

df.info.j = inner_join(df.info, df.OTU, c('taxon_ID' = 'genome_ID'))
df.OTU = NULL

df.info.j %>% head(n=3)


                       taxon_ID KDE_ID      min percentile_5 percentile_25
1 Bacillus_anthracis_str_Vollum      1 1.693936     1.697711      1.700585
2 Bacillus_anthracis_str_Vollum      1 1.693936     1.697711      1.700585
3 Bacillus_anthracis_str_Vollum      1 1.693936     1.697711      1.700585
      mean   median percentile_75 percentile_95      max       stdev genus_ID
1 1.703132 1.703294      1.705403      1.709041 1.713407 0.003371733 Bacillus
2 1.703132 1.703294      1.705403      1.709041 1.713407 0.003371733 Bacillus
3 1.703132 1.703294      1.705403      1.709041 1.713407 0.003371733 Bacillus
          species_ID  OTU_ID
1 Bacillus_anthracis OTU.159
2 Bacillus_anthracis OTU.159
3 Bacillus_anthracis OTU.159

In [31]:
%%R
df.info.j.f1 = df.info.j %>% 
    filter(KDE_ID == 1) %>%
    distinct(taxon_ID, OTU_ID) %>%
    group_by(OTU_ID) %>%
    mutate(n_taxa = n()) %>%
    ungroup() %>%
    filter(n_taxa > 1)

df.info.j.f1 %>% nrow %>% print
df.info.j.f1 %>% head(n=3) %>% as.data.frame


[1] 2634
                                taxon_ID KDE_ID      min percentile_5
1          Bacillus_anthracis_str_Vollum      1 1.693936     1.697711
2    Staphylococcus_lugdunensis_HKU09-01      1 1.693224     1.695745
3 Escherichia_coli_str_K-12_substr_MDS42      1 1.708155     1.709060
  percentile_25     mean   median percentile_75 percentile_95      max
1      1.700585 1.703132 1.703294      1.705403      1.709041 1.713407
2      1.699102 1.701962 1.702088      1.705413      1.707052 1.711376
3      1.709942 1.710383 1.710368      1.710810      1.711587 1.715662
        stdev       genus_ID                 species_ID  OTU_ID n_taxa
1 0.003371733       Bacillus         Bacillus_anthracis OTU.159     58
2 0.003679105 Staphylococcus Staphylococcus_lugdunensis OTU.458     65
3 0.000740320    Escherichia           Escherichia_coli OTU.465     98

In [44]:
%%R -h 4000 -w 800

df.info.j.f1$taxon_ID = reorder(df.info.j.f1$taxon_ID, df.info.j.f1$genus_ID)
df.info.j.f1$OTU_ID = reorder(df.info.j.f1$OTU_ID, -df.info.j.f1$n_taxa)

ggplot(df.info.j.f1, aes(x=taxon_ID, y=median, 
                       ymin=percentile_25, ymax=percentile_75,
                       color=species_ID)) +
    geom_linerange() +
    geom_point(size=1) +
    facet_wrap(~ OTU_ID, scales='free_x', ncol=8) +
    theme_bw() +
    theme(
        axis.text.x = element_blank(),
        legend.position='none'
    )


Conclusions

The large figure that I’ve attached shows what I’ve been suspecting: we can’t really match up OTUs from our dataset to the NCBI genomes, so an ‘apples to apples’ comparison for the simulations at the OTU-level is not really feasible.

  • Explanation of the figure:
    • Downloaded all complete bacterial genomes from NCBI (> 3500)
    • Used rnammer to ID rRNA SSU gene sequences
    • Simulated 16S amplicons with in silico PCR using the 515F-806R primers
    • Amplicon sequence QC & OTU binning (97% seqID) using our normal pipeline
    • Selected all genomes from all non-singleton OTUs (want to look at amplicon fragment G+C variability at the intra-OTU level)
    • Simulated amplicon fragments using my HR-SIP simulation
    • Assessed G+C variability (units in buoyant density)

The plots are showing the amplicon-fragment G+C (BD) variability of each genome that falls into each non-singleton OTU. The line range is 25th percentile to 75th percentile. The point is the median. Colors indicate genus.

Some of the OTUs are very consistent (eg., OTU.465), but many others (OTU.876) are not. So, the amplicon-fragments of OTUs in our datasets will not necessarily match the reference genomes that I’m using for the simulations.

Also of note: some OTUs aggregate genomes as the genus level, but this would probably happen with most clustering methods unless we used at cutoff of ~100% seqID.


In [ ]: