Goal

  • If the DNA species distribution is truely Gaussian in a buoyant density gradient, then what sigma would be needed to reproduce the detection of all taxa > 0.1% in abundance throughout the entire gradient
  • If 1e10 16S rRNA copies in community, then 0.1% abundant taxon = 1e7
    • If detection limit = 1 molecule, then probability density of normal distribution across the entire gradient that we sequence must be >= 1e-7
      • ie., at least 1 of the 1e7 16S rRNA DNA molecules in every gradient fraction

Method

  • assess PDF across gradient for different levels of sigma

Setting parameters


In [4]:
%load_ext rpy2.ipython


The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

In [77]:
workDir = '/home/nick/notebook/SIPSim/dev/fullCyc/frag_norm_9_2.5_n5/default_run/'

In [75]:
%%R

sigmas = seq(1, 50, 1)
means = seq(30, 70, 1)    # mean GC content of 30 to 70%

## max 13C shift
max_13C_shift_in_BD = 0.036
## min BD (that we care about)
min_GC = 13.5
min_BD = min_GC/100.0 * 0.098 + 1.66
## max BD (that we care about)
max_GC = 80
max_BD = max_GC / 100.0 * 0.098 + 1.66    # 80.0% G+C
max_BD = max_BD + max_13C_shift_in_BD

Init


In [52]:
%%R
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)

In [79]:
import numpy as np         
import pandas as pd                                                             
import scipy.stats as stats
import dill

In [54]:
%%R

GC2BD = function(GC) GC / 100.0 * 0.098 + 1.66     
GC2BD(50) %>% print
BD2GC = function(BD) (BD - 1.66) / 0.098 * 100
BD2GC(1.709) %>% print


[1] 1.709
[1] 50

GC min-max


In [55]:
%%R
min_GC = BD2GC(min_BD)
max_GC = BD2GC(max_BD)
cat('Min-max GC:', min_GC, max_GC, '\n')


Min-max GC: 13.5 116.7347 

How big must sigma be to detect throughout the gradient


In [69]:
%%R 

# where is density > 1e-7
detect_thresh = function(mean, sd){
    dens = dnorm(0:117, mean=mean, sd=sd)
    all(dens > 1e-7)
    }

df = expand.grid(means, sigmas)
colnames(df) = c('mean', 'sigma')
df$detect = mapply(detect_thresh, mean=df$mean, sd=df$sigma)
df %>% head(n=4)


  mean sigma detect
1   30     1  FALSE
2   31     1  FALSE
3   32     1  FALSE
4   33     1  FALSE

In [74]:
%%R -w 600
# plotting
ggplot(df, aes(mean, sigma, fill=detect)) +
    geom_tile(color='black') +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )


Notes

  • sigma must be >= 18 to have taxon detected in all gradients
    • assuming mean GC of taxon fragments is 30%
  • How small would the fragments need to be to explain this just from diffusion (Clay et al., 2003)?

How small of fragments would be needed to get the observed detection threshold?

sigma distribution of fragment GC for the reference dataset genomes


In [124]:
# loading fragments
F = os.path.join(workDir, '1', 'fragsParsed.pkl')
with open(F, 'rb') as inFH:
    frags = dill.load(inFH)

In [166]:
stds = []
for x in frags:
    otu = x[0]
    for scaf,arr in x[1].items():
        arr = np.array(arr)
        sd = np.std(arr[:,2])   # fragment GC
        stds.append([otu, scaf, sd])
stds = np.array(stds)

In [116]:
%%R -i stds -w 500 -h 300

stds = stds %>% as.data.frame
colnames(stds) = c('taxon', 'scaffold', 'sigma')
stds = stds %>%
    mutate(sigma = sigma %>% as.character %>% as.numeric)

ggplot(stds, aes(sigma)) +
    geom_histogram() +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )



In [130]:
%%R
# using 10% quantile 
## a relatively small, but not totally outlier of a sigma 
## this will require a lot of diffusion
q10 = quantile(stds$sigma, probs=c(0.1)) %>% as.vector
q10


[1] 1.819238

In [156]:
%%R
# function for sigma diffusion (Clay et al., 2003)
sigma_dif = function(L){
    sqrt(44.5 / L)
}

# function for calculating total sigma (fragment buoyant density) based on mean fragment length
total_sigma = function(L, sigma_start){
    # L = fragment length (kb)
    # start_sigma = genome sigma prior to diffusion
    sigma_D = sigma_dif(L)
    sqrt(sigma_D**2 + sigma_start**2)
}

frag_lens = seq(0.1, 20, 0.1)

total_sd = sapply(frag_lens, total_sigma, sigma_start=q10)
df = data.frame('length__kb' = frag_lens, 'sigma' = total_sd)
df %>% head


  length__kb     sigma
1        0.1 21.173323
2        0.2 15.026963
3        0.3 12.314340
4        0.4 10.703253
5        0.5  9.607790
6        0.6  8.802062

In [157]:
%%R -w 600 -h 350
# plotting
ggplot(df, aes(length__kb, sigma)) +
    geom_point() +
    geom_line() +
    geom_hline(yintercept=18, linetype='dashed', alpha=0.7) +
    labs(x='Fragment length (kb)', y='Standard deviation of fragment BD\n(+diffusion)') +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )


Percent of taxa that would be detected in all fraction depending on the fragment BD stdev with accounting for diffusion


In [158]:
%%R
sigma_thresh = 18

frag_lens = seq(0.1, 20, 0.1)
df = expand.grid(stds$sigma, frag_lens) 
colnames(df) = c('sigma', 'length__kb')

df$total_sd = mapply(total_sigma, df$length__kb, df$sigma)
df$detect = ifelse(df$total_sd >= sigma_thresh, 1, 0) 

df = df %>%
    group_by(length__kb) %>%
    summarize(n = n(),
              detected = sum(detect),
              detect_perc = detected / n * 100)
df %>% head(n=3)


Source: local data frame [3 x 4]

  length__kb     n detected detect_perc
       (dbl) (int)    (dbl)       (dbl)
1        0.1  1954     1954         100
2        0.2  1954        0           0
3        0.3  1954        0           0

In [167]:
%%R -w 600 -h 350
# plotting
ggplot(df, aes(length__kb, detect_perc)) +
    geom_point() +
    geom_line() +
    labs(x='Fragment length (kb)', y='% of taxa detected in all\ngradient fractions') +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )


Notes

  • This analysis indicates that using a gaussian distribution to model the distribution of DNA fragments in a CsCl gradient does not fit the emperical observation that taxa of >0.1% abundance are detected in all gradient fractions.
  • Even when including diffusion (Clay et al., 2003), the mean fragment size would have to be <= 100 bp to have taxa detected in all gradient fractions.

Future work

  • Need to test with out the assumption that genome fragment are homogenous in GC content.
  • Method:
    • SIPSim simulations with differing fragment length distrubutions (differing diffusion):
      • 100 - 1000 bp
      • 1000 - 2000 bp
      • 4000 - 5000 bp
    • Assessing which taxa are present in all 'sequenced' gradient fractions
      • Determine approx abundance threshold for taxa present in fractions
        • Is it near 0.1%?

In [ ]: