Developing simulation of PCR step in the HR-SIP pipeline

  • Main goal is to add in template saturation, which should flatten out the abundance distributions of the abundant taxa

Method

Based on:

Suzuki MT, Giovannoni SJ. (1996). Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Appl Environ Microbiol 62:625–630.

  • User-input
    • Molarity of DNA used for PCRs
      • distribution
    • Molarity of primers used for PCRs
      • distribution
    • Number of PCR cycles
      • default: 30
    • Starting efficiency
  • workflow
    • OTU_table --> PCR_sim --> subsample
      • PCR_sim input/output = OTU_table

Setting variables


In [2]:
# dirs
workDir = '/home/nick/notebook/SIPSim/dev/bac_genome3/PCR_sim/'
genomeDir = '/home/nick/notebook/SIPSim/dev/bac_genome3/genomes/'
R_dir = '/home/nick/notebook/SIPSim/lib/R/'

# input files
genomeIndexFile = '/home/nick/notebook/SIPSim/dev/bac_genome3/genomes/genome_index.txt'
fragBD_file = '/home/nick/notebook/SIPSim/dev/bac_genome3/validation/ampFrags_real_kde_dif.pkl'

# params
nprocs = 24

Init


In [3]:
%load_ext rpy2.ipython

In [4]:
%%R
library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)


/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 
Attaching package: ‘dplyr’


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following objects are masked from ‘package:stats’:

    filter, lag


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: Loading required package: grid

  res = super(Function, self).__call__(*new_args, **new_kwargs)

In [7]:
if not os.path.isdir(workDir):
    os.makedirs(workDir)

Making a community file


In [8]:
commFile = 'comm_n1.txt'

!cd $workDir; \
    SIPSim communities \
    $genomeIndexFile \
    > $commFile
    
!cd $workDir; \
    head $commFile


library	taxon_name	rel_abund_perc	rank
1	Escherichia_coli_1303	55.142898529	1
1	Clostridium_ljungdahlii_DSM_13528	41.256294112	2
1	Streptomyces_pratensis_ATCC_33331	3.600807359	3

Making an incorp config file


In [9]:
incorpConfigFile = 'PT0_PI0.config'

!cd $workDir; \
    SIPSim incorpConfigExample \
    --percTaxa 0 \
    --percIncorpUnif 0 \
    > $incorpConfigFile
    
!cd $workDir; \
    head $incorpConfigFile


[1]
    # baseline: no incorporation
    
    [[intraPopDist 1]]
        distribution = uniform
        
        [[[start]]]
            
            [[[[interPopDist 1]]]]

Adding isotope incorporation to BD distribution


In [10]:
wIncorpFile = os.path.splitext(fragBD_file)[0] + '_incorp.pkl'

!cd $workDir; \
    SIPSim isotope_incorp \
    $fragBD_file \
    $incorpConfigFile \
    --comm $commFile \
    --np $nprocs \
    > $wIncorpFile
    
!cd $workDir; \
    ls -thlc $wIncorpFile


Processing: Clostridium_ljungdahlii_DSM_13528
Processing: Streptomyces_pratensis_ATCC_33331
Processing: Escherichia_coli_1303
Processing: Clostridium_ljungdahlii_DSM_13528
Processing: Streptomyces_pratensis_ATCC_33331
Processing: Escherichia_coli_1303
-rw-rw-r-- 1 nick nick 4.6M Sep  7 10:50 /home/nick/notebook/SIPSim/dev/bac_genome3/validation/ampFrags_real_kde_dif_incorp.pkl

Simulating gradient fractions


In [11]:
fracsFile = 'fracs.txt'

!cd $workDir; \
    SIPSim gradient_fractions \
    comm_n1.txt \
    > $fracsFile
    
!cd $workDir; \
    head -n 4 $fracsFile


library	fraction	BD_min	BD_max	fraction_size
1	1	1.66	1.664	0.004
1	2	1.664	1.669	0.005
1	3	1.669	1.672	0.003

Simulating an OTU table


In [125]:
otuTableFile = 'OTU_abs1e9.txt'

!cd $workDir; \
    SIPSim OTU_table \
    $wIncorpFile \
    $commFile \
    $fracsFile \
    --abs 1e9 \
    --np $nprocs \
    > $otuTableFile
    
!cd $workDir; \
    head -n 4 $otuTableFile


Loading files...
Simulating OTUs...
Processing library: "1"
  Processing taxon: "Escherichia_coli_1303"
   taxon abs-abundance:  551428985
  Processing taxon: "Clostridium_ljungdahlii_DSM_13528"
   taxon abs-abundance:  412562941
  Processing taxon: "Streptomyces_pratensis_ATCC_33331"
   taxon abs-abundance:  36008074
library	taxon	fraction	BD_min	BD_mid	BD_max	count	rel_abund
1	Clostridium_ljungdahlii_DSM_13528	1.660-1.664	1.66	1.662	1.664	0	0.0
1	Clostridium_ljungdahlii_DSM_13528	1.664-1.669	1.664	1.667	1.669	8337	0.433428645698
1	Clostridium_ljungdahlii_DSM_13528	1.669-1.672	1.669	1.671	1.672	1650	1.0

In [126]:
%%R -i workDir -i otuTableFile
setwd(workDir) 

tbl.otu = read.delim(otuTableFile, sep='\t')
tbl.otu %>% head(n=3)


  library                             taxon    fraction BD_min BD_mid BD_max
1       1 Clostridium_ljungdahlii_DSM_13528 1.660-1.664  1.660  1.662  1.664
2       1 Clostridium_ljungdahlii_DSM_13528 1.664-1.669  1.664  1.667  1.669
3       1 Clostridium_ljungdahlii_DSM_13528 1.669-1.672  1.669  1.671  1.672
  count rel_abund
1     0 0.0000000
2  8337 0.4334286
3  1650 1.0000000

In [127]:
%%R -w 900 -h 300
ggplot(tbl.otu, aes(BD_mid, rel_abund, fill=taxon)) +
    geom_area(stat='identity') +
    labs(x = 'Buoyant density', y = 'Relative abundance') +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )


Simulating PCR

  • assuming 5 ng of DNA in a 30 ul rxn
    • Avg. MW of dNTPs = 499.5 g/mol
    • molarity of DNA in rxn = 5 * 1e-9 * (1 / 499.5) * (1 / (30 * 1e-6))
  • molarity of primers:
    • 1 uM (each)
  • number of PCR cycles:
    • 30

Model from Suzuki MT & Giovannoni SJ (1996)

  • equation for modeling increase in polymerized DNA conc.
    • $M = M_0 * e^{f*n}$
  • equation for modeling rxn effeciency
    • $f(n) = f_0 * \big(\frac{P(n)}{k * M(n) + P(n)}\big)$

algorithm

  • deepcopy of OTU table object
    • perc_rel_abund = 0 (to be filled in later)
    • post_PCR_conc = 0
  • foreach column (fraction community) in OTU table:
    • select total start DNA from distribution
    • foreach taxon:
      • taxon_DNA_conc = rel_abund_perc / 100 * DNA_molarity
        • (starting DNA conc. for taxon) = M_0
      • foreach PCR cycle:
        • calculate effeciency (f)
        • calculate new concentration (M)
      • add to dataframe: post_PCR_conc
    • Normalize all post_PCR_conc by total sample conc
      • this is the new perc_rel_abund

Plotting efficiency function


In [216]:
%%R

rxn_eff = function(M_n, P_n=1e-6, f_0=1, k=1){
    b = k * M_n + P_n 
    f_n = f_0 * (P_n / b)
    return(f_n)
}

M_n = c(1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3)
f_n = sapply(M_n, rxn_eff)

df = data.frame(M_n = M_n, f_n = f_n)

ggplot(df, aes(M_n, f_n)) +
    geom_point() +
    geom_line() +
    scale_x_log10() +
    theme(
        text = element_text(size=16)
    )


PCR-SIM


In [246]:
otu_PCR_file = 'OTU_abs1e9_PCR.txt'

!cd $workDir; \
    SIPSim OTU_PCR \
    $otuTableFile \
    > $otu_PCR_file
    
!cd $workDir; \
    head -n 4 $otu_PCR_file


library	taxon	fraction	BD_min	BD_mid	BD_max	count	rel_abund
1	Clostridium_ljungdahlii_DSM_13528	1.660-1.664	1.66	1.662	1.664	0	0.0
1	Clostridium_ljungdahlii_DSM_13528	1.664-1.669	1.664	1.667	1.669	476118124	0.476118123851
1	Clostridium_ljungdahlii_DSM_13528	1.669-1.672	1.669	1.671	1.672	1000000000	1.0

In [247]:
%%R -i workDir -i otu_PCR_file -i  otuTableFile
setwd(workDir) 

tbl.otu = read.delim(otuTableFile, sep='\t') 
tbl.otu.g = tbl.otu %>% gather('type', 'value', c(count, rel_abund))

tbl.otu.PCR = read.delim(otu_PCR_file, sep='\t') %>%
    select(-count) 
tbl.otu.PCR.g = tbl.otu.PCR %>% gather('type', 'value', c(rel_abund)) %>%
    mutate(type = 'rel_abund_PCR')
#tbl.otu.PCR.g %>% head(n=3) %>% print

tbl.otu.c = rbind(tbl.otu.g, tbl.otu.PCR.g)
tbl.otu.c %>% head(n=3)


  library                             taxon    fraction BD_min BD_mid BD_max
1       1 Clostridium_ljungdahlii_DSM_13528 1.660-1.664  1.660  1.662  1.664
2       1 Clostridium_ljungdahlii_DSM_13528 1.664-1.669  1.664  1.667  1.669
3       1 Clostridium_ljungdahlii_DSM_13528 1.669-1.672  1.669  1.671  1.672
   type value
1 count     0
2 count  8337
3 count  1650

In [248]:
%%R -w 900 -h 400

ggplot(tbl.otu.c, aes(BD_mid, value, fill=taxon)) +
    geom_area() +
    facet_grid(type ~ ., scales='free_y') +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )



In [249]:
%%R -w 900 -h 400

tbl.otu.c.f = tbl.otu.c %>%
    filter(type != 'count')

tbl.otu.c.f$taxon = gsub('^([a-zA-Z0-9_]{10,15})_', '\\1\n', tbl.otu.c.f$taxon)
tbl.otu.c.f$taxon = gsub('\n([a-zA-Z0-9_]{10,15})_', '\n\\1\n', tbl.otu.c.f$taxon)

ggplot(tbl.otu.c.f, aes(BD_mid, value, fill=type)) +
    geom_area(position='dodge', alpha=0.5) +
    facet_grid(taxon ~ .) +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )


Notes

  • The PCR step has the effect of leveling out abundance distributions (as expected)

Subsampling

w/out PCR


In [250]:
otu_sub_file = 'OTU_abs1e9_sub.txt'

!cd $workDir; \
    SIPSim OTU_subsample \
    $otuTableFile \
    > $otu_sub_file
    
!cd $workDir; \
    head -n 4 $otu_sub_file


library	fraction	taxon	BD_min	BD_mid	BD_max	count	rel_abund
1	1.660-1.664	Clostridium_ljungdahlii_DSM_13528	1.66	1.662	1.664	0	0.0
1	1.664-1.669	Clostridium_ljungdahlii_DSM_13528	1.664	1.667	1.669	4325	0.4325
1	1.669-1.672	Clostridium_ljungdahlii_DSM_13528	1.669	1.671	1.672	10000	1.0

w/ PCR


In [251]:
otu_PCR_sub_file = 'OTU_abs1e9_PCR_sub.txt'

!cd $workDir; \
    SIPSim OTU_subsample \
    $otu_PCR_file \
    > $otu_PCR_sub_file
    
!cd $workDir; \
    head -n 4 $otu_PCR_sub_file


library	fraction	taxon	BD_min	BD_mid	BD_max	count	rel_abund
1	1.660-1.664	Clostridium_ljungdahlii_DSM_13528	1.66	1.662	1.664	0	0.0
1	1.664-1.669	Clostridium_ljungdahlii_DSM_13528	1.664	1.667	1.669	4753	0.4753
1	1.669-1.672	Clostridium_ljungdahlii_DSM_13528	1.669	1.671	1.672	10000	1.0

Plotting


In [252]:
%%R -i workDir -i otu_sub_file -i otu_PCR_sub_file
setwd(workDir)

tbl.otu1 = read.delim(otu_sub_file, sep='\t') %>%
    mutate(file = 'no_PCR')
tbl.otu2 = read.delim(otu_PCR_sub_file, sep='\t') %>%
    mutate(file = 'PCR')
tbl.otu = rbind(tbl.otu1, tbl.otu2)
tbl.otu %>% head(n=3)


  library    fraction                             taxon BD_min BD_mid BD_max
1       1 1.660-1.664 Clostridium_ljungdahlii_DSM_13528  1.660  1.662  1.664
2       1 1.664-1.669 Clostridium_ljungdahlii_DSM_13528  1.664  1.667  1.669
3       1 1.669-1.672 Clostridium_ljungdahlii_DSM_13528  1.669  1.671  1.672
  count rel_abund   file
1     0    0.0000 no_PCR
2  4325    0.4325 no_PCR
3 10000    1.0000 no_PCR

In [253]:
%%R -w 900 -h 400

ggplot(tbl.otu, aes(BD_mid, rel_abund, fill=taxon)) +
    geom_area() +
    facet_grid(file ~ .) +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )


Notes:

  • the PCR simulation does seem to have the desired effect of broadening the taxon abundance distributions

In [ ]: