Developing a simulation methodology for the qSIP method

Method:
- qPCR simulation
  - mean quantifications derived from the absolute count data
  - variance derived from qSIP paper
  - values will be multiplied by taxon relative abundances (post-simulation subsampling) to get taxon-specific weighted averaged density differences (y_ijk)
- Incorporation calculation
  - $\text{Buoyant density} = y_{ijk} = x_{jk}$
  - $\text{Total copies} = y_{ij} = \sum(y_{ijk})$
  - $\text{taxon density} = W_{ij} = \sum_{k=1}^{K} x_{jk} * \big(\frac{y_{ijk}}{y_{ij}}\big)$
  - $\text{density diff due to isotope incorp} = Z_i = W_{LABi} - W_{LIGHTi}$
    - each is mean across all replicates
    - need to simulate 3 replicates for control & treatment
- Bootstrapping
  - sampling with replication of taxon abundances
    - sampling from all replicates in a treatment
  - estimated 90% CI bootstraps
    - value range encompassing 90% of bootstrap values

Setting variables



In [1]:

    
supInfoFile = '/home/nick/notebook/SIPSim/dev/qSIP/PeerJ_qSIP_preprint/PeerJ_Supplemental_Information.pdf'

Init



In [79]:

    
import os,sys
%load_ext rpy2.ipython



In [81]:

    
%%R
library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)









    



/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 
Attaching package: ‘dplyr’


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following objects are masked from ‘package:stats’:

    filter, lag


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: Loading required package: grid

  res = super(Function, self).__call__(*new_args, **new_kwargs)

Deriving qPCR variance from emperical data: qSIP pre-print

pdf2txt method



In [57]:

    
pdfTextFile= os.path.splitext(supInfoFile)[0] + '.txt'

!pdf2txt.py $supInfoFile | perl -pe 's/\r+/ /g; s/[ \t ]+/\t/g' > $pdfTextFile

!head -n 4 $pdfTextFile
!printf '===============\n'
!tail -n 4 $pdfTextFile









    



1	
2	
3	
4	
===============
652432.3	
268143.0



In [48]:

    
!egrep "Table.+S2" $pdfTextFile









    



Table	S2.	qPCR	analysis	of	the	16S	gene	for	each	density	fraction

Making a parser was too painful; I just broke down and used a text editor & excel to parse out the table



In [77]:

    
tableS2File = pdfTextFile= os.path.splitext(supInfoFile)[0] + '_e.txt'
!head -n4 $tableS2File









    



SampleID	Tube	Glucose	Water	density-g-p-mL	16S-qPCR-copynum-p-ul
1.2	1	none	Nat_Abund	1.69581	195379.5
1.21	1	none	Nat_Abund	1.69249	282486
1.22	1	none	Nat_Abund	1.69139	310844.9

Plotting data



In [88]:

    
%%R -i tableS2File -w 900 

tbl.s2 = read.delim(tableS2File, sep='\t') %>%
    mutate(Tube = as.character(Tube))

ggplot(tbl.s2, aes(density.g.p.mL, X16S.qPCR.copynum.p.ul, color=Tube)) +
    geom_point() +
    geom_line() +
    theme_bw() +
    facet_grid(Glucose ~ Water) +
    theme(
        text = element_text(size=16)
    )



In [91]:

    
%%R -w 600 -h 300
ggplot(tbl.s2, aes(density.g.p.mL, X16S.qPCR.copynum.p.ul)) +
    geom_smooth() +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )

Determining variance across a density window



In [127]:

    
%%R -w 600 -h 300
tbl.s2.s = tbl.s2 %>%
    group_by(ntile(density.g.p.mL, 20)) %>%
    summarize(min_density = min(density.g.p.mL),
              mean_density = mean(density.g.p.mL),
              max_density = max(density.g.p.mL),
              density_width = max_density - min_density,
              var_copy = var(X16S.qPCR.copynum.p.ul),
              sd_copy = sd(X16S.qPCR.copynum.p.ul),
              mean_copy = mean(X16S.qPCR.copynum.p.ul))

ggplot(tbl.s2.s, aes(mean_density, sd_copy, color='red')) +
    geom_bar(stat='identity', aes(width=density_width)) +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )



In [130]:

    
%%R -w 600 -h 500
# how does variance relate to mean copy number

p1 = ggplot(tbl.s2.s, aes(mean_copy, var_copy)) +
    geom_point() +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )

p2 = ggplot(tbl.s2.s, aes(mean_copy, sd_copy)) +
    geom_point() +
    geom_smooth(method='lm') +
    theme_bw() +
    theme(
        text = element_text(size=16)
    )

grid.arrange(p1, p2, ncol=1)

Linear regression of mean_copy_number ~ sd



In [129]:

    
%%R
res = lm(sd_copy ~ mean_copy, tbl.s2.s)
summary(res)









    





Call:
lm(formula = sd_copy ~ mean_copy, data = tbl.s2.s)

Residuals:
   Min     1Q Median     3Q    Max 
-56728  -6766  -2119   6101  55684 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.296e+03  8.167e+03   0.893    0.383    
mean_copy   7.269e-01  5.872e-02  12.380 3.05e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23370 on 18 degrees of freedom
Multiple R-squared:  0.8949,	Adjusted R-squared:  0.8891 
F-statistic: 153.3 on 1 and 18 DF,  p-value: 3.048e-10

Notes:

standard deviation varied more or less linearly with mean

Implementation in modeling

The qPCR values will be drawn from normal distributions representing posible value distribution
- normal dist mean (loc) = community total count value --> 'qPCR mean copy number'
- normal dist sd (scale) = community total count value --> 'qPCR mean copy number' --> sd copy number

Bootstrapping

How does the bootstrapping methodology work in the paper?
Possibilities: *

OLD



In [74]:

    
def subInfo_parser(iFH):
    # parsing tableS2 from qSIP paper
    for line in iFH:
        if re.match('.+Table.+S2', line):
            tableS2_parser(iFH)

def tableS2_parser(iFH):
#    header = ('#SampleID', 'Tube', 'glucose', 'p-', '16S-')
#    is_header = lambda x: [x.startswith(y) for y in header]

    
    tableS2 = {}
    for line in iFH:
        if any(is_header(line)):
            parse
            
def parse_column(iFH):
    
    
    
            
            
with open(pdfTextFile, 'rb') as iFH:
    tableS2_parser(iFH)









    



#SampleID	

Tube	

glucose	

p-	�‐mL	

16S-	�‐qPCR-	�‐

p-	�‐uL



In [ ]: