General info on the fullCyc dataset (as it pertains to SIPSim validation)

Simulating 12C gradients
Determining if simulated taxon abundance distributions resemble the true distributions
Simulation parameters to infer from dataset:
- Infer total richness of bulk soil community
  - richness of starting community
- Infer abundance distribution of bulk soil community
  - NO: distribution fit
  - INSTEAD: using relative abundances of bulk soil community
- Get distribution of total OTU abundances per fraction
  - Number of sequences per sample

User variables



In [1]:

    
%load_ext rpy2.ipython



In [2]:

    
%%R

workDir = '/home/nick/notebook/SIPSim/dev/fullCyc/'
physeqDir = '/home/nick/notebook/fullCyc/data/MiSeq_16S/515f-806r/V4_Lib1-7/phyloseq/'
physeqBulkCore = 'bulk-core'
physeqSIP = 'SIP-core_unk'

Init



In [3]:

    
%%R
library(ggplot2)
library(dplyr)
library(tidyr)
library(phyloseq)
library(fitdistrplus)
library(sads)









    



/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 
Attaching package: ‘dplyr’


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following objects are masked from ‘package:stats’:

    filter, lag


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: Loading required package: MASS

  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 
Attaching package: ‘MASS’


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following object is masked from ‘package:dplyr’:

    select


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: Loading required package: bbmle

  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: Loading required package: stats4

  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 
Attaching package: ‘bbmle’


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following object is masked from ‘package:dplyr’:

    slice


  res = super(Function, self).__call__(*new_args, **new_kwargs)



In [4]:

    
%%R

dir.create(workDir, showWarnings=FALSE)

Loading phyloseq datasets



In [5]:

    
%%R
# bulk core samples
F = file.path(physeqDir, physeqBulkCore)
physeq.bulk = readRDS(F)
physeq.bulk.m = physeq.bulk %>% sample_data
physeq.bulk









    



Error in gzfile(file, "rb") : cannot open the connection






    



/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: Error in gzfile(file, "rb") : cannot open the connection

  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: In addition: 
  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: Warning messages:

  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 1: replacing previous import by ‘scales::alpha’ when loading ‘phyloseq’ 

  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 2: replacing previous import by ‘ggplot2::Position’ when loading ‘DESeq2’ 

  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 3: 
  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: In gzfile(file, "rb") :
  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 
 
  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning:  cannot open compressed file '/home/nick/notebook/fullCyc/data/MiSeq_16S/515f-806r/V4_Lib1-7/phyloseq//bulk-core', probable reason 'No such file or directory'

  res = super(Function, self).__call__(*new_args, **new_kwargs)



In [6]:

    
%%R
# SIP core samples
F = file.path(physeqDir, physeqSIP)
physeq.SIP = readRDS(F)
physeq.SIP.m = physeq.SIP %>% sample_data
physeq.SIP









    



Error in gzfile(file, "rb") : cannot open the connection






    



/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: Warning message:

  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning:  cannot open compressed file '/home/nick/notebook/fullCyc/data/MiSeq_16S/515f-806r/V4_Lib1-7/phyloseq//SIP-core_unk', probable reason 'No such file or directory'

  res = super(Function, self).__call__(*new_args, **new_kwargs)

Selecting gradients to simulate



In [303]:

    
%%R
physeq.bulk.12C = prune_samples(physeq.bulk.m$Substrate == '12C-Con', physeq.bulk) %>%
    filter_taxa(function(x) sum(x) > 0, TRUE)
physeq.bulk.12C.m = physeq.bulk.12C %>% sample_data
physeq.bulk.12C









    





phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4370 taxa and 6 samples ]
sample_data() Sample Data:       [ 6 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 4370 taxa by 8 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 4370 tips and 4369 internal nodes ]



In [304]:

    
%%R
physeq.SIP.12C = prune_samples(physeq.SIP.m$Substrate == '12C-Con', physeq.SIP) %>%
    transform_sample_counts(function(x) x/sum(x))
physeq.SIP.12C









    





phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 12358 taxa and 142 samples ]
sample_data() Sample Data:       [ 142 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 12358 taxa by 8 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 12358 tips and 12357 internal nodes ]



In [305]:

    
%%R
# number of fractions per 12C day
as.Num = function(x) as.numeric(as.character(x))

physeq.SIP.12C %>% sample_data %>% 
    as.matrix %>% as.data.frame %>% 
    group_by(Day) %>%
    summarize(n = n()) %>%
    mutate(Day = as.Num(Day)) %>%
    arrange(Day)









    





Source: local data frame [6 x 2]

    Day     n
  (dbl) (int)
1     1    25
2     3    23
3     6    23
4    14    23
5    30    24
6    48    24

Infer total richness of bulk soil community



In [323]:

    
%%R -w 700

p = plot_richness(physeq.bulk.12C, measures=c('Observed', 'Chao1'))
p = p + theme(text = element_text(size=18))
p



In [339]:

    
%%R
ggplot_build(p)$data[[2]] %>% arrange(PANEL, x)









    





   x        y     ymin     ymax PANEL group xmin xmax
1  1 2443.000       NA       NA     1     1 0.95 1.05
2  2 2008.000       NA       NA     1     3 1.95 2.05
3  3 1621.000       NA       NA     1     5 2.95 3.05
4  4 1440.000       NA       NA     1     7 3.95 4.05
5  5 2191.000       NA       NA     1     9 4.95 5.05
6  6 1718.000       NA       NA     1    11 5.95 6.05
7  1 2929.862 2881.305 2978.418     2     2 0.95 1.05
8  2 2503.459 2451.968 2554.950     2     4 1.95 2.05
9  3 2213.610 2151.041 2276.178     2     6 2.95 3.05
10 4 1784.926 1743.502 1826.351     2     8 3.95 4.05
11 5 2594.394 2551.777 2637.010     2    10 4.95 5.05
12 6 2124.291 2079.485 2169.098     2    12 5.95 6.05

Infer abundance distribution of bulk soil community

distribution fit



In [280]:

    
%%R
df.OTU = physeq.bulk.12C %>%
    transform_sample_counts(function(x) x/sum(x)) %>%
    otu_table %>%
    as.matrix %>% as.data.frame 

df.OTU$OTU = rownames(df.OTU)
df.OTU = df.OTU %>% 
    gather('sample', 'abundance', 1:(ncol(df.OTU)-1))

df.OTU %>% head(n=3)









    





       OTU             sample    abundance
1 OTU.1101 12C-Con.D3.R3_bulk 0.000000e+00
2 OTU.1130 12C-Con.D3.R3_bulk 9.601536e-05
3 OTU.9833 12C-Con.D3.R3_bulk 9.601536e-05



In [281]:

    
%%R -w 700 -h 350


days = physeq.bulk.12C.m$Day %>% unique %>% as.character
days = reorder(days, days %>% as.Num)

df.OTU.l = list()
for(day in days %>% sort){
    day %>% print
    rx = paste(c('\\.D', day, '\\.'), collapse='')
    df.OTU.l[[day]] = df.OTU %>% 
                filter(grepl(rx, sample), abundance > 0)
    plotdist(df.OTU.l[[day]]$abundance)
}









    





[1] "1"
[1] "3"
[1] "6"
[1] "14"
[1] "30"
[1] "48"



In [282]:

    
%%R -w 450 -h 400

lapply(df.OTU.l, function(x) descdist(x$abundance, boot=1000))









    





$`1`
summary statistics
------
min:  6.171316e-05   max:  0.05918292 
median:  0.0001234263 
mean:  0.000498008 
estimated sd:  0.002527486 
estimated skewness:  17.3276 
estimated kurtosis:  353.0851 

$`3`
summary statistics
------
min:  9.601536e-05   max:  0.09371099 
median:  0.0001920307 
mean:  0.0006944444 
estimated sd:  0.003522151 
estimated skewness:  18.90537 
estimated kurtosis:  433.01 

$`6`
summary statistics
------
min:  8.966197e-05   max:  0.06984668 
median:  0.0001793239 
mean:  0.0005820722 
estimated sd:  0.002405602 
estimated skewness:  19.2723 
estimated kurtosis:  481.0681 

$`14`
summary statistics
------
min:  5.145621e-05   max:  0.03571061 
median:  0.0001543686 
mean:  0.0004093328 
estimated sd:  0.001370901 
estimated skewness:  15.40056 
estimated kurtosis:  319.4511 

$`30`
summary statistics
------
min:  0.0001367241   max:  0.03199344 
median:  0.0002734482 
mean:  0.0006169031 
estimated sd:  0.001407484 
estimated skewness:  11.48746 
estimated kurtosis:  198.9396 

$`48`
summary statistics
------
min:  6.797172e-05   max:  0.03045133 
median:  0.0002039152 
mean:  0.0004564126 
estimated sd:  0.001250518 
estimated skewness:  13.0449 
estimated kurtosis:  241.5398



In [294]:

    
%%R
tmp1 = df.OTU.l[['1']]$abundance

f.norm = fitdist(tmp1, 'norm')
f.exp = fitdist(tmp1, 'exp')
f.logn = fitdist(tmp1, 'lnorm')
f.gamma = fitdist(tmp1, 'gamma')
f.beta = fitdist(tmp1, 'beta')

f.list = list(f.norm, f.exp, f.logn, f.gamma, f.beta)

plot.legend = c('normal', 'exponential', 'lognormal', 'gamma', 'beta')

par(mfrow = c(2,1))
denscomp(f.list, legendtext=plot.legend)
qqcomp(f.list, legendtext=plot.legend)



In [295]:

    
%%R
gofstat(f.list, fitnames=plot.legend)









    





Goodness-of-fit statistics
                                  normal exponential  lognormal     gamma
Kolmogorov-Smirnov statistic   0.4314581    0.333942  0.1583872  0.237266
Cramer-von Mises statistic   118.9343834   74.597984 10.7887734 37.706637
Anderson-Darling statistic           Inf         Inf 70.9860349       Inf
                                   beta
Kolmogorov-Smirnov statistic  0.2378843
Cramer-von Mises statistic   37.9793220
Anderson-Darling statistic          Inf

Goodness-of-fit criteria
                                  normal exponential lognormal     gamma
Aikake's Information Criterion -18316.35   -26523.26 -28624.47 -26925.83
Bayesian Information Criterion -18305.14   -26517.65 -28613.26 -26914.62
                                    beta
Aikake's Information Criterion -26910.79
Bayesian Information Criterion -26899.58



In [296]:

    
%%R
summary(f.beta)









    





Fitting of the distribution ' beta ' by maximum likelihood 
Parameters : 
           estimate  Std. Error
shape1    0.6052314  0.01604951
shape2 1206.9991938 47.11957568
Loglikelihood:  13457.39   AIC:  -26910.79   BIC:  -26899.58 
Correlation matrix:
          shape1    shape2
shape1 1.0000000 0.6786354
shape2 0.6786354 1.0000000



In [299]:

    
%%R
tmp1 %>% summary %>% print
rbeta(10000, 0.605, 1206.999) %>% summary









    





     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
6.171e-05 6.171e-05 1.234e-04 4.980e-04 3.086e-04 5.918e-02 
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 7.532e-05 2.663e-04 4.969e-04 6.692e-04 6.318e-03

Notes

Instead of trying to simulate OTU abundances based on a distribution, I will use the observed abundances of each OTU

Get distribution of total OTU abundances per fraction

This provides the parameters for simulating OTU table subsampling (ie., sequencing simulation)



In [307]:

    
%%R
physeq.SIP.12C = prune_samples(physeq.SIP.m$Substrate == '12C-Con', physeq.SIP)
physeq.SIP.12C









    





phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 12358 taxa and 142 samples ]
sample_data() Sample Data:       [ 142 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 12358 taxa by 8 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 12358 tips and 12357 internal nodes ]



In [308]:

    
%%R
df.OTU = physeq.SIP.12C %>% 
    otu_table %>%
    as.matrix %>% as.data.frame 

df.OTU$OTU = rownames(df.OTU)
df.OTU = df.OTU %>% 
    gather('sample', 'abundance', 1:(ncol(df.OTU)-1))

df.OTU %>% head(n=3)









    





       OTU            sample abundance
1 OTU.1101 12C-Con.D1.R2_F23         2
2 OTU.1130 12C-Con.D1.R2_F23         0
3 OTU.9833 12C-Con.D1.R2_F23         0



In [309]:

    
%%R -h 300

df.OTU.s = df.OTU %>%
    group_by(sample) %>%
    summarize(total_abundance = sum(abundance))

cat('Number of samples (fractions): ', nrow(df.OTU.s), '\n')

p = ggplot(df.OTU.s, aes(total_abundance)) +
    geom_histogram(binwidth=100, aes(y = ..density..)) +
    geom_density() + 
    theme_bw() +
    theme(
        text = element_text(size=16)
        )
p









    





Number of samples (fractions):  142

Notes

Total abundances of each sample



In [310]:

    
%%R -w 700 -h 350

plotdist(df.OTU.s$total_abundance)



In [311]:

    
%%R -w 450 -h 400
descdist(df.OTU.s$total_abundance, boot=1000)









    





summary statistics
------
min:  2428   max:  29480 
median:  14326.5 
mean:  14138.24 
estimated sd:  6503.14 
estimated skewness:  0.3214032 
estimated kurtosis:  2.445484



In [312]:

    
%%R
f.n = fitdist(df.OTU.s$total_abundance, 'norm')
f.ln = fitdist(df.OTU.s$total_abundance, 'lnorm')
f.ll = fitdist(df.OTU.s$total_abundance, 'logis')
f.list = list(f.n, f.ln, f.ll)

plot.legend = c('normal', 'log-normal', 'logistic')

par(mfrow = c(2,1))
denscomp(f.list, legendtext=plot.legend)
qqcomp(f.list, legendtext=plot.legend)



In [313]:

    
%%R

gofstat(list(f.n, f.ln, f.ll), fitnames=plot.legend)









    





Goodness-of-fit statistics
                                 normal log-normal   logistic
Kolmogorov-Smirnov statistic 0.07478719  0.1203950 0.07448995
Cramer-von Mises statistic   0.11357745  0.3740005 0.13550439
Anderson-Darling statistic   0.85383850  2.0350254 0.99553341

Goodness-of-fit criteria
                                 normal log-normal logistic
Aikake's Information Criterion 2899.506   2906.347 2906.681
Bayesian Information Criterion 2905.418   2912.259 2912.592



In [314]:

    
%%R
summary(f.ln)









    





Fitting of the distribution ' lnorm ' by maximum likelihood 
Parameters : 
         estimate Std. Error
meanlog 9.4317868 0.04464078
sdlog   0.5319562 0.03156529
Loglikelihood:  -1451.173   AIC:  2906.347   BIC:  2912.259 
Correlation matrix:
        meanlog sdlog
meanlog       1     0
sdlog         0     1



In [315]:

    
%%R -h 300
rlnorm(1000, 9.432, 0.5) %>% hist(breaks=30)



In [ ]:



In [ ]: