Getting info on Priming experiment dataset that's needed for modeling

Info:

  • Which gradient(s) to simulate?
  • For each gradient to simulate:
    • Infer total richness of starting community
    • Get distribution of total OTU abundances per fraction
      • Number of sequences per sample
    • Infer total abundance of each target taxon

User variables


In [2]:
baseDir = '/home/nick/notebook/SIPSim/dev/priming_exp/'
workDir = os.path.join(baseDir, 'exp_info')

otuTableFile = '/var/seq_data/priming_exp/data/otu_table.txt'
otuTableSumFile = '/var/seq_data/priming_exp/data/otu_table_summary.txt'

metaDataFile = '/var/seq_data/priming_exp/data/allsample_metadata_nomock.txt'

#otuRepFile = '/var/seq_data/priming_exp/otusn.pick.fasta'
#otuTaxFile = '/var/seq_data/priming_exp/otusn_tax/otusn_tax_assignments.txt'
#genomeDir = '/home/nick/notebook/SIPSim/dev/bac_genome1210/genomes/'

Init


In [3]:
import glob

In [4]:
%load_ext rpy2.ipython

In [5]:
%%R
library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)
library(fitdistrplus)


/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: 
Attaching package: ‘dplyr’


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following objects are masked from ‘package:stats’:

    filter, lag


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


  res = super(Function, self).__call__(*new_args, **new_kwargs)
/opt/anaconda/lib/python2.7/site-packages/rpy2/robjects/functions.py:106: UserWarning: Loading required package: grid

  res = super(Function, self).__call__(*new_args, **new_kwargs)

In [6]:
if not os.path.isdir(workDir):
    os.makedirs(workDir)

Loading OTU table (filter to just bulk samples)


In [46]:
%%R -i otuTableFile

tbl = read.delim(otuTableFile, sep='\t')

# filter
tbl = tbl %>%
    select(ends_with('.NA'))

tbl %>% ncol %>% print
tbl[1:4,1:4]


[1] 70
  X00C.000.14.09.NA X13C.100.14.08.NA X12C.700.28.04.NA X13C.000.45.04.NA
1                60                88                54                69
2                 1                 0                 0                 0
3                 0                 0                 0                 0
4                 0                 0                 0                 0

In [47]:
%%R

tbl.h = tbl %>% 
    gather('sample', 'count', 1:ncol(tbl)) %>%
    separate(sample, c('isotope','treatment','day','rep','fraction'), sep='\\.', remove=F)
tbl.h %>% head


             sample isotope treatment day rep fraction count
1 X00C.000.14.09.NA    X00C       000  14  09       NA    60
2 X00C.000.14.09.NA    X00C       000  14  09       NA     1
3 X00C.000.14.09.NA    X00C       000  14  09       NA     0
4 X00C.000.14.09.NA    X00C       000  14  09       NA     0
5 X00C.000.14.09.NA    X00C       000  14  09       NA     1
6 X00C.000.14.09.NA    X00C       000  14  09       NA     0

Which gradient(s) to simulate?


In [51]:
%%R -w 900 -h 400

tbl.h.s = tbl.h %>%
    group_by(sample) %>%
    summarize(total_count = sum(count)) %>%
    separate(sample, c('isotope','treatment','day','rep','fraction'), sep='\\.', remove=F)

ggplot(tbl.h.s, aes(day, total_count, color=rep %>% as.character)) +
    geom_point() +
    facet_grid(isotope ~ treatment) +
    theme(
        text = element_text(size=16)
        )



In [69]:
%%R
tbl.h.s$sample[grepl('700', tbl.h.s$sample)] %>% as.vector %>% sort


 [1] "X12C.700.00.09.NA" "X12C.700.00.10.NA" "X12C.700.07.07.NA"
 [4] "X12C.700.07.08.NA" "X12C.700.14.05.NA" "X12C.700.14.06.NA"
 [7] "X12C.700.28.03.NA" "X12C.700.28.04.NA" "X12C.700.45.01.NA"
[10] "X12C.700.45.02.NA" "X13C.700.14.08.NA" "X13C.700.14.09.NA"
[13] "X13C.700.28.06.NA" "X13C.700.28.07.NA" "X13C.700.45.01.NA"
[16] "X13C.700.45.02.NA" "X13C.700.45.03.NA" "X13C.700.45.04.NA"
[19] "X13C.700.45.05.NA"

Notes

Samples to simulate

  • Isotope:
    • 12C vs 13C
  • Treatment:
    • 700
  • Days:
    • 14
    • 28
    • 45

In [10]:
%%R
# bulk soil samples for gradients to simulate

samples.to.use = c(
"X12C.700.14.05.NA",
"X12C.700.28.03.NA",
"X12C.700.45.01.NA",
"X13C.700.14.08.NA",
"X13C.700.28.06.NA",
"X13C.700.45.01.NA"
)

Total richness of starting (bulk-soil) community

Method:

  • Total number of OTUs in OTU table (i.e., gamma richness)
  • Just looking at bulk soil samples

Loading just bulk soil


In [29]:
%%R -i otuTableFile

tbl = read.delim(otuTableFile, sep='\t')

# filter
tbl = tbl %>%
    select(ends_with('.NA'))
tbl$OTUId = rownames(tbl)

tbl %>% ncol %>% print
tbl[1:4,1:4]


[1] 71
  X00C.000.14.09.NA X13C.100.14.08.NA X12C.700.28.04.NA X13C.000.45.04.NA
1                60                88                54                69
2                 1                 0                 0                 0
3                 0                 0                 0                 0
4                 0                 0                 0                 0

In [30]:
%%R

tbl.h = tbl %>% 
    gather('sample', 'count', 1:(ncol(tbl)-1)) %>%
    separate(sample, c('isotope','treatment','day','rep','fraction'), sep='\\.', remove=F)
tbl.h %>% head


  OTUId            sample isotope treatment day rep fraction count
1     1 X00C.000.14.09.NA    X00C       000  14  09       NA    60
2     2 X00C.000.14.09.NA    X00C       000  14  09       NA     1
3     3 X00C.000.14.09.NA    X00C       000  14  09       NA     0
4     4 X00C.000.14.09.NA    X00C       000  14  09       NA     0
5     5 X00C.000.14.09.NA    X00C       000  14  09       NA     1
6     6 X00C.000.14.09.NA    X00C       000  14  09       NA     0

In [31]:
%%R -w 800
tbl.s = tbl.h %>% 
    filter(count > 0) %>%
    group_by(sample, isotope, treatment, day, rep, fraction) %>%
    summarize(n_taxa = n())

ggplot(tbl.s, aes(day, n_taxa, color=rep %>% as.character)) +
    geom_point() +
    facet_grid(isotope ~ treatment) +
    theme_bw() +
    theme(
        text = element_text(size=16),
        axis.text.x = element_blank()
        )



In [33]:
%%R -w 800 -h 350
# filter to just target samples

tbl.s.f = tbl.s %>% filter(sample %in% samples.to.use)

ggplot(tbl.s.f, aes(day, n_taxa, fill=rep %>% as.character)) +
    geom_bar(stat='identity') +
    facet_grid(. ~ isotope) +
    labs(y = 'Number of taxa') +
    theme_bw() +
    theme(
        text = element_text(size=16),
        axis.text.x = element_blank()
        )



In [82]:
%%R
message('Bulk soil total observed richness: ')

tbl.s.f %>% select(-fraction) %>% as.data.frame %>% print


Bulk soil total observed richness: 
             sample isotope treatment day rep n_taxa
1 X13C.700.28.06.NA    X13C       700  28  06   1917
2 X13C.700.14.08.NA    X13C       700  14  08   2677
3 X12C.700.28.03.NA    X12C       700  28  03   2194
4 X12C.700.45.01.NA    X12C       700  45  01   2239
5 X12C.700.14.05.NA    X12C       700  14  05   2170
6 X13C.700.45.01.NA    X13C       700  45  01   1858

Number of taxa in all fractions corresponding to each bulk soil sample

  • Trying to see the difference between richness of bulk vs gradients (veil line effect)

In [49]:
%%R -i otuTableFile
# loading OTU table
tbl = read.delim(otuTableFile, sep='\t') %>%
    select(-ends_with('.NA'))

tbl.h = tbl %>% 
    gather('sample', 'count', 2:ncol(tbl)) %>%
    separate(sample, c('isotope','treatment','day','rep','fraction'), sep='\\.', remove=F)
tbl.h %>% head


      OTUId            sample isotope treatment day rep fraction count
1  OTU.4776 X12C.700.45.01.24    X12C       700  45  01       24   157
2  OTU.2864 X12C.700.45.01.24    X12C       700  45  01       24     0
3  OTU.8170 X12C.700.45.01.24    X12C       700  45  01       24     0
4  OTU.8762 X12C.700.45.01.24    X12C       700  45  01       24     0
5  OTU.5223 X12C.700.45.01.24    X12C       700  45  01       24     0
6 OTU.13091 X12C.700.45.01.24    X12C       700  45  01       24     0

In [56]:
%%R
# basename of fractions
samples.to.use.base = gsub('\\.[0-9]+\\.NA', '', samples.to.use)

samps = tbl.h$sample %>% unique

fracs = sapply(samples.to.use.base, function(x) grep(x, samps, value=TRUE))

for (n in names(fracs)){
    n.frac = length(fracs[[n]])
    cat(n, '-->', 'Number of fraction samples: ', n.frac, '\n')
}


X12C.700.14 --> Number of fraction samples:  20 
X12C.700.28 --> Number of fraction samples:  20 
X12C.700.45 --> Number of fraction samples:  20 
X13C.700.14 --> Number of fraction samples:  20 
X13C.700.28 --> Number of fraction samples:  22 
X13C.700.45 --> Number of fraction samples:  20 

In [79]:
%%R
# function for getting all OTUs in a sample
n.OTUs = function(samples, otu.long){
    otu.long.f = otu.long %>%
        filter(sample %in% samples,
               count > 0) 
    n.OTUs = otu.long.f$OTUId %>% unique %>% length
    return(n.OTUs)
}


num.OTUs = lapply(fracs, n.OTUs, otu.long=tbl.h)

num.OTUs = do.call(rbind, num.OTUs) %>% as.data.frame
colnames(num.OTUs) = c('n_taxa')
num.OTUs$sample = rownames(num.OTUs)
num.OTUs


            n_taxa      sample
X12C.700.14   6606 X12C.700.14
X12C.700.28   6896 X12C.700.28
X12C.700.45   6901 X12C.700.45
X13C.700.14   6005 X13C.700.14
X13C.700.28   6238 X13C.700.28
X13C.700.45   6325 X13C.700.45

In [75]:
%%R
tbl.s.f %>% as.data.frame


             sample isotope treatment day rep fraction n_taxa
1 X13C.700.28.06.NA    X13C       700  28  06       NA   1917
2 X13C.700.14.08.NA    X13C       700  14  08       NA   2677
3 X12C.700.28.03.NA    X12C       700  28  03       NA   2194
4 X12C.700.45.01.NA    X12C       700  45  01       NA   2239
5 X12C.700.14.05.NA    X12C       700  14  05       NA   2170
6 X13C.700.45.01.NA    X13C       700  45  01       NA   1858

In [91]:
%%R   
# joining with bulk soil sample summary table
num.OTUs$data = 'fractions'
tbl.s.f$data = 'bulk_soil'
tbl.j = rbind(num.OTUs,
              tbl.s.f %>% ungroup %>% select(sample, n_taxa, data)) %>%
    mutate(isotope = gsub('X|\\..+', '', sample),
           sample = gsub('\\.[0-9]+\\.NA', '', sample))
tbl.j


   n_taxa      sample      data isotope
1    6606 X12C.700.14 fractions     12C
2    6896 X12C.700.28 fractions     12C
3    6901 X12C.700.45 fractions     12C
4    6005 X13C.700.14 fractions     13C
5    6238 X13C.700.28 fractions     13C
6    6325 X13C.700.45 fractions     13C
7    1917 X13C.700.28 bulk_soil     13C
8    2677 X13C.700.14 bulk_soil     13C
9    2194 X12C.700.28 bulk_soil     12C
10   2239 X12C.700.45 bulk_soil     12C
11   2170 X12C.700.14 bulk_soil     12C
12   1858 X13C.700.45 bulk_soil     13C

In [95]:
%%R -h 300 -w 800

ggplot(tbl.j, aes(sample, n_taxa, fill=data)) +
    geom_bar(stat='identity', position='dodge') +
    facet_grid(. ~ isotope, scales='free_x') +
    labs(y = 'Number of OTUs') +
    theme(
        text = element_text(size=16)
     #   axis.text.x = element_text(angle=90)
    )


Distribution of total sequences per fraction

  • Number of sequences per sample
  • Using all samples to assess this one
  • Just fraction samples

Method:

  • Total number of sequences (total abundance) per sample

Loading OTU table


In [145]:
%%R -i otuTableFile

tbl = read.delim(otuTableFile, sep='\t')

# filter
tbl = tbl %>%
    select(-ends_with('.NA'))

tbl %>% ncol %>% print
tbl[1:4,1:4]


[1] 398
     OTUId X12C.700.45.01.24 X12C.700.14.06.14 X12C.100.14.05.18
1 OTU.4776               157                34                13
2 OTU.2864                 0                 1                 2
3 OTU.8170                 0                 0                 0
4 OTU.8762                 0                 0                 0

In [146]:
%%R

tbl.h = tbl %>% 
    gather('sample', 'count', 2:ncol(tbl)) %>%
    separate(sample, c('isotope','treatment','day','rep','fraction'), sep='\\.', remove=F)
tbl.h %>% head


      OTUId            sample isotope treatment day rep fraction count
1  OTU.4776 X12C.700.45.01.24    X12C       700  45  01       24   157
2  OTU.2864 X12C.700.45.01.24    X12C       700  45  01       24     0
3  OTU.8170 X12C.700.45.01.24    X12C       700  45  01       24     0
4  OTU.8762 X12C.700.45.01.24    X12C       700  45  01       24     0
5  OTU.5223 X12C.700.45.01.24    X12C       700  45  01       24     0
6 OTU.13091 X12C.700.45.01.24    X12C       700  45  01       24     0

In [147]:
%%R -h 400

tbl.h.s = tbl.h %>%
    group_by(sample) %>%
    summarize(total_seqs = sum(count))


p = ggplot(tbl.h.s, aes(total_seqs)) +
    theme_bw() +
    theme(
        text = element_text(size=16)
        )
p1 = p + geom_histogram(binwidth=200) 
p2 = p + geom_density()

grid.arrange(p1,p2,ncol=1)


Distribution fitting


In [152]:
%%R -w 700 -h 350
plotdist(tbl.h.s$total_seqs)



In [153]:
%%R -w 450 -h 400
descdist(tbl.h.s$total_seqs, boot=1000)


summary statistics
------
min:  1   max:  180942 
median:  28263 
mean:  33211.14 
estimated sd:  21541.77 
estimated skewness:  1.627343 
estimated kurtosis:  8.871846 

In [154]:
%%R
f.n = fitdist(tbl.h.s$total_seqs, 'norm')
f.ln = fitdist(tbl.h.s$total_seqs, 'lnorm')
f.ll = fitdist(tbl.h.s$total_seqs, 'logis')
#f.c = fitdist(tbl.s$count, 'cauchy')
f.list = list(f.n, f.ln, f.ll)

plot.legend = c('normal', 'log-normal', 'logistic')

par(mfrow = c(2,1))
denscomp(f.list, legendtext=plot.legend)
qqcomp(f.list, legendtext=plot.legend)



In [155]:
%%R

gofstat(list(f.n, f.ln, f.ll), fitnames=plot.legend)


Goodness-of-fit statistics
                                normal log-normal   logistic
Kolmogorov-Smirnov statistic 0.1010961  0.1971488 0.06645885
Cramer-von Mises statistic   1.2726290  4.4212031 0.58739094
Anderson-Darling statistic   7.5953687 26.8795505 4.84350229

Goodness-of-fit criteria
                                 normal log-normal logistic
Aikake's Information Criterion 9051.969   9300.198 9013.198
Bayesian Information Criterion 9059.937   9308.166 9021.166

In [156]:
%%R
summary(f.ln)


Fitting of the distribution ' lnorm ' by maximum likelihood 
Parameters : 
         estimate Std. Error
meanlog 10.113351 0.05983293
sdlog    1.192163 0.04230814
Loglikelihood:  -4648.099   AIC:  9300.198   BIC:  9308.166 
Correlation matrix:
        meanlog sdlog
meanlog       1     0
sdlog         0     1

Notes:

  • best fit:
    • lognormal
      • mean = 10.113
      • sd = 1.192

Does sample size correlate to buoyant density?

Loading OTU table


In [201]:
%%R -i otuTableFile

tbl = read.delim(otuTableFile, sep='\t')

# filter
tbl = tbl %>%
    select(-ends_with('.NA')) %>%
    select(-starts_with('X0MC'))
tbl = tbl %>%
    gather('sample', 'count', 2:ncol(tbl)) %>%
    mutate(sample = gsub('^X', '', sample))

tbl %>% head


      OTUId           sample count
1  OTU.4776 12C.700.45.01.24   157
2  OTU.2864 12C.700.45.01.24     0
3  OTU.8170 12C.700.45.01.24     0
4  OTU.8762 12C.700.45.01.24     0
5  OTU.5223 12C.700.45.01.24     0
6 OTU.13091 12C.700.45.01.24     0

In [202]:
%%R
# summarize

tbl.s = tbl %>%
    group_by(sample) %>%
    summarize(total_count = sum(count))
tbl.s %>% head(n=3)


Source: local data frame [3 x 2]

            sample total_count
1 12C.000.14.05.01       21963
2 12C.000.14.05.02       94820
3 12C.000.14.05.03       25674

Loading metadata


In [203]:
%%R -i metaDataFile

tbl.meta = read.delim(metaDataFile, sep='\t')
tbl.meta %>% head(n=3)


            Sample FractionNum Bulk Control CC X100 X700 H2O Day Density rep
1 12C.000.28.03.07           7    0       1  1    0    0   0  28  1.7646    
2 12C.000.28.03.08           8    0       1  1    0    0   0  28  1.7614    
3 12C.000.28.03.09           9    0       1  1    0    0   0  28  1.7537    
  contolVlabel Treatment
1      control    12C000
2      control    12C000
3      control    12C000

Determining association


In [186]:
%%R -w 700

tbl.j = inner_join(tbl.s, tbl.meta, c('sample' = 'Sample')) 

ggplot(tbl.j, aes(Density, total_count, color=rep)) +
    geom_point() +
    facet_grid(Treatment ~ Day)



In [184]:
%%R -w 600 -h 350

ggplot(tbl.j, aes(Density, total_count)) +
    geom_point(aes(color=Treatment)) +
    geom_smooth(method='lm') +
    labs(x='Buoyant density', y='Total sequences') +
    theme_bw() +
    theme(
        text = element_text(size=16)
        )


Number of taxa along the gradient


In [208]:
%%R 

tbl.s = tbl %>%
    filter(count > 0) %>%
    group_by(sample) %>%
    summarize(n_taxa = sum(count > 0))

tbl.j = inner_join(tbl.s, tbl.meta, c('sample' = 'Sample')) 
tbl.j %>% head(n=3)


Source: local data frame [3 x 14]

            sample n_taxa FractionNum Bulk Control CC X100 X700 H2O Day Density
1 12C.000.14.05.01    526           1    0       1  1    0    0   0  14  1.7668
2 12C.000.14.05.02   3458           2    0       1  1    0    0   0  14  1.7614
3 12C.000.14.05.03   2493           3    0       1  1    0    0   0  14  1.7570
Variables not shown: rep (fctr), contolVlabel (fctr), Treatment (fctr)

In [217]:
%%R -w 900 -h 600

ggplot(tbl.j, aes(Density, n_taxa, fill=rep, color=rep)) +
    #geom_area(stat='identity', alpha=0.5, position='dodge') +
    geom_point() +
    geom_line() +
    labs(x='Buoyant density', y='Number of taxa') +
    facet_grid(Treatment ~ Day) +
    theme_bw() +
    theme( 
        text = element_text(size=16),
        legend.position = 'none'
    )


Notes:

  • Many taxa out to the tails of the gradient.
  • It seems that the DNA fragments were quite diffuse in the gradients.

Total abundance of each target taxon: bulk soil approach

  • Getting relative abundances from bulk soil samples
    • This has the caveat of likely undersampling richness vs using all gradient fraction samples.
      • i.e., veil line effect

In [134]:
%%R -i otuTableFile
# loading OTU table
tbl = read.delim(otuTableFile, sep='\t')

# filter
tbl = tbl %>%
    select(matches('OTUId'), ends_with('.NA'))

tbl %>% ncol %>% print
tbl[1:4,1:4]


[1] 71
     OTUId X00C.000.14.09.NA X13C.100.14.08.NA X12C.700.28.04.NA
1 OTU.4776                60                88                54
2 OTU.2864                 1                 0                 0
3 OTU.8170                 0                 0                 0
4 OTU.8762                 0                 0                 0

In [135]:
%%R
# long table format w/ selecting samples of interest
tbl.h = tbl %>% 
    gather('sample', 'count', 2:ncol(tbl)) %>%
    separate(sample, c('isotope','treatment','day','rep','fraction'), sep='\\.', remove=F) %>%
    filter(sample %in% samples.to.use,
           count > 0) 
tbl.h %>% head


      OTUId            sample isotope treatment day rep fraction count
1  OTU.4776 X13C.700.28.06.NA    X13C       700  28  06       NA    26
2   OTU.611 X13C.700.28.06.NA    X13C       700  28  06       NA     2
3   OTU.961 X13C.700.28.06.NA    X13C       700  28  06       NA     1
4   OTU.527 X13C.700.28.06.NA    X13C       700  28  06       NA     1
5 OTU.12349 X13C.700.28.06.NA    X13C       700  28  06       NA     1
6   OTU.262 X13C.700.28.06.NA    X13C       700  28  06       NA     9

In [136]:
%%R
message('Number of samples: ', tbl.h$sample %>% unique %>% length)
message('Number of OTUs: ', tbl.h$OTUId %>% unique %>% length)


Number of samples: 6
Number of OTUs: 4252

In [137]:
%%R

tbl.hs = tbl.h %>%
    group_by(OTUId) %>%
    summarize(
        total_count = sum(count),
        mean_count = mean(count),
        median_count = median(count),
        sd_count = sd(count)
    ) %>%
    filter(total_count > 0)

tbl.hs %>% head


Source: local data frame [6 x 5]

     OTUId total_count mean_count median_count    sd_count
1    OTU.1        1590  265.00000          187 173.4785289
2   OTU.10          80   13.33333           12   6.0553007
3  OTU.100         148   24.66667           22  15.7437818
4 OTU.1000           2    1.00000            1   0.0000000
5 OTU.1001           7    1.40000            1   0.5477226
6 OTU.1002          10    2.00000            2   0.7071068

For each sample, writing a table of OTU_ID and count


In [140]:
%%R -i workDir
setwd(workDir)

samps = tbl.h$sample %>% unique %>% as.vector

for(samp in samps){
    outFile = paste(c(samp, 'OTU.txt'), collapse='_')
    
    tbl.p = tbl.h %>% 
        filter(sample == samp, count > 0)
    
    write.table(tbl.p, outFile, sep='\t', quote=F, row.names=F)
    
    message('Table written: ', outFile)
    message('  Number of OTUs: ', tbl.p %>% nrow)
    }


Table written: X13C.700.28.06.NA_OTU.txt
  Number of OTUs: 1917
Table written: X13C.700.14.08.NA_OTU.txt
  Number of OTUs: 2677
Table written: X12C.700.28.03.NA_OTU.txt
  Number of OTUs: 2194
Table written: X12C.700.45.01.NA_OTU.txt
  Number of OTUs: 2239
Table written: X12C.700.14.05.NA_OTU.txt
  Number of OTUs: 2170
Table written: X13C.700.45.01.NA_OTU.txt
  Number of OTUs: 1858

Making directories for simulations


In [133]:
p = os.path.join(workDir, '*_OTU.txt')
files = glob.glob(p)

baseDir = os.path.split(workDir)[0]
newDirs = [os.path.split(x)[1].rstrip('.NA_OTU.txt') for x in files]
newDirs = [os.path.join(baseDir, x) for x in newDirs]

for newDir,f in zip(newDirs, files):
    if not os.path.isdir(newDir):
        print 'Making new directory: {}'.format(newDir)
        os.makedirs(newDir)
    else:
        print 'Directory exists: {}'.format(newDir)
    # symlinking file
    linkPath = os.path.join(newDir, os.path.split(f)[1])
    if not os.path.islink(linkPath):
        os.symlink(f, linkPath)


Directory exists: /home/nick/notebook/SIPSim/dev/priming_exp/X13C.700.28.06
Directory exists: /home/nick/notebook/SIPSim/dev/priming_exp/X12C.700.28.03
Directory exists: /home/nick/notebook/SIPSim/dev/priming_exp/X13C.700.14.08
Directory exists: /home/nick/notebook/SIPSim/dev/priming_exp/X13C.700.45.01
Directory exists: /home/nick/notebook/SIPSim/dev/priming_exp/X12C.700.45.01
Directory exists: /home/nick/notebook/SIPSim/dev/priming_exp/X12C.700.14.05

Rank-abundance distribution for each sample


In [7]:
%%R -i otuTableFile

tbl = read.delim(otuTableFile, sep='\t')

# filter
tbl = tbl %>%
    select(matches('OTUId'), ends_with('.NA'))

tbl %>% ncol %>% print
tbl[1:4,1:4]


[1] 71
     OTUId X00C.000.14.09.NA X13C.100.14.08.NA X12C.700.28.04.NA
1 OTU.4776                60                88                54
2 OTU.2864                 1                 0                 0
3 OTU.8170                 0                 0                 0
4 OTU.8762                 0                 0                 0

In [9]:
%%R
# long table format w/ selecting samples of interest
tbl.h = tbl %>% 
    gather('sample', 'count', 2:ncol(tbl)) %>%
    separate(sample, c('isotope','treatment','day','rep','fraction'), sep='\\.', remove=F) %>%
    filter(sample %in% samples.to.use,
           count > 0) 
tbl.h %>% head


      OTUId            sample isotope treatment day rep fraction count
1  OTU.4776 X13C.700.28.06.NA    X13C       700  28  06       NA    26
2   OTU.611 X13C.700.28.06.NA    X13C       700  28  06       NA     2
3   OTU.961 X13C.700.28.06.NA    X13C       700  28  06       NA     1
4   OTU.527 X13C.700.28.06.NA    X13C       700  28  06       NA     1
5 OTU.12349 X13C.700.28.06.NA    X13C       700  28  06       NA     1
6   OTU.262 X13C.700.28.06.NA    X13C       700  28  06       NA     9

In [25]:
%%R
# ranks of relative abundances

tbl.r = tbl.h %>%
    group_by(sample) %>%
    mutate(perc_rel_abund = count / sum(count) * 100,
           rank = row_number(-perc_rel_abund)) %>%
    unite(day_rep, day, rep, sep='-')
tbl.r %>% as.data.frame %>% head(n=3)


     OTUId            sample isotope treatment day_rep fraction count
1 OTU.4776 X13C.700.28.06.NA    X13C       700   28-06       NA    26
2  OTU.611 X13C.700.28.06.NA    X13C       700   28-06       NA     2
3  OTU.961 X13C.700.28.06.NA    X13C       700   28-06       NA     1
  perc_rel_abund rank
1    0.140479793  143
2    0.010806138  927
3    0.005403069 1231

In [27]:
%%R -w 900 -h 350


ggplot(tbl.r, aes(rank, perc_rel_abund)) +
    geom_point() +
#    labs(x='Buoyant density', y='Number of taxa') +
    facet_wrap(~ day_rep) +
    theme_bw() +
    theme( 
        text = element_text(size=16),
        legend.position = 'none'
    )


Taxon abundance range for each sample-fraction


In [22]:
%%R -i otuTableFile

tbl = read.delim(otuTableFile, sep='\t')

# filter
tbl = tbl %>%
    select(-ends_with('.NA')) %>%
    select(-starts_with('X0MC'))
tbl = tbl %>%
    gather('sample', 'count', 2:ncol(tbl)) %>%
    mutate(sample = gsub('^X', '', sample))

tbl %>% head


      OTUId           sample count
1  OTU.4776 12C.700.45.01.24   157
2  OTU.2864 12C.700.45.01.24     0
3  OTU.8170 12C.700.45.01.24     0
4  OTU.8762 12C.700.45.01.24     0
5  OTU.5223 12C.700.45.01.24     0
6 OTU.13091 12C.700.45.01.24     0

In [43]:
%%R

tbl.ar = tbl %>%
    #mutate(fraction = gsub('.+\\.', '', sample) %>% as.numeric) %>%
    #mutate(treatment = gsub('(.+)\\..+', '\\1', sample)) %>%
    group_by(sample) %>%
    mutate(rel_abund = count / sum(count)) %>% 
    summarize(abund_range = max(rel_abund) - min(rel_abund)) %>%
    ungroup() %>%
    separate(sample, c('isotope','treatment','day','rep','fraction'), sep='\\.', remove=F)

tbl.ar %>% head(n=3)


Source: local data frame [3 x 7]

            sample isotope treatment day rep fraction abund_range
1 12C.000.14.05.01     12C       000  14  05       01  0.14542640
2 12C.000.14.05.02     12C       000  14  05       02  0.01763341
3 12C.000.14.05.03     12C       000  14  05       03  0.01686531

In [47]:
%%R -w 800

tbl.ar = tbl.ar %>%
    mutate(fraction = as.numeric(fraction))

ggplot(tbl.ar, aes(fraction, abund_range, fill=rep, color=rep)) +
    geom_point() +
    geom_line() +
    labs(x='Buoyant density', y='relative abundanc range') +
    facet_grid(treatment ~ day) +
    theme_bw() +
    theme( 
        text = element_text(size=16),
        legend.position = 'none'
    )


Total abundance of each target taxon: all fraction samples approach

  • Getting relative abundances from all fraction samples for the gradient
    • I will need to calculate (mean|max?) relative abundances for each taxon and then re-scale so that cumsum = 1

In [96]:
%%R -i otuTableFile
# loading OTU table
tbl = read.delim(otuTableFile, sep='\t') %>%
    select(-ends_with('.NA'))

tbl.h = tbl %>% 
    gather('sample', 'count', 2:ncol(tbl)) %>%
    separate(sample, c('isotope','treatment','day','rep','fraction'), sep='\\.', remove=F)
tbl.h %>% head


      OTUId            sample isotope treatment day rep fraction count
1  OTU.4776 X12C.700.45.01.24    X12C       700  45  01       24   157
2  OTU.2864 X12C.700.45.01.24    X12C       700  45  01       24     0
3  OTU.8170 X12C.700.45.01.24    X12C       700  45  01       24     0
4  OTU.8762 X12C.700.45.01.24    X12C       700  45  01       24     0
5  OTU.5223 X12C.700.45.01.24    X12C       700  45  01       24     0
6 OTU.13091 X12C.700.45.01.24    X12C       700  45  01       24     0

In [100]:
%%R
# basename of fractions
samples.to.use.base = gsub('\\.[0-9]+\\.NA', '', samples.to.use)

samps = tbl.h$sample %>% unique

fracs = sapply(samples.to.use.base, function(x) grep(x, samps, value=TRUE))

for (n in names(fracs)){
    n.frac = length(fracs[[n]])
    cat(n, '-->', 'Number of fraction samples: ', n.frac, '\n')
}


X12C.700.14 --> Number of fraction samples:  20 
X12C.700.28 --> Number of fraction samples:  20 
X12C.700.45 --> Number of fraction samples:  20 
X13C.700.14 --> Number of fraction samples:  20 
X13C.700.28 --> Number of fraction samples:  22 
X13C.700.45 --> Number of fraction samples:  20 

In [103]:
%%R
# function for getting mean OTU abundance from all fractions 
OTU.abund = function(samples, otu.long){
    otu.rel.abund = otu.long %>%
        filter(sample %in% samples,
               count > 0) %>%
        ungroup() %>%
        group_by(sample) %>%
        mutate(total_count = sum(count)) %>%
        ungroup() %>%
        mutate(perc_abund = count / total_count * 100) %>%
        group_by(OTUId) %>%
        summarize(mean_perc_abund = mean(perc_abund),
                  median_perc_abund = median(perc_abund),
                  max_perc_abund = max(perc_abund)) 
    
    return(otu.rel.abund)
}

## calling function
otu.rel.abund = lapply(fracs, OTU.abund, otu.long=tbl.h)
otu.rel.abund = do.call(rbind, otu.rel.abund) %>% as.data.frame
otu.rel.abund$sample = gsub('\\.[0-9]+$', '', rownames(otu.rel.abund))
otu.rel.abund %>% head


                  OTUId mean_perc_abund median_perc_abund max_perc_abund
X12C.700.14.1     OTU.1     3.827532325       3.926317300    5.603168720
X12C.700.14.2    OTU.10     0.048631000       0.048182090    0.085456173
X12C.700.14.3   OTU.100     0.308850910       0.312166472    0.508460092
X12C.700.14.4  OTU.1000     0.013548909       0.013883827    0.024440914
X12C.700.14.5 OTU.10000     0.001464408       0.001464408    0.001464408
X12C.700.14.6  OTU.1001     0.005668025       0.005251645    0.016415420
                   sample
X12C.700.14.1 X12C.700.14
X12C.700.14.2 X12C.700.14
X12C.700.14.3 X12C.700.14
X12C.700.14.4 X12C.700.14
X12C.700.14.5 X12C.700.14
X12C.700.14.6 X12C.700.14

In [118]:
%%R -h 600 -w 900
# plotting
otu.rel.abund.l = otu.rel.abund %>%
    gather('abund_stat', 'value', mean_perc_abund, median_perc_abund, max_perc_abund)

otu.rel.abund.l$OTUId = reorder(otu.rel.abund.l$OTUId, -otu.rel.abund.l$value)

ggplot(otu.rel.abund.l, aes(OTUId, value, color=abund_stat)) +
    geom_point(shape='O', alpha=0.7) +
    scale_y_log10() +
    facet_grid(abund_stat ~ sample) +
    theme_bw() +
    theme(
        text = element_text(size=16),
        axis.text.x = element_blank(),
        legend.position = 'none'
        )


For each sample, writing a table of OTU_ID and count


In [123]:
%%R -i workDir
setwd(workDir)



# each sample is a file
samps = otu.rel.abund.l$sample %>% unique %>% as.vector

for(samp in samps){
    outFile = paste(c(samp, 'frac_OTU.txt'), collapse='_')
    
    tbl.p = otu.rel.abund %>% 
        filter(sample == samp, mean_perc_abund > 0)
    
    write.table(tbl.p, outFile, sep='\t', quote=F, row.names=F)
    
    cat('Table written: ', outFile, '\n')
    cat('  Number of OTUs: ', tbl.p %>% nrow, '\n')
    }


Table written:  X12C.700.14_frac_OTU.txt 
  Number of OTUs:  6606 
Table written:  X12C.700.28_frac_OTU.txt 
  Number of OTUs:  6896 
Table written:  X12C.700.45_frac_OTU.txt 
  Number of OTUs:  6901 
Table written:  X13C.700.14_frac_OTU.txt 
  Number of OTUs:  6005 
Table written:  X13C.700.28_frac_OTU.txt 
  Number of OTUs:  6238 
Table written:  X13C.700.45_frac_OTU.txt 
  Number of OTUs:  6325 

A broader taxon distribution with increased abundance?

  • Overloading molecules at one spot in the gradient, leading to more diffusion?

In [ ]: