Major questions:

How specific & sensitive is the our current HR-SIP analytical method?
What are some major factors affecting accuracy of our current HR-SIP analytical method?
How can we increase the accuracy of HR-SIP?

Figure run down

Figure1
- taxon abundance example
  - control vs treatment facets (100% atom 13C; 10% incorporators)
  - 3 plots: absolute counts, absolute subsampled counts, relative abundance subsampled counts
Figure2
- Bray-curtis nmds plots
  - 100% atom 13C
  - % incorporators: 0, 10, 25, 50
  - showing variance between replicates (elipses?, error bars?, procrustes?)
    - procrustes: dendrogram of plots (reps & treatments)
      - all replicates should be very similar vs treatments
Figure3
- accuracy ~ atom % 13C + % incorporators
- boxplots
Figure4
- accuracy ~ atom % 13C + evenness
Figure5
- BD-window
Figure6
- comparing methods
  - HR-SIP
  - qSIP
  - old-SIP
- x-axis: atom % 13C
- y-axis: values
- color/fill: method
- different plots: % incorporators

Figure S1:
- simulation scheme diagram
Figure SX:
- validation figs
Figure S2:
- accuracy ~ abundance correlation (figure2)
Figure S3:
- accuracy ~ abundance correlation (figure3)

Simulation methodology validation

Diagram of simulation methodology

Figure 1

Supplemental example plots of key aspects in simulation

Genomic fragment simulation

Sup Figure:
- 3 taxa of differing GC
- Fragment length distributions
  - 'realistic' = skewed normal
  - 'large' = ~50 kb
  - 'small' = ~1-2 kb
- faceted heatmap:
  - taxa ~ (fragment length distribution)

Diffusion calculations

Sup Figure:
- 3 taxa of differing GC (same as last)
- histograms:
  - BD distribution for pre & post diffusion
- Shows effect of diffusion & how it relates to fragment length

Real HR-SIP dataset validation

Dataset: Ashley's priming experiment
Select OTUs with close genome representative
- BLASTn of OTU 16S vs genome 16S blast db
Simulation of just genomes of interest
Simulation of all genomes
Plot abundance distributions: real vs simulated

Mock community validation?

Goal: validate the simulation model vs a ground-truth real dataset
Is this necessary???
10-20 isolates
'Control' community:
- All DNA is 12C
'Treatment' community:
- 2 to 4 isolates grown with 13C (~100% 13C incorporation)
2 gradients (full HR-SIP pipeline)
- add to MiSeq run
Marquessa project???
Downsides:
- Costly in time and money
- Would takes months to complete
(Sup) Figure:
- Taxon abundances of real & simulated communities
Table:
- Accuracy of detecting incorporators (simulation vs real dataset)

Example figure: absolute & relative taxon abundances

All taxa simulated (n=1210); just top 100 plotted
Figure:
- 1 community
- faceted:
  - (absolute abundance or relative abundance) ~ (total community or subsampled community)

1) How specific & sensitive is the our current HR-SIP analytical method?

2) What some major factors affecting accuracy of identifying incorporators?

Accuracy as a function of isotope incorporation

Variable parameters:

% isotope incorporation
- 0, 25, 50, 100
% taxa that incorporate
- 1, 5, 10, 25, 50
n-reps (stocastic: taxon abundances & which incorporate)
- 10 (20?)
total simulations
- 4 * 3 * 10 = 120
- time = 120 * 1.6 [hr] = 192 [hr] = 8 [days]

Set parameters:

Community rank-abundance:
- lognormal
Total community abundance:
- 1e9
'heavy' BD range:
- 1.71 - 1.75

Analysis:

Figure:
- faceted boxplot
  - x-axis = percent isotope incorporation
  - color/group = percent of taxa to incorporate
  - facet = sensitivity/specificity (and balanced accuracy?)

Possible extensions:

% isotope incorporation
- normal distribution, varying standard deviation
  - mean = 50
  - sd = (1, 10, 25)
- uniform distribution

Accuracy as a function of community evenness and isotope incorporation

Variable parameters:

Community rank-abundance:
- lognormal abundance distribution
  - very uneven
  - moderately uneven
  - slightly uneven
% incorp
- 0:100:5
XCommunity richness:
- X100 taxa
- X300 taxa
- X900 taxa
n-reps (stocastic: taxon abundances & which incorporate)
- 20
total simulations

Set parameters:

% taxa that incorporate
- 10%?
Total community abundance:
- 1e9
'heavy' BD range:
- 1.71 - 1.75

Analysis:

Figure:
- pointrange (percent_incorp ~ accuracy_value; color by community evenness)

Accuracy as a function taxon abundance

Variable parameters:

Partition communities into 'dominant' and 'rare' subsets
- dominant/rare cutoff: 1% relative abundance?
Assess sensitivity & specificity for each subset
Note:
- Much less time to perform if using data from past simulation run datasets

Analysis:

Sup Figures:
- Same as above simulation runs, but split into 'dominant' vs 'rare'

Accuracy as a function of BD window

Variable parameters:

Note:
- Much less time to perform if using data from past simulation run datasets
'Heavy' BD range:
- BD-min
  - range = [1.67:1.77:0.02]
- BD-max
  - BD-min + 0.04
- product: BD-min ~ BD-max

Analysis:

Figure:
- Plot showing sensitivities/specificities for each 'heavy' BD range
Figure:
- Taxon abundances of true positives & false negatives
- Why: displays the issue with selecting one 'heavy' BD window

How can we increase the accuracy of our method?

Multiple 'heavy' fraction windows

General sliding window approach
Binning OTUs by log2fc ~ BD
Extrapolating GC for each OTU, with 2-3 different 'heavy' BD windows

Time series analysis

MetagenomeSeq method



In [ ]: