Major questions:

  1. How specific & sensitive is the our current HR-SIP analytical method?

  2. What are some major factors affecting accuracy of our current HR-SIP analytical method?

  3. How can we increase the accuracy of HR-SIP?


Figure run down

  • Figure1
    • taxon abundance example
      • control vs treatment facets (100% atom 13C; 10% incorporators)
      • 3 plots: absolute counts, absolute subsampled counts, relative abundance subsampled counts
  • Figure2
    • Bray-curtis nmds plots
      • 100% atom 13C
      • % incorporators: 0, 10, 25, 50
      • showing variance between replicates (elipses?, error bars?, procrustes?)
        • procrustes: dendrogram of plots (reps & treatments)
          • all replicates should be very similar vs treatments
  • Figure3
    • accuracy ~ atom % 13C + % incorporators
    • boxplots
  • Figure4
    • accuracy ~ atom % 13C + evenness
  • Figure5
    • BD-window
  • Figure6
    • comparing methods
      • HR-SIP
      • qSIP
      • old-SIP
    • x-axis: atom % 13C
    • y-axis: values
    • color/fill: method
    • different plots: % incorporators
  • Figure S1:
    • simulation scheme diagram
  • Figure SX:
    • validation figs
  • Figure S2:
    • accuracy ~ abundance correlation (figure2)
  • Figure S3:
    • accuracy ~ abundance correlation (figure3)

Simulation methodology validation

Diagram of simulation methodology

  • Figure 1

Supplemental example plots of key aspects in simulation

Genomic fragment simulation

  • Sup Figure:
    • 3 taxa of differing GC
    • Fragment length distributions
      • 'realistic' = skewed normal
      • 'large' = ~50 kb
      • 'small' = ~1-2 kb
    • faceted heatmap:
      • taxa ~ (fragment length distribution)

Diffusion calculations

  • Sup Figure:
    • 3 taxa of differing GC (same as last)
    • histograms:
      • BD distribution for pre & post diffusion
    • Shows effect of diffusion & how it relates to fragment length

Real HR-SIP dataset validation

  • Dataset: Ashley's priming experiment
  • Select OTUs with close genome representative
    • BLASTn of OTU 16S vs genome 16S blast db
  • Simulation of just genomes of interest
  • Simulation of all genomes
  • Plot abundance distributions: real vs simulated

Mock community validation?

  • Goal: validate the simulation model vs a ground-truth real dataset
  • Is this necessary???
  • 10-20 isolates
  • 'Control' community:
    • All DNA is 12C
  • 'Treatment' community:
    • 2 to 4 isolates grown with 13C (~100% 13C incorporation)
  • 2 gradients (full HR-SIP pipeline)
    • add to MiSeq run
  • Marquessa project???
  • Downsides:

    • Costly in time and money
    • Would takes months to complete
  • (Sup) Figure:

    • Taxon abundances of real & simulated communities
  • Table:
    • Accuracy of detecting incorporators (simulation vs real dataset)

Example figure: absolute & relative taxon abundances

  • All taxa simulated (n=1210); just top 100 plotted

  • Figure:

    • 1 community
    • faceted:
      • (absolute abundance or relative abundance) ~ (total community or subsampled community)

1) How specific & sensitive is the our current HR-SIP analytical method?

2) What some major factors affecting accuracy of identifying incorporators?

Accuracy as a function of isotope incorporation

Variable parameters:

  • % isotope incorporation
    • 0, 25, 50, 100
  • % taxa that incorporate
    • 1, 5, 10, 25, 50
  • n-reps (stocastic: taxon abundances & which incorporate)
    • 10 (20?)
  • total simulations
    • 4 * 3 * 10 = 120
    • time = 120 * 1.6 [hr] = 192 [hr] = 8 [days]

Set parameters:

  • Community rank-abundance:
    • lognormal
  • Total community abundance:
    • 1e9
  • 'heavy' BD range:
    • 1.71 - 1.75

Analysis:

  • Figure:
    • faceted boxplot
      • x-axis = percent isotope incorporation
      • color/group = percent of taxa to incorporate
      • facet = sensitivity/specificity (and balanced accuracy?)

Possible extensions:

  • % isotope incorporation
    • normal distribution, varying standard deviation
      • mean = 50
      • sd = (1, 10, 25)
    • uniform distribution

Accuracy as a function of community evenness and isotope incorporation

Variable parameters:

  • Community rank-abundance:
    • lognormal abundance distribution
      • very uneven
      • moderately uneven
      • slightly uneven
  • % incorp
    • 0:100:5
  • XCommunity richness:
    • X100 taxa
    • X300 taxa
    • X900 taxa
  • n-reps (stocastic: taxon abundances & which incorporate)
    • 20
  • total simulations

Set parameters:

  • % taxa that incorporate
    • 10%?
  • Total community abundance:
    • 1e9
  • 'heavy' BD range:
    • 1.71 - 1.75

Analysis:

  • Figure:
    • pointrange (percent_incorp ~ accuracy_value; color by community evenness)

Accuracy as a function taxon abundance

Variable parameters:

  • Partition communities into 'dominant' and 'rare' subsets
    • dominant/rare cutoff: 1% relative abundance?
  • Assess sensitivity & specificity for each subset
  • Note:
    • Much less time to perform if using data from past simulation run datasets

Analysis:

  • Sup Figures:
    • Same as above simulation runs, but split into 'dominant' vs 'rare'

Accuracy as a function of BD window

Variable parameters:

  • Note:
    • Much less time to perform if using data from past simulation run datasets
  • 'Heavy' BD range:
    • BD-min
      • range = [1.67:1.77:0.02]
    • BD-max
      • BD-min + 0.04
    • product: BD-min ~ BD-max

Analysis:

  • Figure:
    • Plot showing sensitivities/specificities for each 'heavy' BD range
  • Figure:
    • Taxon abundances of true positives & false negatives
    • Why: displays the issue with selecting one 'heavy' BD window

How can we increase the accuracy of our method?

Multiple 'heavy' fraction windows

  • General sliding window approach

  • Binning OTUs by log2fc ~ BD

  • Extrapolating GC for each OTU, with 2-3 different 'heavy' BD windows

Time series analysis

  • MetagenomeSeq method

In [ ]: