Description:

Notebooks in this directory describe isopycnicGenomes workflow where probability density functions fit to the fragment BD distribution simulated for each genome.

  • PDF = probability density function
  • KDE = kernel density estimation
  • Implementing some of the workflow in python (except simulating the initial communities, which is done with grinder).

Experiments for testing incorporation using the modeling tool

atom % 13C

  • Simulates isotope dilution or short incubations
    • Method
      • incorporation % treatments: 10, 20, 40, 60, 80, 100%
      • Total treatments: 6
      • Title: perc_incorp

Variable incorporation

  • Simulate variable percentages of incorporation and see which taxa are ID'ed
  • A fuzzy cutoff of percent incorp that is detectable?
    • Method
      • 25% of taxa incorporate
      • incorporation % distribution:
        • uniform: 0-100, 50-100, 0-50
        • normal: mean=50, sd=c(0,10,20,30)

Few incorporators vs many (e.g., ~10% vs ~50%)

  • Simulates communities of specialists vs generalists
    • Or a more recalcitrant vs labile substrate
    • Method
      • factorial
      • % incorporator treatments: 5, 10, 20, 40, 60%
      • % incorporation: 20, 50, or 100%
      • Total treatments: 5 x 3 = 15
      • Title: perc_tax-incorp

Community evenness

  • differing levels of evenness
    • Method
      • evenness levels: uniform, log (differing params)

Community richness

  • soil vs simpler habitat
    • Method
      • N-taxa treatments: 100, 500, 1000 taxa

Relative abundance of incorporators

  • dominant vs rare responders
    • Could just do a post-hoc analysis on which taxa were detected by DESeq2

split populations: active vs dead/dormant

  • only some individuals incorporate isotope
    • Method
      • 25% of taxa incorporate
      • factorial
      • % split populations: 10, 50, 100%
      • incorporation % treatments: 10, 50, 100%

split populations: full incorp vs partial

  • only some individuals incorporate isotope
    • Method
      • 25% of taxa incorporate
      • factorial
      • % split populations: 100%
      • all in 1 sub-population incorporate 100%
      • partial incorporation % treatments: 10, 20, 40, 60, 80, 100%

Differences in rank-abundance between control and treatment

  • differing numbers of taxa with varying rank-abundances in the community
    • Method
      • Simulate communities with different levels of shared rank-abundances
      • SIPSim gradientComms --perm_perc

Number of biological replicates

  • N-replicates to test: 1, 3, 5
  • Create 5 replicate control and treatment communities
    • Test for incorp ID accuracy with differing number of replicates
    • How to combine replicates?

Cross feeding

  • Nearly 100% incorporators with some partial 2ndary feeders
    • Method
      • Incorporators: normal dist, mean=90, sd=5
      • Secondary feeders: normal dist, mean=c(10,20,30,40,50), sd=10

Phylogenetic signal vs random trait distribution

  • Random trait distributions coupled with high levels of microdiversity could cause 'split populations' and dilute the signal of incorporation
  • Clustering phylotypes (genomes) to produce course-grained phylotypes and see if the incorporation signal is lost
    • Method
      • 25% of taxa incorporate
      • taxonomic clustering: genus, family, order, class, phylum
      • % incorporation: 10, 20, 40, 60, 80, 100

General Workflow

Fragment GC distributions

fragSim.py

  • input:
    • genomes
    • number of fragments per genome
    • [primers]
  • workflow:
    • Foreach genome:
      • select amplicons (if needed)
      • simulate fragments
      • calculate GC content
  • output: table of fragment GC contents

Community abundance and incorporation

Simulate communities for each gradient fraction

gradientComms.py

  • Goal:
    • simulate abundance of taxa in >=1 community
  • Input:
    • Genome list file
    • Grinder options (profile file)
  • workflow:
    • call modified Grinder with options
  • output:
    • table of taxon abundances for each sample

Simulate isotope incorp

isoIncorp.py

  • Goal:

    • for each taxon in community, how isotope incorp distributed across individuals?
      • example: taxonX incorp is normally distributed with mean of X and s.d. of Y
      • user defines distribution and params
      • can these distributions be 'evolved' across the tree (if provided)?
        • brownian evolution of each param?
      • 'Fragments' assumed to be pulled randomly from taxon population
  • Input:

    • community file
    • [isotope]
    • [percent taxa with any incorp]
    • [intra-taxon incorp: specify distribution and params]
      • special: GMM: [GMM, weights, Norm1_params, Norm2_params]
    • inter-taxon incorp:
      • specify distributions of how intra-taxon params vary
      • OR phylogeny: distribution params 'evolved' across tree
  • Workflow:
    • Load community file
    • if phylogeny:
      • call R script to get which taxa incorporate based on tree (brownian motion)
      • % of taxa defined by user
    • else:
      • random selection of taxa
      • % of taxa defined by user
    • For incorporators (randomly ordered!):
      • if phylogeny:
        • brownian motion evo. of intra-taxon incorp distribution params
      • else:
        • select intra-taxon incorp distribution params from inter-taxon param distribution
    • For non-incorps:
      • uniform distribution with min = 0 & max = 0
  • Output (incorp file):
    • tab-delim table:
      • sample, taxon, intra-taxon_incorp_disribution_type, distribution_params...

Creating OTU tables

make_OTU_table.py

  • input:

    • frag GC file
    • community file
    • incorp file
    • [BD distribution [default: cauchy, params...]]
    • [weight (multiplier) for incorp by abundance]
  • class creation:

    • fragGC
      • KDE fit to fragment G+C values for each taxon
      • class:
        • library : taxon : KDE fit
      • simulate gradients (BD fractions)
        • min-max based on theoretical min-max BD
        • class:
          • library : bin : otu : count
          • function: place_in_bin(self, library, otu, BD)
    • comm
      • subclassed pandas dataframe?
    • incorp file
      • parse into distributions for each lib-taxon
      • dict-like:
        • library : taxon : pymix_distribution
    • otu_table
      • dict-like -- library : fraction : OTU : OTU_count
  • workflow (for each sample):

    • load community file, incorp file, fragGC file
    • Foreach sample:
      • Foreach taxon (from comm class):
        • sample GC value from KDE
          • option: Log GC values from KDE
        • select perc. incorp from intra-taxon incorp-distribution
          • option: Log perc. incorp
        • calculate BD (GC + perc. incorp)
        • simulate gradient noise:
          • select value (new BD) from (cauchy) distribution (mean = calculated BD)
        • bin BD value in corresponding fraction
          • save: sample => fraction => OTU
            • save as pandas dataframe?
    • write out OTU table (frag counts for taxon in fraction = OTU/taxon abundance)
  • output:
    • OTU table:
      • rows=OTU, columns=library-fraction

Validation

Fragment simulation

  • E. coli
  • Plotting fragment length distributions under different conditions

KDE simulation

  • E. coli

Bandwidth

  • Testing effect bandwidth param
  • Plotting/comparing distributions made with differing params

Monte Carlo estimation

  • Testing effect of Monte Carlo estimation of distributions
  • Plotting/comparing distributions made with differing params

Community

  • Plotting community rank-abundances with different params
  • Calculating beta diversity with different params

Incorporation

  • E. coli
  • Plotting BD distribution with differing incorporation settings
  • Settings:
    • Uniform incorporation
    • Normal incorporation
    • Split populations

Gradient fractions

  • Plotting gradient fractions with differing params

In [ ]:


In [ ]:


~OLD~

Gaussian mixture models with pymix

n1 = mixture.NormalDistribution(-2,0.4)
n2 = mixture.NormalDistribution(2,0.6)
m = mixture.MixtureModel(2,[0.5,0.5], [n1,n2])
print m
m.sample()

Fragment GC distributions

fragSim.py

  • input:
    • genomes
    • number of fragments per genome
    • [primers]
  • workflow:
    • loading genomes as flat-file db
    • loading primers as biopython seq records
    • Foreach genome:
      • find amplicons (if needed)
        • load genome seq as Dseqrecord
        • in-silico PCR with pydna
      • simulate fragments
        • get position from amplicon, simulate fragment around amplicon, pull out sequence from genome
      • calculate GC content
      • apply KDE to GC values (non-parametric approach)
  • output:
    • pickle: genome => KDE_object

TODO:


~OLD~


Specific Workflow

Simulate isotope incorporation

  • makeIncorp_phylo.pl
  • already complete

Define pre-isopycnic community

  • preIsopycnicComm.pl
  • really, just utilizing Grinder
  • complete?

Simulate genome fragments and calculate BD

  • simFrags.py
  • simulation of certain number of fragment from each genome
    • simulation of amplicon or metagenome fragments
    • foreach fragment:
      • calculate GC
      • calculate BD based on isotope incorp of genome
  • Output:
    • csv: sample, genome, GC, BD

Create gradient communities

  • simGradientComms.py
  • Simulate communities for each gradient fraction
    • simGradientComms.py
    • simulate fractions
      • difference fractions for each gradient (each sample)
    • Foreach sample: Foreach genome:
      • apply KDE to BD values (non-parametric approach)
      • simulate number of 'fragments' needed to meet abs abund as defined in pre-isopynic community
      • bin sim-frags by fraction
    • Output:
      • csv: sample, genome, abs_abundance, rel_abundance

In [ ]:


~OLD~


questions:

  • Can biopython simulate PCR (as done with bioperl)?

TODO:

  • simFractions.py
    • simulate gradient fractions
    • input: simulated community
    • output: table
      • sample, fraction_num, BD_start, BD_end
    • params:
      • isotope(s)
        • determines possible max BD
      • gradient min-max
      • fraction size dist: mean, stdev

isopycnic.py scheme

  • input:

    • genome fasta
    • sim community file
    • isotope incorp file
    • fraction file
  • main:

    • foreach genome (workflow independent by genome; parallel)
      • bio.sequence instance for genome
      • isopycnic instance (incorp, simComm, script_args)
        • both tables loaded as pandas DataFrames
        • incorp table melted; column of sample index
      • creating read index (genome location where reads originated)
        • artificial PCR to select regions if amplicon fragment (primers provided)
        • random fragments if shotgun
        • generate by calling grinder?
      • simulating fragment start-end
        • from read start-end
      • fragmenting genome
        • load genome
        • select sequence for each frag start-end
        • calculate GC
        • calculate BD
      • array of BD values:
        • write out [if wanted]
        • fit to distributions

Specific Workflow - take 2

Simulate isotope incorporation

  • makeIncorp_phylo.pl
  • already complete

Define pre-isopycnic community

  • preIsopycnicComm.pl
  • really, just utilizing Grinder
  • complete?

Simulate genome fragments and calculate BD

  • isopycnic.pl
    • alter:
      • write out the BD of each simulated fragment
    • output: sample, genome, frag_scaf, frag_start, frag_end, GC, BD

Fitting BD values to a PDF

  • fitBD.py
    • load table as dict {genome: [BD_values,]}
    • foreach genome (parallel):
      • fit data to distributions
      • output table:
        • genome, AIC, BIC, distribution, l-moment(s)

Create gradient communities

  • makeGradientComms.py
  • input:
    • fitBD.py output
    • community abundance file
  • Foreach genome: random draws from PDFs

    • N-draws based on total community abundance & relative abundances of taxa
    • using scipy for drawing from distribution
  • output table of OTU abs abundance

    • each genome = OTU

Specific Workflow - take 3

Simulate isotope incorporation

  • makeIncorp_phylo.pl
  • already complete

Define pre-isopycnic community

  • preIsopycnicComm.pl
  • really, just utilizing Grinder
  • complete?

simulate reads

  • grinderSE.pl
    • altered grinder that just outputs start-end of reads (not read itself)
    • parallel:
      • for each sample-genome
    • Output: sample, genome, scaffold, read_start, read_end

simulate fragments

  • fragSim.py
    • parallel:
      • for each sample-genome
      • determine frag start-end from genome
      • get fragment, calculate G+C and BD

Simulate genome fragments and calculate BD

  • isopycnic.py
    • output: sample, genome, frag_scaf, frag_start, frag_end, GC, BD

Fitting BD values to a PDF

  • fitBD.py
    • load table as dict {genome: [BD_values,]}
    • foreach genome (parallel):
      • fit data to distributions
      • output table:
        • genome, AIC, BIC, distribution, l-moment(s)

Create gradient communities

  • makeGradientComms.py
  • input:
    • fitBD.py output
    • community abundance file
  • Foreach genome: random draws from PDFs

    • N-draws based on total community abundance & relative abundances of taxa
    • using scipy for drawing from distribution
  • output table of OTU abs abundance

    • each genome = OTU

In [ ]: