Supplementary Information: Holmes et al. 2019

2. Full model fitting

This notebook describes fitting of a Bayesian hierarchical model of the effects of control (growth) and treatment (passage) on individual genes from E. coli DH10B (carrier) and Sakai (BAC load), to data obtained using a multi-E. coli microarray.

Much of the code for the visualisation, analysis and data manipulation of the fitting results is found in the associated Python module tools.py, which should also be present in this directory.

The model fit can be downloaded directly from the Zenodo repository, for use in this notebook:

A code cell in the notebook below will attempt to make this download for you if the file does not already exist.

Experiment summary and interpretation
Building the model
Inspecting the fit
1. Median parameter estimates
Identifying locus tags that confer an advantage under treatment
1. Plotting distribution of effects
2. Identifying candidates
Manuscript Figure 1

Experiment summary and interpretation

The experiment involves measuring changes in microarray probe intensity before and after a pool of bacteria is subjected to one of two processes:

a sample from the pool is grown in media to a defined OD, then subsampled. This growth/subsample process is repeated n times. [control]
a sample from the pool is applied to plant leaves, subsampled, and that subsample grown in media to a defined OD, then subsampled. This passage/subsample/growth/subsample process is repeated n times. [treatment]

In a single replicate, the microarray is exposed to genomic DNA extracted from the pool (i) before the experiment begins, and (ii) after the experiment concludes. Three replicates are performed.

All genes in all samples go through the *growth and subsampling* part of the experiment, and we wish to estimate the effect of *passage and subsampling* on individual genes.

The pool of bacteria comprises E. coli DH10B as a carrier organism. The pool is heterogeneous, in that individual cells also contain BACs encoding random stretches of the E. coli Sakai chromosome. We therefore expect carrier organism genes to be unaffected by passage (treatment), and for any effects to be detectable only for genes that originate from E. coli Sakai.

We expect that genes conferring a phenotypic/selective advantage only for association with the plant should be enriched at the end of the treatment experiment, but not at the end of the control experiment. Sakai genes that are enriched in both treatment and control experiments may be generally advantageous for growth, but those giving a selective advantage on passage through the plant could be specifically adaptive in an environmental context.

As the BACs describe contiguous regions of the *E. coli* Sakai genome, there is the possibility that linkage disequilibrium could result in some genes that do not confer an advantage by themselves apparently displaying enrichment after treatment.

If the biological function conferring an advantage during passage is encoded by a suite of coregulated genes in an operon, we might expect all members of this suite to show evidence of enrichment after passage. It is likely that clusters of enrichment for operons or regulons post-passage will be seen in the results. Although we are not accounting for this clustering or association by operon directly in this model, it is a possible additional hierarchical term in future iterations of the model.

We should expect there to be a selective burden to the carriage of additional non-functional gDNA as BACs, so we might also anticipate a slightly negative effect on recovery under control conditions.

Python imports



In [1]:

    
%pylab inline

import os
import pickle
import warnings; warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import pystan
import scipy
import seaborn as sns; sns.set_context('notebook')

from Bio import SeqIO

import tools









    



Populating the interactive namespace from numpy and matplotlib

Building the model

We assume that each array probe $i$ (array probes take a unique values of $i$ in the context of the entire experiment; that is, $i$ is unique for probe X replicate X treatment) measures hybridisation of genomic DNA (gDNA) in the sample that corresponds to a single gene $j[i]$, and that the measured intensity of probe $i$ relates directly to the corresponding amount of gDNA in the sample. There may be multiple probes relating to a single gene, so it is possible that $j[p] = j[q], p \neq q$.

This establishes a basis for pooling probe-level effects as samples of the gene-level effect.

We define the (input) measurement of a probe before an experiment as $x_i$, and the (output) measurement of that probe after the experiment as $y_i$. We assume that the measurement of each probe is subject to random experimental/measurement error that is normally-distributed with mean zero and variance $\sigma_y^2$. The actual quantity of DNA measured after the experiment can then be represented as $\hat{y}$, and the irreducible error in this experiment as $\epsilon$ ($\epsilon_i$ serves to include the irreducible errors in measuring both $x_i$ and $y_i$; all errors are assumed to be Normal, so their linear combinations are also Normal).

$$y_i = \hat{y_i} + \epsilon_i$$$$\epsilon_i \sim N(0, \sigma_y^2) \implies y_i \sim N(\hat{y_i}, \sigma_y^2)$$

The relationship between the input and output DNA quantities measured by a single probe can be represented as $\hat{y_i} = f(x_i)$. That is to say, that the measured input DNA quantity $x_i$ is a predictor of the output quantity. This relationship will be modelled as the sum of two linear effects:

$$\textrm{control effect} = \alpha + \beta x$$$$\textrm{treatment effect} = \gamma + \delta x$$$$\hat{y_i} = \textrm{control effect}(x_i) + \textrm{treatment effect}(x_i) = \alpha + \beta x_i + \gamma + \delta x_i$$

As these are linear effects, we have intercept/offset parameters ($\alpha$, $\gamma$) and gradient/slope parameters ($\beta$, $\delta$).

Where $\beta$ or $\delta$ are large, they would indicate large $x_i$-dependent effects of the control (growth) and treatment (passage) parts of the experiment respectively.

As formulated above, the four parameters would be identical for all probes, but we are interested in estimating the control and treatment effects for individual genes, so we require a set of parameters for each gene (as it corresponds to probe $i$): $j[i]$. This is appropriate for the effects of growth/treatment that are specific to the levels of a single gene: $\beta$ and $\delta$.

The remaining parameters $\alpha$ and $\gamma$, the offsets from zero for each probe, could be considered to be constant across each replicate of both control and treatment experiments. They are possibly more realistically considered to be different for each array (i.e. each combination of replicate and treatment).

The offset for any particular array can be hierarchically modelled as being drawn from a distribution representing all arrays, and we require one parameter for each of the arrays, so that for probe $i$ the corresponding array for that experiment is $k[i]$.

As a result, we estimate $\alpha_{k[i]}$, $\beta_{j[i]}$, $\gamma_{k[i]}$, $\delta_{j[i]}$, and the relationship for each probe is modelled as:

$$\hat{y_i} = \textrm{control effect}_{j[i]}(x_i) + \textrm{treatment effect}_{j[i]}(x_i) = \alpha_{k[i]} + \beta_{j[i]} x_i + \gamma_{k[i]} + \delta_{j[i]} x_i$$

The parameters $\alpha_{k[i]}$, $\beta_{j[i]}$, $\gamma_{k[i]}$, $\delta_{j[i]}$ (and $\epsilon_i$) are to be estimated by the model fit.

We assume that the values of each parameter, e.g. $\alpha_{k[i]}$, are drawn from a single *pooled distribution* for that parameter, $\alpha \sim \textrm{some distribution}$.

This pooling ensures that our fits are not completely pooled as a single estimate $\alpha_{k[i]} = \alpha$, which would imply that all parameter estimates are constant for all genes/arrays, a situation that would be completely uninformative for our goal to identify gene-level effects, and which would underfit our model. It also means that our estimates are not completely unpooled, which would allow all parameter estimates to vary independently. That situation would be equivalent to simultaneously fitting independent linear relationships to each gene, and so risk overfitting our model to the measured data.

NOTE: By using a *pooled distribution*, we allow a parameter estimate for each gene to influence the estimates of that parameter for all other genes in the experiment, constrained by an expected distribution of that parameter's values. To do this, we define a *prior distribution* for each parameter, but we do not specify its mean or variance, allowing the parameters of these *pooled distributions* also to be estimated when fitting our model.

For each parameter's prior we choose a Cauchy distribution, because it has fat tails and infinite variance. This does not constrain outlying and extreme values (those we are interested in) so much as other distributions (e.g. Normal or Student's t):

$$\alpha_{k[i]} \sim Cauchy(\mu_{\alpha}, \sigma_{\alpha}^2)$$$$\beta_{j[i]} \sim Cauchy(\mu_{\beta}, \sigma_{\beta}^2)$$$$\gamma_{k[i]} \sim Cauchy(\mu_{\gamma}, \sigma_{\gamma}^2)$$$$\delta_{j[i]} \sim Cauchy(\mu_{\delta}, \sigma_{\delta}^2)$$

Each parameter's prior distribution requires a fit of both its mean and variance, and these also become parameters in our model. The means are free to vary, but we assume that the variance of each parameter's prior can be drawn from a Uniform distribution on the range (0, 100):

$$\sigma_{\alpha} \sim U(0, 100)$$$$\sigma_{\beta} \sim U(0, 100)$$$$\sigma_{\gamma} \sim U(0, 100)$$$$\sigma_{\delta} \sim U(0, 100)$$

We therefore construct the following model of the experiment: $$\hat{y_i} = \alpha_{k[i]} + \beta_{j[i]} x_i + \gamma_{k[i]} t_i + \delta_{j[i]} t_i x_i$$ $$y_i \sim N(\hat{y_i}, \sigma_y^2)$$ $$\alpha_{k[i]} \sim Cauchy(\mu_{\alpha}, \sigma_{\alpha}^2)$$ $$\beta_{j[i]} \sim Cauchy(\mu_{\beta}, \sigma_{\beta}^2)$$ $$\gamma_{k[i]} \sim Cauchy(\mu_{\gamma}, \sigma_{\gamma}^2)$$ $$\delta_{j[i]} \sim Cauchy(\mu_{\delta}, \sigma_{\delta}^2)$$ $$\sigma_{\alpha} \sim U(0, 100)$$ $$\sigma_{\beta} \sim U(0, 100)$$ $$\sigma_{\gamma} \sim U(0, 100)$$ $$\sigma_{\delta} \sim U(0, 100)$$ $$\sigma_y \sim U(0, \infty)$$

$y_i$: measured intensity output on the array for probe $i$ (specific to each replicate)
$\hat{y_i}$: actual probe intensity for probe $i$ (specific to each replicate)
$x_i$: measured intensity input on the array for probe $i$ (specific to each replicate)
$t_i$: 0/1 pseudovariable indicating whether the probe $i$ was measured in a control (0) or treatment (1) experiment
$\alpha_{k[i]}$: control effect offset for treatment X replicate $k$ (used for probe $i$)
$\mu_{\alpha}$: mean control effect offset for all arrays
$\sigma_{\alpha}$: control effect offset variance for all arrays
$\beta_{j[i]}$: control effect slope for gene $[j[i]$
$\mu_{\beta}$: mean control effect slope for all genes
$\sigma_{\beta}$: control effect slope variance for all genes
$\gamma_{k[i]}$: treatment effect offset for treatment X replicate $k$ (used for probe $i$)
$\mu_{\gamma}$: mean treatment effect offset for all arrays
$\sigma_{\gamma}$: treatment effect offset variance for all arrays
$\delta_{j[i]}$: treatment effect slope for gene $j[i]$
$\mu_{\delta}$: mean treatment effect slope for all genes
$\sigma_{\delta}$: treatment effect slope variance for all genes
$\sigma_y$: variance in measurement due to irreducible error

Load input data for fit

In the cells below we load in the data to be fit, and define useful variables for inspecting/analysing the data later:

locus_tags: the unique locus tags represented in the dataset
ntags: the number of unique locus tags
arrays: the arrays (combinations of replicate X treatment) used in the experiment
narrays: the number of arrays used
outdir: path to the directory in which to place model fit output
outfile: path to the model fit output file (pickled dataframe)



In [2]:

    
# load clean, normalised, indexed data
data = pd.read_csv(os.path.join("datasets", "normalised_array_data.tab"), sep="\t")  # full dataset
#data = pd.read_csv("datasets/reduced_locus_data.tab", sep="\t")  # reduced dataset
#data = data[:100]  # uncomment this for debugging

# useful values
locus_tags = data['locus_tag'].unique()
ntags = len(locus_tags)
arrays = data['repXtrt'].unique()
narrays = len(arrays)



In [3]:

    
# Create output directory and filename to hold the fitted model
outdir = "model_fits"
os.makedirs(outdir, exist_ok=True)
outfile = os.path.join(outdir, 'full_model_fit.pkl')

Stan model construction

We need to define data, parameters and our model for Stan.

In the `data` block, we have:

`N`: `int`, the number of data points
`J`: `int`, the number of unique locus tags (`J` < `N`)
`K`: `int`, the number of unique treatment X replicate combinations (arrays)
`array`: `int[N]`, an index list of arrays
`tag`: `int[N]`, an index list of locus tags
`t`: `vector[N]`, 0/1 control/treatment values for each probe
`x`: `vector[N]`, the input log(intensity) values
`y`: `vector[N]`, the output log(intensity) values

In the `parameter` block, we have:

`a`: `real vector[K]`, estimated offset effect on log(intensity) of the *control* for each array
`mu_a`: `real`, an unconstrained value to be fit that represents the mean of the Cauchy distribution for the *control* effect offset, for all arrays
`sigma_a`: `real`, standard deviation of the Cauchy distribution for the *control* effect offset, for all arrays
`b`: `real vector[J]`, estimated slope effect on log(intensity) of the *control* for each locus tag/gene
`mu_b`: `real`, an unconstrained value to be fit that represents the mean of the Cauchy distribution for the *control* effect slope, for all locus tags
`sigma_b`: `real`, standard deviation of the Cauchy distribution for the *control* effect slope, for all locus tags
`g`: `real vector[K]`, estimate of the influence of treatment on the output measured intensity (offset) for array
`mu_g`: `real`, an unconstrained value to be fit that represents the mean of the Cauchy distribution for the offset for all arrays due to *treatment*
`sigma_g`: `real`, standard deviation of the Cauchy distribution for the offset for all arrays due to *treatment*
`d`: `real vector[J]`, estimate of the influence of treatment on the output measured intensity (slope) for each locus tag/gene
`mu_d`: `real`, an unconstrained value to be fit that represents the mean of the Cauchy distribution for the slope for all locus tags due to *treatment*
`sigma_d`: `real`, standard deviation of the Cauchy distribution for the slope for all locus tags due to *treatment*
`sigma`: `real`, the irreducible error in the experiment/model

We also define a `transformed parameter`:

`y_hat[i] <- b[tag[i]] * x[i] + a[array[i]] + g[tag[i]] * t[i] + d[array[i]] * t[i] * x[i]`: the linear relationship describing $\hat{y}$, our estimate of experimental output intensity, which is subject to variance `sigma`.

Define and fit the Stan model

In the cells below we define the model to be fit, in the Stan language, conduct the fit, and save the fit out to a pickled dataframe (or load it in from one, depending on which code is commented out).



In [4]:

    
# define unpooled stan model
treatment_model = """
data {
  int<lower=0> N;
  int<lower=0> J;
  int<lower=0> K;  
  int<lower=1, upper=J> tag[N];
  int<lower=1, upper=K> array[N];  
  vector[N] t;
  vector[N] x;
  vector[N] y;
}
parameters {
  vector[K] a;
  vector[J] b;
  vector[K] g;
  vector[J] d;
  real mu_a;
  real mu_b;
  real mu_g;
  real mu_d;
  real<lower=0> sigma;
  real<lower=0,upper=100> sigma_a;
  real<lower=0,upper=100> sigma_b;
  real<lower=0,upper=100> sigma_g;
  real<lower=0,upper=100> sigma_d;
}
transformed parameters{
  vector[N] y_hat;

  for (i in 1:N)
    y_hat[i] = a[array[i]] + b[tag[i]] * x[i] + g[array[i]] * t[i] + d[tag[i]] * t[i] * x[i];
}
model {
  sigma_a ~ uniform(0, 100);
  a ~ cauchy(mu_a, sigma_a);

  sigma_b ~ uniform(0, 100);
  b ~ cauchy(mu_b, sigma_b);

  sigma_g ~ uniform(0, 100);
  g ~ cauchy(mu_g, sigma_g);

  sigma_d ~ uniform(0, 100);
  d ~ cauchy(mu_d, sigma_d);

  y ~ normal(y_hat, sigma);
}
"""



In [5]:

    
# relate python variables to stan variables
treatment_data_dict = {'N': len(data),
                       'J': ntags,
                       'K': narrays,
                       'tag': data['locus_tag_index'] + 1,
                       'array': data['repXtrt_index'] + 1,
                       't': data['treatment'],
                       'x': data['log_input'],
                       'y': data['log_output']}

At this point, you have two options to obtain the model fit data

Run the model fit 'live' in the notebook. This may take several hours. USE CELL (1)
1. (optionally) save the newly-generated model fit to a local file. USE CELL (2)
Load the model fit from a local file. USE CELL (4)
1. If you have not generated the data locally, then you can download it from Zenodo USE CELL (3) FIRST.

It may be quicker to download the data from Zenodo using the button below, than to use cell (3), but be sure to place the downloaded file in the correct location as specified in the variable outfile.



In [6]:

    
# (1) USE THIS CELL TO RUN THE STAN FIT - takes a few hours on my laptop
#treatment_fit = pystan.stan(model_code=treatment_model,
#                            data=treatment_data_dict,
#                            iter=1000, chains=2,
#                            seed=tools.SEED)



In [7]:

    
# (2) USE THIS CELL TO SAVE THE STAN FIT TO A PICKLE FILE
#unpermutedChains = treatment_fit.extract()
#unpermutedChains_df = pd.DataFrame([dict(unpermutedChains)])
#pickle.dump(unpermutedChains_df, open(outfile, 'wb'))



In [8]:

    
# (3) USE THIS CELL TO DOWNLOAD THE STAN FIT FROM ZENODO: DOI:10.5281/zenodo.269638
# The file will not be downloaded if it already exists locally.
# The file is 0.5GB in size, so may take some time to download
import urllib.request
if not os.path.isfile(outfile):
    zenodo_url = "https://zenodo.org/record/269638/files/full_model_fit.pkl"
    response = urllib.request.urlretrieve(zenodo_url, outfile)



In [9]:

    
# (4) USE THIS CELL TO LOAD THE STAN FIT FROM A PICKLE FILE
# Import the previously-fit model
treatment_fit = pd.read_pickle(open(outfile, 'rb'))

Extract the fit

In the cells below we load in the contents of the pickled output (if the fit has already been run), and then extract useful summary information about mean, median, variance, and credibility intervals for the parameter estimates.

parameters $\alpha$, $\beta$, $\gamma$ and $\delta$ are represented by their Roman letter equivalents `a`, `b`, `g` and `d`.
`*_mean` and `*_median` are the mean and median estimates for the parameter over the ensemble
`*_sem` is the standard deviation for the parameter estimate over the ensemble
`*_Npc` is the *N*th percentile for the parameter estimate, over the ensemble. These can be combined to obtain credibility intervals (e.g. the range `a_25pc`..`a_75pc` constitutes the 50% CI for $\alpha_{j[i]}$.



In [10]:

    
# Get summary data for parameter estimates
# use 'fit' for the model fit directly, and 'df'for loaded pickled data
(estimates_by_probe, estimates) = tools.extract_variable_summaries(treatment_fit, 'df',
                                                                  ['a', 'b', 'g', 'd'],
                                                                  [arrays, locus_tags, arrays, locus_tags],
                                                                  data)



In [11]:

    
# Inspect the data, one row per experiment probe
estimates_by_probe.head()









    Out[11]:







  
    
      
      probe
      replicate
      treatment
      repXtrt
      input
      output
      log_input
      log_output
      match
      locus_tag
      ...
      g_mean
      g_median
      g_sem
      d_2.5pc
      d_25pc
      d_75pc
      d_97.5pc
      d_mean
      d_median
      d_sem
    
  
  
    
      0
      A_07_P052986
      1
      0
      rep1trt0
      13.354233
      10.507812
      2.591833
      2.352119
      lcl|NC_002695.1_cds_NP_309203.1_1134
      ECs1176
      ...
      0.200417
      0.262675
      0.335062
      -0.510635
      -0.283252
      -0.05752
      -0.024356
      -0.167816
      -0.077139
      0.151117
    
    
      1
      A_07_P045401
      1
      0
      rep1trt0
      8.442597
      9.590875
      2.133290
      2.260812
      lcl|NC_002695.1_cds_NP_309203.1_1134
      ECs1176
      ...
      0.200417
      0.262675
      0.335062
      -0.510635
      -0.283252
      -0.05752
      -0.024356
      -0.167816
      -0.077139
      0.151117
    
    
      2
      A_07_P062008
      1
      0
      rep1trt0
      5.553664
      5.964808
      1.714458
      1.785877
      lcl|NC_002695.1_cds_NP_309203.1_1134
      ECs1176
      ...
      0.200417
      0.262675
      0.335062
      -0.510635
      -0.283252
      -0.05752
      -0.024356
      -0.167816
      -0.077139
      0.151117
    
    
      3
      A_07_P052986
      2
      0
      rep2trt0
      6.815142
      11.374824
      1.919147
      2.431403
      lcl|NC_002695.1_cds_NP_309203.1_1134
      ECs1176
      ...
      0.395308
      0.393646
      0.215424
      -0.510635
      -0.283252
      -0.05752
      -0.024356
      -0.167816
      -0.077139
      0.151117
    
    
      4
      A_07_P045401
      2
      0
      rep2trt0
      5.823013
      13.017137
      1.761818
      2.566267
      lcl|NC_002695.1_cds_NP_309203.1_1134
      ECs1176
      ...
      0.395308
      0.393646
      0.215424
      -0.510635
      -0.283252
      -0.05752
      -0.024356
      -0.167816
      -0.077139
      0.151117
    
  

5 rows × 41 columns



In [12]:

    
# Inspect the data, one row per locus tag
estimates.head()









    Out[12]:







  
    
      
      locus_tag
      b_2.5pc
      b_25pc
      b_75pc
      b_97.5pc
      b_mean
      b_median
      b_sem
      d_2.5pc
      d_25pc
      d_75pc
      d_97.5pc
      d_mean
      d_median
      d_sem
    
  
  
    
      5515
      ECDH10B_RS00010
      0.907909
      0.932580
      0.950421
      0.964608
      0.941324
      0.944069
      0.013947
      -0.106789
      -0.065958
      -0.048565
      -0.023847
      -0.058070
      -0.057542
      0.019050
    
    
      4705
      ECDH10B_RS00020
      0.914396
      0.934645
      0.951195
      0.961351
      0.942357
      0.944690
      0.012564
      -0.085109
      -0.064664
      -0.046774
      -0.025243
      -0.055638
      -0.057188
      0.015163
    
    
      5475
      ECDH10B_RS00060
      0.914911
      0.935482
      0.951967
      0.968178
      0.943733
      0.945555
      0.013453
      -0.105381
      -0.067194
      -0.045122
      -0.014049
      -0.057540
      -0.057505
      0.021189
    
    
      5474
      ECDH10B_RS00065
      0.912512
      0.935084
      0.951980
      0.965746
      0.943209
      0.945886
      0.013542
      -0.092378
      -0.063993
      -0.044987
      -0.020414
      -0.055270
      -0.056006
      0.017903
    
    
      5469
      ECDH10B_RS00070
      0.901806
      0.934273
      0.951271
      0.963839
      0.941399
      0.944425
      0.015426
      -0.098637
      -0.066353
      -0.045173
      -0.022492
      -0.057138
      -0.057634
      0.018181



In [13]:

    
# Separate estimates for Sakai and DH10B into two different dataframes
sakai_estimates = tools.split_estimates(estimates, 'sakai')
dh10b_estimates = tools.split_estimates(estimates, 'dh10b')

Inspecting the fit

In the cells below, we visualise the fitted estimates for each of the parameters $\alpha$, $\beta$, $\gamma$, and $\delta$ as:

box plots of median estimates for each locus tag
relationship between control and treatment effects in Sakai
plots of 50% credibility interval range and median estimate for each locus tag to identify locus tags with a possible selective advantage

Median parameter estimates

We first inspect the range of fitted estimates to get an overview of the relationships for the data as a whole, and then examine whether this relationship varies by E. coli isolate.

Making boxplots for the full set of fitted parameter estimates, for both isolates:



In [14]:

    
# Visualise median values for parameter estimates of alpha and gamma
tools.boxplot_medians(estimates_by_probe, ['a', 'g'])



In [15]:

    
# Visualise median values for parameter estimates of beta and delta
tools.boxplot_medians(estimates, ['b', 'd'])

For this fit we can see that the estimates are all, in the main, tightly-distributed. Most estimated (median) values of $\alpha$ (control intercept), $\gamma$ (treatment intercept), and $\delta$ (treatment slope) are close to zero. Most estimated values of $\beta$ are close to (but slightly less than) unity. This implies that:

The linear relationship between input and output intensity due to the control effects (growth only) is, for most genes in the experiment, a slight reduction of output intensity with respect to input intensity value, and on the whole the effect of the control/growth is neutral [median $\alpha$ ≈ 0, median $\beta$ ≈ 1]
For most genes in the experiment there is no treatment effect due to exposure to the plant [median $\gamma$ ≈ 0, median $\delta$ ≈ 0]

There are, however, a considerable number of outlying median values for each parameter, which suggests that a number of genes have associated parameter values that are affected by either control (growth) or treatment (passage).

DH10B

Considering boxplots of estimated $\beta_{j[i]}$ and $\delta_{j[i]}$ for the DH10B (carrier) isolate only:



In [16]:

    
# Visualise median values for Sakai parameter estimates
tools.boxplot_medians(dh10b_estimates, ['b', 'd'])

it is clear that the median parameter estimates for DH10B are extremely restricted in their range:

$0.93 < \beta < 0.98$
$-0.065 < \delta < 0.045$

The control effect appears to be essentially *neutral*, in that the output intensity is almost a 1:1 linear relationship with the input intensity, but it is striking that the median estimates of $\gamma$ and $\delta$ are very close to zero, suggesting that passage (treatment) has almost no effect on this relationship, for any DH10B locus tag. This is exactly what would be expected for DH10B as the carrier isolate.

Sakai

Considering the Sakai isolate parameter estimates for $\beta_{j[i]}$ and $\gamma_{j[i]}$ only:



In [17]:

    
# Visualise median values for Sakai parameter estimates
tools.boxplot_medians(sakai_estimates, ['b', 'd'])

By contrast to the results for DH10B, the median parameter estimates for Sakai have many large value outliers, though the bulk of estimates are close to the values seen for DH10B:

$0.2 < \beta < 1.4$
$-1.5 < \delta < 0.5$

This indicates that we see the expected result, that strong variability of control and treatment effects are effectively confined to the Sakai BAC fragments. It is expected that some genes/operons may be relatively advantageous in either growth (control) or passage (treatment) conditions, or both.

We can visualise the relationships between parameter estimates for control and treatment effects in a scatterplot of control effect ($\beta$) against treatment effect ($\delta) for each locus tag. This plot can be considered in four quadrants, which are delineated by the bulk of the data which describes orthogonal effects of locus tags on growth and treatment:

(i.e. for most locus tags, there is *either* an effect on treatment or control, but *not both*)

(upper left) positive effect on growth, negative effect for treatment: may be related to ability to use growth medium more efficiently
(upper right) positive effect on both growth and treatment: no locus tags display this characteristic
(lower right) positive effect on treatment, negative effect for control: may be related to ability to use/exploit the plant, that is suppressive in the medium
(lower left) negative effect on both growth and treatment: most locus tags that display an interaction lie in this group



In [18]:

    
# Plot estimated parameters for treatment effects against control effects for Sakai
fig, ax = plt.subplots(1, 1, figsize=(6,6))
ax.scatter(sakai_estimates['d_median'], sakai_estimates['b_median'], alpha=0.2)
ax.set_xlabel('delta (median)')
ax.set_ylabel('beta (median)');

The strong cross-like distribution indicates that most parameter estimates of $\beta$ or $\delta$ that vary from those of the bulk do so orthogonally in either *treatment* or *control* conditions, but not both.

Where Sakai genes have an estimated effect under both conditions, this is typically negative for both treatment and control (lower left quadrant).

Identifying locus tags that confer an advantage under treatment and/or control

We use a 50% credibility interval to determine whether the effect of a gene on passage is likely to be positive. Under this assumption, we identify locus tags for which the median estimate of $\delta$ is positive, and the central 50% of the parameter estimates for $\delta$ (the 50% credibility interval) does not include zero. We label these locus tags as trt_pos in the dataframe.

These locus tags correspond to the genes that we should believe confer a selective advantage in passage/*treatment* (i.e. we require our estimate to be credibly positive).

Likewise, we use a 50% credibility interval to determine whether the effect of a gene on surviving growth (control) is positive. If the 50% CI for $\beta$ does not include the 97.5 percentile for all estimates of $\beta$ (as an upper estimate of overall dataset centrality for this dataset), and the median value of $\beta$ is greater than this value, we consider that the effect of the gene on surviving growth conditions is positive. We label these locus tags as ctl_pos in the dataframe.



In [19]:

    
# Label locus tags with positive effects for control and treatment
sakai_estimates = tools.label_positive_effects(sakai_estimates)

We can count the number of locus_tags in each of the groups:



In [20]:

    
# Count locus tags in each of the positive groups
counts = [sum(sakai_estimates[col]) for col in ('trt_pos', 'ctl_pos', 'combined')]
print("treatment positive: {0}\ncontrol positive: {1}\nboth: {2}".format(*counts))









    



treatment positive: 115
control positive: 65
both: 0

which indicates, with these assumptions, that:

115 genes have a credible positive effect on passage (treatment)
65 genes have a credible positive effect in the growth (control) step
no genes have a credible positive effect for both growth and treatment.

(this confirms our observation in the earlier scatterplot)

Plotting distribution of effects on the Sakai genome

We can show the estimated effects, and our confidence in those estimates, on a rough representation of the genome by plotting those values for each locus tag, sorted in order on the genome.

In the plots that follow, parameter estimates for each locus tag are rendered as points (the median estimate), with the 50% credibility interval for the estimate indicated as a vertical line. If the 50% CI includes a threshold value - the median value for the bulk parameter estimate of $\beta$ or $\delta$ - then we consider that there is not strong evidence of an effect on survival due to that gene (compared to the bulk), and the interval is coloured blue.

If the interval does not include the corresponding threshold value, then it is coloured either green for a positive effect, or magenta for a negative effect.

Sakai

We split the Sakai estimates into groups: one for the chromosome, and one for each plasmid pOSAK and pO157, on the basis of the locus tag prefixes, annotating them with their start position on the parent molecule.



In [21]:

    
sakai_chromosome = sakai_estimates.loc[sakai_estimates['locus_tag'].str.startswith('ECs')]
sakai_pOSAK = sakai_estimates.loc[sakai_estimates['locus_tag'].str.startswith('pOSAK1')]
sakai_pO157 = sakai_estimates.loc[(sakai_estimates['locus_tag'].str.startswith('pO157')) |
                                  (sakai_estimates['locus_tag'].str.startswith('ECp'))]



In [22]:

    
# Sakai chromosome
sakai_chromosome_annotated = tools.annotate_locus_tags(sakai_chromosome,
                                                       os.path.join('..', 'data', 'Sakai',
                                                                    'GCF_000008865.1_ASM886v1_genomic.gbff'))
sakai_chromosome_annotated.sort_values('startpos', inplace=True)
#sakai_chromosome_annotated.head(15)



In [23]:

    
# pOSAK1
sakai_pOSAK_annotated = tools.annotate_locus_tags(sakai_pOSAK,
                                                  os.path.join('..', 'data', 'Sakai',
                                                               'GCF_000008865.1_ASM886v1_genomic.gbff'))
sakai_pOSAK_annotated.sort_values('startpos', inplace=True)
#sakai_pOSAK_annotated.head(15)



In [24]:

    
# pECp
sakai_pO157_annotated = tools.annotate_locus_tags(sakai_pO157,
                                                 os.path.join('..', 'data', 'Sakai',
                                                              'GCF_000008865.1_ASM886v1_genomic.gbff'))
sakai_pO157_annotated.sort_values('startpos', inplace=True)
#sakai_pO157_annotated.head(15)



In [25]:

    
# Regions of interest
regions = [('S-loop 71', 'ECs1276', 'ECs1288', 1.3),
           ('SpLE1', 'ECs1299', 'ECs1410', 1.5),
           ('S-loop 225', 'ECs4325', 'ECs4341', 1.5),
           ('S-loop 231', 'ECs4379', 'ECs4387', 1.3)]
annotations = {k:(tools.get_lt_index(v0, sakai_chromosome_annotated),
                  tools.get_lt_index(v1, sakai_chromosome_annotated), v2) for
               k, v0, v1, v2 in regions}



In [26]:

    
# Plot genome-wide estimates of beta for Sakai and mark values that don't include the median beta in 50% CI
beta_thresh = np.median(sakai_chromosome_annotated['b_median'])  

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for Sakai chromosome'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

# Plot on the figure axes
tools.plot_parameter(sakai_chromosome_annotated, ax, 'b', beta_thresh, annotations=annotations);



In [27]:

    
# Regions of interest
regions = [('S-loop 71', 'ECs1276', 'ECs1288', 1),
           ('SpLE1', 'ECs1299', 'ECs1410', 1.8),
           ('S-loop 225', 'ECs4325', 'ECs4341', 1.8),
           ('S-loop 231', 'ECs4379', 'ECs4387', 1)]
annotations = {k:(tools.get_lt_index(v0, sakai_chromosome_annotated),
                  tools.get_lt_index(v1, sakai_chromosome_annotated), v2) for
               k, v0, v1, v2 in regions}



In [28]:

    
# Plot genome-wide estimates of delta for Sakai and mark values that don't include zero in 50%CI
delta_thresh = np.median(sakai_chromosome_annotated['d_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for Sakai chromosome'
plt.title("{0} [threshold: {1:.2f}]".format(title, delta_thresh))

tools.plot_parameter(sakai_chromosome_annotated, ax, 'd', delta_thresh, annotations=annotations)



In [29]:

    
# Plot genome-wide estimates of beta for Sakai and mark values that don't include the median beta in 50% CI
beta_thresh = np.median(sakai_pOSAK_annotated['b_median']) 

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for Sakai plasmid pOSAK'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(sakai_pOSAK_annotated, ax, 'b', beta_thresh)



In [30]:

    
# Plot genome-wide estimates of delta for Sakai and mark values that don't include zero in 50% CI
delta_thresh = np.median(sakai_pOSAK_annotated['d_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for Sakai plasmid pOSAK'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(sakai_pOSAK_annotated, ax, 'd', delta_thresh)



In [31]:

    
# Regions of interest
regions = [('StcE', 'pO157p01', 'pO157p01', 0.98),
           ('etp T2SS', 'pO157p02', 'pO157p14', 1)]
annotations = {k:(tools.get_lt_index(v0, sakai_pO157_annotated),
                  tools.get_lt_index(v1, sakai_pO157_annotated), v2) for
               k, v0, v1, v2 in regions}



In [32]:

    
# Plot genome-wide estimates of beta for Sakai and mark values that don't include the median beta in 50% CI
beta_thresh = np.median(sakai_pO157_annotated['b_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for Sakai plasmid p0157'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(sakai_pO157_annotated, ax, 'b', beta_thresh, annotations=annotations)



In [33]:

    
# Regions of interest
regions = [('StcE', 'pO157p01', 'pO157p01', 0.13),
           ('etp T2SS', 'pO157p02', 'pO157p14', 0.19)]
annotations = {k:(tools.get_lt_index(v0, sakai_pO157_annotated),
                  tools.get_lt_index(v1, sakai_pO157_annotated), v2) for
               k, v0, v1, v2 in regions}



In [34]:

    
# Plot genome-wide estimates of delta for Sakai and mark values that don't include zero in 50% CI
delta_thresh = np.median(sakai_pO157_annotated['d_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for Sakai plasmid pO157'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(sakai_pO157_annotated, ax, 'd', delta_thresh, annotations=annotations)

These plots indicate that most Sakai genes do not produce parameter estimates that are indicative of credible effects in the control or treatment, in either direction.

Where effects are seen they tend to cluster on the genome, which is as would be expected if operons or gene clusters with common function were responsible for producing an effect. This is suggestive that we are measuring a biological effect, rather than noise.

In general, several clusters of both positive and negative effects appear in the chromosome and pO157 plots for effects due to control ($\beta$) and treatment ($\delta$).

DH10B

We plot similar representations for the DH10B isolate as a control, and see that all parameter estimates for this isolate's locus tags are very similar.

There is a weak sinusoidal pattern of fitted estimates. As no gene ordering information is available to the model fit, and there is an apparent symmetry to this pattern, it may reflect a real underlying biological process or structure.



In [35]:

    
# Annotate the DH10B results
dh10b_annotated = tools.annotate_locus_tags(dh10b_estimates,
                                            os.path.join('..', 'data', 'DH10B',
                                                         'GCF_000019425.1_ASM1942v1_genomic.gbff'))
dh10b_annotated.sort_values('startpos', inplace=True)



In [36]:

    
# Plot genome-wide estimates of beta for DH10B
beta_thresh = np.median(dh10b_estimates['b_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for DH10B', 
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(dh10b_estimates, ax, 'b', beta_thresh)



In [37]:

    
# Plot genome-wide estimates of delta for DH10B
delta_thresh = np.median(dh10b_estimates['d_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for DH10B'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(dh10b_estimates, ax, 'd', delta_thresh)

Identifying Sakai candidates

From the information above, we can list the 180 Sakai genes/locus tags that appear to impart a positive selective effect on treatment/passage (the green points/bars in the plots immediately above).



In [38]:

    
# Generate list of candidates with a positive effect under control or treatment.
candidates = sakai_estimates[sakai_estimates['ctl_pos'] | sakai_estimates['trt_pos']]
candidates = candidates[['locus_tag',
                         'b_median', 'ctl_pos',
                         'd_median', 'trt_pos']].sort_values(['ctl_pos', 'trt_pos', 'locus_tag'])
candidates.shape









    Out[38]:





(180, 5)



In [39]:

    
# Inspect the data
candidates.head()









    Out[39]:







  
    
      
      locus_tag
      b_median
      ctl_pos
      d_median
      trt_pos
    
  
  
    
      3523
      ECs0415
      0.943381
      False
      0.235389
      True
    
    
      3522
      ECs0416
      0.935728
      False
      0.174705
      True
    
    
      40
      ECs1121
      0.948441
      False
      0.057269
      True
    
    
      3754
      ECs1161
      0.947319
      False
      0.237648
      True
    
    
      3005
      ECs1262
      0.938215
      False
      0.197020
      True

We restrict this set to those genes that only have a credible effect on treatment/passage, identifying 115 genes with positive $\delta$ where the 50% CI does not include zero:



In [40]:

    
# Restrict candidates only to those with an effect on treatment/passage.
trt_only_positive = candidates.loc[candidates['trt_pos'] & ~candidates['ctl_pos']]
trt_only_positive.shape









    Out[40]:





(115, 5)

We add a column with the functional annotation of each of the candidates that appear to have a positive selective effect under treatment conditions:



In [41]:

    
# Annotated locus tags with functions from NCBI GenBank files
annotated = tools.annotate_locus_tags(trt_only_positive,
                                      os.path.join('..', 'data', 'Sakai',
                                                   'GCF_000008865.1_ASM886v1_genomic.gbff'))
pd.options.display.max_rows = 115 # force to show all rows
annotated









    Out[41]:







  
    
      
      locus_tag
      b_median
      ctl_pos
      d_median
      trt_pos
      annotation
      startpos
    
  
  
    
      3523
      ECs0415
      0.943381
      False
      0.235389
      True
      periplasmic-iron-binding protein
      441343
    
    
      3522
      ECs0416
      0.935728
      False
      0.174705
      True
      regulatory protein
      442371
    
    
      40
      ECs1121
      0.948441
      False
      0.057269
      True
      host specificity protein
      1203336
    
    
      3754
      ECs1161
      0.947319
      False
      0.237648
      True
      excisionase
      1247402
    
    
      3005
      ECs1262
      0.938215
      False
      0.197020
      True
      hypothetical protein
      1321781
    
    
      2989
      ECs1272
      0.943690
      False
      0.130858
      True
      rtn-like protein
      1337360
    
    
      2988
      ECs1273
      0.945370
      False
      0.169504
      True
      FidL-like protein
      1338994
    
    
      2786
      ECs1274
      0.943394
      False
      0.134231
      True
      transcriptional regulator
      1339490
    
    
      2783
      ECs1275
      0.944433
      False
      0.101744
      True
      oxidoreductase
      1340990
    
    
      2777
      ECs1276
      0.943277
      False
      0.135400
      True
      chaperone protein
      1341918
    
    
      2775
      ECs1277
      0.941800
      False
      0.164063
      True
      outer membrane protein
      1342604
    
    
      2769
      ECs1278
      0.932893
      False
      0.205353
      True
      outer membrane usher protein
      1343954
    
    
      2764
      ECs1279
      0.935759
      False
      0.186083
      True
      chaperone protein
      1346529
    
    
      2763
      ECs1280
      0.935879
      False
      0.230102
      True
      major pilin protein
      1347316
    
    
      2761
      ECs1281
      0.944470
      False
      0.172548
      True
      hypothetical protein
      1347919
    
    
      2756
      ECs1282
      0.947543
      False
      0.124275
      True
      hemagglutinin/hemolysin-like protein
      1349939
    
    
      2754
      ECs1283
      0.946325
      False
      0.103695
      True
      hemolysin activator-like protein
      1353834
    
    
      2750
      ECs1284
      0.947314
      False
      0.131823
      True
      holo-ACP synthase
      1355469
    
    
      2749
      ECs1285
      0.947547
      False
      0.155132
      True
      3-oxoacyl-ACP reductase
      1356043
    
    
      2744
      ECs1286
      0.946540
      False
      0.172656
      True
      (3R)-hydroxymyristoyl-ACP dehydratase
      1356616
    
    
      2742
      ECs1287
      0.945181
      False
      0.158969
      True
      acyl carrier protein
      1357195
    
    
      2739
      ECs1288
      0.946302
      False
      0.172153
      True
      glycine cleavage system protein T
      1357503
    
    
      4542
      ECs1289
      0.944467
      False
      0.144179
      True
      3-oxoacyl-ACP synthase
      1358737
    
    
      4541
      ECs1290
      0.944382
      False
      0.173207
      True
      hypothetical protein
      1361543
    
    
      4530
      ECs1291
      0.944559
      False
      0.139463
      True
      hypothetical protein
      1362622
    
    
      4524
      ECs1292
      0.947028
      False
      0.135113
      True
      ABC transporter ATP-binding protein
      1363484
    
    
      4518
      ECs1293
      0.943448
      False
      0.139623
      True
      hypothetical protein
      1364175
    
    
      4514
      ECs1294
      0.944918
      False
      0.141482
      True
      hypothetical protein
      1364832
    
    
      4507
      ECs1295
      0.943648
      False
      0.188630
      True
      hypothetical protein
      1366151
    
    
      4503
      ECs1296
      0.945348
      False
      0.160612
      True
      hypothetical protein
      1366936
    
    
      4351
      ECs1299
      0.945724
      False
      0.129586
      True
      integrase
      1370629
    
    
      4350
      ECs1300
      0.955336
      False
      0.130850
      True
      hypothetical protein
      1372018
    
    
      4343
      ECs1301
      0.950305
      False
      0.162544
      True
      transposase
      1374331
    
    
      4342
      ECs1302
      0.949132
      False
      0.143249
      True
      regulatory protein
      1374947
    
    
      4335
      ECs1303
      0.948038
      False
      0.130195
      True
      hypothetical protein
      1375886
    
    
      4330
      ECs1304
      0.946557
      False
      0.143737
      True
      hypothetical protein
      1378092
    
    
      4329
      ECs1305
      0.945482
      False
      0.091902
      True
      hypothetical protein
      1378810
    
    
      4317
      ECs1312
      0.949912
      False
      0.130438
      True
      complement resistance protein
      1385490
    
    
      4305
      ECs1315
      0.949239
      False
      0.095599
      True
      hypothetical protein
      1386925
    
    
      4300
      ECs1316
      0.954738
      False
      0.073505
      True
      diacylglycerol kinase
      1387535
    
    
      4292
      ECs1317
      0.948272
      False
      0.104187
      True
      outer membrane protein
      1387987
    
    
      4289
      ECs1319
      0.946332
      False
      0.141756
      True
      hypothetical protein
      1389831
    
    
      4281
      ECs1320
      0.947656
      False
      0.097397
      True
      hypothetical protein
      1390158
    
    
      4279
      ECs1321
      0.944511
      False
      0.167959
      True
      hypothetical protein
      1390442
    
    
      4128
      ECs1322
      0.944768
      False
      0.153130
      True
      urease subunit gamma
      1391276
    
    
      4121
      ECs1323
      0.944515
      False
      0.170197
      True
      urease subunit beta
      1391587
    
    
      4120
      ECs1324
      0.946103
      False
      0.134457
      True
      urease subunit alpha
      1391897
    
    
      4119
      ECs1325
      0.943279
      False
      0.130909
      True
      urease accessory protein UreE
      1393613
    
    
      4113
      ECs1326
      0.944795
      False
      0.178645
      True
      UreF
      1394078
    
    
      4112
      ECs1327
      0.945132
      False
      0.164409
      True
      protein UreG
      1394764
    
    
      4102
      ECs1330
      0.946507
      False
      0.145510
      True
      50S ribosomal protein L31
      1396593
    
    
      4082
      ECs1335
      0.940679
      False
      0.135344
      True
      hypothetical protein
      1399916
    
    
      4065
      ECs1337
      0.933473
      False
      0.244882
      True
      hypothetical protein
      1401179
    
    
      4062
      ECs1338
      0.948663
      False
      0.162611
      True
      hypothetical protein
      1401601
    
    
      4061
      ECs1339
      0.949873
      False
      0.158986
      True
      hypothetical protein
      1401982
    
    
      4055
      ECs1340
      0.937697
      False
      0.242988
      True
      hypothetical protein
      1403599
    
    
      3863
      ECs1343
      0.942417
      False
      0.139782
      True
      TerW protein
      1405456
    
    
      3856
      ECs1344
      0.947931
      False
      0.155960
      True
      hypothetical protein
      1405941
    
    
      3854
      ECs1345
      0.942798
      False
      0.151593
      True
      hypothetical protein
      1407160
    
    
      3846
      ECs1347
      0.949253
      False
      0.169834
      True
      hypothetical protein
      1408852
    
    
      3838
      ECs1348
      0.943916
      False
      0.161774
      True
      hypothetical protein
      1409198
    
    
      501
      ECs1397
      0.942103
      False
      0.143552
      True
      hypothetical protein
      1447109
    
    
      485
      ECs1401
      0.894047
      False
      0.112519
      True
      hypothetical protein
      1451802
    
    
      482
      ECs1402
      0.934423
      False
      0.071359
      True
      hypothetical protein
      1452711
    
    
      453
      ECs1408
      0.951253
      False
      0.063618
      True
      hypothetical protein
      1455430
    
    
      451
      ECs1409
      0.942792
      False
      0.160255
      True
      hypothetical protein
      1455712
    
    
      432
      ECs1780
      0.950131
      False
      0.162195
      True
      DNA methylase
      1772842
    
    
      1969
      ECs1800
      0.948325
      False
      0.071955
      True
      major tail subunit
      1789012
    
    
      3649
      ECs1812
      0.950230
      False
      0.295711
      True
      hypothetical protein
      1801540
    
    
      3650
      ECs1813
      0.948075
      False
      0.303623
      True
      integrase
      1803526
    
    
      3659
      ECs1814
      0.946699
      False
      0.355551
      True
      hypothetical protein
      1804692
    
    
      3463
      ECs1821
      0.968390
      False
      0.253145
      True
      hypothetical protein
      1809856
    
    
      3660
      ECs1824
      0.964454
      False
      0.229530
      True
      hypothetical protein
      1810845
    
    
      3661
      ECs1825
      0.949948
      False
      0.288338
      True
      bfpT-regulated chaperone-like protein
      1811676
    
    
      3975
      ECs1975
      0.946915
      False
      0.076027
      True
      hypothetical protein
      1953689
    
    
      4223
      ECs2192
      0.961011
      False
      0.091518
      True
      hypothetical protein
      2190644
    
    
      1982
      ECs2240
      0.947708
      False
      0.061602
      True
      tail length tape measure protein
      2219960
    
    
      1961
      ECs2246
      0.949162
      False
      0.037638
      True
      head-tail adaptor
      2225510
    
    
      172
      ECs2248
      0.948823
      False
      0.128293
      True
      portal protein
      2226199
    
    
      1950
      ECs2254
      0.947297
      False
      0.075911
      True
      Dnase
      2233502
    
    
      3102
      ECs2797
      0.889599
      False
      0.046828
      True
      hypothetical protein
      2743949
    
    
      4012
      ECs2967
      0.947021
      False
      0.042129
      True
      antirepressor protein
      2919758
    
    
      3223
      ECs3498
      0.938757
      False
      0.135753
      True
      hypothetical protein
      3484251
    
    
      549
      ECs3868
      0.967686
      False
      0.127219
      True
      hypothetical protein
      3872936
    
    
      364
      ECs3869
      0.956561
      False
      0.092450
      True
      hypothetical protein
      3873499
    
    
      780
      ECs4325
      0.873029
      False
      0.103697
      True
      O-methyltransferase
      4332998
    
    
      752
      ECs4329
      0.809716
      False
      0.083031
      True
      acyl carrier protein
      4335884
    
    
      734
      ECs4333
      0.890242
      False
      0.458790
      True
      hypothetical protein
      4338410
    
    
      718
      ECs4335
      0.772257
      False
      0.122669
      True
      hypothetical protein
      4340509
    
    
      512
      ECs4339
      0.794904
      False
      0.209705
      True
      beta-hydroxydecanoyl-ACP dehydrase
      4345150
    
    
      507
      ECs4341
      0.791081
      False
      0.132900
      True
      3-oxoacyl-ACP synthase
      4346342
    
    
      1887
      ECs4353
      0.811411
      False
      0.055275
      True
      sugar kinase
      4356742
    
    
      1885
      ECs4354
      0.811870
      False
      0.164720
      True
      phosphotransferase system HPr enzyme
      4358240
    
    
      1856
      ECs4379
      0.943504
      False
      0.091106
      True
      hypothetical protein
      4388386
    
    
      1851
      ECs4380
      0.945445
      False
      0.061092
      True
      heme utilization/transport protein
      4389463
    
    
      1842
      ECs4381
      0.947578
      False
      0.054208
      True
      hypothetical protein
      4391721
    
    
      1839
      ECs4382
      0.942031
      False
      0.112038
      True
      hemin binding protein
      4392050
    
    
      1837
      ECs4383
      0.945615
      False
      0.069680
      True
      coproporphyrinogen III oxidase
      4393062
    
    
      1831
      ECs4384
      0.946534
      False
      0.098545
      True
      ShuX-like protein
      4394412
    
    
      1828
      ECs4385
      0.944502
      False
      0.097967
      True
      ShuY-like protein
      4394906
    
    
      1822
      ECs4386
      0.943409
      False
      0.108441
      True
      hemin permease
      4395614
    
    
      1818
      ECs4387
      0.948050
      False
      0.050882
      True
      hemin importer ATP-binding protein
      4396567
    
    
      5690
      ECs4971
      0.784740
      False
      0.432552
      True
      hypothetical protein
      5057187
    
    
      3068
      ECs5307
      0.942778
      False
      0.052771
      True
      type I restriction modification enzyme M subunit
      5439606
    
    
      3077
      ECs5593
      0.936959
      False
      0.065415
      True
      endoribonuclease SymE
      5437290
    
    
      3128
      pO157p01
      0.944843
      False
      0.031526
      True
      ToxR-regulated lipoprotein
      0
    
    
      3117
      pO157p03
      0.943886
      False
      0.042784
      True
      EtpD
      3674
    
    
      2885
      pO157p05
      0.941998
      False
      0.047712
      True
      EtpF
      6938
    
    
      2881
      pO157p06
      0.942176
      False
      0.039165
      True
      EtpG
      8192
    
    
      2874
      pO157p08
      0.940078
      False
      0.052343
      True
      EtpI
      9192
    
    
      3165
      pO157p78
      0.942091
      False
      0.103422
      True
      EspP
      80756
    
    
      3151
      pO157p79
      0.948124
      False
      0.085414
      True
      hypothetical protein
      87221
    
    
      3144
      pO157p80
      0.942737
      False
      0.090473
      True
      RfbU-like protein
      87658
    
    
      3143
      pO157p81
      0.945955
      False
      0.055917
      True
      hypothetical protein
      88854
    
    
      3138
      pO157p82
      0.943998
      False
      0.103180
      True
      lipid A biosynthesis (KDO)2-(lauroyl)-lipid IV...
      90649

Finally, we write this data out in tab-separated format



In [42]:

    
# Write data to file in tab-separated format
outfile_annotated = os.path.join('datasets', 'trt_positive.tab')
annotated.to_csv(outfile_annotated, sep="\t")

Manuscript Figure 1

The code in the cell below will reproduce figure 1 from the manuscript.



In [43]:

    
# Create figure with no title or xticks to hold the plotted axes
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(20, 26))  

# Add subplot for each result

# 1) Sakai chromosome
regions = [('S-loop 71', 'ECs1276', 'ECs1288', 1),
           ('SpLE1', 'ECs1299', 'ECs1410', 1.8),
           ('S-loop 225', 'ECs4325', 'ECs4341', 1.8),
           ('S-loop 231', 'ECs4379', 'ECs4387', 1)]
annotations = {k:(tools.get_lt_index(v0, sakai_chromosome_annotated),
                  tools.get_lt_index(v1, sakai_chromosome_annotated), v2) for
               k, v0, v1, v2 in regions}
delta_thresh = np.median(sakai_chromosome_annotated['d_median'])
tools.plot_parameter(sakai_chromosome_annotated, ax1, 'd', delta_thresh, annotations=annotations,
                     label="a) Sakai chromosome")

# 2) pO157 plasmid
regions = [('StcE', 'pO157p01', 'pO157p01', 0.13),
           ('etp T2SS', 'pO157p02', 'pO157p14', 0.19)]
annotations = {k:(tools.get_lt_index(v0, sakai_pO157_annotated),
                  tools.get_lt_index(v1, sakai_pO157_annotated), v2) for
               k, v0, v1, v2 in regions}
delta_thresh = np.median(sakai_pO157_annotated['d_median'])
tools.plot_parameter(sakai_pO157_annotated, ax2, 'd', delta_thresh, annotations=annotations,
                     label="b) Sakai pO157")

# 3) DH10B chromosome
delta_thresh = np.median(dh10b_estimates['d_median'])
tools.plot_parameter(dh10b_estimates, ax3, 'd', delta_thresh, label="c) DH10B chromosome")

# Save figure as pdf
plt.savefig("figure_1.pdf");

	probe	replicate	repXtrt	input	output	log_input	log_output	match	locus_tag	...	g_mean	g_median	g_sem	d_2.5pc	d_25pc	d_75pc	d_97.5pc	d_mean	d_median	d_sem
0	A_07_P052986	1	rep1trt0	13.354233	10.507812	2.591833	2.352119	lcl\|NC_002695.1_cds_NP_309203.1_1134	ECs1176	...	0.200417	0.262675	0.335062	-0.510635	-0.283252	-0.05752	-0.024356	-0.167816	-0.077139	0.151117
1	A_07_P045401	1	rep1trt0	8.442597	9.590875	2.133290	2.260812	lcl\|NC_002695.1_cds_NP_309203.1_1134	ECs1176	...	0.200417	0.262675	0.335062	-0.510635	-0.283252	-0.05752	-0.024356	-0.167816	-0.077139	0.151117
2	A_07_P062008	1	rep1trt0	5.553664	5.964808	1.714458	1.785877	lcl\|NC_002695.1_cds_NP_309203.1_1134	ECs1176	...	0.200417	0.262675	0.335062	-0.510635	-0.283252	-0.05752	-0.024356	-0.167816	-0.077139	0.151117
3	A_07_P052986	2	rep2trt0	6.815142	11.374824	1.919147	2.431403	lcl\|NC_002695.1_cds_NP_309203.1_1134	ECs1176	...	0.395308	0.393646	0.215424	-0.510635	-0.283252	-0.05752	-0.024356	-0.167816	-0.077139	0.151117
4	A_07_P045401	2	rep2trt0	5.823013	13.017137	1.761818	2.566267	lcl\|NC_002695.1_cds_NP_309203.1_1134	ECs1176	...	0.395308	0.393646	0.215424	-0.510635	-0.283252	-0.05752	-0.024356	-0.167816	-0.077139	0.151117

	locus_tag	b_2.5pc	b_25pc	b_75pc	b_97.5pc	b_mean	b_median	b_sem	d_2.5pc	d_25pc	d_75pc	d_97.5pc	d_mean	d_median	d_sem
5515	ECDH10B_RS00010	0.907909	0.932580	0.950421	0.964608	0.941324	0.944069	0.013947	-0.106789	-0.065958	-0.048565	-0.023847	-0.058070	-0.057542	0.019050
4705	ECDH10B_RS00020	0.914396	0.934645	0.951195	0.961351	0.942357	0.944690	0.012564	-0.085109	-0.064664	-0.046774	-0.025243	-0.055638	-0.057188	0.015163
5475	ECDH10B_RS00060	0.914911	0.935482	0.951967	0.968178	0.943733	0.945555	0.013453	-0.105381	-0.067194	-0.045122	-0.014049	-0.057540	-0.057505	0.021189
5474	ECDH10B_RS00065	0.912512	0.935084	0.951980	0.965746	0.943209	0.945886	0.013542	-0.092378	-0.063993	-0.044987	-0.020414	-0.055270	-0.056006	0.017903
5469	ECDH10B_RS00070	0.901806	0.934273	0.951271	0.963839	0.941399	0.944425	0.015426	-0.098637	-0.066353	-0.045173	-0.022492	-0.057138	-0.057634	0.018181

	locus_tag	b_median	ctl_pos	d_median	trt_pos
3523	ECs0415	0.943381	False	0.235389	True
3522	ECs0416	0.935728	False	0.174705	True
40	ECs1121	0.948441	False	0.057269	True
3754	ECs1161	0.947319	False	0.237648	True
3005	ECs1262	0.938215	False	0.197020	True

Supplementary Information: Holmes et al. 2019

2. Full model fitting

Table of Contents

Experiment summary and interpretation

Python imports

Building the model

Load input data for fit

Stan model construction

Define and fit the Stan model

Extract the fit

Inspecting the fit

Median parameter estimates

DH10B

Sakai

Identifying locus tags that confer an advantage under treatment and/or control

Plotting distribution of effects on the Sakai genome

Sakai

DH10B

Identifying Sakai candidates

Manuscript Figure 1