Supplementary Information: Holmes et al. 2019

2. Full model fitting

This notebook describes fitting of a Bayesian hierarchical model of the effects of control (growth) and treatment (passage) on individual genes from E. coli DH10B (carrier) and Sakai (BAC load), to data obtained using a multi-E. coli microarray.

Much of the code for the visualisation, analysis and data manipulation of the fitting results is found in the associated Python module tools.py, which should also be present in this directory.

The model fit can be downloaded directly from the Zenodo repository, for use in this notebook:

A code cell in the notebook below will attempt to make this download for you if the file does not already exist.

Table of Contents

  1. Experiment summary and interpretation
  2. Building the model
    1. Stan model construction
    2. Define and fit the Stan model
    3. Extract the fit
  3. Inspecting the fit
    1. Median parameter estimates
  4. Identifying locus tags that confer an advantage under treatment
    1. Plotting distribution of effects
    2. Identifying candidates
  5. Manuscript Figure 1

Experiment summary and interpretation

The experiment involves measuring changes in microarray probe intensity before and after a pool of bacteria is subjected to one of two processes:

  1. a sample from the pool is grown in media to a defined OD, then subsampled. This growth/subsample process is repeated n times. [control]
  2. a sample from the pool is applied to plant leaves, subsampled, and that subsample grown in media to a defined OD, then subsampled. This passage/subsample/growth/subsample process is repeated n times. [treatment]

In a single replicate, the microarray is exposed to genomic DNA extracted from the pool (i) before the experiment begins, and (ii) after the experiment concludes. Three replicates are performed.


All genes in all samples go through the *growth and subsampling* part of the experiment, and we wish to estimate the effect of *passage and subsampling* on individual genes.

The pool of bacteria comprises E. coli DH10B as a carrier organism. The pool is heterogeneous, in that individual cells also contain BACs encoding random stretches of the E. coli Sakai chromosome. We therefore expect carrier organism genes to be unaffected by passage (treatment), and for any effects to be detectable only for genes that originate from E. coli Sakai.


We expect that genes conferring a phenotypic/selective advantage only for association with the plant should be enriched at the end of the treatment experiment, but not at the end of the control experiment. Sakai genes that are enriched in both treatment and control experiments may be generally advantageous for growth, but those giving a selective advantage on passage through the plant could be specifically adaptive in an environmental context.


As the BACs describe contiguous regions of the *E. coli* Sakai genome, there is the possibility that linkage disequilibrium could result in some genes that do not confer an advantage by themselves apparently displaying enrichment after treatment.

If the biological function conferring an advantage during passage is encoded by a suite of coregulated genes in an operon, we might expect all members of this suite to show evidence of enrichment after passage. It is likely that clusters of enrichment for operons or regulons post-passage will be seen in the results. Although we are not accounting for this clustering or association by operon directly in this model, it is a possible additional hierarchical term in future iterations of the model.

We should expect there to be a selective burden to the carriage of additional non-functional gDNA as BACs, so we might also anticipate a slightly negative effect on recovery under control conditions.

Python imports


In [1]:
%pylab inline

import os
import pickle
import warnings; warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import pystan
import scipy
import seaborn as sns; sns.set_context('notebook')

from Bio import SeqIO

import tools


Populating the interactive namespace from numpy and matplotlib

Building the model

We assume that each array probe $i$ (array probes take a unique values of $i$ in the context of the entire experiment; that is, $i$ is unique for probe X replicate X treatment) measures hybridisation of genomic DNA (gDNA) in the sample that corresponds to a single gene $j[i]$, and that the measured intensity of probe $i$ relates directly to the corresponding amount of gDNA in the sample. There may be multiple probes relating to a single gene, so it is possible that $j[p] = j[q], p \neq q$.

This establishes a basis for pooling probe-level effects as samples of the gene-level effect.

We define the (input) measurement of a probe before an experiment as $x_i$, and the (output) measurement of that probe after the experiment as $y_i$. We assume that the measurement of each probe is subject to random experimental/measurement error that is normally-distributed with mean zero and variance $\sigma_y^2$. The actual quantity of DNA measured after the experiment can then be represented as $\hat{y}$, and the irreducible error in this experiment as $\epsilon$ ($\epsilon_i$ serves to include the irreducible errors in measuring both $x_i$ and $y_i$; all errors are assumed to be Normal, so their linear combinations are also Normal).

$$y_i = \hat{y_i} + \epsilon_i$$$$\epsilon_i \sim N(0, \sigma_y^2) \implies y_i \sim N(\hat{y_i}, \sigma_y^2)$$

The relationship between the input and output DNA quantities measured by a single probe can be represented as $\hat{y_i} = f(x_i)$. That is to say, that the measured input DNA quantity $x_i$ is a predictor of the output quantity. This relationship will be modelled as the sum of two linear effects:

$$\textrm{control effect} = \alpha + \beta x$$$$\textrm{treatment effect} = \gamma + \delta x$$$$\hat{y_i} = \textrm{control effect}(x_i) + \textrm{treatment effect}(x_i) = \alpha + \beta x_i + \gamma + \delta x_i$$

As these are linear effects, we have intercept/offset parameters ($\alpha$, $\gamma$) and gradient/slope parameters ($\beta$, $\delta$).

Where $\beta$ or $\delta$ are large, they would indicate large $x_i$-dependent effects of the control (growth) and treatment (passage) parts of the experiment respectively.

As formulated above, the four parameters would be identical for all probes, but we are interested in estimating the control and treatment effects for individual genes, so we require a set of parameters for each gene (as it corresponds to probe $i$): $j[i]$. This is appropriate for the effects of growth/treatment that are specific to the levels of a single gene: $\beta$ and $\delta$.

The remaining parameters $\alpha$ and $\gamma$, the offsets from zero for each probe, could be considered to be constant across each replicate of both control and treatment experiments. They are possibly more realistically considered to be different for each array (i.e. each combination of replicate and treatment).

The offset for any particular array can be hierarchically modelled as being drawn from a distribution representing all arrays, and we require one parameter for each of the arrays, so that for probe $i$ the corresponding array for that experiment is $k[i]$.

As a result, we estimate $\alpha_{k[i]}$, $\beta_{j[i]}$, $\gamma_{k[i]}$, $\delta_{j[i]}$, and the relationship for each probe is modelled as:

$$\hat{y_i} = \textrm{control effect}_{j[i]}(x_i) + \textrm{treatment effect}_{j[i]}(x_i) = \alpha_{k[i]} + \beta_{j[i]} x_i + \gamma_{k[i]} + \delta_{j[i]} x_i$$

The parameters $\alpha_{k[i]}$, $\beta_{j[i]}$, $\gamma_{k[i]}$, $\delta_{j[i]}$ (and $\epsilon_i$) are to be estimated by the model fit.


We assume that the values of each parameter, e.g. $\alpha_{k[i]}$, are drawn from a single *pooled distribution* for that parameter, $\alpha \sim \textrm{some distribution}$.

This pooling ensures that our fits are not completely pooled as a single estimate $\alpha_{k[i]} = \alpha$, which would imply that all parameter estimates are constant for all genes/arrays, a situation that would be completely uninformative for our goal to identify gene-level effects, and which would underfit our model. It also means that our estimates are not completely unpooled, which would allow all parameter estimates to vary independently. That situation would be equivalent to simultaneously fitting independent linear relationships to each gene, and so risk overfitting our model to the measured data.


NOTE: By using a *pooled distribution*, we allow a parameter estimate for each gene to influence the estimates of that parameter for all other genes in the experiment, constrained by an expected distribution of that parameter's values. To do this, we define a *prior distribution* for each parameter, but we do not specify its mean or variance, allowing the parameters of these *pooled distributions* also to be estimated when fitting our model.

For each parameter's prior we choose a Cauchy distribution, because it has fat tails and infinite variance. This does not constrain outlying and extreme values (those we are interested in) so much as other distributions (e.g. Normal or Student's t):

$$\alpha_{k[i]} \sim Cauchy(\mu_{\alpha}, \sigma_{\alpha}^2)$$$$\beta_{j[i]} \sim Cauchy(\mu_{\beta}, \sigma_{\beta}^2)$$$$\gamma_{k[i]} \sim Cauchy(\mu_{\gamma}, \sigma_{\gamma}^2)$$$$\delta_{j[i]} \sim Cauchy(\mu_{\delta}, \sigma_{\delta}^2)$$

Each parameter's prior distribution requires a fit of both its mean and variance, and these also become parameters in our model. The means are free to vary, but we assume that the variance of each parameter's prior can be drawn from a Uniform distribution on the range (0, 100):

$$\sigma_{\alpha} \sim U(0, 100)$$$$\sigma_{\beta} \sim U(0, 100)$$$$\sigma_{\gamma} \sim U(0, 100)$$$$\sigma_{\delta} \sim U(0, 100)$$
We therefore construct the following model of the experiment: $$\hat{y_i} = \alpha_{k[i]} + \beta_{j[i]} x_i + \gamma_{k[i]} t_i + \delta_{j[i]} t_i x_i$$ $$y_i \sim N(\hat{y_i}, \sigma_y^2)$$ $$\alpha_{k[i]} \sim Cauchy(\mu_{\alpha}, \sigma_{\alpha}^2)$$ $$\beta_{j[i]} \sim Cauchy(\mu_{\beta}, \sigma_{\beta}^2)$$ $$\gamma_{k[i]} \sim Cauchy(\mu_{\gamma}, \sigma_{\gamma}^2)$$ $$\delta_{j[i]} \sim Cauchy(\mu_{\delta}, \sigma_{\delta}^2)$$ $$\sigma_{\alpha} \sim U(0, 100)$$ $$\sigma_{\beta} \sim U(0, 100)$$ $$\sigma_{\gamma} \sim U(0, 100)$$ $$\sigma_{\delta} \sim U(0, 100)$$ $$\sigma_y \sim U(0, \infty)$$
  • $y_i$: measured intensity output on the array for probe $i$ (specific to each replicate)
  • $\hat{y_i}$: actual probe intensity for probe $i$ (specific to each replicate)
  • $x_i$: measured intensity input on the array for probe $i$ (specific to each replicate)
  • $t_i$: 0/1 pseudovariable indicating whether the probe $i$ was measured in a control (0) or treatment (1) experiment
  • $\alpha_{k[i]}$: control effect offset for treatment X replicate $k$ (used for probe $i$)
  • $\mu_{\alpha}$: mean control effect offset for all arrays
  • $\sigma_{\alpha}$: control effect offset variance for all arrays
  • $\beta_{j[i]}$: control effect slope for gene $[j[i]$
  • $\mu_{\beta}$: mean control effect slope for all genes
  • $\sigma_{\beta}$: control effect slope variance for all genes
  • $\gamma_{k[i]}$: treatment effect offset for treatment X replicate $k$ (used for probe $i$)
  • $\mu_{\gamma}$: mean treatment effect offset for all arrays
  • $\sigma_{\gamma}$: treatment effect offset variance for all arrays
  • $\delta_{j[i]}$: treatment effect slope for gene $j[i]$
  • $\mu_{\delta}$: mean treatment effect slope for all genes
  • $\sigma_{\delta}$: treatment effect slope variance for all genes
  • $\sigma_y$: variance in measurement due to irreducible error

Load input data for fit

In the cells below we load in the data to be fit, and define useful variables for inspecting/analysing the data later:

  • locus_tags: the unique locus tags represented in the dataset
  • ntags: the number of unique locus tags
  • arrays: the arrays (combinations of replicate X treatment) used in the experiment
  • narrays: the number of arrays used
  • outdir: path to the directory in which to place model fit output
  • outfile: path to the model fit output file (pickled dataframe)

In [2]:
# load clean, normalised, indexed data
data = pd.read_csv(os.path.join("datasets", "normalised_array_data.tab"), sep="\t")  # full dataset
#data = pd.read_csv("datasets/reduced_locus_data.tab", sep="\t")  # reduced dataset
#data = data[:100]  # uncomment this for debugging

# useful values
locus_tags = data['locus_tag'].unique()
ntags = len(locus_tags)
arrays = data['repXtrt'].unique()
narrays = len(arrays)

In [3]:
# Create output directory and filename to hold the fitted model
outdir = "model_fits"
os.makedirs(outdir, exist_ok=True)
outfile = os.path.join(outdir, 'full_model_fit.pkl')

Stan model construction

We need to define data, parameters and our model for Stan.

In the `data` block, we have:
  • `N`: `int`, the number of data points
  • `J`: `int`, the number of unique locus tags (`J` < `N`)
  • `K`: `int`, the number of unique treatment X replicate combinations (arrays)
  • `array`: `int[N]`, an index list of arrays
  • `tag`: `int[N]`, an index list of locus tags
  • `t`: `vector[N]`, 0/1 control/treatment values for each probe
  • `x`: `vector[N]`, the input log(intensity) values
  • `y`: `vector[N]`, the output log(intensity) values
In the `parameter` block, we have:
  • `a`: `real vector[K]`, estimated offset effect on log(intensity) of the *control* for each array
  • `mu_a`: `real`, an unconstrained value to be fit that represents the mean of the Cauchy distribution for the *control* effect offset, for all arrays
  • `sigma_a`: `real`, standard deviation of the Cauchy distribution for the *control* effect offset, for all arrays
  • `b`: `real vector[J]`, estimated slope effect on log(intensity) of the *control* for each locus tag/gene
  • `mu_b`: `real`, an unconstrained value to be fit that represents the mean of the Cauchy distribution for the *control* effect slope, for all locus tags
  • `sigma_b`: `real`, standard deviation of the Cauchy distribution for the *control* effect slope, for all locus tags
  • `g`: `real vector[K]`, estimate of the influence of treatment on the output measured intensity (offset) for array
  • `mu_g`: `real`, an unconstrained value to be fit that represents the mean of the Cauchy distribution for the offset for all arrays due to *treatment*
  • `sigma_g`: `real`, standard deviation of the Cauchy distribution for the offset for all arrays due to *treatment*
  • `d`: `real vector[J]`, estimate of the influence of treatment on the output measured intensity (slope) for each locus tag/gene
  • `mu_d`: `real`, an unconstrained value to be fit that represents the mean of the Cauchy distribution for the slope for all locus tags due to *treatment*
  • `sigma_d`: `real`, standard deviation of the Cauchy distribution for the slope for all locus tags due to *treatment*
  • `sigma`: `real`, the irreducible error in the experiment/model
We also define a `transformed parameter`:
  • `y_hat[i] <- b[tag[i]] * x[i] + a[array[i]] + g[tag[i]] * t[i] + d[array[i]] * t[i] * x[i]`: the linear relationship describing $\hat{y}$, our estimate of experimental output intensity, which is subject to variance `sigma`.

Define and fit the Stan model

In the cells below we define the model to be fit, in the Stan language, conduct the fit, and save the fit out to a pickled dataframe (or load it in from one, depending on which code is commented out).


In [4]:
# define unpooled stan model
treatment_model = """
data {
  int<lower=0> N;
  int<lower=0> J;
  int<lower=0> K;  
  int<lower=1, upper=J> tag[N];
  int<lower=1, upper=K> array[N];  
  vector[N] t;
  vector[N] x;
  vector[N] y;
}
parameters {
  vector[K] a;
  vector[J] b;
  vector[K] g;
  vector[J] d;
  real mu_a;
  real mu_b;
  real mu_g;
  real mu_d;
  real<lower=0> sigma;
  real<lower=0,upper=100> sigma_a;
  real<lower=0,upper=100> sigma_b;
  real<lower=0,upper=100> sigma_g;
  real<lower=0,upper=100> sigma_d;
}
transformed parameters{
  vector[N] y_hat;

  for (i in 1:N)
    y_hat[i] = a[array[i]] + b[tag[i]] * x[i] + g[array[i]] * t[i] + d[tag[i]] * t[i] * x[i];
}
model {
  sigma_a ~ uniform(0, 100);
  a ~ cauchy(mu_a, sigma_a);

  sigma_b ~ uniform(0, 100);
  b ~ cauchy(mu_b, sigma_b);

  sigma_g ~ uniform(0, 100);
  g ~ cauchy(mu_g, sigma_g);

  sigma_d ~ uniform(0, 100);
  d ~ cauchy(mu_d, sigma_d);

  y ~ normal(y_hat, sigma);
}
"""

In [5]:
# relate python variables to stan variables
treatment_data_dict = {'N': len(data),
                       'J': ntags,
                       'K': narrays,
                       'tag': data['locus_tag_index'] + 1,
                       'array': data['repXtrt_index'] + 1,
                       't': data['treatment'],
                       'x': data['log_input'],
                       'y': data['log_output']}
At this point, you have two options to obtain the model fit data
  1. Run the model fit 'live' in the notebook. This may take several hours. USE CELL (1)
    1. (optionally) save the newly-generated model fit to a local file. USE CELL (2)
  2. Load the model fit from a local file. USE CELL (4)
    1. If you have not generated the data locally, then you can download it from Zenodo USE CELL (3) FIRST.

It may be quicker to download the data from Zenodo using the button below, than to use cell (3), but be sure to place the downloaded file in the correct location as specified in the variable outfile.


In [6]:
# (1) USE THIS CELL TO RUN THE STAN FIT - takes a few hours on my laptop
#treatment_fit = pystan.stan(model_code=treatment_model,
#                            data=treatment_data_dict,
#                            iter=1000, chains=2,
#                            seed=tools.SEED)

In [7]:
# (2) USE THIS CELL TO SAVE THE STAN FIT TO A PICKLE FILE
#unpermutedChains = treatment_fit.extract()
#unpermutedChains_df = pd.DataFrame([dict(unpermutedChains)])
#pickle.dump(unpermutedChains_df, open(outfile, 'wb'))

In [8]:
# (3) USE THIS CELL TO DOWNLOAD THE STAN FIT FROM ZENODO: DOI:10.5281/zenodo.269638
# The file will not be downloaded if it already exists locally.
# The file is 0.5GB in size, so may take some time to download
import urllib.request
if not os.path.isfile(outfile):
    zenodo_url = "https://zenodo.org/record/269638/files/full_model_fit.pkl"
    response = urllib.request.urlretrieve(zenodo_url, outfile)

In [9]:
# (4) USE THIS CELL TO LOAD THE STAN FIT FROM A PICKLE FILE
# Import the previously-fit model
treatment_fit = pd.read_pickle(open(outfile, 'rb'))

Extract the fit


In the cells below we load in the contents of the pickled output (if the fit has already been run), and then extract useful summary information about mean, median, variance, and credibility intervals for the parameter estimates.

  • parameters $\alpha$, $\beta$, $\gamma$ and $\delta$ are represented by their Roman letter equivalents `a`, `b`, `g` and `d`.
  • `*_mean` and `*_median` are the mean and median estimates for the parameter over the ensemble
  • `*_sem` is the standard deviation for the parameter estimate over the ensemble
  • `*_Npc` is the *N*th percentile for the parameter estimate, over the ensemble. These can be combined to obtain credibility intervals (e.g. the range `a_25pc`..`a_75pc` constitutes the 50% CI for $\alpha_{j[i]}$.

In [10]:
# Get summary data for parameter estimates
# use 'fit' for the model fit directly, and 'df'for loaded pickled data
(estimates_by_probe, estimates) = tools.extract_variable_summaries(treatment_fit, 'df',
                                                                  ['a', 'b', 'g', 'd'],
                                                                  [arrays, locus_tags, arrays, locus_tags],
                                                                  data)

In [11]:
# Inspect the data, one row per experiment probe
estimates_by_probe.head()


Out[11]:
probe replicate treatment repXtrt input output log_input log_output match locus_tag ... g_mean g_median g_sem d_2.5pc d_25pc d_75pc d_97.5pc d_mean d_median d_sem
0 A_07_P052986 1 0 rep1trt0 13.354233 10.507812 2.591833 2.352119 lcl|NC_002695.1_cds_NP_309203.1_1134 ECs1176 ... 0.200417 0.262675 0.335062 -0.510635 -0.283252 -0.05752 -0.024356 -0.167816 -0.077139 0.151117
1 A_07_P045401 1 0 rep1trt0 8.442597 9.590875 2.133290 2.260812 lcl|NC_002695.1_cds_NP_309203.1_1134 ECs1176 ... 0.200417 0.262675 0.335062 -0.510635 -0.283252 -0.05752 -0.024356 -0.167816 -0.077139 0.151117
2 A_07_P062008 1 0 rep1trt0 5.553664 5.964808 1.714458 1.785877 lcl|NC_002695.1_cds_NP_309203.1_1134 ECs1176 ... 0.200417 0.262675 0.335062 -0.510635 -0.283252 -0.05752 -0.024356 -0.167816 -0.077139 0.151117
3 A_07_P052986 2 0 rep2trt0 6.815142 11.374824 1.919147 2.431403 lcl|NC_002695.1_cds_NP_309203.1_1134 ECs1176 ... 0.395308 0.393646 0.215424 -0.510635 -0.283252 -0.05752 -0.024356 -0.167816 -0.077139 0.151117
4 A_07_P045401 2 0 rep2trt0 5.823013 13.017137 1.761818 2.566267 lcl|NC_002695.1_cds_NP_309203.1_1134 ECs1176 ... 0.395308 0.393646 0.215424 -0.510635 -0.283252 -0.05752 -0.024356 -0.167816 -0.077139 0.151117

5 rows × 41 columns


In [12]:
# Inspect the data, one row per locus tag
estimates.head()


Out[12]:
locus_tag b_2.5pc b_25pc b_75pc b_97.5pc b_mean b_median b_sem d_2.5pc d_25pc d_75pc d_97.5pc d_mean d_median d_sem
5515 ECDH10B_RS00010 0.907909 0.932580 0.950421 0.964608 0.941324 0.944069 0.013947 -0.106789 -0.065958 -0.048565 -0.023847 -0.058070 -0.057542 0.019050
4705 ECDH10B_RS00020 0.914396 0.934645 0.951195 0.961351 0.942357 0.944690 0.012564 -0.085109 -0.064664 -0.046774 -0.025243 -0.055638 -0.057188 0.015163
5475 ECDH10B_RS00060 0.914911 0.935482 0.951967 0.968178 0.943733 0.945555 0.013453 -0.105381 -0.067194 -0.045122 -0.014049 -0.057540 -0.057505 0.021189
5474 ECDH10B_RS00065 0.912512 0.935084 0.951980 0.965746 0.943209 0.945886 0.013542 -0.092378 -0.063993 -0.044987 -0.020414 -0.055270 -0.056006 0.017903
5469 ECDH10B_RS00070 0.901806 0.934273 0.951271 0.963839 0.941399 0.944425 0.015426 -0.098637 -0.066353 -0.045173 -0.022492 -0.057138 -0.057634 0.018181

In [13]:
# Separate estimates for Sakai and DH10B into two different dataframes
sakai_estimates = tools.split_estimates(estimates, 'sakai')
dh10b_estimates = tools.split_estimates(estimates, 'dh10b')

Inspecting the fit

In the cells below, we visualise the fitted estimates for each of the parameters $\alpha$, $\beta$, $\gamma$, and $\delta$ as:

  • box plots of median estimates for each locus tag
  • relationship between control and treatment effects in Sakai
  • plots of 50% credibility interval range and median estimate for each locus tag to identify locus tags with a possible selective advantage

Median parameter estimates

We first inspect the range of fitted estimates to get an overview of the relationships for the data as a whole, and then examine whether this relationship varies by E. coli isolate.

Making boxplots for the full set of fitted parameter estimates, for both isolates:


In [14]:
# Visualise median values for parameter estimates of alpha and gamma
tools.boxplot_medians(estimates_by_probe, ['a', 'g'])



In [15]:
# Visualise median values for parameter estimates of beta and delta
tools.boxplot_medians(estimates, ['b', 'd'])


For this fit we can see that the estimates are all, in the main, tightly-distributed. Most estimated (median) values of $\alpha$ (control intercept), $\gamma$ (treatment intercept), and $\delta$ (treatment slope) are close to zero. Most estimated values of $\beta$ are close to (but slightly less than) unity. This implies that:
  • The linear relationship between input and output intensity due to the control effects (growth only) is, for most genes in the experiment, a slight reduction of output intensity with respect to input intensity value, and on the whole the effect of the control/growth is neutral [median $\alpha$ ≈ 0, median $\beta$ ≈ 1]
  • For most genes in the experiment there is no treatment effect due to exposure to the plant [median $\gamma$ ≈ 0, median $\delta$ ≈ 0]


There are, however, a considerable number of outlying median values for each parameter, which suggests that a number of genes have associated parameter values that are affected by either control (growth) or treatment (passage).

DH10B

Considering boxplots of estimated $\beta_{j[i]}$ and $\delta_{j[i]}$ for the DH10B (carrier) isolate only:


In [16]:
# Visualise median values for Sakai parameter estimates
tools.boxplot_medians(dh10b_estimates, ['b', 'd'])


it is clear that the median parameter estimates for DH10B are extremely restricted in their range:

  • $0.93 < \beta < 0.98$
  • $-0.065 < \delta < 0.045$
The control effect appears to be essentially *neutral*, in that the output intensity is almost a 1:1 linear relationship with the input intensity, but it is striking that the median estimates of $\gamma$ and $\delta$ are very close to zero, suggesting that passage (treatment) has almost no effect on this relationship, for any DH10B locus tag. This is exactly what would be expected for DH10B as the carrier isolate.

Sakai

Considering the Sakai isolate parameter estimates for $\beta_{j[i]}$ and $\gamma_{j[i]}$ only:


In [17]:
# Visualise median values for Sakai parameter estimates
tools.boxplot_medians(sakai_estimates, ['b', 'd'])


By contrast to the results for DH10B, the median parameter estimates for Sakai have many large value outliers, though the bulk of estimates are close to the values seen for DH10B:

  • $0.2 < \beta < 1.4$
  • $-1.5 < \delta < 0.5$
This indicates that we see the expected result, that strong variability of control and treatment effects are effectively confined to the Sakai BAC fragments. It is expected that some genes/operons may be relatively advantageous in either growth (control) or passage (treatment) conditions, or both.

We can visualise the relationships between parameter estimates for control and treatment effects in a scatterplot of control effect ($\beta$) against treatment effect ($\delta) for each locus tag. This plot can be considered in four quadrants, which are delineated by the bulk of the data which describes orthogonal effects of locus tags on growth and treatment:


(i.e. for most locus tags, there is *either* an effect on treatment or control, but *not both*)

  • (upper left) positive effect on growth, negative effect for treatment: may be related to ability to use growth medium more efficiently
  • (upper right) positive effect on both growth and treatment: no locus tags display this characteristic
  • (lower right) positive effect on treatment, negative effect for control: may be related to ability to use/exploit the plant, that is suppressive in the medium
  • (lower left) negative effect on both growth and treatment: most locus tags that display an interaction lie in this group

In [18]:
# Plot estimated parameters for treatment effects against control effects for Sakai
fig, ax = plt.subplots(1, 1, figsize=(6,6))
ax.scatter(sakai_estimates['d_median'], sakai_estimates['b_median'], alpha=0.2)
ax.set_xlabel('delta (median)')
ax.set_ylabel('beta (median)');



The strong cross-like distribution indicates that most parameter estimates of $\beta$ or $\delta$ that vary from those of the bulk do so orthogonally in either *treatment* or *control* conditions, but not both.

Where Sakai genes have an estimated effect under both conditions, this is typically negative for both treatment and control (lower left quadrant).

Identifying locus tags that confer an advantage under treatment and/or control

We use a 50% credibility interval to determine whether the effect of a gene on passage is likely to be positive. Under this assumption, we identify locus tags for which the median estimate of $\delta$ is positive, and the central 50% of the parameter estimates for $\delta$ (the 50% credibility interval) does not include zero. We label these locus tags as trt_pos in the dataframe.


These locus tags correspond to the genes that we should believe confer a selective advantage in passage/*treatment* (i.e. we require our estimate to be credibly positive).

Likewise, we use a 50% credibility interval to determine whether the effect of a gene on surviving growth (control) is positive. If the 50% CI for $\beta$ does not include the 97.5 percentile for all estimates of $\beta$ (as an upper estimate of overall dataset centrality for this dataset), and the median value of $\beta$ is greater than this value, we consider that the effect of the gene on surviving growth conditions is positive. We label these locus tags as ctl_pos in the dataframe.


In [19]:
# Label locus tags with positive effects for control and treatment
sakai_estimates = tools.label_positive_effects(sakai_estimates)

We can count the number of locus_tags in each of the groups:


In [20]:
# Count locus tags in each of the positive groups
counts = [sum(sakai_estimates[col]) for col in ('trt_pos', 'ctl_pos', 'combined')]
print("treatment positive: {0}\ncontrol positive: {1}\nboth: {2}".format(*counts))


treatment positive: 115
control positive: 65
both: 0

which indicates, with these assumptions, that:

  • 115 genes have a credible positive effect on passage (treatment)
  • 65 genes have a credible positive effect in the growth (control) step
  • no genes have a credible positive effect for both growth and treatment.

(this confirms our observation in the earlier scatterplot)

Plotting distribution of effects on the Sakai genome

We can show the estimated effects, and our confidence in those estimates, on a rough representation of the genome by plotting those values for each locus tag, sorted in order on the genome.

In the plots that follow, parameter estimates for each locus tag are rendered as points (the median estimate), with the 50% credibility interval for the estimate indicated as a vertical line. If the 50% CI includes a threshold value - the median value for the bulk parameter estimate of $\beta$ or $\delta$ - then we consider that there is not strong evidence of an effect on survival due to that gene (compared to the bulk), and the interval is coloured blue.

If the interval does not include the corresponding threshold value, then it is coloured either green for a positive effect, or magenta for a negative effect.

Sakai

We split the Sakai estimates into groups: one for the chromosome, and one for each plasmid pOSAK and pO157, on the basis of the locus tag prefixes, annotating them with their start position on the parent molecule.


In [21]:
sakai_chromosome = sakai_estimates.loc[sakai_estimates['locus_tag'].str.startswith('ECs')]
sakai_pOSAK = sakai_estimates.loc[sakai_estimates['locus_tag'].str.startswith('pOSAK1')]
sakai_pO157 = sakai_estimates.loc[(sakai_estimates['locus_tag'].str.startswith('pO157')) |
                                  (sakai_estimates['locus_tag'].str.startswith('ECp'))]

In [22]:
# Sakai chromosome
sakai_chromosome_annotated = tools.annotate_locus_tags(sakai_chromosome,
                                                       os.path.join('..', 'data', 'Sakai',
                                                                    'GCF_000008865.1_ASM886v1_genomic.gbff'))
sakai_chromosome_annotated.sort_values('startpos', inplace=True)
#sakai_chromosome_annotated.head(15)

In [23]:
# pOSAK1
sakai_pOSAK_annotated = tools.annotate_locus_tags(sakai_pOSAK,
                                                  os.path.join('..', 'data', 'Sakai',
                                                               'GCF_000008865.1_ASM886v1_genomic.gbff'))
sakai_pOSAK_annotated.sort_values('startpos', inplace=True)
#sakai_pOSAK_annotated.head(15)

In [24]:
# pECp
sakai_pO157_annotated = tools.annotate_locus_tags(sakai_pO157,
                                                 os.path.join('..', 'data', 'Sakai',
                                                              'GCF_000008865.1_ASM886v1_genomic.gbff'))
sakai_pO157_annotated.sort_values('startpos', inplace=True)
#sakai_pO157_annotated.head(15)

In [25]:
# Regions of interest
regions = [('S-loop 71', 'ECs1276', 'ECs1288', 1.3),
           ('SpLE1', 'ECs1299', 'ECs1410', 1.5),
           ('S-loop 225', 'ECs4325', 'ECs4341', 1.5),
           ('S-loop 231', 'ECs4379', 'ECs4387', 1.3)]
annotations = {k:(tools.get_lt_index(v0, sakai_chromosome_annotated),
                  tools.get_lt_index(v1, sakai_chromosome_annotated), v2) for
               k, v0, v1, v2 in regions}

In [26]:
# Plot genome-wide estimates of beta for Sakai and mark values that don't include the median beta in 50% CI
beta_thresh = np.median(sakai_chromosome_annotated['b_median'])  

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for Sakai chromosome'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

# Plot on the figure axes
tools.plot_parameter(sakai_chromosome_annotated, ax, 'b', beta_thresh, annotations=annotations);



In [27]:
# Regions of interest
regions = [('S-loop 71', 'ECs1276', 'ECs1288', 1),
           ('SpLE1', 'ECs1299', 'ECs1410', 1.8),
           ('S-loop 225', 'ECs4325', 'ECs4341', 1.8),
           ('S-loop 231', 'ECs4379', 'ECs4387', 1)]
annotations = {k:(tools.get_lt_index(v0, sakai_chromosome_annotated),
                  tools.get_lt_index(v1, sakai_chromosome_annotated), v2) for
               k, v0, v1, v2 in regions}

In [28]:
# Plot genome-wide estimates of delta for Sakai and mark values that don't include zero in 50%CI
delta_thresh = np.median(sakai_chromosome_annotated['d_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for Sakai chromosome'
plt.title("{0} [threshold: {1:.2f}]".format(title, delta_thresh))

tools.plot_parameter(sakai_chromosome_annotated, ax, 'd', delta_thresh, annotations=annotations)



In [29]:
# Plot genome-wide estimates of beta for Sakai and mark values that don't include the median beta in 50% CI
beta_thresh = np.median(sakai_pOSAK_annotated['b_median']) 

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for Sakai plasmid pOSAK'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(sakai_pOSAK_annotated, ax, 'b', beta_thresh)



In [30]:
# Plot genome-wide estimates of delta for Sakai and mark values that don't include zero in 50% CI
delta_thresh = np.median(sakai_pOSAK_annotated['d_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for Sakai plasmid pOSAK'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(sakai_pOSAK_annotated, ax, 'd', delta_thresh)



In [31]:
# Regions of interest
regions = [('StcE', 'pO157p01', 'pO157p01', 0.98),
           ('etp T2SS', 'pO157p02', 'pO157p14', 1)]
annotations = {k:(tools.get_lt_index(v0, sakai_pO157_annotated),
                  tools.get_lt_index(v1, sakai_pO157_annotated), v2) for
               k, v0, v1, v2 in regions}

In [32]:
# Plot genome-wide estimates of beta for Sakai and mark values that don't include the median beta in 50% CI
beta_thresh = np.median(sakai_pO157_annotated['b_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for Sakai plasmid p0157'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(sakai_pO157_annotated, ax, 'b', beta_thresh, annotations=annotations)



In [33]:
# Regions of interest
regions = [('StcE', 'pO157p01', 'pO157p01', 0.13),
           ('etp T2SS', 'pO157p02', 'pO157p14', 0.19)]
annotations = {k:(tools.get_lt_index(v0, sakai_pO157_annotated),
                  tools.get_lt_index(v1, sakai_pO157_annotated), v2) for
               k, v0, v1, v2 in regions}

In [34]:
# Plot genome-wide estimates of delta for Sakai and mark values that don't include zero in 50% CI
delta_thresh = np.median(sakai_pO157_annotated['d_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for Sakai plasmid pO157'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(sakai_pO157_annotated, ax, 'd', delta_thresh, annotations=annotations)


These plots indicate that most Sakai genes do not produce parameter estimates that are indicative of credible effects in the control or treatment, in either direction.

Where effects are seen they tend to cluster on the genome, which is as would be expected if operons or gene clusters with common function were responsible for producing an effect. This is suggestive that we are measuring a biological effect, rather than noise.

In general, several clusters of both positive and negative effects appear in the chromosome and pO157 plots for effects due to control ($\beta$) and treatment ($\delta$).

DH10B

We plot similar representations for the DH10B isolate as a control, and see that all parameter estimates for this isolate's locus tags are very similar.


There is a weak sinusoidal pattern of fitted estimates. As no gene ordering information is available to the model fit, and there is an apparent symmetry to this pattern, it may reflect a real underlying biological process or structure.


In [35]:
# Annotate the DH10B results
dh10b_annotated = tools.annotate_locus_tags(dh10b_estimates,
                                            os.path.join('..', 'data', 'DH10B',
                                                         'GCF_000019425.1_ASM1942v1_genomic.gbff'))
dh10b_annotated.sort_values('startpos', inplace=True)

In [36]:
# Plot genome-wide estimates of beta for DH10B
beta_thresh = np.median(dh10b_estimates['b_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for DH10B', 
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(dh10b_estimates, ax, 'b', beta_thresh)



In [37]:
# Plot genome-wide estimates of delta for DH10B
delta_thresh = np.median(dh10b_estimates['d_median'])

# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))  
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for DH10B'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))

tools.plot_parameter(dh10b_estimates, ax, 'd', delta_thresh)


Identifying Sakai candidates

From the information above, we can list the 180 Sakai genes/locus tags that appear to impart a positive selective effect on treatment/passage (the green points/bars in the plots immediately above).


In [38]:
# Generate list of candidates with a positive effect under control or treatment.
candidates = sakai_estimates[sakai_estimates['ctl_pos'] | sakai_estimates['trt_pos']]
candidates = candidates[['locus_tag',
                         'b_median', 'ctl_pos',
                         'd_median', 'trt_pos']].sort_values(['ctl_pos', 'trt_pos', 'locus_tag'])
candidates.shape


Out[38]:
(180, 5)

In [39]:
# Inspect the data
candidates.head()


Out[39]:
locus_tag b_median ctl_pos d_median trt_pos
3523 ECs0415 0.943381 False 0.235389 True
3522 ECs0416 0.935728 False 0.174705 True
40 ECs1121 0.948441 False 0.057269 True
3754 ECs1161 0.947319 False 0.237648 True
3005 ECs1262 0.938215 False 0.197020 True

We restrict this set to those genes that only have a credible effect on treatment/passage, identifying 115 genes with positive $\delta$ where the 50% CI does not include zero:


In [40]:
# Restrict candidates only to those with an effect on treatment/passage.
trt_only_positive = candidates.loc[candidates['trt_pos'] & ~candidates['ctl_pos']]
trt_only_positive.shape


Out[40]:
(115, 5)

We add a column with the functional annotation of each of the candidates that appear to have a positive selective effect under treatment conditions:


In [41]:
# Annotated locus tags with functions from NCBI GenBank files
annotated = tools.annotate_locus_tags(trt_only_positive,
                                      os.path.join('..', 'data', 'Sakai',
                                                   'GCF_000008865.1_ASM886v1_genomic.gbff'))
pd.options.display.max_rows = 115 # force to show all rows
annotated


Out[41]:
locus_tag b_median ctl_pos d_median trt_pos annotation startpos
3523 ECs0415 0.943381 False 0.235389 True periplasmic-iron-binding protein 441343
3522 ECs0416 0.935728 False 0.174705 True regulatory protein 442371
40 ECs1121 0.948441 False 0.057269 True host specificity protein 1203336
3754 ECs1161 0.947319 False 0.237648 True excisionase 1247402
3005 ECs1262 0.938215 False 0.197020 True hypothetical protein 1321781
2989 ECs1272 0.943690 False 0.130858 True rtn-like protein 1337360
2988 ECs1273 0.945370 False 0.169504 True FidL-like protein 1338994
2786 ECs1274 0.943394 False 0.134231 True transcriptional regulator 1339490
2783 ECs1275 0.944433 False 0.101744 True oxidoreductase 1340990
2777 ECs1276 0.943277 False 0.135400 True chaperone protein 1341918
2775 ECs1277 0.941800 False 0.164063 True outer membrane protein 1342604
2769 ECs1278 0.932893 False 0.205353 True outer membrane usher protein 1343954
2764 ECs1279 0.935759 False 0.186083 True chaperone protein 1346529
2763 ECs1280 0.935879 False 0.230102 True major pilin protein 1347316
2761 ECs1281 0.944470 False 0.172548 True hypothetical protein 1347919
2756 ECs1282 0.947543 False 0.124275 True hemagglutinin/hemolysin-like protein 1349939
2754 ECs1283 0.946325 False 0.103695 True hemolysin activator-like protein 1353834
2750 ECs1284 0.947314 False 0.131823 True holo-ACP synthase 1355469
2749 ECs1285 0.947547 False 0.155132 True 3-oxoacyl-ACP reductase 1356043
2744 ECs1286 0.946540 False 0.172656 True (3R)-hydroxymyristoyl-ACP dehydratase 1356616
2742 ECs1287 0.945181 False 0.158969 True acyl carrier protein 1357195
2739 ECs1288 0.946302 False 0.172153 True glycine cleavage system protein T 1357503
4542 ECs1289 0.944467 False 0.144179 True 3-oxoacyl-ACP synthase 1358737
4541 ECs1290 0.944382 False 0.173207 True hypothetical protein 1361543
4530 ECs1291 0.944559 False 0.139463 True hypothetical protein 1362622
4524 ECs1292 0.947028 False 0.135113 True ABC transporter ATP-binding protein 1363484
4518 ECs1293 0.943448 False 0.139623 True hypothetical protein 1364175
4514 ECs1294 0.944918 False 0.141482 True hypothetical protein 1364832
4507 ECs1295 0.943648 False 0.188630 True hypothetical protein 1366151
4503 ECs1296 0.945348 False 0.160612 True hypothetical protein 1366936
4351 ECs1299 0.945724 False 0.129586 True integrase 1370629
4350 ECs1300 0.955336 False 0.130850 True hypothetical protein 1372018
4343 ECs1301 0.950305 False 0.162544 True transposase 1374331
4342 ECs1302 0.949132 False 0.143249 True regulatory protein 1374947
4335 ECs1303 0.948038 False 0.130195 True hypothetical protein 1375886
4330 ECs1304 0.946557 False 0.143737 True hypothetical protein 1378092
4329 ECs1305 0.945482 False 0.091902 True hypothetical protein 1378810
4317 ECs1312 0.949912 False 0.130438 True complement resistance protein 1385490
4305 ECs1315 0.949239 False 0.095599 True hypothetical protein 1386925
4300 ECs1316 0.954738 False 0.073505 True diacylglycerol kinase 1387535
4292 ECs1317 0.948272 False 0.104187 True outer membrane protein 1387987
4289 ECs1319 0.946332 False 0.141756 True hypothetical protein 1389831
4281 ECs1320 0.947656 False 0.097397 True hypothetical protein 1390158
4279 ECs1321 0.944511 False 0.167959 True hypothetical protein 1390442
4128 ECs1322 0.944768 False 0.153130 True urease subunit gamma 1391276
4121 ECs1323 0.944515 False 0.170197 True urease subunit beta 1391587
4120 ECs1324 0.946103 False 0.134457 True urease subunit alpha 1391897
4119 ECs1325 0.943279 False 0.130909 True urease accessory protein UreE 1393613
4113 ECs1326 0.944795 False 0.178645 True UreF 1394078
4112 ECs1327 0.945132 False 0.164409 True protein UreG 1394764
4102 ECs1330 0.946507 False 0.145510 True 50S ribosomal protein L31 1396593
4082 ECs1335 0.940679 False 0.135344 True hypothetical protein 1399916
4065 ECs1337 0.933473 False 0.244882 True hypothetical protein 1401179
4062 ECs1338 0.948663 False 0.162611 True hypothetical protein 1401601
4061 ECs1339 0.949873 False 0.158986 True hypothetical protein 1401982
4055 ECs1340 0.937697 False 0.242988 True hypothetical protein 1403599
3863 ECs1343 0.942417 False 0.139782 True TerW protein 1405456
3856 ECs1344 0.947931 False 0.155960 True hypothetical protein 1405941
3854 ECs1345 0.942798 False 0.151593 True hypothetical protein 1407160
3846 ECs1347 0.949253 False 0.169834 True hypothetical protein 1408852
3838 ECs1348 0.943916 False 0.161774 True hypothetical protein 1409198
501 ECs1397 0.942103 False 0.143552 True hypothetical protein 1447109
485 ECs1401 0.894047 False 0.112519 True hypothetical protein 1451802
482 ECs1402 0.934423 False 0.071359 True hypothetical protein 1452711
453 ECs1408 0.951253 False 0.063618 True hypothetical protein 1455430
451 ECs1409 0.942792 False 0.160255 True hypothetical protein 1455712
432 ECs1780 0.950131 False 0.162195 True DNA methylase 1772842
1969 ECs1800 0.948325 False 0.071955 True major tail subunit 1789012
3649 ECs1812 0.950230 False 0.295711 True hypothetical protein 1801540
3650 ECs1813 0.948075 False 0.303623 True integrase 1803526
3659 ECs1814 0.946699 False 0.355551 True hypothetical protein 1804692
3463 ECs1821 0.968390 False 0.253145 True hypothetical protein 1809856
3660 ECs1824 0.964454 False 0.229530 True hypothetical protein 1810845
3661 ECs1825 0.949948 False 0.288338 True bfpT-regulated chaperone-like protein 1811676
3975 ECs1975 0.946915 False 0.076027 True hypothetical protein 1953689
4223 ECs2192 0.961011 False 0.091518 True hypothetical protein 2190644
1982 ECs2240 0.947708 False 0.061602 True tail length tape measure protein 2219960
1961 ECs2246 0.949162 False 0.037638 True head-tail adaptor 2225510
172 ECs2248 0.948823 False 0.128293 True portal protein 2226199
1950 ECs2254 0.947297 False 0.075911 True Dnase 2233502
3102 ECs2797 0.889599 False 0.046828 True hypothetical protein 2743949
4012 ECs2967 0.947021 False 0.042129 True antirepressor protein 2919758
3223 ECs3498 0.938757 False 0.135753 True hypothetical protein 3484251
549 ECs3868 0.967686 False 0.127219 True hypothetical protein 3872936
364 ECs3869 0.956561 False 0.092450 True hypothetical protein 3873499
780 ECs4325 0.873029 False 0.103697 True O-methyltransferase 4332998
752 ECs4329 0.809716 False 0.083031 True acyl carrier protein 4335884
734 ECs4333 0.890242 False 0.458790 True hypothetical protein 4338410
718 ECs4335 0.772257 False 0.122669 True hypothetical protein 4340509
512 ECs4339 0.794904 False 0.209705 True beta-hydroxydecanoyl-ACP dehydrase 4345150
507 ECs4341 0.791081 False 0.132900 True 3-oxoacyl-ACP synthase 4346342
1887 ECs4353 0.811411 False 0.055275 True sugar kinase 4356742
1885 ECs4354 0.811870 False 0.164720 True phosphotransferase system HPr enzyme 4358240
1856 ECs4379 0.943504 False 0.091106 True hypothetical protein 4388386
1851 ECs4380 0.945445 False 0.061092 True heme utilization/transport protein 4389463
1842 ECs4381 0.947578 False 0.054208 True hypothetical protein 4391721
1839 ECs4382 0.942031 False 0.112038 True hemin binding protein 4392050
1837 ECs4383 0.945615 False 0.069680 True coproporphyrinogen III oxidase 4393062
1831 ECs4384 0.946534 False 0.098545 True ShuX-like protein 4394412
1828 ECs4385 0.944502 False 0.097967 True ShuY-like protein 4394906
1822 ECs4386 0.943409 False 0.108441 True hemin permease 4395614
1818 ECs4387 0.948050 False 0.050882 True hemin importer ATP-binding protein 4396567
5690 ECs4971 0.784740 False 0.432552 True hypothetical protein 5057187
3068 ECs5307 0.942778 False 0.052771 True type I restriction modification enzyme M subunit 5439606
3077 ECs5593 0.936959 False 0.065415 True endoribonuclease SymE 5437290
3128 pO157p01 0.944843 False 0.031526 True ToxR-regulated lipoprotein 0
3117 pO157p03 0.943886 False 0.042784 True EtpD 3674
2885 pO157p05 0.941998 False 0.047712 True EtpF 6938
2881 pO157p06 0.942176 False 0.039165 True EtpG 8192
2874 pO157p08 0.940078 False 0.052343 True EtpI 9192
3165 pO157p78 0.942091 False 0.103422 True EspP 80756
3151 pO157p79 0.948124 False 0.085414 True hypothetical protein 87221
3144 pO157p80 0.942737 False 0.090473 True RfbU-like protein 87658
3143 pO157p81 0.945955 False 0.055917 True hypothetical protein 88854
3138 pO157p82 0.943998 False 0.103180 True lipid A biosynthesis (KDO)2-(lauroyl)-lipid IV... 90649

Finally, we write this data out in tab-separated format


In [42]:
# Write data to file in tab-separated format
outfile_annotated = os.path.join('datasets', 'trt_positive.tab')
annotated.to_csv(outfile_annotated, sep="\t")

Manuscript Figure 1

The code in the cell below will reproduce figure 1 from the manuscript.


In [43]:
# Create figure with no title or xticks to hold the plotted axes
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(20, 26))  

# Add subplot for each result

# 1) Sakai chromosome
regions = [('S-loop 71', 'ECs1276', 'ECs1288', 1),
           ('SpLE1', 'ECs1299', 'ECs1410', 1.8),
           ('S-loop 225', 'ECs4325', 'ECs4341', 1.8),
           ('S-loop 231', 'ECs4379', 'ECs4387', 1)]
annotations = {k:(tools.get_lt_index(v0, sakai_chromosome_annotated),
                  tools.get_lt_index(v1, sakai_chromosome_annotated), v2) for
               k, v0, v1, v2 in regions}
delta_thresh = np.median(sakai_chromosome_annotated['d_median'])
tools.plot_parameter(sakai_chromosome_annotated, ax1, 'd', delta_thresh, annotations=annotations,
                     label="a) Sakai chromosome")

# 2) pO157 plasmid
regions = [('StcE', 'pO157p01', 'pO157p01', 0.13),
           ('etp T2SS', 'pO157p02', 'pO157p14', 0.19)]
annotations = {k:(tools.get_lt_index(v0, sakai_pO157_annotated),
                  tools.get_lt_index(v1, sakai_pO157_annotated), v2) for
               k, v0, v1, v2 in regions}
delta_thresh = np.median(sakai_pO157_annotated['d_median'])
tools.plot_parameter(sakai_pO157_annotated, ax2, 'd', delta_thresh, annotations=annotations,
                     label="b) Sakai pO157")

# 3) DH10B chromosome
delta_thresh = np.median(dh10b_estimates['d_median'])
tools.plot_parameter(dh10b_estimates, ax3, 'd', delta_thresh, label="c) DH10B chromosome")

# Save figure as pdf
plt.savefig("figure_1.pdf");