Supplementary Information: Holmes et al. 2019

4. etpD knockouts and complementation


This notebook describes raw data import, cleaning and QA, then modelling of the *etpD* knockout and complementation experiments.

The goal is to clean the data, account for non-biological effects, and to infer the relative contributions to recovery of E. coli Sakai from spinach material of

  1. knocking out the etpD gene
  2. restoring the etpD on a pSE380 plasmid, in a $\Delta etpD$ Sakai background.

Table of Contents

  1. Experiment design
  2. Data import
    1. Process data
    2. Inspect raw data
  3. Model definition
    1. Measured values of logCFU
    2. Linear model parameters
    3. Choice of priors
    4. Stan model construction
  4. Model fit
    1. Leaf tissue
    2. Root tissue
  5. Conclusions

Python imports

The dependencies below are required to run the code in this notebook, and we also set visualisation defaults for graphs.


In [1]:
%matplotlib inline

import os
import warnings; warnings.filterwarnings('ignore')

import arviz as az
import matplotlib.pyplot as plt; plt.rcParams['figure.figsize'] = (12, 8)
import numpy as np
import pandas as pd
import pystan
import seaborn as sns

import tools

1. Experiment design


The experiments comprise four variants of E. coli Sakai:
  1. Wild Type (`WT`)
  2. etpD knockouts: ΔetpD (`KO`)
  3. empty plasmid pSE380: (`empty`)
  4. plasmid pSE380 carrying etpD complement: (`complement`)

The variants can be considered to be nested in that they are stepwise changes to the composition of E. coli Sakai. The KO variant can be considered to modify baseline WT adherence (by deletion of the etpD gene). The empty variant can be considered to modify the adherence properties of the KO variant (by introducing an empty pSE380 plasmid to the knockout background). The complement variant can be considered to modify the adherence properties of the empty variant (by introducing the etpD gene into the pSE380 plasmid, in the knockout background).


The four bacterium variants are each and separately exposed to two spinach plant tissues: leaf and root, and logCFU of recovered bacteria measured as a proxy for adherence, as described in the manuscript.

The questions at hand are:
  1. Is there a difference in logCFU for `KO` with respect to `WT`? If so, what is the direction and magnitude of change?
  2. Is there a difference in logCFU for `complement` with respect to `empty`? If so, what is the direction and magnitude of this change?

Measurements were made in batches of five. That is, on a particular half-day a batch of ten measurements were made: five `KO` and five `WT`; or five `empty` and five `complement`. There is therefore no natural pairing of `WT`/`KO` to `empty`/`complement` measurements as these were carried out in different batches at different times. We may assume that each batch is subject to specific effects that may bias the observed logCFU with respect to other batches, and that this effect is consistent for all treatments applied to that batch.

For each tissue, experiments were conducted such that each logCFU measurement was acquired on a single, independent root or leaf. This means that there is no natural pairing of specific `WT` and `KO`, or `empty` and `complement` measurements. The natural comparison is therefore within-batch, and pooled.

2. Data import

Raw data was previously converted to plain text comma-separated variable format from the `Excel` file `../data/etpD/etpD_raw_data.xlsx`:
  • The file ../data/etpD/leaves.csv contains data from experiments on spinach leaves
  • The file ../data/etpD/roots.csv contains data from experiments on spinach roots

We import the data using pandas, and show the first few lines:


In [2]:
# Raw data locations
leafdatafile = os.path.join("..", "data", "etpD", "leaves.csv")
rootdatafile = os.path.join("..", "data", "etpD", "roots.csv")

# Parse data
leafdata = pd.read_csv(leafdatafile)
rootdata = pd.read_csv(rootdatafile)

In [3]:
# Show leaf data
leafdata.head()


Out[3]:
Sakai delta EtpD pSE380 pSE_EtpD
0 1.496078e+06 424092.4092 7.414966e+05 3288288.288
1 9.719888e+05 122362.8692 2.788462e+06 4272300.469
2 1.614790e+06 451790.6336 4.020619e+05 2897435.897
3 6.333333e+05 894444.4444 4.341564e+05 2933333.333
4 1.228175e+06 676767.6768 7.233010e+05 1450396.825

In [4]:
# Show root data
rootdata.head()


Out[4]:
Sakai delta EtpD pSE380 pSE_EtpD
0 4750000.000 852564.1026 439814.8148 662162.1622
1 2000000.000 666666.6667 588652.4823 569230.7692
2 2166666.667 757894.7368 699530.5164 154166.6667
3 4927536.232 618618.6186 682051.2821 565040.6504
4 5714285.714 973856.2092 628415.3005 307692.3077

Process data


The raw data are, as described above, divisible into batches of five as separate (`Sakai`/`WT`; `delta EtpD`/`KO`) and (`pSE380`/`empty`; `pSE_EtpD`/`complement`) results. We convert the data from wide to long form, creating a categorical factors to describe the four treatments.

We also introduce three further factors: KO, empty, and complement, so that these describe the cumulative modifications to Sakai as a series of 0/1 values. This will define the treatment categories in the model we develop, below.

  • Sakai: KO:0, empty:0, complement:0
  • delta EtpD: KO:1, empty:0, complement:0
  • pSE380: KO:1, empty:1, complement:0
  • pSE_EtpD: KO:1, empty:1, complement:1

and introduce a further batch identifier, uniquely numbering batches from $1 \ldots n$ in blocks of five rows.


In [5]:
def wide_to_long(df):
    """Convert wide form array to long form array.
    
    Assumes the columns are in order: Sakai, delta EtpD, pSE380,
    pSE_EtpD.
    
    Splits the data into two tables: Sakai and delta Etpd;
    pSE380 and pSE_EtpD. Adds a column `batch` and fills this
    incrementally in groups of five, *not* repeating batch
    number across the two tables.
    
    Then the tables are melted with all data values in a single
    column headed CFU, and the batch column retained. The tables
    are joined row-wide (Sakai/EtpD table on top).
    
    New columns are added: KO, pSE380, etpD. These columns are
    filled as 1/0 depending on the originating column for the
    value:
    
    - first 25% of rows are 0, 0, 0
    - next 25% of rows are 1, 0, 0
    - next 25% of rows are 1, 1, 0
    - next 25% of rows are 1, 1, 1
    """
    # Split data
    df1 = df[['Sakai', 'delta EtpD']]
    df2 = df[['pSE380', 'pSE_EtpD']]
    
    # Assign batches
    for idx, dfn in enumerate([df1, df2]):
        dfn['batch'] = (np.floor(np.array(dfn.index)/5) + 1).astype(int)
        dfn['batch'] = (dfn['batch'] + (len(dfn)/5 * idx)).astype(int)

    # wide to long
    df1 = pd.melt(df1, id_vars=['batch'], value_vars=['Sakai', 'delta EtpD'])
    df2 = pd.melt(df2, id_vars=['batch'], value_vars=['pSE380', 'pSE_EtpD'])
 
    # Concatenate dataframes lengthwise and rename headers
    df = pd.concat([df1, df2])
    df.columns = ['batch', 'label', 'CFU']
    
    # Add factors
    df['KO'] = [0] * int(0.25 * len(df)) + [1] * int(0.75 * len(df))
    df['empty'] = [0] * int(0.5 * len(df)) + [1] * int(0.5 * len(df))    
    df['complement'] = [0] * int(0.75 * len(df)) + [1] * int(0.25 * len(df))
    
    # Convert values to log_10 values
    df['logCFU'] = np.log10(df['CFU'])
    
    # Make factors categorical
    for col in ['batch', 'KO', 'empty', 'complement', 'label']:
        df[col] = df[col].astype('category')
    
    return df

In [6]:
# Convert data from wide to long
leafdata = wide_to_long(leafdata)
rootdata = wide_to_long(rootdata)

In [7]:
# Show leaf data
leafdata.head()


Out[7]:
batch label CFU KO empty complement logCFU
0 1 Sakai 1.496078e+06 0 0 0 6.174954
1 1 Sakai 9.719888e+05 0 0 0 5.987661
2 1 Sakai 1.614790e+06 0 0 0 6.208116
3 1 Sakai 6.333333e+05 0 0 0 5.801632
4 1 Sakai 1.228175e+06 0 0 0 6.089260

In [8]:
# Show root data
rootdata.head()


Out[8]:
batch label CFU KO empty complement logCFU
0 1 Sakai 4750000.000 0 0 0 6.676694
1 1 Sakai 2000000.000 0 0 0 6.301030
2 1 Sakai 2166666.667 0 0 0 6.335792
3 1 Sakai 4927536.232 0 0 0 6.692630
4 1 Sakai 5714285.714 0 0 0 6.756962

Inspect raw data

We can visualise the data for root and leaf applications to see trends, such as whether there are batch effects that need to be considered and accounted for in the statistical model.


Plotting the leaf data below, we can see that there is a clear difference between batches 1 and 2 for any condition, but little difference between batches 3 and 4. This suggests that we must account for batch effects for this dataset.

Unfortunately, as only two batch-level replicates for each condition are available, we cannot determine from inspection alone whether batch 1 measurements are lower with respect to an accurate batch 2, or whether batch 2 measurements are higher with respect to an accurate batch 1. However, by constructing an appropriate model of the system and taking into account that the conditions represent successive modifications to the baseline recovery of Sakai we can 'borrow' information about baseline recovery from all measurements in this dataset.

Further, noting that 6.5 logCFU is a typical recovery of bacteria from leaf material, we can apply suitable prior assumptions in a Bayesian statistical model. This will allow us to infer values for the batch effects alongside the biological effects we seek to answer questions (1) and (2) above.


Batch effects notwithstanding, it appears initially that knocking out etpD reduces the amount of bacteria recovered from leaf tissue, while introducing the etpD complement in the knockout background increases the level of recovery with respect to the empty vector.

In [9]:
# Plot leaf data by treatment, split by batch
ax = sns.stripplot(x='label', y='logCFU', hue='batch',
                   data=leafdata,
                   jitter=True, dodge=True)



Plotting the root data, similar batch-level effects are again observed (measurements in batches 1, 2 and 5 appear to be suppressed in relation to the other batches).

As before, using a suitable model of the system that includes batch effects, and applying reasonable priors, it shall be possible to infer both batch effect and biological effect sizes simultaneously from this dataset.


Initially it appears that knockout of etpD results in reduced recovery from roots, but reintroduction of etpD in a knockout background does not increase recovery levels with respect to the empty vector.

In [10]:
# Plot root data by treatment, split by batch
ax = sns.stripplot(x='label', y='logCFU', hue='batch',
                   data=rootdata,
                   jitter=True, dodge=True)


3. Model definition


In this section, we will define the statistical model that will be used to infer four key biologically-relevant values for each of the two tissues:
  1. The expected level of recovery (logCFU) of wild-type Sakai
  2. The expected change in recovery (logCFU) of Sakai with respect to wild-type if the etpD gene is knocked out
  3. The expected change in recovery (logCFU) when the pSE380 empty vector is introduced, with respect to the $\Delta etpD$ knockout
  4. The expected change in recovery (logCFU) with respect to the knockout carrying empty vector, if the pSE380 vector carries the etpD gene

The model will also infer a value for each of the distinct batches of measurements in the dataset:

  • The expected difference in recovery (logCFU) due to the sample being measured in that batch

Measured values of logCFU


With the logCFU measurements presented in long form, as above, each measurement is assigned a unique identifying number $i \in 1, 2 \ldots k$, where $k$ is the total number of measurements.

We assume that each measurement with index i measures the logCFU (proxy for extent of adherence/attachment) of a particular Sakai variant when recovered from plant tissue. We define the measured value as the *output* $y_{i}$, and assume that it represents a true value of logCFU (which we call $\hat{y_{i}}$) plus some irreducible error in the measurement ($\epsilon$). The error in measurement is assumed to be the same for all measurements, and we assume it is normally-distributed with mean 0 and variance $\sigma^{2}_{y}$.

$$y_i = \hat{y_i} + \epsilon_i$$ $$\epsilon_i \sim N(0, \sigma_y^2) \implies y_i \sim N(\hat{y_i}, \sigma_y^2)$$

Linear model parameters


We assume that the 'true' recovered value $\hat{y_{i}}$ is determined by the following factors:
  1. The baseline tendency of the `WT` wild-type variant to adhere to the tissue (shared by all datapoints)
  2. Modification of (1) by the specific loss of etpD
  3. Modification of (2) by the presence of plasmid pSE380
  4. Modification of (3) by the presence of etpD on plasmid pSE380
  5. Effects specific to the batch being run.

We assume that the effects of these factors are additive only, do not interact in complex ways, and can be combined into a linear model.

The baseline factor (1) is represented as the parameter $\alpha$. This influences all measurements $i$. The other factors (2)-(4) represent modifications to the baseline, each with its own parameter ($\beta$, $\gamma$, $\delta$). Any one of these influences only a subset of measurements, and whether a measurement is influenced by one of these factors can be coded as 1/0 integer values. In the long-form table above these are found in columns KO, empty and complement. For the purposes of formally defining our model, we refer to these instead as $t$, $u$ and $v$. The measurement with index $i$ therefore has corresponding values $t_i$, $u_i$, and $v_i$.


Splitting the factors in this way enables 'borrowing' of data from the plasmid-bearing variants for the estimate of the effect due to loss of *etpD*. Similarly, it enables the direct estimation of the effect of reintroducing etpD as a complement. With this interpretation, the parameters $\beta$, $\gamma$ and $\delta$ have the meanings:
  • $\beta$ - the change in logCFU due to loss of *etpD* with respect to the wild type
  • $\gamma$ - the change in logCFU due to incorporation of the pSE380 plasmid. Note that as there is no experiment in which the wild-type Sakai carries this plasmid, this parameter only estimates the effect in a $\Delta etpD$ background.
  • $\delta$ - the change in logCFU due to incorporation of *etpD* on the pSE380 plasmid. Note again that this parameter only estimates the effect with respect to a $\Delta etpD$ Sakai background carrying an empty pSE380 plasmid, due to the experiment structure.

The collection of batch effect factors (5) can be represented as an array of parameters, $\phi_{j_{i}}$, where $j \in {1, 2, \ldots, n}$ and $n$ is the number of batches. Each value of $\phi_{j_{i}}$ represents the common effect on all measurements in batch $j_{i}$ - the batch to which measurement $i$ belongs - due to being in that batch. It is a non-biological effect.


We choose to model each of the parameters as additive effects, cumulatively acting upon the baseline adherence $\alpha$. Effects that diminish recovery should have negative values for their parameters; effects that enhance recovery should have positive values for their parameters. Effects for which their parameter's credibility interval spans zero will not be considered to have been shown convincingly to have any effect.

Choice of priors


For each biological effect, we choose the parameter's *prior distribution* to be a Normal distribution, as we expect conservative behaviour and no extreme outliers. Our goal here is to reasonably constrain the estimated parameters to biologically plausible values, without being overly prescriptive.

Our model will estimate the mean value and dispersion of each parameter's posterior distribution:

$$\alpha \sim N(\mu_{\alpha}, \sigma_{\alpha}^2)$$$$\beta \sim N(\mu_{\beta}, \sigma_{\beta}^2)$$$$\gamma \sim N(\mu_{\gamma}, \sigma_{\gamma}^2)$$$$\delta \sim N(\mu_{\delta}, \sigma_{\delta}^2)$$

We have a strong prior belief about the expected distribution of $\alpha$. From the obtained raw data (and additional experiments) we expect the mean baseline of `logCFU` for wild-type Sakai to be somewhere around 6.5. This is represented in the prior distribution we choose for $\mu_{\alpha}$.

The distribution is centred on 6.5 logCFU, and we assume one log unit variance in this:

$$\mu_{\alpha} \sim \textrm{N}(6.5, 1)$$

The variance of each parameter's prior distribution will be drawn from a weak prior, as we do not have strong beliefs about its value.

We choose a half-Normal distribution with mean zero, and standard deviation of 1 log unit:

$$\sigma_\alpha \sim \textrm{N}(0, 1)$$$$\sigma_\beta \sim \textrm{N}(0, 1)$$$$\sigma_\gamma \sim \textrm{N}(0, 1)$$$$\sigma_\delta \sim \textrm{N}(0, 1)$$

For batch effects $j_i$ (the batch from which measurement $i$ is drawn) we expect that the batch effect itself will be conservative and well-behaved, so represent this with a Normal distribution. Most batches will be fairly similar to the 'usual' measured value, so the effect of a batch (the mean of that distribution) is likely to be close to zero. However, we expect that occasionally a batch effect will be quite large, so we choose a Cauchy distribution to describe our prior for the mean, as it is Normal-like but has fat tails, so can take more extreme values more frequently. We expect the usual mean of this effect to be close to zero, so use $\textrm{Cauchy}(0, 0.2)$ as quite a strong prior against large batch effects.

We select a weaker half-Normal prior for the variance of the batch effect itself $\sigma_{\phi_{j_{i}}}$:

$$j_i \in {1, 2, \ldots, 8}$$$$\phi_{j_{i}} \sim N(\mu_{\phi_{j_{i}}}, \sigma_{\phi_{j_{i}}}^2)$$$$\mu_{\phi_{j_{i}}} \sim \textrm{Cauchy}(0, 0.2)$$$$\sigma_{\phi_{j_{i}}} \sim \textrm{Normal}(0, 1)$$

We need to take some care with the estimates of $\phi_{j_{i}}$ as it is a linear combination with $\alpha$, and these values can blow up in a complementary manner (very large values of one permit very large values of the other). Specifically, leaving $\mu_{\phi_{j_{i}}}$ weakly constrained can result in wildly unrealistic estimates of $\phi_{j_{i}}$ and $\alpha$ being possible. Conversely, we *do* expect at least one outlier, as biasing effects due to the batches are clearly visible. As a result, we constrained our priors so that $\mu_{\phi_{j_{i}}} \sim \textrm{Cauchy}(0, 0.2)$ to reflect our understanding that batch effects are typically close to zero, but can occasionally take large values.
We therefore construct the following model of the experiment, for a given tissue: $$y_i = \hat{y_i} + \epsilon_i$$ $$\hat{y_i} = \alpha + \beta t_i + \gamma u_i + \delta v_i + \phi_{j_{i}}$$ $$j_i \in {1, 2, \ldots, 8}$$ $$\alpha \sim \textrm{N}(\mu_{\alpha}, \sigma_{\alpha}^2)$$ $$\beta \sim \textrm{N}(\mu_{\beta}, \sigma_{\beta}^2)$$ $$\gamma \sim \textrm{N}(\mu_{\gamma}, \sigma_{\gamma}^2)$$ $$\delta \sim \textrm{N}(\mu_{\delta}, \sigma_{\delta}^2)$$ $$\phi_{j_{i}} \sim \textrm{N}(\mu_{\phi_{j_{i}}}, \sigma_{\phi_{j_{i}}}^2)$$ $$\mu_{\alpha} \sim \textrm{N}(6.5, 1)$$ $$\mu_{\phi_{j_{i}}} \sim \textrm{Cauchy}(0, 0.2)$$ $$\sigma_\alpha \sim \textrm{N}(0, 1)$$ $$\sigma_\beta \sim \textrm{N}(0, 1)$$ $$\sigma_\gamma \sim \textrm{N}(0, 1)$$ $$\sigma_\delta \sim \textrm{N}(0, 1)$$ $$\sigma_{\phi_{j_{i}}} \sim \textrm{N}(0, 1)$$
  • $y_i$: measured logCFU for a single application $i$ of bacteria to a plant
  • $\hat{y_i}$: 'true' logCFU for a single application $i$ of bacteria to a plant
  • $\epsilon_i$: measurement error in logCFU for a single application of $i$ of bacteria to a plant
  • $\alpha$: the expected logCFU for wild-type (`WT`) Sakai to a plant
  • $\mu_\alpha$: mean logCFU for `WT` Sakai on the plant
  • $\sigma_\alpha$: variance logCFU for `WT` Sakai on the plant
  • $\beta$: the expected modification of logCFU w.r.t. `WT` as the result of deletion of *etpD*
  • $\mu_\beta$: mean change in logCFU for `KO` w.r.t. `WT` Sakai on the plant
  • $\sigma_\alpha$: variance for change in logCFU for `KO` w.r.t. `WT` Sakai on the plant
  • $t_i$: 0/1 pseudovariable indicating whether the strain used for $i$ is $\Delta etpD$ (is `KO`)
  • $\gamma$: the expected modification of logCFU w.r.t. `KO` as the result of incorporation of plasmid pSE380
  • $\mu_\gamma$: mean change in logCFU for `empty` w.r.t. `KO` Sakai on the plant
  • $\sigma_\gamma$: variance for change in logCFU for `empty` w.r.t. `KO` Sakai on the plant
  • $u_i$: 0/1 pseudovariable indicating whether the strain used for $i$ is $\Delta etpD$ and carrying pSE380 (is `empty`)
  • $\delta$: the expected modification of logCFU w.r.t. `empty` as the result of complementation with *etpD*
  • $\mu_\delta$: mean change in logCFU for `complement` w.r.t. `empty` Sakai on the plant
  • $\sigma_\delta$: variance for change in logCFU for `complement` w.r.t. `empty` Sakai on the plant
  • $v_i$: 0/1 pseudovariable indicating whether the strain used for $i$ is $\Delta etpD$, carrying pSE380, and includes the complemented *etpD* (is `complement`)
  • $j_i$: the batch from which $i$ is drawn, $j \in {1, 2, \ldots, 8}$
  • $\phi_{j_{i}}$: the expected modification of logCFU for batch $j_i$ w.r.t. `WT`
  • $\mu_{\phi_{j_{i}}}$: mean change in logCFU for batch $j_i$ w.r.t. `WT`
  • $\sigma_{\phi_{j_{i}}}$: variance for change in logCFU for batch $j_i$ w.r.t. `WT`

Stan model construction

We need to define data, parameters and our model for Stan

data block

  • N: int - number of datapoints
  • K: int - number of batches
  • t: vector[N] - 0/1 values for KO
  • u: vector[N] - 0/1 values for empty
  • v: vector[N] - 0/1 values for complement
  • batch: int[N] - batch (1:K) from which each measurement is drawn
  • y: vector[N] - the output logCFU values

parameter block

  • a: real - estimated logCFU for WT
  • mu_a: real - unconstrained value representing mean of distribution underlying a
  • sigma_a: real<lower=0> - standard deviation of distribution underlying a
  • b: real - estimated change in logCFU for KO
  • mu_b: real - unconstrained value representing mean of distribution underlying b
  • sigma_b: real<lower=0> - standard deviation of distribution underlying b
  • c: real - estimated change in logCFU for empty
  • mu_c: real - unconstrained value representing mean of distribution underlying c
  • sigma_c: real<lower=0> - standard deviation of distribution underlying c
  • d: real - estimated change in logCFU for complement
  • mu_d: real - unconstrained value representing mean of distribution underlying d
  • sigma_d: real<lower=0> - standard deviation of distribution underlying d
  • f: real vector[K] - estimated changes in logCFU due to each batch effect
  • mu_f: real - estimated mean change due to batch effects underlying f
  • sigma_f: real<lower=0> - standard deviation of distribution underlying f
  • sigma: real<lower=0> - irreducible error in experiment/model

transformed parameter block

  • y_hat: vector[N] - linear relationship describing $\hat{y}$, our estimate of logCFU, subject to variance sigma:

y_hat[i] <- a + b * t[i] + c * u[i] + d * v[i] + f[batch[i]]

Define Stan model

This section describes the Python code implementing the model.


In [11]:
# define Stan model
treatment_model = """
data {
  int<lower=0> N;
  int<lower=0> K;
  vector[N] t;
  vector[N] u;
  vector[N] v;
  vector[N] y;
  int<lower=1, upper=K> batch[N];
}
parameters {
  real a;
  real b;
  real g;
  real d;
  vector[K] f;
  real mu_a;
  real mu_b;
  real mu_g;
  real mu_d;
  real mu_f;
  real<lower=0> sigma_a;
  real<lower=0> sigma_b;
  real<lower=0> sigma_g;
  real<lower=0> sigma_d;  
  real<lower=0> sigma_f;
  real<lower=0> sigma;
}
transformed parameters{
  vector[N] y_hat;
  
  for (i in 1:N)
    y_hat[i] = a + b * t[i] + g * u[i] + d * v[i] + f[batch[i]];
}
model {
  sigma_a ~ normal(0, 1);
  mu_a ~ normal(6.5, 1);
  a ~ normal(mu_a, sigma_a);
  
  sigma_b ~ normal(0, 1);
  b ~ normal(mu_b, sigma_b);

  sigma_g ~ normal(0, 1);
  g ~ normal(mu_g, sigma_g);

  sigma_d ~ normal(0, 1);
  d ~ normal(mu_d, sigma_d);

  sigma_f ~ normal(0, 1);
  mu_f ~ cauchy(0, 0.2);
  f ~ normal(mu_f, sigma_f);

  y ~ normal(y_hat, sigma);
}
"""

In [12]:
# function to run pyStan fit on dataset
def run_fit(df, model, seed=987654321):
    """Run pyStan fit on passed dataset.
    
    Returns pyStan fit of the passed model on the passed
    dataframe. The seed argument ensures repeatability
    for the purpose of demonstration - change to explore
    stability of the result.
    """
    batches = df['batch'].unique()
    nbatches = len(batches)

    # Set variables into a dictionary for passing to Stan
    data_dict = {'N': len(df),
                 'K': nbatches,
                 't': df['KO'],
                 'u': df['empty'],
                 'v': df['complement'],
                 'y': df['logCFU'],
                 'batch': df['batch']
                }
    model = pystan.StanModel(model_code=model)  # Compile
    fit = model.sampling(data=data_dict,        # fit
                         iter=5000, seed=seed)
    return fit

4. Model fit


We apply the same model to both the leaf and root datasets to estimate the parameters $\alpha$, $\beta$, $\gamma$, $\delta$ and $\phi_{j_{i}}$, separately for each tissue.

 Leaf tissue

Firstly, we fit the model to leaf tissue


In [13]:
# fit the model to leaf tissue
leaf_fit = run_fit(leafdata, treatment_model)


INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_e9dc8683e85b3f71a1137d158b6fda8d NOW.
WARNING:pystan:417 of 10000 iterations ended with a divergence (4.17 %).
WARNING:pystan:Try running with adapt_delta larger than 0.8 to remove the divergences.

In [14]:
# plot parameter estimates
leafdata = az.from_pystan(posterior=leaf_fit)
coords = {"f_dim_0": [0, 1]}
az.plot_trace(leafdata,
              var_names=['a', 'b', 'g', 'd', 'f'],
              coords=coords,
              combined=True,
              figsize=(10,12));


There are several encouraging features of the fitted parameter estimates:

  • The curves are smooth and (mainly) unimodal, suggesting convergence and a fairly well-behaved posterior distribution.
  • The sample values are also well-behaved and do not 'flip' between extreme values

With a forest plot, we can see whether the credibility intervals of the sampled variables include zero (thick line represents 50% CI; thin line represents 94% CI):


In [15]:
az.plot_forest(leafdata,
               var_names=['a', 'b', 'g', 'd', 'f'],
               combined=True,
               figsize=(10,6),
               rope=(0,0));


The estimated CIs for $\beta$ are entirely negative and do not include zero. We can conclude that the effect of deleting etpD is therefore to reduce logCFU (median value ≈-0.32) with respect to WT.

All estimated CIs for $\alpha$ and $\delta$ are entirely positive, and do not include zero. We can conclude that the "baseline" WT logCFU is therefore positive (median value ≈6.37), and the effect on logCFU of complementing etpD with respect to the empty vector in a knockout background is also positive (median value ≈0.4).

By contrast, both CIs for $\gamma$ include zero, and we can conclude that the effect of introducing the empty plasmid into a knockout background is that logCFU is essentially unaffected (median value ≈-0.16 implies a small negative effect but not "significant" departure from zero)


In [16]:
# summarise results
leaf_fit


Out[16]:
Inference for Stan model: anon_model_e9dc8683e85b3f71a1137d158b6fda8d.
4 chains, each with iter=5000; warmup=2500; thin=1; 
post-warmup draws per chain=2500, total post-warmup draws=10000.

            mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
a           6.39    0.02   0.59   5.22   6.05   6.37   6.69   7.63   1330    1.0
b          -0.32  1.3e-3   0.12  -0.56   -0.4  -0.32  -0.25  -0.09   7943    1.0
g          -0.15    0.01   0.63  -1.45  -0.48  -0.16   0.17   1.12   2033    1.0
d            0.4  1.4e-3   0.12   0.16   0.32    0.4   0.48   0.63   7428    1.0
f[1]       -0.37    0.02   0.59   -1.6  -0.67  -0.34  -0.03    0.8   1348    1.0
f[2]        0.28    0.02   0.59  -0.95  -0.03   0.29   0.61   1.43   1339    1.0
f[3]        0.06    0.02   0.65  -1.41  -0.25   0.07   0.41   1.31   1596    1.0
f[4]       -0.13    0.02   0.65  -1.62  -0.44  -0.11   0.22    1.1   1573    1.0
mu_a        6.42    0.01   0.69   5.06   5.99    6.4   6.84   7.87   2976    1.0
mu_b       -0.33    0.02    1.0   -2.5  -0.71  -0.33   0.04   1.89   4015    1.0
mu_g       -0.16    0.02   1.18   -2.6  -0.75  -0.15   0.41   2.37   2915    1.0
mu_d         0.4    0.02   1.07  -1.96-9.0e-3    0.4   0.84   2.69   3711    1.0
mu_f       -0.03    0.01   0.46  -1.09  -0.18-5.5e-3   0.16   0.88   1433    1.0
sigma_a      0.7  7.7e-3   0.54   0.05   0.27   0.57    1.0   1.99   4839    1.0
sigma_b      0.8    0.01    0.6   0.04   0.31   0.68   1.16   2.26   2659    1.0
sigma_g     0.79    0.01   0.59   0.06   0.32   0.66   1.11   2.25   3119    1.0
sigma_d     0.84    0.01    0.6   0.07   0.36   0.72   1.21   2.28   3508    1.0
sigma_f     0.57  6.1e-3   0.34   0.18   0.33   0.48   0.71   1.45   3001    1.0
sigma       0.26  4.8e-4   0.03   0.21   0.24   0.26   0.28   0.34   5083    1.0
y_hat[1]    6.02  1.1e-3    0.1   5.82   5.95   6.02   6.09   6.22   8423    1.0
y_hat[2]    6.02  1.1e-3    0.1   5.82   5.95   6.02   6.09   6.22   8423    1.0
y_hat[3]    6.02  1.1e-3    0.1   5.82   5.95   6.02   6.09   6.22   8423    1.0
y_hat[4]    6.02  1.1e-3    0.1   5.82   5.95   6.02   6.09   6.22   8423    1.0
y_hat[5]    6.02  1.1e-3    0.1   5.82   5.95   6.02   6.09   6.22   8423    1.0
y_hat[6]    6.66  1.1e-3    0.1   6.45   6.59   6.66   6.73   6.86   8275    1.0
y_hat[7]    6.66  1.1e-3    0.1   6.45   6.59   6.66   6.73   6.86   8275    1.0
y_hat[8]    6.66  1.1e-3    0.1   6.45   6.59   6.66   6.73   6.86   8275    1.0
y_hat[9]    6.66  1.1e-3    0.1   6.45   6.59   6.66   6.73   6.86   8275    1.0
y_hat[10]   6.66  1.1e-3    0.1   6.45   6.59   6.66   6.73   6.86   8275    1.0
y_hat[11]    5.7  1.2e-3    0.1    5.5   5.63    5.7   5.77    5.9   7579    1.0
y_hat[12]    5.7  1.2e-3    0.1    5.5   5.63    5.7   5.77    5.9   7579    1.0
y_hat[13]    5.7  1.2e-3    0.1    5.5   5.63    5.7   5.77    5.9   7579    1.0
y_hat[14]    5.7  1.2e-3    0.1    5.5   5.63    5.7   5.77    5.9   7579    1.0
y_hat[15]    5.7  1.2e-3    0.1    5.5   5.63    5.7   5.77    5.9   7579    1.0
y_hat[16]   6.34  1.1e-3    0.1   6.14   6.27   6.34    6.4   6.54   8279    1.0
y_hat[17]   6.34  1.1e-3    0.1   6.14   6.27   6.34    6.4   6.54   8279    1.0
y_hat[18]   6.34  1.1e-3    0.1   6.14   6.27   6.34    6.4   6.54   8279    1.0
y_hat[19]   6.34  1.1e-3    0.1   6.14   6.27   6.34    6.4   6.54   8279    1.0
y_hat[20]   6.34  1.1e-3    0.1   6.14   6.27   6.34    6.4   6.54   8279    1.0
y_hat[21]   5.96  1.1e-3    0.1   5.76    5.9   5.96   6.03   6.16   8490    1.0
y_hat[22]   5.96  1.1e-3    0.1   5.76    5.9   5.96   6.03   6.16   8490    1.0
y_hat[23]   5.96  1.1e-3    0.1   5.76    5.9   5.96   6.03   6.16   8490    1.0
y_hat[24]   5.96  1.1e-3    0.1   5.76    5.9   5.96   6.03   6.16   8490    1.0
y_hat[25]   5.96  1.1e-3    0.1   5.76    5.9   5.96   6.03   6.16   8490    1.0
y_hat[26]   5.78  1.1e-3    0.1   5.58   5.71   5.78   5.84   5.98   8494    1.0
y_hat[27]   5.78  1.1e-3    0.1   5.58   5.71   5.78   5.84   5.98   8494    1.0
y_hat[28]   5.78  1.1e-3    0.1   5.58   5.71   5.78   5.84   5.98   8494    1.0
y_hat[29]   5.78  1.1e-3    0.1   5.58   5.71   5.78   5.84   5.98   8494    1.0
y_hat[30]   5.78  1.1e-3    0.1   5.58   5.71   5.78   5.84   5.98   8494    1.0
y_hat[31]   6.36  1.2e-3    0.1   6.16   6.29   6.36   6.43   6.56   7646    1.0
y_hat[32]   6.36  1.2e-3    0.1   6.16   6.29   6.36   6.43   6.56   7646    1.0
y_hat[33]   6.36  1.2e-3    0.1   6.16   6.29   6.36   6.43   6.56   7646    1.0
y_hat[34]   6.36  1.2e-3    0.1   6.16   6.29   6.36   6.43   6.56   7646    1.0
y_hat[35]   6.36  1.2e-3    0.1   6.16   6.29   6.36   6.43   6.56   7646    1.0
y_hat[36]   6.18  1.1e-3    0.1   5.97   6.11   6.18   6.24   6.38   8575    1.0
y_hat[37]   6.18  1.1e-3    0.1   5.97   6.11   6.18   6.24   6.38   8575    1.0
y_hat[38]   6.18  1.1e-3    0.1   5.97   6.11   6.18   6.24   6.38   8575    1.0
y_hat[39]   6.18  1.1e-3    0.1   5.97   6.11   6.18   6.24   6.38   8575    1.0
y_hat[40]   6.18  1.1e-3    0.1   5.97   6.11   6.18   6.24   6.38   8575    1.0
lp__       28.15    0.07   3.48  20.27  26.07  28.53  30.65  33.88   2208    1.0

Samples were drawn using NUTS at Tue Jan 29 11:10:07 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The model fit appears to have converged (Rhat=1.0 for all estimates), and the estimated standard error of the mean for any estimated parameter is much smaller than the estimated mean. We can be confident that the model fit is stable.


The model fit above is consistent with the following conclusions:

  • The baseline recovery ($\alpha$) has median value 6.37 logCFU (50% credibility interval; 6.05:6.69)
  • The effect of knocking out etpD with respect to wild-type Sakai ($\beta$) is to reduce recovery (all values in 95% CI are negative). The median reduction is 0.32 logCFU (50% CI; -0.4:-0.25).
  • There is no strong effect on recovery due to introduction of empty vector pSE380, with respect to the $\Delta etpD$ knockout ($\gamma$, 50% CI includes zero)
  • The reintroduction of etpD ($\delta$) increases recovery (all values in 95% CI are positive). The median increase is 0.4 logCFU (50% CI; 0.32:0.48)
  • The mean bias in measurement due to a single batch ($\mu_f$) is zero (50% CI includes zero)
  • The bias introduced by batch 1 ($\phi_1$/`f[0]`) acts to suppress bacterial recovery by ≈0.34 logCFU (50% CI; -0.67:-0.03)
  • No other batch has a notable effect on recovery (50% CI includes zero for `f[1]`, `f[2]`, `f[3]`)

Root tissue

We next fit the model to root tissue


In [17]:
# fit model to root tissue
root_fit = run_fit(rootdata, treatment_model)


INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_e9dc8683e85b3f71a1137d158b6fda8d NOW.
WARNING:pystan:480 of 10000 iterations ended with a divergence (4.8 %).
WARNING:pystan:Try running with adapt_delta larger than 0.8 to remove the divergences.

In [18]:
# plot parameter estimates
rootdata = az.from_pystan(posterior=root_fit)
coords = {"f_dim_0": [3, 4, 5]}
az.plot_trace(rootdata,
              var_names=['a', 'b', 'g', 'd', 'f'],
              coords=coords,
              combined=True,
              figsize=(10,14));


As with the fit to leaf data, there is good reason to believe that the parameter estimates are reliable.

  • The curves are smooth and unimodal, suggesting convergence and a well-behaved posterior distribution.
  • The sample values are also well-behaved and do not 'flip' between extreme values

With a forest plot, we can again see whether the credibility intervals of the sampled variables include zero (as before: thick line represents 50% CI; thin line represents 94% CI):


In [19]:
az.plot_forest(rootdata,
               var_names=['a', 'b', 'g', 'd', 'f'],
               combined=True,
               figsize=(10,6),
               rope=(0,0));


The estimated CIs for $\beta$ are again entirely negative and do not include zero. We can conclude that the effect of deleting etpD is therefore to reduce logCFU (median value ≈-0.31) with respect to WT. This is consistent with the estimated values seen in the leaf data.

Both estimated CIs for $\alpha$ are entirely positive, and do not include zero. We can conclude that the "baseline" WT logCFU is therefore positive (median value ≈6.56), again consistent with the estimated values seen in the leaf data.

Both CIs for $\gamma$ include zero, so we can again conclude that the effect of introducing the empty plasmid into a knockout background is that logCFU is essentially unaffected (median value ≈-0.07 implies a small negative effect but not "significant" departure from zero), as with the leaf data.

However, the effect on logCFU of complementing etpD with respect to the empty vector in a knockout background is not the same as for the leaf data. The 50% CI is entirely negative, but the 94% CI includes zero. There is stochasticity in the Bayesian model fitting, and the 50% CI is seen in other runs also to include zero. As a result we cannot confidently say that our estimate for $\delta$ is consistently negative. Moreover, the median value of $\delta$ is ≈-0.06, of the same order as that for $\gamma$, which we consider to have no significant effect on logCFU. We may consistently, therefore, consider the fitted value of $\gamma$ to be either not different from zero (at 94% CI), or slightly negative with a non-significant effect size.


In [20]:
# summarise results
root_fit


WARNING:pystan:Truncated summary with the 'fit.__repr__' method. For the full summary use 'print(fit)'
Out[20]:
Warning: Shown data is truncated to 100 parameters
For the full summary use 'print(fit)'

Inference for Stan model: anon_model_e9dc8683e85b3f71a1137d158b6fda8d.
4 chains, each with iter=5000; warmup=2500; thin=1; 
post-warmup draws per chain=2500, total post-warmup draws=10000.

            mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
a           6.54    0.01   0.44   5.57   6.34   6.56   6.75   7.42   1528    1.0
b          -0.31  7.2e-4   0.06  -0.43  -0.35  -0.31  -0.27  -0.19   7379    1.0
g          -0.06  4.2e-3   0.26  -0.56  -0.22  -0.07    0.1   0.46   3752    1.0
d          -0.06  7.5e-4   0.06  -0.19   -0.1  -0.06  -0.02   0.06   6989    1.0
f[1]       -0.16    0.01   0.44  -1.05  -0.37  -0.17   0.04   0.81   1526    1.0
f[2]       -0.21    0.01   0.44   -1.1  -0.42  -0.22-5.8e-3   0.75   1544    1.0
f[3]        0.16    0.01   0.44  -0.73  -0.05   0.14   0.36   1.12   1535    1.0
f[4]        0.23    0.01   0.44  -0.66   0.02   0.22   0.43   1.19   1532    1.0
f[5]       -0.43    0.01   0.44  -1.35  -0.65  -0.43  -0.23   0.52   1481   1.01
f[6]        0.26    0.01   0.44  -0.65   0.04   0.26   0.47   1.22   1478   1.01
f[7]        0.07    0.01   0.44  -0.83  -0.15   0.08   0.28   1.04   1483   1.01
f[8]         0.1    0.01   0.44  -0.82  -0.12    0.1    0.3   1.06   1469   1.01
mu_a        6.53    0.01   0.62   5.25   6.17   6.55    6.9   7.82   3402    1.0
mu_b       -0.33    0.02   1.01  -2.53  -0.67  -0.31   0.04   1.79   4310    1.0
mu_g       -0.09    0.02   1.03  -2.32  -0.49  -0.08   0.36   1.98   4295    1.0
mu_d       -0.06    0.02   0.96  -2.15  -0.39  -0.06   0.27   2.05   3966    1.0
mu_f      3.8e-3    0.01   0.41  -0.84  -0.16-7.3e-4   0.15   0.93   1403   1.01
sigma_a     0.68  8.0e-3   0.53   0.05   0.26   0.55   0.96   1.99   4399    1.0
sigma_b     0.79    0.01    0.6   0.05   0.33   0.64   1.13   2.27   3164    1.0
sigma_g     0.78    0.02    0.6   0.02   0.32   0.66   1.13   2.21    961    1.0
sigma_d     0.79    0.01   0.59   0.05   0.33   0.66   1.12   2.23   3484    1.0
sigma_f     0.34  2.5e-3   0.13   0.18   0.25   0.31   0.39   0.67   2642    1.0
sigma        0.2  2.2e-4   0.02   0.17   0.19    0.2   0.21   0.24   6157    1.0
y_hat[1]    6.38  7.1e-4   0.07   6.24   6.33   6.38   6.43   6.52   9589    1.0
y_hat[2]    6.38  7.1e-4   0.07   6.24   6.33   6.38   6.43   6.52   9589    1.0
y_hat[3]    6.38  7.1e-4   0.07   6.24   6.33   6.38   6.43   6.52   9589    1.0
y_hat[4]    6.38  7.1e-4   0.07   6.24   6.33   6.38   6.43   6.52   9589    1.0
y_hat[5]    6.38  7.1e-4   0.07   6.24   6.33   6.38   6.43   6.52   9589    1.0
y_hat[6]    6.33  1.1e-3   0.07   6.19   6.28   6.33   6.37   6.47   4092    1.0
y_hat[7]    6.33  1.1e-3   0.07   6.19   6.28   6.33   6.37   6.47   4092    1.0
y_hat[8]    6.33  1.1e-3   0.07   6.19   6.28   6.33   6.37   6.47   4092    1.0
y_hat[9]    6.33  1.1e-3   0.07   6.19   6.28   6.33   6.37   6.47   4092    1.0
y_hat[10]   6.33  1.1e-3   0.07   6.19   6.28   6.33   6.37   6.47   4092    1.0
y_hat[11]    6.7  7.4e-4   0.07   6.56   6.65    6.7   6.74   6.83   8607    1.0
y_hat[12]    6.7  7.4e-4   0.07   6.56   6.65    6.7   6.74   6.83   8607    1.0
y_hat[13]    6.7  7.4e-4   0.07   6.56   6.65    6.7   6.74   6.83   8607    1.0
y_hat[14]    6.7  7.4e-4   0.07   6.56   6.65    6.7   6.74   6.83   8607    1.0
y_hat[15]    6.7  7.4e-4   0.07   6.56   6.65    6.7   6.74   6.83   8607    1.0
y_hat[16]   6.77  7.6e-4   0.07   6.63   6.72   6.77   6.82   6.91   8267    1.0
y_hat[17]   6.77  7.6e-4   0.07   6.63   6.72   6.77   6.82   6.91   8267    1.0
y_hat[18]   6.77  7.6e-4   0.07   6.63   6.72   6.77   6.82   6.91   8267    1.0
y_hat[19]   6.77  7.6e-4   0.07   6.63   6.72   6.77   6.82   6.91   8267    1.0
y_hat[20]   6.77  7.6e-4   0.07   6.63   6.72   6.77   6.82   6.91   8267    1.0
y_hat[21]   6.07  6.9e-4   0.07   5.94   6.03   6.07   6.12   6.21   9864    1.0
y_hat[22]   6.07  6.9e-4   0.07   5.94   6.03   6.07   6.12   6.21   9864    1.0
y_hat[23]   6.07  6.9e-4   0.07   5.94   6.03   6.07   6.12   6.21   9864    1.0
y_hat[24]   6.07  6.9e-4   0.07   5.94   6.03   6.07   6.12   6.21   9864    1.0
y_hat[25]   6.07  6.9e-4   0.07   5.94   6.03   6.07   6.12   6.21   9864    1.0
y_hat[26]   6.02  8.1e-4   0.07   5.89   5.98   6.02   6.07   6.15   7113    1.0
y_hat[27]   6.02  8.1e-4   0.07   5.89   5.98   6.02   6.07   6.15   7113    1.0
y_hat[28]   6.02  8.1e-4   0.07   5.89   5.98   6.02   6.07   6.15   7113    1.0
y_hat[29]   6.02  8.1e-4   0.07   5.89   5.98   6.02   6.07   6.15   7113    1.0
y_hat[30]   6.02  8.1e-4   0.07   5.89   5.98   6.02   6.07   6.15   7113    1.0
y_hat[31]   6.39  7.2e-4   0.07   6.25   6.34   6.39   6.44   6.52   9457    1.0
y_hat[32]   6.39  7.2e-4   0.07   6.25   6.34   6.39   6.44   6.52   9457    1.0
y_hat[33]   6.39  7.2e-4   0.07   6.25   6.34   6.39   6.44   6.52   9457    1.0
y_hat[34]   6.39  7.2e-4   0.07   6.25   6.34   6.39   6.44   6.52   9457    1.0
y_hat[35]   6.39  7.2e-4   0.07   6.25   6.34   6.39   6.44   6.52   9457    1.0
y_hat[36]   6.46  6.9e-4   0.07   6.32   6.42   6.46   6.51    6.6   9805    1.0
y_hat[37]   6.46  6.9e-4   0.07   6.32   6.42   6.46   6.51    6.6   9805    1.0
y_hat[38]   6.46  6.9e-4   0.07   6.32   6.42   6.46   6.51    6.6   9805    1.0
y_hat[39]   6.46  6.9e-4   0.07   6.32   6.42   6.46   6.51    6.6   9805    1.0
y_hat[40]   6.46  6.9e-4   0.07   6.32   6.42   6.46   6.51    6.6   9805    1.0
y_hat[41]   5.74  7.8e-4   0.07   5.61   5.69   5.74   5.79   5.88   7959    1.0
y_hat[42]   5.74  7.8e-4   0.07   5.61   5.69   5.74   5.79   5.88   7959    1.0
y_hat[43]   5.74  7.8e-4   0.07   5.61   5.69   5.74   5.79   5.88   7959    1.0
y_hat[44]   5.74  7.8e-4   0.07   5.61   5.69   5.74   5.79   5.88   7959    1.0
y_hat[45]   5.74  7.8e-4   0.07   5.61   5.69   5.74   5.79   5.88   7959    1.0
y_hat[46]   6.43  1.2e-3   0.07    6.3   6.39   6.43   6.48   6.57   3588    1.0
y_hat[47]   6.43  1.2e-3   0.07    6.3   6.39   6.43   6.48   6.57   3588    1.0
y_hat[48]   6.43  1.2e-3   0.07    6.3   6.39   6.43   6.48   6.57   3588    1.0
y_hat[49]   6.43  1.2e-3   0.07    6.3   6.39   6.43   6.48   6.57   3588    1.0
y_hat[50]   6.43  1.2e-3   0.07    6.3   6.39   6.43   6.48   6.57   3588    1.0
y_hat[51]   6.25  7.6e-4   0.07   6.11    6.2   6.24   6.29   6.38   8115    1.0
y_hat[52]   6.25  7.6e-4   0.07   6.11    6.2   6.24   6.29   6.38   8115    1.0
y_hat[53]   6.25  7.6e-4   0.07   6.11    6.2   6.24   6.29   6.38   8115    1.0
y_hat[54]   6.25  7.6e-4   0.07   6.11    6.2   6.24   6.29   6.38   8115    1.0
y_hat[55]   6.25  7.6e-4   0.07   6.11    6.2   6.24   6.29   6.38   8115    1.0
y_hat[56]   6.27  7.4e-4   0.07   6.14   6.22   6.27   6.32   6.41   8984    1.0
y_hat[57]   6.27  7.4e-4   0.07   6.14   6.22   6.27   6.32   6.41   8984    1.0
y_hat[58]   6.27  7.4e-4   0.07   6.14   6.22   6.27   6.32   6.41   8984    1.0
y_hat[59]   6.27  7.4e-4   0.07   6.14   6.22   6.27   6.32   6.41   8984    1.0
y_hat[60]   6.27  7.4e-4   0.07   6.14   6.22   6.27   6.32   6.41   8984    1.0
y_hat[61]   5.68  8.7e-4   0.07   5.55   5.63   5.68   5.73   5.82   6334    1.0
y_hat[62]   5.68  8.7e-4   0.07   5.55   5.63   5.68   5.73   5.82   6334    1.0
y_hat[63]   5.68  8.7e-4   0.07   5.55   5.63   5.68   5.73   5.82   6334    1.0
y_hat[64]   5.68  8.7e-4   0.07   5.55   5.63   5.68   5.73   5.82   6334    1.0
y_hat[65]   5.68  8.7e-4   0.07   5.55   5.63   5.68   5.73   5.82   6334    1.0
y_hat[66]   6.38  8.9e-4   0.07   6.24   6.33   6.37   6.42   6.51   6071    1.0
y_hat[67]   6.38  8.9e-4   0.07   6.24   6.33   6.37   6.42   6.51   6071    1.0
y_hat[68]   6.38  8.9e-4   0.07   6.24   6.33   6.37   6.42   6.51   6071    1.0
y_hat[69]   6.38  8.9e-4   0.07   6.24   6.33   6.37   6.42   6.51   6071    1.0
y_hat[70]   6.38  8.9e-4   0.07   6.24   6.33   6.37   6.42   6.51   6071    1.0
y_hat[71]   6.19  8.3e-4   0.07   6.05   6.14   6.19   6.23   6.32   7010    1.0
y_hat[72]   6.19  8.3e-4   0.07   6.05   6.14   6.19   6.23   6.32   7010    1.0
y_hat[73]   6.19  8.3e-4   0.07   6.05   6.14   6.19   6.23   6.32   7010    1.0
y_hat[74]   6.19  8.3e-4   0.07   6.05   6.14   6.19   6.23   6.32   7010    1.0
y_hat[75]   6.19  8.3e-4   0.07   6.05   6.14   6.19   6.23   6.32   7010    1.0
y_hat[76]   6.21  7.6e-4   0.07   6.08   6.17   6.21   6.26   6.35   8203    1.0
lp__       88.49    0.17   3.86  80.15  86.07  88.77  91.21  95.31    490   1.01

Samples were drawn using NUTS at Tue Jan 29 11:11:00 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The model fit has again converged (Rhat=1.0 for all estimates), and once more the estimated standard error of the mean for any estimated parameter is much smaller than the estimated mean (except for $\mu_{\phi}$/mu_f). We can be confident that the model fit is stable.


The model fit above is consistent with the following conclusions:

  • The baseline recovery ($\alpha$) has median value 6.56 logCFU (50% credibility interval; 6.34:6.72)
  • The effect of knocking out etpD with respect to wild-type Sakai ($\beta$) is to reduce recovery (all values in 95% CI are negative). The median reduction is 0.31 logCFU (50% CI; -0.35:-0.27).
  • There is no notable effect on recovery due to introduction of empty vector pSE380, with respect to the $\Delta etpD$ knockout ($\gamma$, 50% CI includes zero)
  • The reintroduction of etpD ($\delta$) does not increase recovery (50% CI is negative; -0.1:-0.02). The median loss in recovery is small: -0.06 logCFU.
  • The mean bias in measurement due to a single batch ($\mu_f$) is zero (50% CI includes zero), although individual batches appear to act to either suppress (2, 5) or enhance (4, 6) logCFU at the 50% CI.

5. Conclusions


From the model fits above, we can conclude the following:

  • Knockout of etpD reduces the recovery of E. coli Sakai by ≈0.3 logCFU with respect to the wild-type Sakai in both root and leaf tissue.
  • In leaf tissue, complementation of the $\Delta etpD$ mutant with etpD on pSE380 increases recovery by ≈0.4 logCFU with respect to the empty pSE380 plasmid in the same background.
  • In root tissue, complementation of the $\Delta etpD$ mutant with etpD on pSE380 does not increase recovery with respect to the empty pSE380 plasmid in the same background.

6. Manuscript figure

The code below should reproduce figure 4 from the manuscript, reporting the estimated model parameters in leaf and root tissue.


In [21]:
sns.set_style("darkgrid")
sns.set_context('notebook')
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['ps.useafm'] = True
plt.rcParams['pdf.fonttype'] = 42

In [22]:
def extract_fit_vars(fit):
    """Return a pandas dataframe summarising estimated parameters."""
    dfs = []
    for param in ['b', 'g', 'd']:
        df = tools.extract_fit_variable_summary(fit, param)
        df.columns=["mean", "sem", "median", "2.5pc", "97.5pc", "25pc", "75pc"]
        df.index=[param]
        dfs.append(df)
    return pd.concat(dfs)

def plot_params(df, ax):
    """Plot the parameter estimates as ball and stick"""
    vals = df['median']  # marker
    cilo = df['25pc']    # lower end of 50% CI
    cihi = df['75pc']    # upper end of 50% CI
        
    # Add markers to plot
    ax.scatter(range(len(df)), vals, marker='_', s=100,
               c='#000000', lw=1)
    for idx, val, vlo, vhi in zip(range(len(df)), vals, cilo, cihi):
        ax.plot([idx, idx], [vlo, vhi], color='k', lw=2)
    
    # Format plot
    sns.despine(bottom=True)
    ax.set_xticks([0, 1, 2, 3])
    ax.axhline(color='k', lw=1)
    ax.set_ylabel('logCFU')
    ax.set_xticklabels(['Sakai\n$\Delta etpD$', '$\Delta etpD$\npSE380', '$\Delta etpD$\npSE_EtpD'])

    
# Set default seaborn style (plain, black and white)
sns.set_style("white")

# Construct the axes for drawing
fig = plt.figure(figsize=(4, 12))
bac_ax = fig.add_subplot(3, 1, 1)
leaf_ax = fig.add_subplot(3, 1, 2)
root_ax = fig.add_subplot(3, 1, 3)


# Render ball-and-stick plots for each fit
plot_params(extract_fit_vars(root_fit), root_ax)
plot_params(extract_fit_vars(leaf_fit), leaf_ax)

# Plot part 3A: pV41 and BAC2B5 results
bacdata = pd.read_csv(os.path.join('..', 'data', 'etpD', 'pV41_BAC2B5.csv'))            # load data
bacdata = pd.melt(bacdata, value_vars=['pV41', 'BAC2B5 (etp)'])  # wide to long
bacdata = bacdata.dropna(how='any')                              # drop NaN values
bacdata['logCFU'] = np.log10(bacdata['value'])                   # take log transform
sns.boxplot(x='variable', y='logCFU', data=bacdata,              # plot boxplot
            color="#ffffff", ax=bac_ax)
bac_ax.plot([0, 1], [6.9, 6.9], linewidth=2, color='#000000')    # plot asterisk and bar
bac_ax.plot([0, 0], [6.8, 7.0], linewidth=2, color='#000000')
bac_ax.plot([1, 1], [6.8, 7.0], linewidth=2, color='#000000')
bac_ax.text(0.5, 6.9, '*', fontsize=20)
bac_ax.set_xlabel('')                                            # remove x-axis label
for idx, artist in enumerate(bac_ax.artists):                    # make boxplot lines black
    artist.set_edgecolor('#000000')
    for jdx in range(idx * 6, idx * 6 + 6):
        line = bac_ax.lines[jdx]
        line.set_color('#000000')
        line.set_mfc('#000000')
        line.set_mec('#000000')
        
# Add subplot labels
bac_ax.text(-1.25, 7, 'A', fontsize=20)
leaf_ax.text(-1.25, 0.45, 'B', fontsize=20)
root_ax.text(-1.25, 0.1, 'C', fontsize=20)

# Relabel y-axes and rotate x-ticks
for ax in (bac_ax, leaf_ax, root_ax):
    ax.set_ylabel("$\log_{10}$ (cfu/ml/g)")
    for lbl in ax.get_xticklabels():
        lbl.set_rotation(45)
        
# Modify margins
plt.subplots_adjust(left=0, right=0.9, top=0.8, bottom=0.2)
plt.tight_layout()

# Save figure in useful formats (.tif is not available)
plt.savefig('figure4.png', dpi=600)
plt.savefig('figure4.pdf');



In [ ]: