The goal is to clean the data, account for non-biological effects, and to infer the relative contributions to recovery of E. coli Sakai from spinach material of
In [1]:
%matplotlib inline
import os
import warnings; warnings.filterwarnings('ignore')
import arviz as az
import matplotlib.pyplot as plt; plt.rcParams['figure.figsize'] = (12, 8)
import numpy as np
import pandas as pd
import pystan
import seaborn as sns
import tools
The variants can be considered to be nested in that they are stepwise changes to the composition of E. coli Sakai. The KO
variant can be considered to modify baseline WT
adherence (by deletion of the etpD gene). The empty
variant can be considered to modify the adherence properties of the KO
variant (by introducing an empty pSE380 plasmid to the knockout background). The complement
variant can be considered to modify the adherence properties of the empty
variant (by introducing the etpD gene into the pSE380 plasmid, in the knockout background).
../data/etpD/leaves.csv
contains data from experiments on spinach leaves
../data/etpD/roots.csv
contains data from experiments on spinach roots
We import the data using pandas
, and show the first few lines:
In [2]:
# Raw data locations
leafdatafile = os.path.join("..", "data", "etpD", "leaves.csv")
rootdatafile = os.path.join("..", "data", "etpD", "roots.csv")
# Parse data
leafdata = pd.read_csv(leafdatafile)
rootdata = pd.read_csv(rootdatafile)
In [3]:
# Show leaf data
leafdata.head()
Out[3]:
In [4]:
# Show root data
rootdata.head()
Out[4]:
We also introduce three further factors: KO
, empty
, and complement
, so that these describe the cumulative modifications to Sakai as a series of 0/1
values. This will define the treatment categories in the model we develop, below.
Sakai
: KO
:0, empty
:0, complement
:0delta EtpD
: KO
:1, empty
:0, complement
:0pSE380
: KO
:1, empty
:1, complement
:0pSE_EtpD
: KO
:1, empty
:1, complement
:1and introduce a further batch
identifier, uniquely numbering batches from $1 \ldots n$ in blocks of five rows.
In [5]:
def wide_to_long(df):
"""Convert wide form array to long form array.
Assumes the columns are in order: Sakai, delta EtpD, pSE380,
pSE_EtpD.
Splits the data into two tables: Sakai and delta Etpd;
pSE380 and pSE_EtpD. Adds a column `batch` and fills this
incrementally in groups of five, *not* repeating batch
number across the two tables.
Then the tables are melted with all data values in a single
column headed CFU, and the batch column retained. The tables
are joined row-wide (Sakai/EtpD table on top).
New columns are added: KO, pSE380, etpD. These columns are
filled as 1/0 depending on the originating column for the
value:
- first 25% of rows are 0, 0, 0
- next 25% of rows are 1, 0, 0
- next 25% of rows are 1, 1, 0
- next 25% of rows are 1, 1, 1
"""
# Split data
df1 = df[['Sakai', 'delta EtpD']]
df2 = df[['pSE380', 'pSE_EtpD']]
# Assign batches
for idx, dfn in enumerate([df1, df2]):
dfn['batch'] = (np.floor(np.array(dfn.index)/5) + 1).astype(int)
dfn['batch'] = (dfn['batch'] + (len(dfn)/5 * idx)).astype(int)
# wide to long
df1 = pd.melt(df1, id_vars=['batch'], value_vars=['Sakai', 'delta EtpD'])
df2 = pd.melt(df2, id_vars=['batch'], value_vars=['pSE380', 'pSE_EtpD'])
# Concatenate dataframes lengthwise and rename headers
df = pd.concat([df1, df2])
df.columns = ['batch', 'label', 'CFU']
# Add factors
df['KO'] = [0] * int(0.25 * len(df)) + [1] * int(0.75 * len(df))
df['empty'] = [0] * int(0.5 * len(df)) + [1] * int(0.5 * len(df))
df['complement'] = [0] * int(0.75 * len(df)) + [1] * int(0.25 * len(df))
# Convert values to log_10 values
df['logCFU'] = np.log10(df['CFU'])
# Make factors categorical
for col in ['batch', 'KO', 'empty', 'complement', 'label']:
df[col] = df[col].astype('category')
return df
In [6]:
# Convert data from wide to long
leafdata = wide_to_long(leafdata)
rootdata = wide_to_long(rootdata)
In [7]:
# Show leaf data
leafdata.head()
Out[7]:
In [8]:
# Show root data
rootdata.head()
Out[8]:
We can visualise the data for root and leaf applications to see trends, such as whether there are batch effects that need to be considered and accounted for in the statistical model.
Unfortunately, as only two batch-level replicates for each condition are available, we cannot determine from inspection alone whether batch 1 measurements are lower with respect to an accurate batch 2, or whether batch 2 measurements are higher with respect to an accurate batch 1. However, by constructing an appropriate model of the system and taking into account that the conditions represent successive modifications to the baseline recovery of Sakai
we can 'borrow' information about baseline recovery from all measurements in this dataset.
Further, noting that 6.5 logCFU is a typical recovery of bacteria from leaf material, we can apply suitable prior assumptions in a Bayesian statistical model. This will allow us to infer values for the batch effects alongside the biological effects we seek to answer questions (1) and (2) above.
In [9]:
# Plot leaf data by treatment, split by batch
ax = sns.stripplot(x='label', y='logCFU', hue='batch',
data=leafdata,
jitter=True, dodge=True)
As before, using a suitable model of the system that includes batch effects, and applying reasonable priors, it shall be possible to infer both batch effect and biological effect sizes simultaneously from this dataset.
In [10]:
# Plot root data by treatment, split by batch
ax = sns.stripplot(x='label', y='logCFU', hue='batch',
data=rootdata,
jitter=True, dodge=True)
The model will also infer a value for each of the distinct batches of measurements in the dataset:
The baseline factor (1) is represented as the parameter $\alpha$. This influences all measurements $i$.
The other factors (2)-(4) represent modifications to the baseline, each with its own parameter ($\beta$, $\gamma$, $\delta$). Any one of these influences only a subset of measurements, and whether a measurement is influenced by one of these factors can be coded as 1/0
integer values. In the long-form table above these are found in columns KO
, empty
and complement
. For the purposes of formally defining our model, we refer to these instead as $t$, $u$ and $v$. The measurement with index $i$ therefore has corresponding values $t_i$, $u_i$, and $v_i$.
The collection of batch effect factors (5) can be represented as an array of parameters, $\phi_{j_{i}}$, where $j \in {1, 2, \ldots, n}$ and $n$ is the number of batches. Each value of $\phi_{j_{i}}$ represents the common effect on all measurements in batch $j_{i}$ - the batch to which measurement $i$ belongs - due to being in that batch. It is a non-biological effect.
Our model will estimate the mean value and dispersion of each parameter's posterior distribution:
$$\alpha \sim N(\mu_{\alpha}, \sigma_{\alpha}^2)$$$$\beta \sim N(\mu_{\beta}, \sigma_{\beta}^2)$$$$\gamma \sim N(\mu_{\gamma}, \sigma_{\gamma}^2)$$$$\delta \sim N(\mu_{\delta}, \sigma_{\delta}^2)$$The distribution is centred on 6.5 logCFU, and we assume one log unit variance in this:
$$\mu_{\alpha} \sim \textrm{N}(6.5, 1)$$We choose a half-Normal distribution with mean zero, and standard deviation of 1 log unit:
$$\sigma_\alpha \sim \textrm{N}(0, 1)$$$$\sigma_\beta \sim \textrm{N}(0, 1)$$$$\sigma_\gamma \sim \textrm{N}(0, 1)$$$$\sigma_\delta \sim \textrm{N}(0, 1)$$We select a weaker half-Normal prior for the variance of the batch effect itself $\sigma_{\phi_{j_{i}}}$:
$$j_i \in {1, 2, \ldots, 8}$$$$\phi_{j_{i}} \sim N(\mu_{\phi_{j_{i}}}, \sigma_{\phi_{j_{i}}}^2)$$$$\mu_{\phi_{j_{i}}} \sim \textrm{Cauchy}(0, 0.2)$$$$\sigma_{\phi_{j_{i}}} \sim \textrm{Normal}(0, 1)$$We need to define data
, parameters
and our model
for Stan
data
blockN
: int
- number of datapointsK
: int
- number of batchest
: vector[N]
- 0/1 values for KO
u
: vector[N]
- 0/1 values for empty
v
: vector[N]
- 0/1 values for complement
batch
: int[N]
- batch (1:K
) from which each measurement is drawny
: vector[N]
- the output logCFU valuesparameter
blocka
: real
- estimated logCFU for WT
mu_a
: real
- unconstrained value representing mean of distribution underlying a
sigma_a
: real<lower=0>
- standard deviation of distribution underlying a
b
: real
- estimated change in logCFU for KO
mu_b
: real
- unconstrained value representing mean of distribution underlying b
sigma_b
: real<lower=0>
- standard deviation of distribution underlying b
c
: real
- estimated change in logCFU for empty
mu_c
: real
- unconstrained value representing mean of distribution underlying c
sigma_c
: real<lower=0>
- standard deviation of distribution underlying c
d
: real
- estimated change in logCFU for complement
mu_d
: real
- unconstrained value representing mean of distribution underlying d
sigma_d
: real<lower=0>
- standard deviation of distribution underlying d
f
: real vector[K]
- estimated changes in logCFU due to each batch effectmu_f
: real
- estimated mean change due to batch effects underlying f
sigma_f
: real<lower=0>
- standard deviation of distribution underlying f
sigma
: real<lower=0>
- irreducible error in experiment/modeltransformed parameter
blocky_hat
: vector[N]
- linear relationship describing $\hat{y}$, our estimate of logCFU, subject to variance sigma
:y_hat[i] <- a + b * t[i] + c * u[i] + d * v[i] + f[batch[i]]
In [11]:
# define Stan model
treatment_model = """
data {
int<lower=0> N;
int<lower=0> K;
vector[N] t;
vector[N] u;
vector[N] v;
vector[N] y;
int<lower=1, upper=K> batch[N];
}
parameters {
real a;
real b;
real g;
real d;
vector[K] f;
real mu_a;
real mu_b;
real mu_g;
real mu_d;
real mu_f;
real<lower=0> sigma_a;
real<lower=0> sigma_b;
real<lower=0> sigma_g;
real<lower=0> sigma_d;
real<lower=0> sigma_f;
real<lower=0> sigma;
}
transformed parameters{
vector[N] y_hat;
for (i in 1:N)
y_hat[i] = a + b * t[i] + g * u[i] + d * v[i] + f[batch[i]];
}
model {
sigma_a ~ normal(0, 1);
mu_a ~ normal(6.5, 1);
a ~ normal(mu_a, sigma_a);
sigma_b ~ normal(0, 1);
b ~ normal(mu_b, sigma_b);
sigma_g ~ normal(0, 1);
g ~ normal(mu_g, sigma_g);
sigma_d ~ normal(0, 1);
d ~ normal(mu_d, sigma_d);
sigma_f ~ normal(0, 1);
mu_f ~ cauchy(0, 0.2);
f ~ normal(mu_f, sigma_f);
y ~ normal(y_hat, sigma);
}
"""
In [12]:
# function to run pyStan fit on dataset
def run_fit(df, model, seed=987654321):
"""Run pyStan fit on passed dataset.
Returns pyStan fit of the passed model on the passed
dataframe. The seed argument ensures repeatability
for the purpose of demonstration - change to explore
stability of the result.
"""
batches = df['batch'].unique()
nbatches = len(batches)
# Set variables into a dictionary for passing to Stan
data_dict = {'N': len(df),
'K': nbatches,
't': df['KO'],
'u': df['empty'],
'v': df['complement'],
'y': df['logCFU'],
'batch': df['batch']
}
model = pystan.StanModel(model_code=model) # Compile
fit = model.sampling(data=data_dict, # fit
iter=5000, seed=seed)
return fit
In [13]:
# fit the model to leaf tissue
leaf_fit = run_fit(leafdata, treatment_model)
In [14]:
# plot parameter estimates
leafdata = az.from_pystan(posterior=leaf_fit)
coords = {"f_dim_0": [0, 1]}
az.plot_trace(leafdata,
var_names=['a', 'b', 'g', 'd', 'f'],
coords=coords,
combined=True,
figsize=(10,12));
There are several encouraging features of the fitted parameter estimates:
With a forest plot, we can see whether the credibility intervals of the sampled variables include zero (thick line represents 50% CI; thin line represents 94% CI):
In [15]:
az.plot_forest(leafdata,
var_names=['a', 'b', 'g', 'd', 'f'],
combined=True,
figsize=(10,6),
rope=(0,0));
The estimated CIs for $\beta$ are entirely negative and do not include zero. We can conclude that the effect of deleting etpD is therefore to reduce logCFU (median value ≈-0.32) with respect to WT.
All estimated CIs for $\alpha$ and $\delta$ are entirely positive, and do not include zero. We can conclude that the "baseline" WT logCFU is therefore positive (median value ≈6.37), and the effect on logCFU of complementing etpD with respect to the empty vector in a knockout background is also positive (median value ≈0.4).
By contrast, both CIs for $\gamma$ include zero, and we can conclude that the effect of introducing the empty plasmid into a knockout background is that logCFU is essentially unaffected (median value ≈-0.16 implies a small negative effect but not "significant" departure from zero)
In [16]:
# summarise results
leaf_fit
Out[16]:
The model fit appears to have converged (Rhat=1.0
for all estimates), and the estimated standard error of the mean for any estimated parameter is much smaller than the estimated mean. We can be confident that the model fit is stable.
In [17]:
# fit model to root tissue
root_fit = run_fit(rootdata, treatment_model)
In [18]:
# plot parameter estimates
rootdata = az.from_pystan(posterior=root_fit)
coords = {"f_dim_0": [3, 4, 5]}
az.plot_trace(rootdata,
var_names=['a', 'b', 'g', 'd', 'f'],
coords=coords,
combined=True,
figsize=(10,14));
As with the fit to leaf data, there is good reason to believe that the parameter estimates are reliable.
With a forest plot, we can again see whether the credibility intervals of the sampled variables include zero (as before: thick line represents 50% CI; thin line represents 94% CI):
In [19]:
az.plot_forest(rootdata,
var_names=['a', 'b', 'g', 'd', 'f'],
combined=True,
figsize=(10,6),
rope=(0,0));
The estimated CIs for $\beta$ are again entirely negative and do not include zero. We can conclude that the effect of deleting etpD is therefore to reduce logCFU (median value ≈-0.31) with respect to WT. This is consistent with the estimated values seen in the leaf data.
Both estimated CIs for $\alpha$ are entirely positive, and do not include zero. We can conclude that the "baseline" WT logCFU is therefore positive (median value ≈6.56), again consistent with the estimated values seen in the leaf data.
Both CIs for $\gamma$ include zero, so we can again conclude that the effect of introducing the empty plasmid into a knockout background is that logCFU is essentially unaffected (median value ≈-0.07 implies a small negative effect but not "significant" departure from zero), as with the leaf data.
However, the effect on logCFU of complementing etpD with respect to the empty vector in a knockout background is not the same as for the leaf data. The 50% CI is entirely negative, but the 94% CI includes zero. There is stochasticity in the Bayesian model fitting, and the 50% CI is seen in other runs also to include zero. As a result we cannot confidently say that our estimate for $\delta$ is consistently negative. Moreover, the median value of $\delta$ is ≈-0.06, of the same order as that for $\gamma$, which we consider to have no significant effect on logCFU. We may consistently, therefore, consider the fitted value of $\gamma$ to be either not different from zero (at 94% CI), or slightly negative with a non-significant effect size.
In [20]:
# summarise results
root_fit
Out[20]:
The model fit has again converged (Rhat=1.0
for all estimates), and once more the estimated standard error of the mean for any estimated parameter is much smaller than the estimated mean (except for $\mu_{\phi}$/mu_f
). We can be confident that the model fit is stable.
In [21]:
sns.set_style("darkgrid")
sns.set_context('notebook')
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['ps.useafm'] = True
plt.rcParams['pdf.fonttype'] = 42
In [22]:
def extract_fit_vars(fit):
"""Return a pandas dataframe summarising estimated parameters."""
dfs = []
for param in ['b', 'g', 'd']:
df = tools.extract_fit_variable_summary(fit, param)
df.columns=["mean", "sem", "median", "2.5pc", "97.5pc", "25pc", "75pc"]
df.index=[param]
dfs.append(df)
return pd.concat(dfs)
def plot_params(df, ax):
"""Plot the parameter estimates as ball and stick"""
vals = df['median'] # marker
cilo = df['25pc'] # lower end of 50% CI
cihi = df['75pc'] # upper end of 50% CI
# Add markers to plot
ax.scatter(range(len(df)), vals, marker='_', s=100,
c='#000000', lw=1)
for idx, val, vlo, vhi in zip(range(len(df)), vals, cilo, cihi):
ax.plot([idx, idx], [vlo, vhi], color='k', lw=2)
# Format plot
sns.despine(bottom=True)
ax.set_xticks([0, 1, 2, 3])
ax.axhline(color='k', lw=1)
ax.set_ylabel('logCFU')
ax.set_xticklabels(['Sakai\n$\Delta etpD$', '$\Delta etpD$\npSE380', '$\Delta etpD$\npSE_EtpD'])
# Set default seaborn style (plain, black and white)
sns.set_style("white")
# Construct the axes for drawing
fig = plt.figure(figsize=(4, 12))
bac_ax = fig.add_subplot(3, 1, 1)
leaf_ax = fig.add_subplot(3, 1, 2)
root_ax = fig.add_subplot(3, 1, 3)
# Render ball-and-stick plots for each fit
plot_params(extract_fit_vars(root_fit), root_ax)
plot_params(extract_fit_vars(leaf_fit), leaf_ax)
# Plot part 3A: pV41 and BAC2B5 results
bacdata = pd.read_csv(os.path.join('..', 'data', 'etpD', 'pV41_BAC2B5.csv')) # load data
bacdata = pd.melt(bacdata, value_vars=['pV41', 'BAC2B5 (etp)']) # wide to long
bacdata = bacdata.dropna(how='any') # drop NaN values
bacdata['logCFU'] = np.log10(bacdata['value']) # take log transform
sns.boxplot(x='variable', y='logCFU', data=bacdata, # plot boxplot
color="#ffffff", ax=bac_ax)
bac_ax.plot([0, 1], [6.9, 6.9], linewidth=2, color='#000000') # plot asterisk and bar
bac_ax.plot([0, 0], [6.8, 7.0], linewidth=2, color='#000000')
bac_ax.plot([1, 1], [6.8, 7.0], linewidth=2, color='#000000')
bac_ax.text(0.5, 6.9, '*', fontsize=20)
bac_ax.set_xlabel('') # remove x-axis label
for idx, artist in enumerate(bac_ax.artists): # make boxplot lines black
artist.set_edgecolor('#000000')
for jdx in range(idx * 6, idx * 6 + 6):
line = bac_ax.lines[jdx]
line.set_color('#000000')
line.set_mfc('#000000')
line.set_mec('#000000')
# Add subplot labels
bac_ax.text(-1.25, 7, 'A', fontsize=20)
leaf_ax.text(-1.25, 0.45, 'B', fontsize=20)
root_ax.text(-1.25, 0.1, 'C', fontsize=20)
# Relabel y-axes and rotate x-ticks
for ax in (bac_ax, leaf_ax, root_ax):
ax.set_ylabel("$\log_{10}$ (cfu/ml/g)")
for lbl in ax.get_xticklabels():
lbl.set_rotation(45)
# Modify margins
plt.subplots_adjust(left=0, right=0.9, top=0.8, bottom=0.2)
plt.tight_layout()
# Save figure in useful formats (.tif is not available)
plt.savefig('figure4.png', dpi=600)
plt.savefig('figure4.pdf');
In [ ]: