This notebook describes fitting of a Bayesian hierarchical model of the effects of control (growth) and treatment (passage) on individual genes from E. coli DH10B (carrier) and Sakai (BAC load), to data obtained using a multi-E. coli microarray.
Much of the code for the visualisation, analysis and data manipulation of the fitting results is found in the associated Python module tools.py
, which should also be present in this directory.
The model fit can be downloaded directly from the Zenodo repository, for use in this notebook:
A code cell in the notebook below will attempt to make this download for you if the file does not already exist.
The experiment involves measuring changes in microarray probe intensity before and after a pool of bacteria is subjected to one of two processes:
In a single replicate, the microarray is exposed to genomic DNA extracted from the pool (i) before the experiment begins, and (ii) after the experiment concludes. Three replicates are performed.
The pool of bacteria comprises E. coli DH10B as a carrier organism. The pool is heterogeneous, in that individual cells also contain BACs encoding random stretches of the E. coli Sakai chromosome. We therefore expect carrier organism genes to be unaffected by passage (treatment), and for any effects to be detectable only for genes that originate from E. coli Sakai.
If the biological function conferring an advantage during passage is encoded by a suite of coregulated genes in an operon, we might expect all members of this suite to show evidence of enrichment after passage. It is likely that clusters of enrichment for operons or regulons post-passage will be seen in the results. Although we are not accounting for this clustering or association by operon directly in this model, it is a possible additional hierarchical term in future iterations of the model.
We should expect there to be a selective burden to the carriage of additional non-functional gDNA as BACs, so we might also anticipate a slightly negative effect on recovery under control conditions.
In [1]:
%pylab inline
import os
import pickle
import warnings; warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import pystan
import scipy
import seaborn as sns; sns.set_context('notebook')
from Bio import SeqIO
import tools
We assume that each array probe $i$ (array probes take a unique values of $i$ in the context of the entire experiment; that is, $i$ is unique for probe X replicate X treatment) measures hybridisation of genomic DNA (gDNA) in the sample that corresponds to a single gene $j[i]$, and that the measured intensity of probe $i$ relates directly to the corresponding amount of gDNA in the sample. There may be multiple probes relating to a single gene, so it is possible that $j[p] = j[q], p \neq q$.
We define the (input) measurement of a probe before an experiment as $x_i$, and the (output) measurement of that probe after the experiment as $y_i$. We assume that the measurement of each probe is subject to random experimental/measurement error that is normally-distributed with mean zero and variance $\sigma_y^2$. The actual quantity of DNA measured after the experiment can then be represented as $\hat{y}$, and the irreducible error in this experiment as $\epsilon$ ($\epsilon_i$ serves to include the irreducible errors in measuring both $x_i$ and $y_i$; all errors are assumed to be Normal, so their linear combinations are also Normal).
$$y_i = \hat{y_i} + \epsilon_i$$$$\epsilon_i \sim N(0, \sigma_y^2) \implies y_i \sim N(\hat{y_i}, \sigma_y^2)$$The relationship between the input and output DNA quantities measured by a single probe can be represented as $\hat{y_i} = f(x_i)$. That is to say, that the measured input DNA quantity $x_i$ is a predictor of the output quantity. This relationship will be modelled as the sum of two linear effects:
$$\textrm{control effect} = \alpha + \beta x$$$$\textrm{treatment effect} = \gamma + \delta x$$$$\hat{y_i} = \textrm{control effect}(x_i) + \textrm{treatment effect}(x_i) = \alpha + \beta x_i + \gamma + \delta x_i$$As these are linear effects, we have intercept/offset parameters ($\alpha$, $\gamma$) and gradient/slope parameters ($\beta$, $\delta$).
As formulated above, the four parameters would be identical for all probes, but we are interested in estimating the control and treatment effects for individual genes, so we require a set of parameters for each gene (as it corresponds to probe $i$): $j[i]$. This is appropriate for the effects of growth/treatment that are specific to the levels of a single gene: $\beta$ and $\delta$.
The remaining parameters $\alpha$ and $\gamma$, the offsets from zero for each probe, could be considered to be constant across each replicate of both control and treatment experiments. They are possibly more realistically considered to be different for each array (i.e. each combination of replicate and treatment).
As a result, we estimate $\alpha_{k[i]}$, $\beta_{j[i]}$, $\gamma_{k[i]}$, $\delta_{j[i]}$, and the relationship for each probe is modelled as:
$$\hat{y_i} = \textrm{control effect}_{j[i]}(x_i) + \textrm{treatment effect}_{j[i]}(x_i) = \alpha_{k[i]} + \beta_{j[i]} x_i + \gamma_{k[i]} + \delta_{j[i]} x_i$$The parameters $\alpha_{k[i]}$, $\beta_{j[i]}$, $\gamma_{k[i]}$, $\delta_{j[i]}$ (and $\epsilon_i$) are to be estimated by the model fit.
This pooling ensures that our fits are not completely pooled as a single estimate $\alpha_{k[i]} = \alpha$, which would imply that all parameter estimates are constant for all genes/arrays, a situation that would be completely uninformative for our goal to identify gene-level effects, and which would underfit our model. It also means that our estimates are not completely unpooled, which would allow all parameter estimates to vary independently. That situation would be equivalent to simultaneously fitting independent linear relationships to each gene, and so risk overfitting our model to the measured data.
For each parameter's prior we choose a Cauchy distribution, because it has fat tails and infinite variance. This does not constrain outlying and extreme values (those we are interested in) so much as other distributions (e.g. Normal or Student's t):
$$\alpha_{k[i]} \sim Cauchy(\mu_{\alpha}, \sigma_{\alpha}^2)$$$$\beta_{j[i]} \sim Cauchy(\mu_{\beta}, \sigma_{\beta}^2)$$$$\gamma_{k[i]} \sim Cauchy(\mu_{\gamma}, \sigma_{\gamma}^2)$$$$\delta_{j[i]} \sim Cauchy(\mu_{\delta}, \sigma_{\delta}^2)$$Each parameter's prior distribution requires a fit of both its mean and variance, and these also become parameters in our model. The means are free to vary, but we assume that the variance of each parameter's prior can be drawn from a Uniform distribution on the range (0, 100):
$$\sigma_{\alpha} \sim U(0, 100)$$$$\sigma_{\beta} \sim U(0, 100)$$$$\sigma_{\gamma} \sim U(0, 100)$$$$\sigma_{\delta} \sim U(0, 100)$$In the cells below we load in the data to be fit, and define useful variables for inspecting/analysing the data later:
locus_tags
: the unique locus tags represented in the datasetntags
: the number of unique locus tagsarrays
: the arrays (combinations of replicate X treatment) used in the experimentnarrays
: the number of arrays usedoutdir
: path to the directory in which to place model fit outputoutfile
: path to the model fit output file (pickled dataframe)
In [2]:
# load clean, normalised, indexed data
data = pd.read_csv(os.path.join("datasets", "normalised_array_data.tab"), sep="\t") # full dataset
#data = pd.read_csv("datasets/reduced_locus_data.tab", sep="\t") # reduced dataset
#data = data[:100] # uncomment this for debugging
# useful values
locus_tags = data['locus_tag'].unique()
ntags = len(locus_tags)
arrays = data['repXtrt'].unique()
narrays = len(arrays)
In [3]:
# Create output directory and filename to hold the fitted model
outdir = "model_fits"
os.makedirs(outdir, exist_ok=True)
outfile = os.path.join(outdir, 'full_model_fit.pkl')
We need to define data
, parameters
and our model
for Stan
.
In [4]:
# define unpooled stan model
treatment_model = """
data {
int<lower=0> N;
int<lower=0> J;
int<lower=0> K;
int<lower=1, upper=J> tag[N];
int<lower=1, upper=K> array[N];
vector[N] t;
vector[N] x;
vector[N] y;
}
parameters {
vector[K] a;
vector[J] b;
vector[K] g;
vector[J] d;
real mu_a;
real mu_b;
real mu_g;
real mu_d;
real<lower=0> sigma;
real<lower=0,upper=100> sigma_a;
real<lower=0,upper=100> sigma_b;
real<lower=0,upper=100> sigma_g;
real<lower=0,upper=100> sigma_d;
}
transformed parameters{
vector[N] y_hat;
for (i in 1:N)
y_hat[i] = a[array[i]] + b[tag[i]] * x[i] + g[array[i]] * t[i] + d[tag[i]] * t[i] * x[i];
}
model {
sigma_a ~ uniform(0, 100);
a ~ cauchy(mu_a, sigma_a);
sigma_b ~ uniform(0, 100);
b ~ cauchy(mu_b, sigma_b);
sigma_g ~ uniform(0, 100);
g ~ cauchy(mu_g, sigma_g);
sigma_d ~ uniform(0, 100);
d ~ cauchy(mu_d, sigma_d);
y ~ normal(y_hat, sigma);
}
"""
In [5]:
# relate python variables to stan variables
treatment_data_dict = {'N': len(data),
'J': ntags,
'K': narrays,
'tag': data['locus_tag_index'] + 1,
'array': data['repXtrt_index'] + 1,
't': data['treatment'],
'x': data['log_input'],
'y': data['log_output']}
It may be quicker to download the data from Zenodo using the button below, than to use cell (3), but be sure to place the downloaded file in the correct location as specified in the variable outfile
.
In [6]:
# (1) USE THIS CELL TO RUN THE STAN FIT - takes a few hours on my laptop
#treatment_fit = pystan.stan(model_code=treatment_model,
# data=treatment_data_dict,
# iter=1000, chains=2,
# seed=tools.SEED)
In [7]:
# (2) USE THIS CELL TO SAVE THE STAN FIT TO A PICKLE FILE
#unpermutedChains = treatment_fit.extract()
#unpermutedChains_df = pd.DataFrame([dict(unpermutedChains)])
#pickle.dump(unpermutedChains_df, open(outfile, 'wb'))
In [8]:
# (3) USE THIS CELL TO DOWNLOAD THE STAN FIT FROM ZENODO: DOI:10.5281/zenodo.269638
# The file will not be downloaded if it already exists locally.
# The file is 0.5GB in size, so may take some time to download
import urllib.request
if not os.path.isfile(outfile):
zenodo_url = "https://zenodo.org/record/269638/files/full_model_fit.pkl"
response = urllib.request.urlretrieve(zenodo_url, outfile)
In [9]:
# (4) USE THIS CELL TO LOAD THE STAN FIT FROM A PICKLE FILE
# Import the previously-fit model
treatment_fit = pd.read_pickle(open(outfile, 'rb'))
In [10]:
# Get summary data for parameter estimates
# use 'fit' for the model fit directly, and 'df'for loaded pickled data
(estimates_by_probe, estimates) = tools.extract_variable_summaries(treatment_fit, 'df',
['a', 'b', 'g', 'd'],
[arrays, locus_tags, arrays, locus_tags],
data)
In [11]:
# Inspect the data, one row per experiment probe
estimates_by_probe.head()
Out[11]:
In [12]:
# Inspect the data, one row per locus tag
estimates.head()
Out[12]:
In [13]:
# Separate estimates for Sakai and DH10B into two different dataframes
sakai_estimates = tools.split_estimates(estimates, 'sakai')
dh10b_estimates = tools.split_estimates(estimates, 'dh10b')
In the cells below, we visualise the fitted estimates for each of the parameters $\alpha$, $\beta$, $\gamma$, and $\delta$ as:
In [14]:
# Visualise median values for parameter estimates of alpha and gamma
tools.boxplot_medians(estimates_by_probe, ['a', 'g'])
In [15]:
# Visualise median values for parameter estimates of beta and delta
tools.boxplot_medians(estimates, ['b', 'd'])
In [16]:
# Visualise median values for Sakai parameter estimates
tools.boxplot_medians(dh10b_estimates, ['b', 'd'])
it is clear that the median parameter estimates for DH10B are extremely restricted in their range:
In [17]:
# Visualise median values for Sakai parameter estimates
tools.boxplot_medians(sakai_estimates, ['b', 'd'])
By contrast to the results for DH10B, the median parameter estimates for Sakai have many large value outliers, though the bulk of estimates are close to the values seen for DH10B:
We can visualise the relationships between parameter estimates for control and treatment effects in a scatterplot of control effect ($\beta$) against treatment effect ($\delta) for each locus tag. This plot can be considered in four quadrants, which are delineated by the bulk of the data which describes orthogonal effects of locus tags on growth and treatment:
In [18]:
# Plot estimated parameters for treatment effects against control effects for Sakai
fig, ax = plt.subplots(1, 1, figsize=(6,6))
ax.scatter(sakai_estimates['d_median'], sakai_estimates['b_median'], alpha=0.2)
ax.set_xlabel('delta (median)')
ax.set_ylabel('beta (median)');
We use a 50% credibility interval to determine whether the effect of a gene on passage is likely to be positive. Under this assumption, we identify locus tags for which the median estimate of $\delta$ is positive, and the central 50% of the parameter estimates for $\delta$ (the 50% credibility interval) does not include zero. We label these locus tags as trt_pos
in the dataframe.
Likewise, we use a 50% credibility interval to determine whether the effect of a gene on surviving growth (control) is positive. If the 50% CI for $\beta$ does not include the 97.5 percentile for all estimates of $\beta$ (as an upper estimate of overall dataset centrality for this dataset), and the median value of $\beta$ is greater than this value, we consider that the effect of the gene on surviving growth conditions is positive. We label these locus tags as ctl_pos
in the dataframe.
In [19]:
# Label locus tags with positive effects for control and treatment
sakai_estimates = tools.label_positive_effects(sakai_estimates)
We can count the number of locus_tags in each of the groups:
In [20]:
# Count locus tags in each of the positive groups
counts = [sum(sakai_estimates[col]) for col in ('trt_pos', 'ctl_pos', 'combined')]
print("treatment positive: {0}\ncontrol positive: {1}\nboth: {2}".format(*counts))
which indicates, with these assumptions, that:
(this confirms our observation in the earlier scatterplot)
We can show the estimated effects, and our confidence in those estimates, on a rough representation of the genome by plotting those values for each locus tag, sorted in order on the genome.
In the plots that follow, parameter estimates for each locus tag are rendered as points (the median estimate), with the 50% credibility interval for the estimate indicated as a vertical line. If the 50% CI includes a threshold value - the median value for the bulk parameter estimate of $\beta$ or $\delta$ - then we consider that there is not strong evidence of an effect on survival due to that gene (compared to the bulk), and the interval is coloured blue.
If the interval does not include the corresponding threshold value, then it is coloured either green for a positive effect, or magenta for a negative effect.
We split the Sakai estimates into groups: one for the chromosome, and one for each plasmid pOSAK and pO157, on the basis of the locus tag prefixes, annotating them with their start position on the parent molecule.
In [21]:
sakai_chromosome = sakai_estimates.loc[sakai_estimates['locus_tag'].str.startswith('ECs')]
sakai_pOSAK = sakai_estimates.loc[sakai_estimates['locus_tag'].str.startswith('pOSAK1')]
sakai_pO157 = sakai_estimates.loc[(sakai_estimates['locus_tag'].str.startswith('pO157')) |
(sakai_estimates['locus_tag'].str.startswith('ECp'))]
In [22]:
# Sakai chromosome
sakai_chromosome_annotated = tools.annotate_locus_tags(sakai_chromosome,
os.path.join('..', 'data', 'Sakai',
'GCF_000008865.1_ASM886v1_genomic.gbff'))
sakai_chromosome_annotated.sort_values('startpos', inplace=True)
#sakai_chromosome_annotated.head(15)
In [23]:
# pOSAK1
sakai_pOSAK_annotated = tools.annotate_locus_tags(sakai_pOSAK,
os.path.join('..', 'data', 'Sakai',
'GCF_000008865.1_ASM886v1_genomic.gbff'))
sakai_pOSAK_annotated.sort_values('startpos', inplace=True)
#sakai_pOSAK_annotated.head(15)
In [24]:
# pECp
sakai_pO157_annotated = tools.annotate_locus_tags(sakai_pO157,
os.path.join('..', 'data', 'Sakai',
'GCF_000008865.1_ASM886v1_genomic.gbff'))
sakai_pO157_annotated.sort_values('startpos', inplace=True)
#sakai_pO157_annotated.head(15)
In [25]:
# Regions of interest
regions = [('S-loop 71', 'ECs1276', 'ECs1288', 1.3),
('SpLE1', 'ECs1299', 'ECs1410', 1.5),
('S-loop 225', 'ECs4325', 'ECs4341', 1.5),
('S-loop 231', 'ECs4379', 'ECs4387', 1.3)]
annotations = {k:(tools.get_lt_index(v0, sakai_chromosome_annotated),
tools.get_lt_index(v1, sakai_chromosome_annotated), v2) for
k, v0, v1, v2 in regions}
In [26]:
# Plot genome-wide estimates of beta for Sakai and mark values that don't include the median beta in 50% CI
beta_thresh = np.median(sakai_chromosome_annotated['b_median'])
# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for Sakai chromosome'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))
# Plot on the figure axes
tools.plot_parameter(sakai_chromosome_annotated, ax, 'b', beta_thresh, annotations=annotations);
In [27]:
# Regions of interest
regions = [('S-loop 71', 'ECs1276', 'ECs1288', 1),
('SpLE1', 'ECs1299', 'ECs1410', 1.8),
('S-loop 225', 'ECs4325', 'ECs4341', 1.8),
('S-loop 231', 'ECs4379', 'ECs4387', 1)]
annotations = {k:(tools.get_lt_index(v0, sakai_chromosome_annotated),
tools.get_lt_index(v1, sakai_chromosome_annotated), v2) for
k, v0, v1, v2 in regions}
In [28]:
# Plot genome-wide estimates of delta for Sakai and mark values that don't include zero in 50%CI
delta_thresh = np.median(sakai_chromosome_annotated['d_median'])
# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for Sakai chromosome'
plt.title("{0} [threshold: {1:.2f}]".format(title, delta_thresh))
tools.plot_parameter(sakai_chromosome_annotated, ax, 'd', delta_thresh, annotations=annotations)
In [29]:
# Plot genome-wide estimates of beta for Sakai and mark values that don't include the median beta in 50% CI
beta_thresh = np.median(sakai_pOSAK_annotated['b_median'])
# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for Sakai plasmid pOSAK'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))
tools.plot_parameter(sakai_pOSAK_annotated, ax, 'b', beta_thresh)
In [30]:
# Plot genome-wide estimates of delta for Sakai and mark values that don't include zero in 50% CI
delta_thresh = np.median(sakai_pOSAK_annotated['d_median'])
# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for Sakai plasmid pOSAK'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))
tools.plot_parameter(sakai_pOSAK_annotated, ax, 'd', delta_thresh)
In [31]:
# Regions of interest
regions = [('StcE', 'pO157p01', 'pO157p01', 0.98),
('etp T2SS', 'pO157p02', 'pO157p14', 1)]
annotations = {k:(tools.get_lt_index(v0, sakai_pO157_annotated),
tools.get_lt_index(v1, sakai_pO157_annotated), v2) for
k, v0, v1, v2 in regions}
In [32]:
# Plot genome-wide estimates of beta for Sakai and mark values that don't include the median beta in 50% CI
beta_thresh = np.median(sakai_pO157_annotated['b_median'])
# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for Sakai plasmid p0157'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))
tools.plot_parameter(sakai_pO157_annotated, ax, 'b', beta_thresh, annotations=annotations)
In [33]:
# Regions of interest
regions = [('StcE', 'pO157p01', 'pO157p01', 0.13),
('etp T2SS', 'pO157p02', 'pO157p14', 0.19)]
annotations = {k:(tools.get_lt_index(v0, sakai_pO157_annotated),
tools.get_lt_index(v1, sakai_pO157_annotated), v2) for
k, v0, v1, v2 in regions}
In [34]:
# Plot genome-wide estimates of delta for Sakai and mark values that don't include zero in 50% CI
delta_thresh = np.median(sakai_pO157_annotated['d_median'])
# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for Sakai plasmid pO157'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))
tools.plot_parameter(sakai_pO157_annotated, ax, 'd', delta_thresh, annotations=annotations)
We plot similar representations for the DH10B isolate as a control, and see that all parameter estimates for this isolate's locus tags are very similar.
In [35]:
# Annotate the DH10B results
dh10b_annotated = tools.annotate_locus_tags(dh10b_estimates,
os.path.join('..', 'data', 'DH10B',
'GCF_000019425.1_ASM1942v1_genomic.gbff'))
dh10b_annotated.sort_values('startpos', inplace=True)
In [36]:
# Plot genome-wide estimates of beta for DH10B
beta_thresh = np.median(dh10b_estimates['b_median'])
# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of beta for DH10B',
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))
tools.plot_parameter(dh10b_estimates, ax, 'b', beta_thresh)
In [37]:
# Plot genome-wide estimates of delta for DH10B
delta_thresh = np.median(dh10b_estimates['d_median'])
# Create figure with title to hold the plotted axis
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(1, 1, 1)
title = 'Estimates of delta for DH10B'
plt.title("{0} [threshold: {1:.2f}]".format(title, beta_thresh))
tools.plot_parameter(dh10b_estimates, ax, 'd', delta_thresh)
In [38]:
# Generate list of candidates with a positive effect under control or treatment.
candidates = sakai_estimates[sakai_estimates['ctl_pos'] | sakai_estimates['trt_pos']]
candidates = candidates[['locus_tag',
'b_median', 'ctl_pos',
'd_median', 'trt_pos']].sort_values(['ctl_pos', 'trt_pos', 'locus_tag'])
candidates.shape
Out[38]:
In [39]:
# Inspect the data
candidates.head()
Out[39]:
We restrict this set to those genes that only have a credible effect on treatment/passage, identifying 115 genes with positive $\delta$ where the 50% CI does not include zero:
In [40]:
# Restrict candidates only to those with an effect on treatment/passage.
trt_only_positive = candidates.loc[candidates['trt_pos'] & ~candidates['ctl_pos']]
trt_only_positive.shape
Out[40]:
We add a column with the functional annotation of each of the candidates that appear to have a positive selective effect under treatment conditions:
In [41]:
# Annotated locus tags with functions from NCBI GenBank files
annotated = tools.annotate_locus_tags(trt_only_positive,
os.path.join('..', 'data', 'Sakai',
'GCF_000008865.1_ASM886v1_genomic.gbff'))
pd.options.display.max_rows = 115 # force to show all rows
annotated
Out[41]:
Finally, we write this data out in tab-separated format
In [42]:
# Write data to file in tab-separated format
outfile_annotated = os.path.join('datasets', 'trt_positive.tab')
annotated.to_csv(outfile_annotated, sep="\t")
In [43]:
# Create figure with no title or xticks to hold the plotted axes
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(20, 26))
# Add subplot for each result
# 1) Sakai chromosome
regions = [('S-loop 71', 'ECs1276', 'ECs1288', 1),
('SpLE1', 'ECs1299', 'ECs1410', 1.8),
('S-loop 225', 'ECs4325', 'ECs4341', 1.8),
('S-loop 231', 'ECs4379', 'ECs4387', 1)]
annotations = {k:(tools.get_lt_index(v0, sakai_chromosome_annotated),
tools.get_lt_index(v1, sakai_chromosome_annotated), v2) for
k, v0, v1, v2 in regions}
delta_thresh = np.median(sakai_chromosome_annotated['d_median'])
tools.plot_parameter(sakai_chromosome_annotated, ax1, 'd', delta_thresh, annotations=annotations,
label="a) Sakai chromosome")
# 2) pO157 plasmid
regions = [('StcE', 'pO157p01', 'pO157p01', 0.13),
('etp T2SS', 'pO157p02', 'pO157p14', 0.19)]
annotations = {k:(tools.get_lt_index(v0, sakai_pO157_annotated),
tools.get_lt_index(v1, sakai_pO157_annotated), v2) for
k, v0, v1, v2 in regions}
delta_thresh = np.median(sakai_pO157_annotated['d_median'])
tools.plot_parameter(sakai_pO157_annotated, ax2, 'd', delta_thresh, annotations=annotations,
label="b) Sakai pO157")
# 3) DH10B chromosome
delta_thresh = np.median(dh10b_estimates['d_median'])
tools.plot_parameter(dh10b_estimates, ax3, 'd', delta_thresh, label="c) DH10B chromosome")
# Save figure as pdf
plt.savefig("figure_1.pdf");