This notebook describes analysis of 10-fold crossvalidation data of the Bayesian hierarchical model described in notebook 02-full_model_fit.ipynb
. This model estimates the selective effects on growth (control) and leaf passage (treatment) due to individual genes from E. coli DH10B (carrier) and Sakai (BAC load). The estimates are made on the basis of data obtained using a multi-E. coli microarray.
Much of the code for the visualisation, analysis and data manipulation of the fitting results is found in the associated Python module tools.py
, which should also be present in this directory.
We will load predicted and measured output probe intensity data from a 10-fold crossvalidation, performed as described in the file README.md
(the data provided in this repository was obtaind from a run on the JHI cluster).
We can assess the applicability and success of the Stan model by investigating how well it can predict the measured output intensity of a probe, given its measured input intensity. We are conducting crossvalidation on the original dataset, and we consider a prediction to be "correct" if the measured output intensity lies within the 90% credibility interval of the predicted output intensity.
In [1]:
%pylab inline
import os
import pickle
import warnings; warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import scipy
import seaborn as sns; sns.set_context('notebook')
import tools
We load data from the multiplexed run that was performed on the JHI cluster, as described in README.md
.
The column containing the experimentally-measured predictor (input intensity) is headed log_input
, and the column containing the measured output probe intensity is headed log_output
.
The predicted output mean and median values for 1000 iterations of two chains are found in the columns y_pred_mean
and y_pred_median
. 5%, 25%, 75% and 95% percentiles for predicted output are given in the columns y_pred_5pc
, y_pred_25pc
, y_pred_75pc
and y_pred_95pc
. These can be combined to provide 50% and 90% credibility intervals for each predicted output intensity.
In [2]:
# Crossvalidation results file
resultsfile = os.path.join("datasets", "10-fold_CV.tab")
# Load predictions from file
results = pd.read_csv(resultsfile, sep='\t', index_col=0)
print(results.shape)
print(results.columns)
In [3]:
# Inspect crossvalidation results
results.head()
Out[3]:
We want to investigate how predictive performance compares with our estimate of parameters from a fit on full data, to see - for example - whether the genes for which we have confidence in a treatment effect are associated with good predictions of output probe intensity.
We load the model fit results, generated by notebook 02-full_model_fit.ipynb
, from the associated output pickle file.
In [4]:
# File containing pickled fit from notebook 02
fitfile = os.path.join("model_fits", "full_model_fit.pkl")
# Load array measurements and get locus tags/arrays as indices
datafile = os.path.join("datasets", "normalised_array_data.tab")
indata = pd.read_csv(datafile, sep="\t")
locus_tags = indata['locus_tag'].unique()
arrays = indata['repXtrt'].unique()
# Load the pickled fit into `estimates`
fit = pd.read_pickle(open(fitfile, 'rb'))
(estimates_by_probe, estimates) = tools.extract_variable_summaries(fit, 'df',
['a', 'b', 'g', 'd'],
[arrays, locus_tags, arrays, locus_tags],
indata)
estimates.head() # inspect the data
Out[4]:
The estimates
dataframe contains the notebook 02-full_model_fit.ipynb
estimates of parameters for each gene, with percentiles enabling a 95% or 50% credibility interval to be estimated.
We join the crossvalidation data with the fits for the corresponding genes, on the basis of locus_tag
, to make plotting and analysis easier. We also reduce the columns in the dataset to a subset describing the locus tag, probe ID, the treatment and replicate factor, the measured log-transformed input and output intensities, and the median and 90% CI for the predicted output intensity. We also keep the estimated median $\delta$ and associated 95% CI so we can investigate the relationship with treatment effect.
In [5]:
# Columns to keep from the merged data
resultscols = ['locus_tag', 'probe', 'replicate', 'treatment',
'log_input', 'log_output', 'y_pred_5pc', 'y_pred_median', 'y_pred_95pc',
'd_2.5pc', 'd_25pc', 'd_median', 'd_75pc', 'd_97.5pc']
# Merge fit estimates with observed data
results_merged = pd.merge(results, estimates,
how='outer',
left_on='locus_tag', right_on='locus_tag').loc[:, resultscols]
results_merged.head()
Out[5]:
We create new columns to represent the error in crossvalidation prediction of output intensity:
y_pred_abs_error
: the absolute error (y_pred_median
- log_output
)y_pred_rel_error
: the relative error (y_pred_abs_error
/log_output
)and to represent the absolute difference between the measured input and: (i) the measured output; (ii) the predicted output
y_diff_pred
: absolute difference between input and prediction (y_pred_median
- log_input
)y_diff_obs
: absolute difference between input and measured output (log_output
- log_input
)
In [6]:
# Calculate absolute and relative prediction error
results_merged['y_pred_abs_error'] = results_merged['y_pred_median'] - results_merged['log_output']
results_merged['y_pred_rel_error'] = results_merged['y_pred_abs_error']/results_merged['log_output']
# Calculate observed and predicted differences
results_merged['y_diff_pred'] = results_merged['y_pred_median'] - results_merged['log_input']
results_merged['y_diff_obs'] = results_merged['log_output'] - results_merged['log_input']
In [7]:
# Plot prediction errors boxplots
tools.plot_errors(results_merged)
By plotting the absolute and relative error in output intensity prediction against measured values, we can get an idea of whether the errors are uniformly distributed, or likely to be associated primarily with weak measured intensities.
We might expect an overabundance of large relative error for low measured intensity values, as the relative effect of a constant absolute error will be greater for these smaller values.
In [8]:
# Plot relative and absolute error wrt input and output
tools.plot_error_vs_column(results_merged, "log_input")
tools.plot_error_vs_column(results_merged, "log_output")
By relating the absolute and relative error to the estimated treatment effect, we can interpret whether we should continue to be confident in the results of notebook 02-full_model_fit.ipynb
. If the 25% percentile for our estimate of $\delta$ is greater than zero, then the 50% CI for that probe's gene does not include zero, and we interpret this as a positive effect due to treatment.
In [9]:
# Plot errors relative to 25% percentile for estimate of treatment effect
tools.plot_error_vs_column(results_merged, "d_25pc")
tools.plot_error_vs_column(results_merged, "d_median")
We subset the data to investigate this more closely, in the dataset trt_pos
representing the 115 locus tags with an estimated positive treatment effect.
In [10]:
# Subset data to positive estimates of delta only
trt_pos = results_merged[results_merged['d_25pc'] > 0]
# Inspect results
trt_pos.head()
Out[10]:
In [11]:
# Plot errors relative to 25% percentile for estimate of treatment effect
tools.plot_error_vs_column(trt_pos, "d_25pc")
tools.plot_error_vs_column(trt_pos, "d_median")
By visual inspection, most probe prediction errors for the "treatment positive" probes are small in relative terms, but appear to be Normally distributed in absolute terms with respect to the value of d_median
. We can examine the error distributions for these probes with respect to log input and output intensities, as above.
In [12]:
# Plot errors for positive treatment effects wrt measured intensities
tools.plot_error_vs_column(trt_pos, "log_input")
tools.plot_error_vs_column(trt_pos, "log_output")
As an estimate of prediction accuracy, we can calculate the number of observed output intensities that lie outwith the 90% credibility interval of the prediction. We create the column pred_success
, which contains True
where the predicted output value lies in the 90% CI for the crossvalidation predictions.
In [13]:
# Add a column for probe predictions that lie outwith the 90% credibility interval of prediction
results_merged['pred_success'] = (results_merged['log_output'] > results_merged['y_pred_5pc']) & \
(results_merged['log_output'] < results_merged['y_pred_95pc'])
# Inspect data
results_merged.head()
Out[13]:
In [14]:
# Make a dataframe of missed predictions
errors = results_merged[results_merged['pred_success'] == False]
print(errors.shape, results_merged.shape, len(errors['locus_tag'].unique()))
We can gain an insight into the number of probes that are likely to be in error for any particular locus tag, by plotting the distribution of their counts:
In [15]:
# Distribution of probes in error, by locus tag
error_probe_counts = errors['locus_tag'].groupby(errors['locus_tag']).agg(['count'])
error_probe_counts.columns=['probes_in_error']
ax = sns.distplot(error_probe_counts['probes_in_error'], bins=max(error_probe_counts['probes_in_error']))
ax.set_title("probes in error, by locus tag")
ax.set_xlim(0, max(error_probe_counts['probes_in_error']));
Most of the locus tags with prediction errors have errors in 4 probes or fewer (out of an average of six or so), so in general we might expect most probes for most locus tags to be relatively well-predicted by our model.
But is this the case for the locus tags with a predicted positive treatment effect?
We can examine the distribution of crossvalidation prediction errors for the locus tags with positive estimated treatment effect, and compare this to the distribution for the dataset as a whole. If these are similar, then there may be no systematic misprediction for the positive estimated treatment values.
In [16]:
# Subset data to positive estimates of delta only
trt_pos = results_merged[results_merged['d_25pc'] > 0]
trt_errors = trt_pos[trt_pos['pred_success'] == False]
print(trt_errors.shape, trt_pos.shape, len(trt_errors['locus_tag'].unique()))
Plotting the distribution of the number of probes in error for each locus tag also shows a noisier distribution to that for the main dataset, where some predictions have a much larger number of probe predictions that appear to be in error:
In [17]:
trt_error_probe_counts = trt_errors['locus_tag'].groupby(trt_errors['locus_tag']).agg(['count'])
trt_error_probe_counts.columns=['probes_in_error']
ax = sns.distplot(trt_error_probe_counts['probes_in_error'], bins=max(trt_error_probe_counts['probes_in_error']))
ax.set_title("probes in error, by locus tag (positive trt effect)")
ax.set_xlim(0, max(trt_error_probe_counts['probes_in_error']));
The modal number of probes in error is one, but the distribution has a different shape to that for the complete dataset, suggesting a systematic failure of the model to predict precisely the output intensity for some probes.
We break down the prediction errors for our positive treatment effect candidates into two types (after Gelman & Carlin (2014) DOI: 10.1177/1745691614551642).
We create two new columns in the trt_pos
dataframe that contains estimates for those locus tags with an estimated positive value for $\delta$, the effect on treatment/passage:
type_s
: there is a prediction error, and the sign of the predicted and observed differences are the same value - trt_pos['y_diff_pred']/trt_pos['y_diff_obs'] > 0
type_m
: there is a prediction error, but it's not Type S - trt_pos['type_s'] is False and trt_pos['pred_success'] is False
In [18]:
# Create columns for error types
trt_pos['type_s'] = trt_pos['pred_success'] is False and (trt_pos['y_diff_pred']/trt_pos['y_diff_obs'] < 0)
trt_pos['type_m'] = (trt_pos['type_s'] == False) & (trt_pos['pred_success'] == False)
# Inspect data
trt_pos.head()
Out[18]:
In [19]:
# How many errors of each type?
print("Type S errors: %d" % sum(trt_pos['type_s']))
print("Type M errors: %d" % sum(trt_pos['type_m']))
All errors in the predicted output intensity are Type M errors: errors in the predicted effect magnitude, where the effect is observed to be in the correct direction, with respect to the input intensity.
The question arises whether the errors are typically underestimates or overestimates, for the probes that are in error. If we are systematically overestimating output probe intensity, then we might consider our estimates of $\delta$ to be generally too large. Alternatively, if we are systematically underestimating output probe intensity, we might thing that our estimates are too small.
In [20]:
# Violinplot of prediction error for all probes
sns.violinplot(trt_pos['log_output'] - trt_pos['y_pred_median']);
In general, most probe predictions are within one log unit of the observed value for the probes corresponding to positive treatment effects. This is encouraging and suggests that, in general, the parameter estimates and model are appropriate.
In [21]:
# Violinplot of prediction error for probes where prediction fails
trt_errors = trt_pos[trt_pos['pred_success'] == False]
print("%d positive errors" % sum(trt_errors['y_pred_abs_error'] > 0))
print("%d negative errors" % sum(trt_errors['y_pred_abs_error'] < 0))
sns.violinplot(trt_errors['y_pred_abs_error']);
Although our results seem to be on the whole quite sound, there are some locus tags for which the results may be more questionable. We can identify the total number of probes for each locus tag, and the number of probe output predictions that are in error, and determine which locus tags appear to have an unusually high proportion of errors as having more questionable estimates.
We generate a dataframe, indexed by locus tag, with a column fail_prop
that describes the proportion of probe predictions that are in error.
In [22]:
# Count number of probes and failed predictions, by locus tag
collist = ['d_2.5pc', 'd_25pc', 'd_median', 'd_75pc', 'd_97.5pc',
'pred_success', 'type_m']
lt_pos = trt_pos.groupby('locus_tag').agg(sum)[collist]
lt_pos['fail_prop'] = lt_pos['type_m']/(lt_pos['type_m'] + lt_pos['pred_success'])
lt_pos.reset_index(inplace=True)
# Inspect output
lt_pos.head()
Out[22]:
In [23]:
# Distribution of proportion of failed probes
sns.distplot(lt_pos['fail_prop'], bins=10);
From this data, we see that most locus tags in this dataset have no probe errors, and a small proportion have more than 50% probe errors. We can inspect the predictions for those locus tags directly.
In [24]:
# Make list of locus tags with more than 50% probe errors
q_tags = lt_pos.loc[lt_pos['fail_prop'] > 0.5]
print(q_tags.shape, sum(q_tags['type_m']))
We use the function plot_locustag_predictions()
to visualise the prediction results for these genes directly. In these plots, values 0-2 indicate control replicates, and values 3-5 treatment replicates. The grey points show measured input intensity for the probe, and the black points show measured output intensity. The coloured points show predicted median values, and the bars indicate the 90% credibility interval for the prediction.
Yellow and blue bars indicate that the observed output intensity lies within the 90% CI; green and red bars indicate that the observed output value lies outwith the 90% CI.
In [25]:
# Plot predictions for each questionable locus tag
print(q_tags['locus_tag'])
for lt in q_tags['locus_tag']:
tools.plot_locustag_predictions(trt_pos, lt)
These ten locus tags explain nearly 1/3 of all Type M errors in the positive treatment dataset, and are closely collocated on the genome, from ECs4325
to ECs4387
, and so may be subject to the same effects of BAC over- or under-representation. Where this occurs, a single pool may give an erroneous result due to the aberrational effets of over- or under-representation, and this may skew our crossvalidation predictions.
Visual inspection of the plots above indicates that the crossvalidation parameter estimates for $\delta$ were typically positive, and those for $\beta$ negative. They also indicate that the observed output intensities for the treatment and control values were not consistent.
For instance, for ECs4387
control samples 1 and 2 showed an increase in intensity post-experiment, but sample 0 shows quite a sharp decrease in intensity. Similarly, treatment samples 3 and 4 show an increase in post-experiment intensity, but sample 5 shows a decrease.
Similar inconsistencies are observed for all ten locus tags in this region, and BAC over or under-representation in the pools may be a possible physical cause of this.