Read:
Old (todo: clear up):
How does limma's eBayes work?
the tutorials here:
bayesian notebook:
.
.
.
couple with nikolay's presentation where possible http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/
more bayesian learning?
probabilistic learning (for variational inference)
convolutional variational autoencoder, via keras https://docs.pymc.io/notebooks/convolutional_vae_keras_advi.html
is the code in gtb meetup also using pymc3 for VAE?
There is no 'true' probability because the probability of an event is fixed. Or?
Let us make a Bayesian probabilistic model for gene expression. The number of reads mapping to a gene can be assumed to be Poisson approximated. Let's say we have a set of technical replicates giving various coverage numbers for a certain gene. Let $E = \{E_i,e_i\}$ be the set of coverage numbers and associated errors drawing from the technical replicates. The question is to find a best estimate of the true expression $E_{true}$.
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
np.random.seed(1) # for repeatability
E_true = 1000 # true expression
N = 50 # samples
E = stats.poisson(E_true).rvs(N) # N measurements of the expression
err = np.sqrt(E) # errors on Poisson counts estimated via square root
fig, ax = plt.subplots()
ax.errorbar(E, np.arange(N), xerr=err, fmt='ok', ecolor='gray', alpha=0.5)
ax.vlines([E_true], 0, N, linewidth=5, alpha=0.2)
ax.set_xlabel("sequence counts");ax.set_ylabel("samples");
Now given our errors of estimation, what is our best expectation for the true expression?
The frequentist approach is based on a maximum likelihood estimate. Basically one can compute the probability of an observation given that the true gene expression value is fixed, and then compute the product of these probabilities for each data point:
$$ L(E|E_{true}) = \prod_{i=1}^{N}{ P (E_i|E_{true}) } $$What we want is to compute the E_true for which the log likelihood estimate is maximized, and in this case it can be solved analytically to this formula (while generally alternatively approximated via optimization, although it is not always possible):
$$ E_{est} = argmax_{E_{true}} = argmin_{E_{true}} {- \sum_{i=1}^{N}log(P(E_i|E_{true})) } \approx \frac{\sum{w_i E_i}}{\sum{w_i}}, w_i = 1/e_i^2 $$(Also, in this case) we can also estimate the error of measurement by using a gaussian estimate of the likelihood function at its maximum:
In [6]:
w = 1. / err ** 2
print("""
E_true = {0}
E_est = {1:.0f} +/- {2:.0f} (based on {3} measurements)
""".format(E_true, (w * E).sum() / w.sum(), w.sum() ** -0.5, N))
When using the Bayesian approach, we are estimating the probability of the model parameters giving the data, so no absolute estimate. This is also called posterior probability. We do this using the likelihood and the model prior, which is an expectation of the model before we are given the data. The data probability is encoding how likely our data is, and is usually approximated into a normalization term. The formula used is also known as Bayes theorem but is using a Bayesian interpretation of probability.
$$ P(E_{true}|E) = \frac{P(E|E_{true})P(E_{true})}{P(E)}$$$$ {posterior} = \frac{{likelihood}~\cdot~{prior}}{data~probability}$$
In [18]:
import pymc3 as pm
with pm.Model():
mu = pm.Normal('mu', 900, 1.)
sigma = 1.
E_obs = pm.Normal('E_obs', mu=mu, sd=sigma, observed=E)
step = pm.Metropolis()
trace = pm.sample(15000, step)
#sns.distplot(trace[2000:]['mu'], label='PyMC3 sampler');
#sns.distplot(posterior[500:], label='Hand-written sampler');
pm.traceplot(trace)
plt.show()
Task:
In [6]:
Out[6]:
In [8]:
def log_prior(E_true):
return 1 # flat prior
def log_likelihood(E_true, E, e):
return -0.5 * np.sum(np.log(2 * np.pi * e ** 2)
+ (E - E_true) ** 2 / e ** 2)
def log_posterior(E_true, E, e):
return log_prior(E_true) + log_likelihood(E_true, E, e)
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10)
beta = pm.Normal('beta', mu=0, sd=10)
sigma = pm.HalfNormal('sigma', sd=1)
# Expected value of outcome
mu = alpha + beta*E
# Likelihood (sampling distribution) of observations
E_obs = pm.Normal('Y_obs', mu = mu, sd = sigma, observed = E)
start = pm.find_MAP(model=basic_model)
step = pm.Metropolis()
# draw 20000 posterior samples
trace = pm.sample(20000, step=step, start=start)
_ = pm.traceplot(trace)
plt.show()
In [3]:
import pymc3 as pm
help(pm.Normal)
In [1]:
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10)
beta = pm.Normal('beta', mu=0, sd=10)
sigma = pm.HalfNormal('sigma', sd=1)
# Expected value of outcome
mu = alpha + beta*phen['age_scr']
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu = mu, sd = sigma, observed = phen['BMI'])
start = pm.find_MAP(model=basic_model)
step = pm.Metropolis()
# draw 20000 posterior samples
trace = pm.sample(20000, step=step, start=start)
_ = pm.traceplot(trace)
plt.show()