The difference between statistical learning and classical machine learning is in the way they handle data. Rather than working with the data directly, statistical learning works with models of this data (statistical models, formulated in terms of distributions). There is a slight difference in ultimate goals too: while ML is striving towards narrower quantifiable goals, such as regression/classification error, the main obsession in the statistical learning circles is model inference. This creates a cultural rift as well, while ML researchers are publishing their work in conferences and are more industry focused, the statistical learning reseachers tend to publish in scientific journals and have a more academic focus.
Further read:
There is no 'true' probability because the probability of an event is fixed. Or?
Let us make a Bayesian probabilistic model for gene expression. The number of reads mapping to a gene can be assumed to be Poisson approximated. Let's say we have a set of technical replicates giving various coverage numbers for a certain gene. Let $E = \{E_i,e_i\}$ be the set of coverage numbers and associated errors drawing from the technical replicates. The question is to find a best estimate of the true expression $E_{true}$.
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
np.random.seed(1) # for repeatability
E_true = 1000 # true expression
N = 50 # samples
E = stats.poisson(E_true).rvs(N) # N measurements of the expression
err = np.sqrt(E) # errors on Poisson counts estimated via square root
fig, ax = plt.subplots()
ax.errorbar(E, np.arange(N), xerr=err, fmt='ok', ecolor='gray', alpha=0.5)
ax.vlines([E_true], 0, N, linewidth=5, alpha=0.2)
ax.set_xlabel("sequence counts");ax.set_ylabel("samples");
Now given our errors of estimation, what is our best expectation for the true expression?
The frequentist approach is based on a maximum likelihood estimate. Basically one can compute the probability of an observation given that the true gene expression value is fixed, and then compute the product of these probabilities for each data point:
$$ L(E|E_{true}) = \prod_{i=1}^{N}{ P (E_i|E_{true}) } $$What we want is to compute the E_true for which the log likelihood estimate is maximized, and in this case it can be solved analytically to this formula (while generally alternatively approximated via optimization, although it is not always possible):
$$ E_{est} = argmax_{E_{true}} = argmin_{E_{true}} {- \sum_{i=1}^{N}log(P(E_i|E_{true})) } \approx \frac{\sum{w_i E_i}}{\sum{w_i}}, w_i = 1/e_i^2 $$(Also, in this case) we can also estimate the error of measurement by using a gaussian estimate of the likelihood function at its maximum:
In [6]:
w = 1. / err ** 2
print("""
E_true = {0}
E_est = {1:.0f} +/- {2:.0f} (based on {3} measurements)
""".format(E_true, (w * E).sum() / w.sum(), w.sum() ** -0.5, N))
When using the Bayesian approach, we are estimating the probability of the model parameters giving the data, so no absolute estimate. This is also called posterior probability. We do this using the likelihood and the model prior, which is an expectation of the model before we are given the data. The data probability is encoding how likely our data is, and is usually approximated into a normalization term. The formula used is also known as Bayes theorem but is using a Bayesian interpretation of probability.
$$ P(E_{true}|E) = \frac{P(E|E_{true})P(E_{true})}{P(E)}$$$$ {posterior} = \frac{{likelihood}~\cdot~{prior}}{data~probability}$$
In [18]:
import pymc3 as pm
with pm.Model():
mu = pm.Normal('mu', 900, 1.)
sigma = 1.
E_obs = pm.Normal('E_obs', mu=mu, sd=sigma, observed=E)
step = pm.Metropolis()
trace = pm.sample(15000, step)
#sns.distplot(trace[2000:]['mu'], label='PyMC3 sampler');
#sns.distplot(posterior[500:], label='Hand-written sampler');
pm.traceplot(trace)
plt.show()
Task:
In [8]:
def log_prior(E_true):
return 1 # flat prior
def log_likelihood(E_true, E, e):
return -0.5 * np.sum(np.log(2 * np.pi * e ** 2)
+ (E - E_true) ** 2 / e ** 2)
def log_posterior(E_true, E, e):
return log_prior(E_true) + log_likelihood(E_true, E, e)
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10)
beta = pm.Normal('beta', mu=0, sd=10)
sigma = pm.HalfNormal('sigma', sd=1)
# Expected value of outcome
mu = alpha + beta*E
# Likelihood (sampling distribution) of observations
E_obs = pm.Normal('Y_obs', mu = mu, sd = sigma, observed = E)
start = pm.find_MAP(model=basic_model)
step = pm.Metropolis()
# draw 20000 posterior samples
trace = pm.sample(20000, step=step, start=start)
_ = pm.traceplot(trace)
plt.show()
In [3]:
import pymc3 as pm
help(pm.Normal)