Disclaimer: I don't really like how the book explains Bayesian methods, so I'm pulling from different sources: Gelman's Bayesian Data Anlysis (3rd Edition) (http://www.stat.columbia.edu/~gelman/book/) and Probabilistic Programming and Bayesian Methods for Hackers (http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Chapter1_Introduction.ipynb)
The fundamental difference between Bayesian and Frequentist methods can be understood in terms of an example.
This map shows the highest and lowest kidney cancer rate per county in the US in the 80's (1980-1989) Both the highest rates and the lowest rates seem to be in the same area.
For the sake of completeness, let's reintroduce Baye's rule:
$P(\theta|\textbf{D}) = P(\theta ) \frac{P(\textbf{D} |\theta)}{P(\textbf{D})}$
where
$P(\theta|\textbf{D})$ is the posterior distribution over your parameters $\theta$
$P(\theta)$ is called the prior
$P(\textbf{D} |\theta)$ is called the likelihood, for some esoteric reasons, this looks like, behaves like, but isn't a probability distribution (more than happy to extend on that)
$P(\textbf{D})$ is called the marginal distribution over the data $D$. (And as such, is only marginally important)
Once you have your posterior distribution, you can take as many samples from it as you want, and obtain your ranking statistics (mode, percentiles, etc)
Coin flips just happen to be fun, and essentially shows the idea that as we have data, we have new estimates of our posterior probability.
In a coin flip, the coin is governed by a probability $p$ of falling tails, which in this case will be the parameter that we are interested in.
In [10]:
# The code below can be passed over, as it is currently not important, plus it
# uses advanced topics we have not covered yet. LOOK AT PICTURE, MICHAEL!
%matplotlib inline
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
figsize(11, 9)
import scipy.stats as stats
dist = stats.beta
n_trials = [0, 1, 2, 3, 4, 5, 8, 15, 50, 500]
data = stats.bernoulli.rvs(0.5, size=n_trials[-1])
x = np.linspace(0, 1, 100)
# For the already prepared, I'm using Binomial's conj. prior.
for k, N in enumerate(n_trials):
sx = plt.subplot(len(n_trials) / 2, 2, k + 1)
plt.xlabel("$p$, probability of heads") \
if k in [0, len(n_trials) - 1] else None
plt.setp(sx.get_yticklabels(), visible=False)
heads = data[:N].sum()
y = dist.pdf(x, 1 + heads, 1 + N - heads)
plt.plot(x, y, label="observe %d tosses,\n %d heads" % (N, heads))
plt.fill_between(x, 0, y, color="#348ABD", alpha=0.4)
plt.vlines(0.5, 0, 4, color="k", linestyles="--", lw=1)
leg = plt.legend()
leg.get_frame().set_alpha(0.4)
plt.autoscale(tight=True)
plt.suptitle("Bayesian updating of posterior probabilities",
y=1.02,
fontsize=14)
plt.tight_layout()
In this example, the prior is a beta distribution distribution, and the model is a bernoulli distribution, and since we have the magic of the conjugate distributions (http://en.wikipedia.org/wiki/Conjugate_prior) the posterior is going to be another Beta distributions, as we gather more and more observations, we have a better certainty of the parameters of the parameter $p$ that controls the coin toss.
AT&T asks us if their change in price scheme actually affected the texting habits of a user (trivialy extended to multiple users)
We have data for a single user (all of the data is in the Python Notebook)
In [11]:
figsize(12.5, 3.5)
count_data = np.loadtxt("data/txtdata.csv")
n_count_data = len(count_data)
plt.bar(np.arange(n_count_data), count_data, color="#348ABD")
plt.xlabel("Time (days)")
plt.ylabel("count of text-msgs received")
plt.title("Did the user's texting habits change over time?")
plt.xlim(0, n_count_data);
We learned in the first chapters, that count data can be modeled using a Poisson distribution:
$ C_i \sim \text{Poisson}(\lambda)$
The parameter $\lambda$ is the rate, and by looking at the data, it seems like there was a change in the user behavior.
So we can define two lambdas, one before time $\tau$ when the behavior changed, and one after.
$ \lambda = \begin{cases} \lambda_1 & \text{if } t \lt \tau \cr \lambda_2 & \text{if } t \ge \tau \end{cases}, $
If there is no change, $\lambda_1$ and $\lambda_2$ shuld be the same.
We are interested in the rate $\lambda$ and the moment the change happens $\tau$ and as such, are the parameters of interest in the Bayes theorem. This means we need to define a prior $p(\lambda)$ and $p(\tau)$ over the parameter.
To find the prior for $\lambda$ as we also saw in the first chapters, and we can check in the wikipedia link, a good conjugate prior distribution for a Poisson random variable, is a Gamma distribution, but a particular, simpler case of the Gamma is an exponential distribution (The Chi squared distribution also falls here).
The combination Poisson-Exponential create what is called a Poisson Process, and among other things is used to models counts that happen ovr time.
So formally:
$ \begin{align} \lambda_1 \sim \text{Exp}( \alpha )\\ \lambda_2 \sim \text{Exp}( \alpha ) \end{align} $
In this expression, $\alpha$ is called a hyperparameter or parent variable, both rates depend on a single alpha. There is nothing preventing us from using two $\alpha$ but the sake of simplicity.
Some questions:
To model the $\alpha$ and make our prior at least a bit informative, a good rule of thumb is to choose the inverse of the average of the count data:
$\frac{1}{N}\sum_{i=0}^N \;C_i \approx E[\; \lambda \; |\; \alpha ] = \frac{1}{\alpha}$
This prior is also closely related to the expected value of the parameter lambda given an exponential distribution. Which is also related to the minimum entropy principle.
In the definition of the lambdas, we saw that there is another random variable called $\tau$ that is also a variable of interest because it indicates the breakpoint of the trend change.
Since it is also a random variable, it also needs a prior distribution. We don't really have much information on where $\tau$ could be initially located, so we used a noninformative prior in the form of a unifrom distribution over the time span (0,70):
$ \tau \sim DiscreteUniform(1,70)\\ P(\tau) = 1/70 $
We see that is a nice scalar value that does not really messes up calculations
Remember that our main objective is to find if/when there is a change in trend, that would mean lambdas are different and $\tau$ has a value different than 0 or 70.
Using Bayes theorem:
$ P(\lambda_1, \lambda_2, \tau |Data (counts)) \propto P(Data (counts) |\lambda_1) P(Data(counts) |\lambda_2)P(\lambda_1|\alpha)P(\lambda_2|\alpha)P(\tau) $
Some coments:
-Independence: Before we learned that $P(a,b) = P(a)P(b)$ if they are independent, so we have applied that same principle here, in strict theory they are not independent, but that is represented in the hyper parameters, they are thus conditionally independent
A more diligent presenter would probably go ahead and expand each expression and would show that the joint posterior distribution is a nice exponential over the parameters $\lambda$.
We can see that the final math expression will be an exponential distribution scaled by some factors (conjugacy). And using that final expression, we can sample values of the final joint distribution.
You can sample a point from the prior, then sample from the likelihood using that sample, normalize and that way you can generate samples from the posterior, with this you can build histograms to work with.
If you are interested in point values, you can optimize the whole aposteriori in a MAP approach (maximum aposteriori).
One problem though, is that we migh like to do inference in particular variables, like $\tau$ given the rest of the variables, and that gets complicated, since it involves the integral of all the posterior over the other variables.
$ \int P(\tau, \lambda_1, \lambda_2|Data) d\lambda_1\lambda_2 $
There are two ways to deal with this, actual math, or sampling approaches.
In the book they discuss a bit on sampling, which are Monte Carlo techniques to get samples from complicated probability distributions. A way to deal with it, also in the book is the convenient probabilistic programming approach.
Many people can see the inherent complication of wanting to do good, nice models and how the math can actually prevent people from actually trying this approaches.
Many intelligent people got together, and created software that requires you to only specify your likelihood, priors and it will pull the hard work for you giving nice plots and results.
In R, they have Stan, and in Python we use PyMC.
In [12]:
import pymc as pm
alpha = 1.0 / count_data.mean() # Recall count_data is the
# variable that holds our txt counts
#We define our priors here
lambda_1 = pm.Exponential("lambda_1", alpha)
lambda_2 = pm.Exponential("lambda_2", alpha)
tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data)
#The assignment function where we say that there are two domains for lambda given tau
@pm.deterministic
def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
out = np.zeros(n_count_data)
out[:tau] = lambda_1 # lambda before tau is lambda1
out[tau:] = lambda_2 # lambda after (and including) tau is lambda2
return out
#The likelihood, where we indicate which values are observed
observation = pm.Poisson("obs", lambda_, value=count_data, observed=True)
model = pm.Model([observation, lambda_1, lambda_2, tau])
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000, 1)
lambda_1_samples = mcmc.trace('lambda_1')[:]
lambda_2_samples = mcmc.trace('lambda_2')[:]
tau_samples = mcmc.trace('tau')[:]
figsize(12.5, 10)
# histogram of the samples:
ax = plt.subplot(311)
ax.set_autoscaley_on(False)
plt.hist(lambda_1_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
plt.legend(loc="upper left")
plt.title(r"""Posterior distributions of the variables
$\lambda_1,\;\lambda_2,\;\tau$""")
plt.xlim([15, 30])
plt.xlabel("$\lambda_1$ value")
ax = plt.subplot(312)
ax.set_autoscaley_on(False)
plt.hist(lambda_2_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_2$", color="#7A68A6", normed=True)
plt.legend(loc="upper left")
plt.xlim([15, 30])
plt.xlabel("$\lambda_2$ value")
plt.subplot(313)
w = 1.0 / tau_samples.shape[0] * np.ones_like(tau_samples)
plt.hist(tau_samples, bins=n_count_data, alpha=1,
label=r"posterior of $\tau$",
color="#467821", weights=w, rwidth=2.)
plt.xticks(np.arange(n_count_data))
plt.legend(loc="upper left")
plt.ylim([0, .75])
plt.xlim([35, len(count_data) - 20])
plt.xlabel(r"$\tau$ (in days)")
plt.ylabel("probability");
On first inspection, we can see that the posteriors for both lambda are different, which is the rate of texts per day, we can see clearly that there was an increase in the use of the messaging service, and the posterior on $\tau$ tells us that the change happens approximately on the 45th day.
To do this in a more formal way, we can calculate the expected number of text messages each day given that we know all of these variables.
The variables tau samples have the sample output of a 3000 step Markoc Chain Monte Carlo Method, that is, 3000 samples for the expected value of $\tau$. Then for each day, we need to see which samples are lower than $\tau$
With those indexes, we can calculate the expected value of texts per day.
To obtain
In [13]:
figsize(12.5, 5)
# tau_samples, lambda_1_samples, lambda_2_samples contain
# N samples from the corresponding posterior distribution
N = tau_samples.shape[0]
expected_texts_per_day = np.zeros(n_count_data)
for day in range(0, n_count_data):
# ix is a bool index of all tau samples corresponding to
# the switchpoint occurring prior to value of 'day'
ix = day < tau_samples
# Each posterior sample corresponds to a value for tau.
# for each day, that value of tau indicates whether we're "before"
# (in the lambda1 "regime") or
# "after" (in the lambda2 "regime") the switchpoint.
# by taking the posterior sample of lambda1/2 accordingly, we can average
# over all samples to get an expected value for lambda on that day.
# As explained, the "message count" random variable is Poisson distributed,
# and therefore lambda (the poisson parameter) is the expected value of
# "message count".
expected_texts_per_day[day] = (lambda_1_samples[ix].sum()
+ lambda_2_samples[~ix].sum()) / N
plt.plot(range(n_count_data), expected_texts_per_day, lw=4, color="#E24A33",
label="expected number of text-messages received")
plt.xlim(0, n_count_data)
plt.xlabel("Day")
plt.ylabel("Expected # text-messages")
plt.title("Expected number of text-messages received")
plt.ylim(0, 60)
plt.bar(np.arange(len(count_data)), count_data, color="#348ABD", alpha=0.65,
label="observed texts per day")
plt.legend(loc="upper left");
Probabilistic Programming techniques allow us to use more complicated distributions, you can always derive the expressions yourself, but that is ill advised if you are not in for a Stats PhD.
This is a principle used to find good nice informative priors for your distributions. If you do not want to be bothered with conjugacy. Is not a ver common method, and in most books is a sidenote to calculate prior distributions.
For the sake of completeness I'll put it here.
When we have some low order statistics, like for example the Estimated Value, we can use it to approximate a prior distribution via the entropy:
$ S=-\sum^N_{i=1} p_i ln(p_i) $
If we want to find our prior $p_i$, we might have access to a vague prior $m_i$, if we extend the entropy to compare two probabilities:
$ S=-\sum^N_{i=1} p_i ln(\frac{p_i}{m_i}) $
For example, in a six sided die, a vague prior would be the the uniform over faces: $m_i = frac{1}{6}$
Once we have defined the entropy, we maximize it constrained to the fact that $p_i$ has to sum up to one, and that the expected value for many distributions is well known before hand.
This is just the theoretical justification to the choice I did in the example above.
If we do not have access to a score metric, is difficult to asses whether one model will be better or worse. To do this we can take a ratio of the model given the data.
$ O_{21} = \frac{p(M_2|D)}{p(M_1|D)} $
To obtain P(M|D), you have to integrate over all the possible parameters, and you will get the probability of your model given your data.
In [ ]: