Bayesian Statistical Inference

Introduction

Disclaimer: I don't really like how the book explains Bayesian methods, so I'm pulling from different sources: Gelman's Bayesian Data Anlysis (3rd Edition) (http://www.stat.columbia.edu/~gelman/book/) and Probabilistic Programming and Bayesian Methods for Hackers (http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Chapter1_Introduction.ipynb)

The fundamental difference between Bayesian and Frequentist methods can be understood in terms of an example.

This map shows the highest and lowest kidney cancer rate per county in the US in the 80's (1980-1989) Both the highest rates and the lowest rates seem to be in the same area.

There might be all sort of dietary conditions that make people in that area prone to kidney cancer, or it might be something else.
The reason for this, is that all of those counties also happen to be some of the less populated.
A single case in 10 years would shoot up the rate
Because there are so few people, even having one case is an unlikely scenario
Bayesian approaches, via the prior, allow us to adjust for population and obtain more sensible estimates.

Essence of the Bayesian idea

For the sake of completeness, let's reintroduce Baye's rule:

$P(\theta|\textbf{D}) = P(\theta ) \frac{P(\textbf{D} |\theta)}{P(\textbf{D})}$

where

$P(\theta|\textbf{D})$ is the posterior distribution over your parameters $\theta$

$P(\theta)$ is called the prior

$P(\textbf{D} |\theta)$ is called the likelihood, for some esoteric reasons, this looks like, behaves like, but isn't a probability distribution (more than happy to extend on that)

$P(\textbf{D})$ is called the marginal distribution over the data $D$. (And as such, is only marginally important)

Once you have your posterior distribution, you can take as many samples from it as you want, and obtain your ranking statistics (mode, percentiles, etc)

Mandatory coin flip example

Coin flips just happen to be fun, and essentially shows the idea that as we have data, we have new estimates of our posterior probability.

In a coin flip, the coin is governed by a probability $p$ of falling tails, which in this case will be the parameter that we are interested in.



In [10]:

    
# The code below can be passed over, as it is currently not important, plus it
# uses advanced topics we have not covered yet. LOOK AT PICTURE, MICHAEL!
%matplotlib inline
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
figsize(11, 9)

import scipy.stats as stats

dist = stats.beta
n_trials = [0, 1, 2, 3, 4, 5, 8, 15, 50, 500]
data = stats.bernoulli.rvs(0.5, size=n_trials[-1])
x = np.linspace(0, 1, 100)

# For the already prepared, I'm using Binomial's conj. prior.
for k, N in enumerate(n_trials):
    sx = plt.subplot(len(n_trials) / 2, 2, k + 1)
    plt.xlabel("$p$, probability of heads") \
        if k in [0, len(n_trials) - 1] else None
    plt.setp(sx.get_yticklabels(), visible=False)
    heads = data[:N].sum()
    y = dist.pdf(x, 1 + heads, 1 + N - heads)
    plt.plot(x, y, label="observe %d tosses,\n %d heads" % (N, heads))
    plt.fill_between(x, 0, y, color="#348ABD", alpha=0.4)
    plt.vlines(0.5, 0, 4, color="k", linestyles="--", lw=1)

    leg = plt.legend()
    leg.get_frame().set_alpha(0.4)
    plt.autoscale(tight=True)


plt.suptitle("Bayesian updating of posterior probabilities",
             y=1.02,
             fontsize=14)

plt.tight_layout()

In this example, the prior is a beta distribution distribution, and the model is a bernoulli distribution, and since we have the magic of the conjugate distributions (http://en.wikipedia.org/wiki/Conjugate_prior) the posterior is going to be another Beta distributions, as we gather more and more observations, we have a better certainty of the parameters of the parameter $p$ that controls the coin toss.

Not so mandatory Text Message Example

AT&T asks us if their change in price scheme actually affected the texting habits of a user (trivialy extended to multiple users)

We have data for a single user (all of the data is in the Python Notebook)



In [11]:

    
figsize(12.5, 3.5)
count_data = np.loadtxt("data/txtdata.csv")
n_count_data = len(count_data)
plt.bar(np.arange(n_count_data), count_data, color="#348ABD")
plt.xlabel("Time (days)")
plt.ylabel("count of text-msgs received")
plt.title("Did the user's texting habits change over time?")
plt.xlim(0, n_count_data);

We learned in the first chapters, that count data can be modeled using a Poisson distribution:

$ C_i \sim \text{Poisson}(\lambda)$

The parameter $\lambda$ is the rate, and by looking at the data, it seems like there was a change in the user behavior.

So we can define two lambdas, one before time $\tau$ when the behavior changed, and one after.

$ \lambda = \begin{cases} \lambda_1 & \text{if } t \lt \tau \cr \lambda_2 & \text{if } t \ge \tau \end{cases}, $

Hyperparameters

If there is no change, $\lambda_1$ and $\lambda_2$ shuld be the same.

We are interested in the rate $\lambda$ and the moment the change happens $\tau$ and as such, are the parameters of interest in the Bayes theorem. This means we need to define a prior $p(\lambda)$ and $p(\tau)$ over the parameter.

To find the prior for $\lambda$ as we also saw in the first chapters, and we can check in the wikipedia link, a good conjugate prior distribution for a Poisson random variable, is a Gamma distribution, but a particular, simpler case of the Gamma is an exponential distribution (The Chi squared distribution also falls here).

The combination Poisson-Exponential create what is called a Poisson Process, and among other things is used to models counts that happen ovr time.

So formally:

$ \begin{align} \lambda_1 \sim \text{Exp}( \alpha )\\ \lambda_2 \sim \text{Exp}( \alpha ) \end{align} $

In this expression, $\alpha$ is called a hyperparameter or parent variable, both rates depend on a single alpha. There is nothing preventing us from using two $\alpha$ but the sake of simplicity.

Some questions:

How did I know the exponential was the best choice as a prior? Experience, also if you are dilligient, you can go ahead and proof that the Gamma is a conjugate prior of the Poisson, furthermore, you can find the right form for most priors.
What do I do with the hyperparameters? I'm stuck with dealing with a new parameter again? Yes, but as we saw in the coin example, as you gather more data, your choice of $\alpha$ becomes less relevant.
Wait, can I choose a hyper-hyper parameter for a prior over the prior? Yes, as a matter of fact, if your data is not strong enough, this practice is often encouraged.
Well how far can I go in the hyper paramterization? As you go back, it becomes tricky to find nice conjugates so things work out, so you are either stuck with simple non informative priors (more on that in a minute) or using reasonable priors and spend some time with the math.

Informative and Non Informative priors

Non Informative Prior: Like its name implies, is a prior with no actual strength, it is the closest we have to a frequentits approach, it usually has simple forms, like the Unifrom distribution, that just assigns the same probability to every sample. Although it can get complicated if we wish (Standard Normal, Chi Squared, etc). It usually is really weak, and in a couple of iterations, its importance just dissapears.
Informative Priors: The use some degree of knowledge of our data to set the hyper parameters, so when we do the probability inference, we are not blindly looking in the dark.

Back to our example

To model the $\alpha$ and make our prior at least a bit informative, a good rule of thumb is to choose the inverse of the average of the count data:

$\frac{1}{N}\sum_{i=0}^N \;C_i \approx E[\; \lambda \; |\; \alpha ] = \frac{1}{\alpha}$

This prior is also closely related to the expected value of the parameter lambda given an exponential distribution. Which is also related to the minimum entropy principle.

What about $\tau$ ?

In the definition of the lambdas, we saw that there is another random variable called $\tau$ that is also a variable of interest because it indicates the breakpoint of the trend change.

Since it is also a random variable, it also needs a prior distribution. We don't really have much information on where $\tau$ could be initially located, so we used a noninformative prior in the form of a unifrom distribution over the time span (0,70):

$ \tau \sim DiscreteUniform(1,70)\\ P(\tau) = 1/70 $

We see that is a nice scalar value that does not really messes up calculations

What do we do now that we have all of this?

Remember that our main objective is to find if/when there is a change in trend, that would mean lambdas are different and $\tau$ has a value different than 0 or 70.

Using Bayes theorem:

Some coments:

-Independence: Before we learned that $P(a,b) = P(a)P(b)$ if they are independent, so we have applied that same principle here, in strict theory they are not independent, but that is represented in the hyper parameters, they are thus conditionally independent

Proportional to We have not used the equal sign because as I mentioned before, the likelihood is not a probability distribution, thus, we need to add a normalization term that will make it a probability distribution. In this case, it is the joint marginal $P(Data)$ which is obtained by integrating over lambda and tau.
Switching time $\tau$ We did not model the fact that $\tau$ controls under which $\lambda$ domain are we woriking in, that is and extra modeling step, but we could arguably build two models.

A more diligent presenter would probably go ahead and expand each expression and would show that the joint posterior distribution is a nice exponential over the parameters $\lambda$.

Ok, so I got the model, what now?

We can see that the final math expression will be an exponential distribution scaled by some factors (conjugacy). And using that final expression, we can sample values of the final joint distribution.

You can sample a point from the prior, then sample from the likelihood using that sample, normalize and that way you can generate samples from the posterior, with this you can build histograms to work with.

Sample $\tau$
Given $\tau$ and the current day, dample either $\lambda_1$ or $\lambda_2$
Calculate the probability for each draw using the joint posterior that we have available.

If you are interested in point values, you can optimize the whole aposteriori in a MAP approach (maximum aposteriori).

One problem though, is that we migh like to do inference in particular variables, like $\tau$ given the rest of the variables, and that gets complicated, since it involves the integral of all the posterior over the other variables.

$ \int P(\tau, \lambda_1, \lambda_2|Data) d\lambda_1\lambda_2 $

There are two ways to deal with this, actual math, or sampling approaches.

In the book they discuss a bit on sampling, which are Monte Carlo techniques to get samples from complicated probability distributions. A way to deal with it, also in the book is the convenient probabilistic programming approach.

Probabilistic Programming

Many people can see the inherent complication of wanting to do good, nice models and how the math can actually prevent people from actually trying this approaches.

Many intelligent people got together, and created software that requires you to only specify your likelihood, priors and it will pull the hard work for you giving nice plots and results.

In R, they have Stan, and in Python we use PyMC.

How do we work out our example



In [12]:

    
import pymc as pm

alpha = 1.0 / count_data.mean()  # Recall count_data is the
                               # variable that holds our txt counts
#We define our priors here
lambda_1 = pm.Exponential("lambda_1", alpha)
lambda_2 = pm.Exponential("lambda_2", alpha)

tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data)
#The assignment function where we say that there are two domains for lambda given tau

@pm.deterministic
def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
    out = np.zeros(n_count_data)
    out[:tau] = lambda_1  # lambda before tau is lambda1
    out[tau:] = lambda_2  # lambda after (and including) tau is lambda2
    return out
#The likelihood, where we indicate which values are observed
observation = pm.Poisson("obs", lambda_, value=count_data, observed=True)

model = pm.Model([observation, lambda_1, lambda_2, tau])

mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000, 1)

lambda_1_samples = mcmc.trace('lambda_1')[:]
lambda_2_samples = mcmc.trace('lambda_2')[:]
tau_samples = mcmc.trace('tau')[:]


figsize(12.5, 10)
# histogram of the samples:

ax = plt.subplot(311)
ax.set_autoscaley_on(False)

plt.hist(lambda_1_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of $\lambda_1$", color="#A60628", normed=True)
plt.legend(loc="upper left")
plt.title(r"""Posterior distributions of the variables
    $\lambda_1,\;\lambda_2,\;\tau$""")
plt.xlim([15, 30])
plt.xlabel("$\lambda_1$ value")

ax = plt.subplot(312)
ax.set_autoscaley_on(False)
plt.hist(lambda_2_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of $\lambda_2$", color="#7A68A6", normed=True)
plt.legend(loc="upper left")
plt.xlim([15, 30])
plt.xlabel("$\lambda_2$ value")

plt.subplot(313)
w = 1.0 / tau_samples.shape[0] * np.ones_like(tau_samples)
plt.hist(tau_samples, bins=n_count_data, alpha=1,
         label=r"posterior of $\tau$",
         color="#467821", weights=w, rwidth=2.)
plt.xticks(np.arange(n_count_data))

plt.legend(loc="upper left")
plt.ylim([0, .75])
plt.xlim([35, len(count_data) - 20])
plt.xlabel(r"$\tau$ (in days)")
plt.ylabel("probability");









    



 [-----------------100%-----------------] 40000 of 40000 complete in 6.8 sec

On first inspection, we can see that the posteriors for both lambda are different, which is the rate of texts per day, we can see clearly that there was an increase in the use of the messaging service, and the posterior on $\tau$ tells us that the change happens approximately on the 45th day.

To do this in a more formal way, we can calculate the expected number of text messages each day given that we know all of these variables.

The variables tau samples have the sample output of a 3000 step Markoc Chain Monte Carlo Method, that is, 3000 samples for the expected value of $\tau$. Then for each day, we need to see which samples are lower than $\tau$

With those indexes, we can calculate the expected value of texts per day.

To obtain



In [13]:

    
figsize(12.5, 5)
# tau_samples, lambda_1_samples, lambda_2_samples contain
# N samples from the corresponding posterior distribution
N = tau_samples.shape[0]
expected_texts_per_day = np.zeros(n_count_data)
for day in range(0, n_count_data):
    # ix is a bool index of all tau samples corresponding to
    # the switchpoint occurring prior to value of 'day'
    ix = day < tau_samples
    # Each posterior sample corresponds to a value for tau.
    # for each day, that value of tau indicates whether we're "before"
    # (in the lambda1 "regime") or
    #  "after" (in the lambda2 "regime") the switchpoint.
    # by taking the posterior sample of lambda1/2 accordingly, we can average
    # over all samples to get an expected value for lambda on that day.
    # As explained, the "message count" random variable is Poisson distributed,
    # and therefore lambda (the poisson parameter) is the expected value of
    # "message count".
    expected_texts_per_day[day] = (lambda_1_samples[ix].sum()
                                   + lambda_2_samples[~ix].sum()) / N


plt.plot(range(n_count_data), expected_texts_per_day, lw=4, color="#E24A33",
         label="expected number of text-messages received")
plt.xlim(0, n_count_data)
plt.xlabel("Day")
plt.ylabel("Expected # text-messages")
plt.title("Expected number of text-messages received")
plt.ylim(0, 60)
plt.bar(np.arange(len(count_data)), count_data, color="#348ABD", alpha=0.65,
        label="observed texts per day")

plt.legend(loc="upper left");

So what if I do not have nice probability distributions?

Probabilistic Programming techniques allow us to use more complicated distributions, you can always derive the expressions yourself, but that is ill advised if you are not in for a Stats PhD.

Maximum Entropy

This is a principle used to find good nice informative priors for your distributions. If you do not want to be bothered with conjugacy. Is not a ver common method, and in most books is a sidenote to calculate prior distributions.

For the sake of completeness I'll put it here.

When we have some low order statistics, like for example the Estimated Value, we can use it to approximate a prior distribution via the entropy:

$ S=-\sum^N_{i=1} p_i ln(p_i) $

If we want to find our prior $p_i$, we might have access to a vague prior $m_i$, if we extend the entropy to compare two probabilities:

$ S=-\sum^N_{i=1} p_i ln(\frac{p_i}{m_i}) $

For example, in a six sided die, a vague prior would be the the uniform over faces: $m_i = frac{1}{6}$

Once we have defined the entropy, we maximize it constrained to the fact that $p_i$ has to sum up to one, and that the expected value for many distributions is well known before hand.

This is just the theoretical justification to the choice I did in the example above.

Bayesian Model Selection

If we do not have access to a score metric, is difficult to asses whether one model will be better or worse. To do this we can take a ratio of the model given the data.

$ O_{21} = \frac{p(M_2|D)}{p(M_1|D)} $

To obtain P(M|D), you have to integrate over all the possible parameters, and you will get the probability of your model given your data.



In [ ]: