Here I will have some examples showing how to use Sampyl. This is for version 0.2.1. Let's import it and get started. Sampyl is a Python package used to sample from probability distributions using Markov Chain Monte Carlo (MCMC). This is most useful when sampling from the posterior distribution of a Bayesian model.
Every sampler provided by Sampyl works the same way. Define $ \log P(\theta) $ as a function, then pass it to the sampler class. The class returns a sampler object, which you can then use to sample from $P(\theta)$. For samplers which use the gradient, $\nabla_{\theta} \log P(\theta)$, Sampyl uses autograd to automatically calculate the gradients. However, you can pass in your own $\nabla_{\theta} \log P(\theta)$ functions.
Starting out simple, let's sample from a normal distribution.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import sampyl as smp
from sampyl import np
A normal distribution with mean $\mu$ and variance $\sigma^2$ is defined as:
$$ P(x,\mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} \; \mathrm{Exp}\left( \frac{-(x - \mu)^2}{2\sigma^2} \right) $$For numerical stability, it is typically better to deal with log probabilities, $\log{P(\theta)}$. Then for the normal distribution with known mean and variance,
$$ \log{P(x \mid \mu, \sigma)} = -\log{\sigma} - \frac{(x - \mu)^2}{2\sigma^2} $$where we can drop constant terms since the MCMC samplers only require something proportional to $\log{P(\theta)}$. We can easily write this as a Python function.
In [2]:
mu, sig = 3, 2
def logp(x):
return -np.log(sig) - (x - mu)**2/(2*sig**2)
First we'll use a Metropolis-Hastings sampler. Each sampler requires a $\log{P(\theta)}$ function and a starting state. We have included a function to calculate the maximum a posteriori (MAP) to find the peak of the distribution for use as the starting state. Then you call the sample method and a chain of samples is returned.
In [6]:
start = smp.find_MAP(logp, {'x':1.})
metro = smp.Metropolis(logp, start)
chain = metro.sample(10000, burn=2000, thin=4)
We can retrieve the chain by accessing the attributes defined by the parameter name(s) of logp.
In [7]:
plt.plot(chain.x)
Out[7]:
In [8]:
_ = plt.hist(chain.x, bins=30)
_ = plt.vlines(mu, 0, 250, linestyles='--')
Here we have sampled from a normal distribution with a mean of 3, indicated with the dashed vertical line.
There is also a No-U-Turn Sampler (NUTS), which avoids the random-walk nature of Metropolis samplers. NUTS uses the gradient of $\log{P(\theta)}$ to make intelligent state proposals. You'll notice here that we don't pass in any information about the gradient. Instead, it is calculated automatically with autograd.
In [9]:
nuts = smp.NUTS(logp, start)
chain = nuts.sample(2100, burn=100)
In [10]:
plt.plot(chain)
Out[10]:
In [11]:
_ = plt.hist(chain.x, bins=30)
_ = plt.vlines(mu, 0, 250, linestyles='--')
Let's try something a little more complicated. Let's say you run a business and you put an advertisement in the paper. Then, to judge the effectiveness of the ad, you want to compare the number of incoming phone calls per hour before and after the placement of the add. Then we can build a Bayesian model using a Poisson likelihood with exponential priors for $\lambda_1$ and $\lambda_2$.
\begin{align} P(\lambda_1, \lambda_2 \mid D) &\propto P( D \mid \lambda_1, \lambda_2)\, P(\lambda_1)\, P(\lambda_2) \\ P( D \mid \lambda_1, \lambda_2) &\sim \mathrm{Poisson}(D\mid\lambda_1)\,\mathrm{Poisson}(D\mid\lambda_2) \\ P(\lambda_1) &\sim \mathrm{Exp}(1) \\ P(\lambda_2) &\sim \mathrm{Exp}(1) \end{align}This analysis method is known as Bayesian inference or Bayesian estimation. We want to know likely values for $\lambda_1$ and $\lambda_2$. This information is contained in the posterior distribution $P(\lambda_1, \lambda_2 \mid D)$. To infer values for $\lambda_1$ and $\lambda_2$, we can sample from the posterior using our MCMC samplers.
In [12]:
# Fake data for the day before and after placing the ad.
# We'll make the calls increase by 2 an hour. Record data for each
# hour over two work days.
before = np.random.poisson(7, size=16)
after = np.random.poisson(9, size=16)
# Define the log-P function here
def logp(lam1, lam2):
# Rates for Poisson must be > 0
if lam1 <= 0 or lam2 <=0:
return -np.inf
else:
# logps for likelihoods
llh1 = np.sum(before*np.log(lam1)) - before.size*lam1
llh2 = np.sum(after*np.log(lam2)) - after.size*lam2
# Exponential priors for lams
lam1_prior = -lam1
lam2_prior = -lam2
return llh1 + llh2 + lam1_prior + lam2_prior
In [13]:
start = smp.find_MAP(logp, {'lam1':1., 'lam2':1.})
metro = smp.Metropolis(logp, start)
trace = metro.sample(10000, burn=2000, thin=4)
Sampling returns a numpy record array which you can use to access samples by name. Variable names are taken directly from the argument list of logp.
In [14]:
print(metro.var_names)
In [15]:
plt.plot(trace.lam1)
plt.plot(trace.lam2)
Out[15]:
Now to see if there is a significant difference between the two days. We can find the difference $\delta = \lambda_2 - \lambda_1$, then find the probability that $\delta > 0$.
In [16]:
delta = trace.lam2 - trace.lam1
_ = plt.hist(delta, bins=30)
_ = plt.vlines(2, 0, 250, linestyle='--')
In [17]:
p = np.mean(delta > 0)
effect = np.mean(delta)
CR = np.percentile(delta, (2.5, 97.5))
print("{:.3f} probability the rate of phone calls increased".format(p))
print("delta = {:.3f}, 95% CR = {{{:.3f} {:.3f}}}".format(effect, *CR))
There true difference in rates is two per hour, marked with the dashed line. Our posterior is showing an effect, but our best estimate is that the rate increased by only one call per hour. The 95% credible region is {-0.865 2.982} which idicates that there is a 95% probability that the true effect lies with the region, as it indeed does.
We can alo use NUTS to sample from the posterior.
In [18]:
nuts = smp.NUTS(logp, start)
trace = nuts.sample(2100, burn=100)
_ = plt.plot(trace.lam1)
_ = plt.plot(trace.lam2)
In [20]:
delta = trace.lam2 - trace.lam1
_ = plt.hist(delta, bins=30)
_ = plt.vlines(2, 0, 250, linestyle='--')
p = np.mean(delta > 0)
effect = np.mean(delta)
CR = np.percentile(delta, (2.5, 97.5))
print("{:.3f} probability the rate of phone calls increased".format(p))
print("delta = {:.3f}, 95% CR = {{{:.3f} {:.3f}}}".format(effect, *CR))
When you build larger models, it would be cumbersome to have to include every parameter as an argument in the logp function. To avoid this, you can declare the size of variables when passing in the starting state.
For instance, with a linear model it would be great to pass the coefficients as one parameter. First, we'll make some fake data, then infer the coefficients.
In [29]:
N = 200
# True coefficients
true_B = np.array([1,2,3,4,5])
# Features, including a constant
X = np.ones((N,5))
X[:,:-1] = np.random.rand(N, 4)
# Outcomes
y = np.dot(X, true_B) + np.random.randn(len(X))
In [30]:
from sampyl.priors import bound, prior_map, uniform
# Here, b is a length 5 array of coefficients
def logp(b, sig):
if sig <=0:
return -np.inf
n = len(y)
y_hat = np.dot(X, b)
likelihood = -n*np.log(sig) - \
np.sum((y - y_hat)**2)/(2*sig**2)
# Jeffrey's prior for sigma
prior_sig = -np.log(sig)
# Uniform priors on coefficients
prior_b = prior_map(uniform, b, lower=-5, upper=10).sum()
return likelihood + prior_sig + prior_b
In [31]:
start = smp.find_MAP(logp, {'b': np.ones(5), 'sig': 1.},
bounds={'b':(-5, 10), 'sig':(0.01, None)})
metro = smp.Metropolis(logp, start)
chain = metro.sample(20000, burn=5000, thin=4)
In [32]:
_ = plt.plot(chain.b)
And using NUTS too.
In [36]:
start = smp.find_MAP(logp, {'b': np.ones(5), 'sig': 1.},
bounds={'b':(-5, 10), 'sig':(0.01, None)})
nuts = smp.NUTS(logp, start)
chain = nuts.sample(2100, burn=100)
In [37]:
_ = plt.plot(chain.b)
In [38]:
_ = plt.plot(chain.sig)