Here I will have some examples showing how to use Sampyl. This is for version 0.2.1. Let's import it and get started. Sampyl is a Python package used to sample from probability distributions using Markov Chain Monte Carlo (MCMC). This is most useful when sampling from the posterior distribution of a Bayesian model.

Every sampler provided by Sampyl works the same way. Define $ \log P(\theta) $ as a function, then pass it to the sampler class. The class returns a sampler object, which you can then use to sample from $P(\theta)$. For samplers which use the gradient, $\nabla_{\theta} \log P(\theta)$, Sampyl uses autograd to automatically calculate the gradients. However, you can pass in your own $\nabla_{\theta} \log P(\theta)$ functions.

Starting out simple, let's sample from a normal distribution.



In [1]:

    
%matplotlib inline

import matplotlib.pyplot as plt
import sampyl as smp
from sampyl import np

A normal distribution with mean $\mu$ and variance $\sigma^2$ is defined as:

$$ P(x,\mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} \; \mathrm{Exp}\left( \frac{-(x - \mu)^2}{2\sigma^2} \right) $$

For numerical stability, it is typically better to deal with log probabilities, $\log{P(\theta)}$. Then for the normal distribution with known mean and variance,

$$ \log{P(x \mid \mu, \sigma)} = -\log{\sigma} - \frac{(x - \mu)^2}{2\sigma^2} $$

where we can drop constant terms since the MCMC samplers only require something proportional to $\log{P(\theta)}$. We can easily write this as a Python function.



In [2]:

    
mu, sig = 3, 2
def logp(x):
    return  -np.log(sig) - (x - mu)**2/(2*sig**2)

First we'll use a Metropolis-Hastings sampler. Each sampler requires a $\log{P(\theta)}$ function and a starting state. We have included a function to calculate the maximum a posteriori (MAP) to find the peak of the distribution for use as the starting state. Then you call the sample method and a chain of samples is returned.



In [6]:

    
start = smp.find_MAP(logp, {'x':1.})
metro = smp.Metropolis(logp, start)
chain = metro.sample(10000, burn=2000, thin=4)









    



Progress: [##############################] 10000 of 10000 samples

We can retrieve the chain by accessing the attributes defined by the parameter name(s) of logp.



In [7]:

    
plt.plot(chain.x)









    Out[7]:





[<matplotlib.lines.Line2D at 0x10780c5d0>]



In [8]:

    
_ = plt.hist(chain.x, bins=30)
_ = plt.vlines(mu, 0, 250, linestyles='--')

Here we have sampled from a normal distribution with a mean of 3, indicated with the dashed vertical line.

There is also a No-U-Turn Sampler (NUTS), which avoids the random-walk nature of Metropolis samplers. NUTS uses the gradient of $\log{P(\theta)}$ to make intelligent state proposals. You'll notice here that we don't pass in any information about the gradient. Instead, it is calculated automatically with autograd.



In [9]:

    
nuts = smp.NUTS(logp, start)
chain = nuts.sample(2100, burn=100)









    



Progress: [##############################] 2100 of 2100 samples



In [10]:

    
plt.plot(chain)









    Out[10]:





[<matplotlib.lines.Line2D at 0x107bc2890>]



In [11]:

    
_ = plt.hist(chain.x, bins=30)
_ = plt.vlines(mu, 0, 250, linestyles='--')

Let's try something a little more complicated. Let's say you run a business and you put an advertisement in the paper. Then, to judge the effectiveness of the ad, you want to compare the number of incoming phone calls per hour before and after the placement of the add. Then we can build a Bayesian model using a Poisson likelihood with exponential priors for $\lambda_1$ and $\lambda_2$.

\begin{align} P(\lambda_1, \lambda_2 \mid D) &\propto P( D \mid \lambda_1, \lambda_2)\, P(\lambda_1)\, P(\lambda_2) \\ P( D \mid \lambda_1, \lambda_2) &\sim \mathrm{Poisson}(D\mid\lambda_1)\,\mathrm{Poisson}(D\mid\lambda_2) \\ P(\lambda_1) &\sim \mathrm{Exp}(1) \\ P(\lambda_2) &\sim \mathrm{Exp}(1) \end{align}

This analysis method is known as Bayesian inference or Bayesian estimation. We want to know likely values for $\lambda_1$ and $\lambda_2$. This information is contained in the posterior distribution $P(\lambda_1, \lambda_2 \mid D)$. To infer values for $\lambda_1$ and $\lambda_2$, we can sample from the posterior using our MCMC samplers.



In [12]:

    
# Fake data for the day before and after placing the ad.
# We'll make the calls increase by 2 an hour. Record data for each
# hour over two work days.
before = np.random.poisson(7, size=16)
after = np.random.poisson(9, size=16)

# Define the log-P function here
def logp(lam1, lam2):
    # Rates for Poisson must be > 0
    if lam1 <= 0 or lam2 <=0:
        return -np.inf
    else:
        # logps for likelihoods
        llh1 = np.sum(before*np.log(lam1)) - before.size*lam1
        llh2 = np.sum(after*np.log(lam2)) - after.size*lam2
        
        # Exponential priors for lams
        lam1_prior = -lam1
        lam2_prior = -lam2
        return llh1 + llh2 + lam1_prior + lam2_prior



In [13]:

    
start = smp.find_MAP(logp, {'lam1':1., 'lam2':1.})
metro = smp.Metropolis(logp, start)
trace = metro.sample(10000, burn=2000, thin=4)









    



Progress: [##############################] 10000 of 10000 samples

Sampling returns a numpy record array which you can use to access samples by name. Variable names are taken directly from the argument list of logp.



In [14]:

    
print(metro.var_names)









    



('lam1', 'lam2')



In [15]:

    
plt.plot(trace.lam1)
plt.plot(trace.lam2)









    Out[15]:





[<matplotlib.lines.Line2D at 0x107ad5d90>]

Now to see if there is a significant difference between the two days. We can find the difference $\delta = \lambda_2 - \lambda_1$, then find the probability that $\delta > 0$.



In [16]:

    
delta = trace.lam2 - trace.lam1
_ = plt.hist(delta, bins=30)
_ = plt.vlines(2, 0, 250, linestyle='--')



In [17]:

    
p = np.mean(delta > 0)
effect = np.mean(delta)
CR = np.percentile(delta, (2.5, 97.5))
print("{:.3f} probability the rate of phone calls increased".format(p))
print("delta = {:.3f}, 95% CR = {{{:.3f} {:.3f}}}".format(effect, *CR))









    



0.921 probability the rate of phone calls increased
delta = 1.352, 95% CR = {-0.452 3.253}

There true difference in rates is two per hour, marked with the dashed line. Our posterior is showing an effect, but our best estimate is that the rate increased by only one call per hour. The 95% credible region is {-0.865 2.982} which idicates that there is a 95% probability that the true effect lies with the region, as it indeed does.

We can alo use NUTS to sample from the posterior.



In [18]:

    
nuts = smp.NUTS(logp, start)
trace = nuts.sample(2100, burn=100)
_ = plt.plot(trace.lam1)
_ = plt.plot(trace.lam2)









    



Progress: [##############################] 2100 of 2100 samples






    



/Users/mat/anaconda3/envs/py2/lib/python2.7/site-packages/autograd/core.py:39: UserWarning: Output seems independent of input. Returning zero gradient.
  warnings.warn("Output seems independent of input. Returning zero gradient.")



In [20]:

    
delta = trace.lam2 - trace.lam1
_ = plt.hist(delta, bins=30)
_ = plt.vlines(2, 0, 250, linestyle='--')
p = np.mean(delta > 0)
effect = np.mean(delta)
CR = np.percentile(delta, (2.5, 97.5))
print("{:.3f} probability the rate of phone calls increased".format(p))
print("delta = {:.3f}, 95% CR = {{{:.3f} {:.3f}}}".format(effect, *CR))









    



0.916 probability the rate of phone calls increased
delta = 1.288, 95% CR = {-0.602 3.173}

When you build larger models, it would be cumbersome to have to include every parameter as an argument in the logp function. To avoid this, you can declare the size of variables when passing in the starting state.

For instance, with a linear model it would be great to pass the coefficients as one parameter. First, we'll make some fake data, then infer the coefficients.



In [29]:

    
N = 200
# True coefficients
true_B = np.array([1,2,3,4,5])

# Features, including a constant
X = np.ones((N,5))
X[:,:-1] = np.random.rand(N, 4)

# Outcomes
y = np.dot(X, true_B) + np.random.randn(len(X))



In [30]:

    
from sampyl.priors import bound, prior_map, uniform
# Here, b is a length 5 array of coefficients
def logp(b, sig):
    if sig <=0:
        return -np.inf
    n = len(y)
    y_hat = np.dot(X, b)
    likelihood = -n*np.log(sig) - \
                  np.sum((y - y_hat)**2)/(2*sig**2)
    
    # Jeffrey's prior for sigma
    prior_sig = -np.log(sig)
    
    # Uniform priors on coefficients
    prior_b = prior_map(uniform, b, lower=-5, upper=10).sum()
    
    return likelihood + prior_sig + prior_b



In [31]:

    
start = smp.find_MAP(logp, {'b': np.ones(5), 'sig': 1.}, 
                     bounds={'b':(-5, 10), 'sig':(0.01, None)})
metro = smp.Metropolis(logp, start)
chain = metro.sample(20000, burn=5000, thin=4)









    



Progress: [##############################] 20000 of 20000 samples



In [32]:

    
_ = plt.plot(chain.b)

And using NUTS too.



In [36]:

    
start = smp.find_MAP(logp, {'b': np.ones(5), 'sig': 1.}, 
                     bounds={'b':(-5, 10), 'sig':(0.01, None)})
nuts = smp.NUTS(logp, start)
chain = nuts.sample(2100, burn=100)









    



Progress: [##############################] 2100 of 2100 samples



In [37]:

    
_ = plt.plot(chain.b)



In [38]:

    
_ = plt.plot(chain.sig)

The future

Write a module that makes it easier to build logp functions from distributions
Add various functions such as autocorrelation, HPD, etc.
Plots!