Bayesian Statistics (Unfinished)

In this notebook, we discuss Bayesian statistics from a computational/ practical point of view. We borrow some code examples from chapter 2 of Bayesian Analysis with Python (2nd end) by Osvaldo Martin.


In [1]:
%matplotlib inline
import sklearn
import scipy.stats as stats
import scipy.optimize
import matplotlib.pyplot as plt
import seaborn as sns
import time
import numpy as np
import os
import pandas as pd


/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

In [2]:
# We install various packages for approximate Bayesian inference
# To avoid installing packages the internet every time you open a colab,
# you can use this trick:
# https://stackoverflow.com/questions/55253498/how-do-i-install-a-library-permanently-in-colab

# The PyMC3 package (https://docs.pymc.io) supports HMC and variational inference
# https://docs.pymc.io/notebooks/api_quickstart.html
!pip install pymc3==3.8
import pymc3 as pm
pm.__version__

# The arviz package (https://github.com/arviz-devs/arviz) can be used to make various plots
# of posterior samples generated by any algorithm. 
!pip install arviz
import arviz as az


Collecting pymc3==3.8
  Downloading https://files.pythonhosted.org/packages/32/19/6c94cbadb287745ac38ff1197b9fadd66500b6b9c468e79099b110c6a2e9/pymc3-3.8-py3-none-any.whl (908kB)
     |████████████████████████████████| 911kB 8.9MB/s 
Collecting arviz>=0.4.1
  Downloading https://files.pythonhosted.org/packages/6c/23/73ae3b88a6837fa5a162d984acabfd2e75dc847ed67e5690aa44d02f491a/arviz-0.7.0-py3-none-any.whl (1.5MB)
     |████████████████████████████████| 1.5MB 45.5MB/s 
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from pymc3==3.8) (1.4.1)
Requirement already satisfied: numpy>=1.13.0 in /usr/local/lib/python3.6/dist-packages (from pymc3==3.8) (1.18.4)
Requirement already satisfied: theano>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from pymc3==3.8) (1.0.4)
Requirement already satisfied: h5py>=2.7.0 in /usr/local/lib/python3.6/dist-packages (from pymc3==3.8) (2.10.0)
Requirement already satisfied: tqdm>=4.8.4 in /usr/local/lib/python3.6/dist-packages (from pymc3==3.8) (4.41.1)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from pymc3==3.8) (0.5.1)
Requirement already satisfied: pandas>=0.18.0 in /usr/local/lib/python3.6/dist-packages (from pymc3==3.8) (1.0.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from arviz>=0.4.1->pymc3==3.8) (20.3)
Collecting netcdf4
  Downloading https://files.pythonhosted.org/packages/35/4f/d49fe0c65dea4d2ebfdc602d3e3d2a45a172255c151f4497c43f6d94a5f6/netCDF4-1.5.3-cp36-cp36m-manylinux1_x86_64.whl (4.1MB)
     |████████████████████████████████| 4.1MB 49.2MB/s 
Requirement already satisfied: xarray>=0.11 in /usr/local/lib/python3.6/dist-packages (from arviz>=0.4.1->pymc3==3.8) (0.15.1)
Requirement already satisfied: matplotlib>=3.0 in /usr/local/lib/python3.6/dist-packages (from arviz>=0.4.1->pymc3==3.8) (3.2.1)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from theano>=1.0.4->pymc3==3.8) (1.12.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.18.0->pymc3==3.8) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.18.0->pymc3==3.8) (2.8.1)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->arviz>=0.4.1->pymc3==3.8) (2.4.7)
Collecting cftime
  Downloading https://files.pythonhosted.org/packages/0f/5e/ee154e2aabb0beea0c4c7dc3d93c6f64f96a2a2019bbd05afc905439d042/cftime-1.1.3-cp36-cp36m-manylinux1_x86_64.whl (322kB)
     |████████████████████████████████| 327kB 49.5MB/s 
Requirement already satisfied: setuptools>=41.2 in /usr/local/lib/python3.6/dist-packages (from xarray>=0.11->arviz>=0.4.1->pymc3==3.8) (46.3.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->arviz>=0.4.1->pymc3==3.8) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->arviz>=0.4.1->pymc3==3.8) (1.2.0)
Installing collected packages: cftime, netcdf4, arviz, pymc3
  Found existing installation: pymc3 3.7
    Uninstalling pymc3-3.7:
      Successfully uninstalled pymc3-3.7
Successfully installed arviz-0.7.0 cftime-1.1.3 netcdf4-1.5.3 pymc3-3.8
Requirement already satisfied: arviz in /usr/local/lib/python3.6/dist-packages (0.7.0)
Requirement already satisfied: pandas>=0.23 in /usr/local/lib/python3.6/dist-packages (from arviz) (1.0.3)
Requirement already satisfied: netcdf4 in /usr/local/lib/python3.6/dist-packages (from arviz) (1.5.3)
Requirement already satisfied: xarray>=0.11 in /usr/local/lib/python3.6/dist-packages (from arviz) (0.15.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from arviz) (20.3)
Requirement already satisfied: matplotlib>=3.0 in /usr/local/lib/python3.6/dist-packages (from arviz) (3.2.1)
Requirement already satisfied: numpy>=1.12 in /usr/local/lib/python3.6/dist-packages (from arviz) (1.18.4)
Requirement already satisfied: scipy>=0.19 in /usr/local/lib/python3.6/dist-packages (from arviz) (1.4.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23->arviz) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23->arviz) (2018.9)
Requirement already satisfied: cftime in /usr/local/lib/python3.6/dist-packages (from netcdf4->arviz) (1.1.3)
Requirement already satisfied: setuptools>=41.2 in /usr/local/lib/python3.6/dist-packages (from xarray>=0.11->arviz) (46.3.0)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->arviz) (2.4.7)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from packaging->arviz) (1.12.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->arviz) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->arviz) (1.2.0)

Beta-Binomial model

Exact inference


In [0]:
# Plot the Binomial likelihood

n_params = [1, 2, 4]  # Number of trials
p_params = [0.25, 0.5, 0.75]  # Probability of success

x = np.arange(0, max(n_params)+1)
f,ax = plt.subplots(len(n_params), len(p_params), sharex=True, sharey=True,
                    figsize=(8, 7), constrained_layout=True)

for i in range(len(n_params)):
    for j in range(len(p_params)):
        n = n_params[i]
        p = p_params[j]

        y = stats.binom(n=n, p=p).pmf(x)

        ax[i,j].vlines(x, 0, y, colors='C0', lw=5)
        ax[i,j].set_ylim(0, 1)
        ax[i,j].plot(0, 0, label="N = {:3.2f}\nθ = {:3.2f}".format(n,p), alpha=0)
        ax[i,j].legend()

        ax[2,1].set_xlabel('y')
        ax[1,0].set_ylabel('p(y | θ, N)')
        ax[0,0].set_xticks(x)



In [0]:
# Plot the beta prior

params = [0.5, 1, 2, 3]
x = np.linspace(0, 1, 100)
f, ax = plt.subplots(len(params), len(params), sharex=True, sharey=True,
                     figsize=(8, 7), constrained_layout=True)
for i in range(4):
    for j in range(4):
        a = params[i]
        b = params[j]
        y = stats.beta(a, b).pdf(x)
        ax[i,j].plot(x, y)
        ax[i,j].plot(0, 0, label="α = {:2.1f}\nβ = {:2.1f}".format(a, b), alpha=0)
        ax[i,j].legend()
ax[1,0].set_yticks([])
ax[1,0].set_xticks([0, 0.5, 1])
f.text(0.5, 0.05, 'θ', ha='center')
f.text(0.07, 0.5, 'p(θ)', va='center', rotation=0)



In [0]:
# Compute and plot posterior (black vertical line = true parameter value)

plt.figure(figsize=(10, 8))

n_trials = [0, 1, 2, 3, 4, 8, 16, 32, 50, 150]
data = [0, 1, 1, 1, 1, 4, 6, 9, 13, 48]
theta_real = 0.35

beta_params = [(1, 1), (20, 20), (1, 4)]
dist = stats.beta
x = np.linspace(0, 1, 200)

for idx, N in enumerate(n_trials):
    if idx == 0:
        plt.subplot(4, 3, 2)
        plt.xlabel('θ')
    else:
        plt.subplot(4, 3, idx+3)
        plt.xticks([])
    y = data[idx]
    for (a_prior, b_prior) in beta_params:
        p_theta_given_y = dist.pdf(x, a_prior + y, b_prior + N - y)
        plt.fill_between(x, 0, p_theta_given_y, alpha=0.7)

    plt.axvline(theta_real, ymax=0.3, color='k')
    plt.plot(0, 0, label=f'{N:4d} trials\n{y:4d} heads', alpha=0)
    plt.xlim(0, 1)
    plt.ylim(0, 12)
    plt.legend()
    plt.yticks([])
plt.tight_layout()


Credible intervals


In [3]:
# We illustrate how to compute a 95% posterior credible interval for a random variable
# with a beta distribution.

from scipy.stats import beta

np.random.seed(42)
theta_real = 0.35
ntrials = 100
data = stats.bernoulli.rvs(p=theta_real, size=ntrials)

N = ntrials; N1 = sum(data); N0 = N-N1; # Sufficient statistics
aprior = 1; bprior = 1; # prior
apost = aprior + N1; bpost = bprior + N0 # posterior

# Interval function
alpha = 0.05
CI1 = beta.interval(1-alpha, apost, bpost)
print('{:0.2f}--{:0.2f}'.format(CI1[0], CI1[1])) # (0.06:0.52) 

# Use the inverse CDF (percent point function)
l  = beta.ppf(alpha/2, apost, bpost)
u  = beta.ppf(1-alpha/2, apost, bpost)
CI2 = (l,u)
print('{:0.2f}--{:0.2f}'.format(CI2[0], CI2[1])) # (0.06:0.52) 

# Use Monte Carlo sampling
samples = beta.rvs(apost, bpost, size=10000)
samples = np.sort(samples)
CI3 = np.percentile(samples, 100*np.array([alpha/2, 1-alpha/2])) 
print('{:0.2f}--{:0.2f}'.format(CI3[0], CI3[1])) # (0.06:0.51) 
print(np.mean(samples))


0.24--0.42
0.24--0.42
0.24--0.42
0.3230834644012448

In [4]:
# Use arviz to plot posterior.
# By default, it shows the 94\% interval, but we change it to 95%.

az.plot_posterior({'θ':samples}, credible_interval=0.95);



In [5]:
# See if the parameter is inside the region of practical equivalence centered at 0.5
az.plot_posterior(samples, rope=[0.45, .55]);


Out[5]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fc227f147b8>],
      dtype=object)

In [6]:
# From the above plot, we see that the HPD does not overlap the ROPE,
#so we can confidently say the parameter is "significiantly different" from 0.5

# We can also verify this by checking if 0.5 is in the HPD
az.plot_posterior(samples,  credible_interval=0.95, ref_val=0.5);



In [19]:
# Summarize  posterior samples
az.summary(samples)
# We can ignore the warning about not having enough 'chains',
# since we are drawing exact iid samples from the posterior.


arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 10000), minimum_shape: (chains=2, draws=4)
Out[19]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
x 0.323 0.046 0.239 0.41 0.041 0.04 1.0 1.0 1.0 10.0 NaN

Point estimates

We minimize the posterior expected loss, using L2 loss (estimator is posterior mean) or L1 loss (estimator is posterior median).


In [0]:
grid = np.linspace(0, 1, 200)
θ_pos = samples #trace['θ']
lossf_a = [np.mean(abs(i - θ_pos)) for i in grid]
lossf_b = [np.mean((i - θ_pos)**2) for i in grid]

for lossf, c in zip([lossf_a, lossf_b], ['C0', 'C1']):
    mini = np.argmin(lossf)
    plt.plot(grid, lossf, c)
    plt.plot(grid[mini], lossf[mini], 'o', color=c)
    plt.annotate('{:.3f}'.format(grid[mini]),
                 (grid[mini], lossf[mini] + 0.03), color=c)
    plt.yticks([])
    plt.xlabel(r'$\hat \theta$')


MCMC inference

We will use pymc3 to approximate the posterior of this simple model. Code is based on this notebook.


In [7]:
data # same as above


Out[7]:
array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0])

In [9]:
with pm.Model() as model:
    # a priori
    θ = pm.Beta('θ', alpha=aprior, beta=bprior)
    # likelihood
    y = pm.Bernoulli('y', p=θ, observed=data)

pm.model_to_graphviz(model) # show the DAG


Out[9]:
%3 cluster100 100 θ θ ~ Beta y y ~ Bernoulli θ->y

In [10]:
# run MCMC (defaults to 2 chains)
with model:
    trace = pm.sample(1000, random_seed=123)


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [θ]
Sampling chain 0, 0 divergences: 100%|██████████| 1500/1500 [00:01<00:00, 1155.94it/s]
Sampling chain 1, 0 divergences: 100%|██████████| 1500/1500 [00:00<00:00, 1877.55it/s]

In [13]:
# Standard MCMC diagonistics
pm.traceplot(trace);
Rhat = pm.rhat(trace);
print(Rhat) # should be close to 1.0


<xarray.Dataset>
Dimensions:  ()
Data variables:
    θ        float64 1.006

In [14]:
pm.plot_posterior(trace);
# The samples from MCMC (called "trace") should be similar to the exact
# iid samples from the posterior, plotted above
# Under the hood, pm.plot_posterior(trace) calls az.plot_posterior(trace)



In [21]:
# Summarize  posterior samples
pm.summary(trace)


Out[21]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
θ 0.324 0.045 0.237 0.407 0.002 0.001 867.0 867.0 851.0 1292.0 1.01

In [20]:
# Convert posterior samples into a parametric distribution
trace_approx = pm.Empirical(trace, model=model)
# Now plot samples from this distribution
az.plot_posterior(trace_approx.sample(1000));


Variational inference

We use automatic differentiation VI. Details can be found at https://docs.pymc.io/notebooks/variational_api_quickstart.html


In [25]:
niter = 10000
with model:
    post = pm.fit(niter, method='advi'); # mean field approximation


Average Loss = 64.881: 100%|██████████| 10000/10000 [00:03<00:00, 2560.06it/s]
Finished [100%]: Average Loss = 64.882

In [26]:
# Plot negative ELBO vs iteration to assess convergence
plt.plot(post.hist);



In [27]:
pm.plot_posterior(post.sample(1000));


1d Gaussian

Exact inference

For simplicity, we assume the variance is known, and we just want to infer the mean.


In [36]:
np.random.seed(0)
N = 100
x = np.random.randn(100)

# Parameters of prior
mu_prior = 1.1
sigma_prior = 1.2 #std
Sigma_prior = sigma_prior**2 #var

# Parameters of likelihood
sigma_x = 1.3
Sigma_x = sigma_x**2

# Bayes rule for Gaussians 
Sigma_post = 1/( 1/Sigma_prior + N/Sigma_x )
xbar = np.mean(x)
mu_post = Sigma_post * (1/Sigma_x * N * xbar + 1/Sigma_prior * mu_prior);
print('p(mu|D)=N(mu|{:.3f}, {:.3f})'.format(mu_post, Sigma_post))


p(\mu|D)=N(mu|0.072, 0.017)

MCMC inference

Initially we assume the variance is known, so we can compare results to exact infernece.


In [31]:
with pm.Model() as model:
    mu = pm.Normal('mu', mu=mu_prior, sd=sigma_prior)
    obs = pm.Normal('obs', mu=mu, sd=sigma_x, observed=x)
    mcmc_samples = pm.sample(1000, tune=500) # mcmc
    
pm.plot_posterior(mcmc_samples);


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [mu]
Sampling chain 0, 0 divergences: 100%|██████████| 1500/1500 [00:00<00:00, 1881.55it/s]
Sampling chain 1, 0 divergences: 100%|██████████| 1500/1500 [00:00<00:00, 1957.38it/s]
The acceptance probability does not match the target. It is 0.8864843167489614, but should be close to 0.8. Try to increase the number of tuning steps.

In [37]:
vals = mcmc_samples.get_values('mu')
mu_post_mcmc = np.mean(vals)
Sigma_post_mcmc = np.var(vals)
print('pMCMC(mu|D)=N(mu|{:.3f}, {:.3f})'.format(mu_post_mcmc, Sigma_post_mcmc))
assert np.isclose(mu_post, mu_post_mcmc, atol=1e-1)
assert np.isclose(Sigma_post, Sigma_post_mcmc, atol=1e-1)


pMCMC(mu|D)=N(mu|0.069, 0.016)

In [0]:
# We can also evaluate the log joint at any given point in parameter space.
# The 'obs' variable is already observed (value=x)
# so the only unknown is mu. Let's clamp it to some value
# and compute log p(mu, D)
mu_clamped = -0.5    
logp = model.logp({'mu': mu_clamped})

# Computed the log joint "by hand"
log_prior = scipy.stats.norm(mu_prior, sigma_prior).logpdf(mu_clamped)
log_lik  = np.sum(scipy.stats.norm(mu_clamped, sigma_x).logpdf(x))
log_joint = log_prior + log_lik

assert np.isclose(logp, log_joint)

Now we consider the case where the mean and variance are both unknown. We also switch to a "real world" dataset, of "chemical shifts", that has a couple of "outliers".


In [39]:
#url = 'https://github.com/aloctavodia/BAP/blob/master/code/data/chemical_shifts.csv'
url = 'https://raw.githubusercontent.com/aloctavodia/BAP/master/code/data/chemical_shifts.csv'

df = pd.read_csv(url)
# b=df.iloc[:,1:].values
#data = df.to_numpy() 
data = df.iloc[:,0].values
print(data.shape)
print(data)


(47,)
[55.12 53.73 50.24 52.05 56.4  48.45 52.34 55.65 51.49 51.86 63.43 53.
 56.09 51.93 52.31 52.33 57.48 57.44 55.14 53.93 54.62 56.09 68.58 51.36
 55.47 50.73 51.94 54.95 50.39 52.91 51.5  52.68 47.72 49.73 51.82 54.99
 52.84 53.19 54.52 51.46 53.73 51.61 49.81 52.42 54.3  53.84 53.16]

In [40]:
az.plot_kde(data, rug=True)
plt.yticks([0], alpha=0)


Out[40]:
([<matplotlib.axis.YTick at 0x7fc209be60b8>],
 <a list of 1 Text major ticklabel objects>)

In [41]:
# We will infer a posterior for the mean and variance.
# We use a uniform prior for the mean, with support slightly larger than the data.
# We use a truncated normal for the variance, with effective support uniform 0 to 3*std.
r = np.max(data)-np.min(data)
min_mu = np.min(data)-0.1*r
max_mu = np.max(data)+0.1*r
prior_std = 3*np.std(data)
print([min_mu, max_mu, prior_std])

with pm.Model() as model_g:
    μ = pm.Uniform('μ', lower=min_mu, upper=max_mu)
    σ = pm.HalfNormal('σ', sd=10)
    y = pm.Normal('y', mu=μ, sd=σ, observed=data)

pm.model_to_graphviz(model_g) # show the DAG


[45.634, 70.666, 10.312458871064944]
Out[41]:
%3 cluster47 47 μ μ ~ Uniform y y ~ Normal μ->y σ σ ~ HalfNormal σ->y

In [42]:
with model_g:
    trace_g = pm.sample(1000, random_seed=123)


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [σ, μ]
Sampling chain 0, 0 divergences: 100%|██████████| 1500/1500 [00:01<00:00, 887.75it/s]
Sampling chain 1, 0 divergences: 100%|██████████| 1500/1500 [00:01<00:00, 1474.21it/s]

In [43]:
az.plot_trace(trace_g);



In [44]:
az.summary(trace_g)


Out[44]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
μ 53.544 0.526 52.549 54.486 0.012 0.008 1929.0 1929.0 1929.0 1423.0 1.0
σ 3.566 0.387 2.835 4.266 0.010 0.008 1366.0 1297.0 1438.0 1157.0 1.0

In [46]:
az.plot_joint(trace_g, kind='kde', fill_last=False);


Posterior predictive checks


In [48]:
# We check how well the gaussian assumption fits our data
# by sampling from the fitted model, and plotting the samples
# and the original data.
# For details, see https://docs.pymc.io/notebooks/posterior_predictive.html

# For the Gaussian model, the mean and variance is higher than for the observed data, 
# indicating poor fit.
y_pred_g = pm.sample_posterior_predictive(trace_g, 100, model_g)

print(y_pred_g.keys())
v=y_pred_g['y']
print(type(v))
print(v.shape) # 100 x 47


/usr/local/lib/python3.6/dist-packages/pymc3/sampling.py:1247: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample
  "samples parameter is smaller than nchains times ndraws, some draws "
100%|██████████| 100/100 [00:00<00:00, 1094.21it/s]
dict_keys(['y'])
<class 'numpy.ndarray'>
(100, 47)


In [49]:
data_ppc = az.from_pymc3(trace=trace_g, posterior_predictive=y_pred_g)
ax = az.plot_ppc(data_ppc, figsize=(12, 6), mean=False);
ax[0].legend(fontsize=15)


arviz.data.io_pymc3 - WARNING - posterior predictive variable y's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.
Out[49]:
<matplotlib.legend.Legend at 0x7fc2141a10f0>

Robust likelihood (1d Student distribution)


In [35]:
# We replace the above Gaussian likelihood with a Student t distribution.
# The degree of freedom parameter \nu > 0 (also called the "normality parameter")
# determines how close to Normal the distribution is.
# nu=1 corredsponds to a Cauchy, nu >> 10 corresponds to a Gaussian.
# We put an Exponential prior on nu, with a mean of 30.

with pm.Model() as model_t:
    μ = pm.Uniform('μ', 40, 75)
    σ = pm.HalfNormal('σ', sd=10)
    ν = pm.Exponential('ν', 1/30) # PyMC3 uses inverse of the mean
    y = pm.StudentT('y', mu=μ, sd=σ, nu=ν, observed=data)

pm.model_to_graphviz(model_t) # show the DAG

with model_t:
    trace_t = pm.sample(1000)

az.plot_trace(trace_t)

az.summary(trace_t)
# We see that E[nu]=4.5, which is fairly far from Gaussian
# We see E[sigma]=2.1, wich is less than the 3.5 estimate from Gaussian likelihood


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [ν, σ, μ]
Sampling chain 0, 0 divergences: 100%|██████████| 1500/1500 [00:01<00:00, 850.29it/s]
Sampling chain 1, 0 divergences: 100%|██████████| 1500/1500 [00:01<00:00, 857.23it/s]
Out[35]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
μ 53.077 0.388 52.340 53.774 0.010 0.007 1365.0 1365.0 1366.0 867.0 1.0
σ 2.188 0.403 1.507 2.977 0.012 0.009 1043.0 1043.0 1035.0 1275.0 1.0
ν 4.500 3.510 1.270 9.556 0.115 0.081 933.0 933.0 1095.0 1125.0 1.0

In [36]:
# posterior predictive check

y_ppc_t = pm.sample_posterior_predictive(
    trace_t, 100, model_t, random_seed=123)
y_pred_t = az.from_pymc3(trace=trace_t, posterior_predictive=y_ppc_t)
az.plot_ppc(y_pred_t, figsize=(12, 6), mean=False)
ax[0].legend(fontsize=15)
plt.xlim(40, 70)


/usr/local/lib/python3.6/dist-packages/pymc3/sampling.py:1247: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample
  "samples parameter is smaller than nchains times ndraws, some draws "
100%|██████████| 100/100 [00:00<00:00, 1055.67it/s]
arviz.data.io_pymc3 - WARNING - posterior predictive variable y's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.
Out[36]:
(40.0, 70.0)

In [38]:
# Remove outliers from data and compare empirical mean and variance of cleaned data
# to posterior mean and posterior scale of a Student likelihood

# https://gist.github.com/vishalkuo/f4aec300cf6252ed28d3
def removeOutliers(x, outlierConstant=1.5):
    a = np.array(x)
    upper_quartile = np.percentile(a, 75)
    lower_quartile = np.percentile(a, 25)
    IQR = (upper_quartile - lower_quartile) * outlierConstant
    quartileSet = (lower_quartile - IQR, upper_quartile + IQR)
    result = a[np.where((a >= quartileSet[0]) & (a <= quartileSet[1]))]
    return result.tolist()

data_clean = removeOutliers(data)
mu_mcmc = np.mean(trace_t.get_values('μ'))
sigma_mcmc = np.mean(trace_t.get_values('σ'))

print([np.mean(data), np.mean(data_clean), mu_mcmc])
print([np.std(data), np.std(data_clean), sigma_mcmc])


[53.548297872340434, 52.994666666666674]
[3.4374862903549817, 2.200877198856048]

Comparing means of different datasets

We often want to know if one dataset, $D_i$, has a "statistically signficant" difference in one of its parameters, such as its mean $\mu_i$, compared to some other dataset, $D_j$. We can answer this in a Bayesian way by computing $p(\delta_{ij}|D_i,D_j)$, where

$\delta_{ij}=\mu_i-\mu_j$.

To do this, we draw samples from $p(\mu_i|D_i)$ and $p(\mu_j|D_j)$.

Since the magnitude of $\delta_{ij}$ can be hard to interpret, it is common to divide it by the pooled standard deviation, to get a metric known as Cohen's d (see this website for details):

$d_{ij} = \frac{\mu_j - \mu_i}{\sqrt{\frac{\sigma_i^2 + \sigma_j^2}{2}}}$

We can compute $p(d_{ij}|D_i,D_j)$ using posterior samples of $\mu_i,\mu_j,\sigma_i,\sigma_j$.


In [48]:
#We illustrate this below using the same dataset as used in chap 2 of
#["Bayesian Analysis with Python (2nd end)"](https://github.com/aloctavodia/BAP) that records how much tips waiters made.

url = 'https://raw.githubusercontent.com/aloctavodia/BAP/master/code/data/tips.csv'
df = pd.read_csv(url)
df.tail()


Out[48]:
total_bill tip sex smoker day time size
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

In [46]:
# We look at the effect of day on tip amount.
# We ignore other covariates, such as gender.

sns.violinplot(x='day', y='tip', data=df)


Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1ab030ec88>

In [51]:
# We will compute 4 groups, corresponding to Thur-Sun.
x=df['day'].values
print(type(x))
print(x)
print(np.unique(x))


<class 'numpy.ndarray'>
['Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun'
 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat'
 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat'
 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun'
 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sat' 'Sat' 'Sat' 'Sat'
 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat'
 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur'
 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Fri' 'Fri' 'Fri' 'Fri'
 'Fri' 'Fri' 'Fri' 'Fri' 'Fri' 'Fri' 'Fri' 'Fri' 'Sat' 'Sat' 'Sat' 'Sat'
 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Thur'
 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur'
 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur'
 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur'
 'Thur' 'Thur' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun'
 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sat' 'Sat' 'Sat' 'Sat'
 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun'
 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Sun' 'Thur' 'Thur' 'Thur' 'Thur'
 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur' 'Thur'
 'Thur' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat'
 'Sat' 'Sat' 'Sat' 'Fri' 'Fri' 'Fri' 'Fri' 'Fri' 'Fri' 'Fri' 'Sat' 'Sat'
 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat' 'Sat'
 'Sat' 'Sat' 'Thur']
['Fri' 'Sat' 'Sun' 'Thur']

In [67]:
tip_amount = df['tip'].values
days = ['Thur', 'Fri', 'Sat', 'Sun']
idx = pd.Categorical(tips['day'], categories=days).codes
ngroups = len(np.unique(idx))
print(idx)


[3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
 2 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0]

In [69]:
with pm.Model() as model_cg:
    μ = pm.Normal('μ', mu=0, sd=10, shape=ngroups)
    σ = pm.HalfNormal('σ', sd=10, shape=ngroups)
    y = pm.Normal('y', mu=μ[idx], sd=σ[idx], observed=tip_amount)

print(model_cg.basic_RVs)

with model_cg:
    trace_cg = pm.sample(5000)

az.plot_trace(trace_cg)
az.summary(trace_cg)


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
[μ, σ_log__, y]
Sequential sampling (2 chains in 1 job)
NUTS: [σ, μ]
Sampling chain 0, 0 divergences: 100%|██████████| 5500/5500 [00:05<00:00, 938.62it/s]
Sampling chain 1, 0 divergences: 100%|██████████| 5500/5500 [00:05<00:00, 919.67it/s]
Out[69]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
μ[0] 2.770 0.162 2.469 3.078 0.001 0.001 12816.0 12699.0 12846.0 7972.0 1.0
μ[1] 2.732 0.250 2.268 3.208 0.003 0.002 8791.0 8791.0 8927.0 6985.0 1.0
μ[2] 2.992 0.178 2.653 3.320 0.002 0.001 12015.0 12015.0 12037.0 7042.0 1.0
μ[3] 3.255 0.144 2.985 3.523 0.001 0.001 12028.0 11919.0 12093.0 7467.0 1.0
σ[0] 1.264 0.116 1.064 1.494 0.001 0.001 11665.0 11170.0 12329.0 7697.0 1.0
σ[1] 1.094 0.198 0.768 1.473 0.002 0.002 8779.0 7947.0 9898.0 6861.0 1.0
σ[2] 1.654 0.126 1.433 1.901 0.001 0.001 11351.0 11129.0 11410.0 7870.0 1.0
σ[3] 1.255 0.105 1.071 1.460 0.001 0.001 10241.0 9897.0 10768.0 7375.0 1.0

In [70]:
# Looking at the posterior mean for the mu_i,
# we see Thur ~ Fri < Sat < Sun.
# But to see if these differences are significant, we should take
# into account the variability. We illustrate this below.
# We see that Thursday and Friday both earn significantly less than Sunday.
# (The threshold of 0 is outside the 95% HPD).
# Other differences are less significant.

fig, ax = plt.subplots(3, 2, figsize=(14, 8), constrained_layout=True)

comparisons = [(i, j) for i in range(ngroups) for j in range(i+1, ngroups)]
pos = [(k, l) for k in range(ngroups-1) for l in (0, 1)] # position of plot

for (i, j), (k, l) in zip(comparisons, pos):
    means_diff = trace_cg['μ'][:, i] - trace_cg['μ'][:, j]
    d_cohen = (means_diff / np.sqrt((trace_cg['σ'][:, i]**2 + trace_cg['σ'][:, j]**2) / 2)).mean()
    az.plot_posterior(means_diff, ref_val=0, ax=ax[k, l],  credible_interval=0.95)
    name_i = days[i]
    name_j = days[j]
    str = 'mean {} -  mean {}'.format(name_i, name_j)
    ax[k, l].set_title(str)
    ax[k, l].plot(0, label=f"Cohen's d = {d_cohen:.2f}")
    ax[k, l].legend()



In [0]:

Hierarchical Bayes

Binomial rates (cancer data)

We use the "cancer rates" example from the MLAPP2.0 book. For PyMC3 code for a similar rats" example from BDA3, see here.


In [13]:
#https://github.com/probml/pmtk3/blob/master/demos/cancerRatesEb.m

data_y = np.array([0, 0, 2, 0, 1, 1, 0, 2, 1, 3, 0, 1, 1, 1, 54, 0, 0, 1, 3, 0]);
data_n = np.array([1083, 855, 3461, 657, 1208, 1025, 527, 1668, 583, 582, 917, 857,
    680, 917, 53637, 874, 395, 581, 588, 383]);
N = len(data_n)

# We put a prior on the mean and precision () of the Beta distribution,
# instead of on the alpha and beta parameters 
with pm.Model() as model_h:
    mu = pm.Beta('mu', 1., 1.)
    kappa = pm.HalfNormal('kappa', 500)
    alpha = pm.Deterministic('alpha', mu*kappa)
    beta = pm.Deterministic('beta', (1.0-mu)*kappa)
    theta = pm.Beta('theta', alpha=alpha, beta=beta, shape=N)
    y = pm.Binomial('y', p=theta, observed=data_y, n=data_n)

np.random.seed(0)
with model_h:
  trace_h = pm.sample(1000, chains=4)


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (4 chains in 1 job)
NUTS: [theta, kappa, mu]
Sampling chain 0, 0 divergences: 100%|██████████| 1500/1500 [00:03<00:00, 414.72it/s]
Sampling chain 1, 0 divergences: 100%|██████████| 1500/1500 [00:03<00:00, 427.66it/s]
Sampling chain 2, 0 divergences: 100%|██████████| 1500/1500 [00:03<00:00, 388.88it/s]
Sampling chain 3, 0 divergences: 100%|██████████| 1500/1500 [00:04<00:00, 366.06it/s]
The acceptance probability does not match the target. It is 0.8860999355152308, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.8922335926247497, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.9040355530698296, but should be close to 0.8. Try to increase the number of tuning steps.

In [14]:
az.summary(trace_h)


Out[14]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
mu 0.001 0.000 0.001 0.002 0.000 0.000 2985.0 2948.0 3017.0 3066.0 1.0
kappa 768.204 306.928 249.011 1356.562 7.212 5.100 1811.0 1811.0 1692.0 2178.0 1.0
alpha 1.055 0.419 0.400 1.866 0.013 0.009 1096.0 1096.0 970.0 1622.0 1.0
beta 767.149 306.613 248.607 1354.849 7.201 5.093 1813.0 1813.0 1693.0 2165.0 1.0
theta[0] 0.001 0.001 0.000 0.002 0.000 0.000 4150.0 3848.0 2472.0 1526.0 1.0
theta[1] 0.001 0.001 0.000 0.002 0.000 0.000 4051.0 4006.0 1938.0 1308.0 1.0
theta[2] 0.001 0.000 0.000 0.001 0.000 0.000 4410.0 3838.0 3884.0 2253.0 1.0
theta[3] 0.001 0.001 0.000 0.002 0.000 0.000 4041.0 3797.0 2305.0 1357.0 1.0
theta[4] 0.001 0.001 0.000 0.002 0.000 0.000 4349.0 3764.0 3489.0 2266.0 1.0
theta[5] 0.001 0.001 0.000 0.003 0.000 0.000 4776.0 3974.0 3840.0 1975.0 1.0
theta[6] 0.001 0.001 0.000 0.002 0.000 0.000 4311.0 3737.0 1983.0 1097.0 1.0
theta[7] 0.001 0.001 0.000 0.003 0.000 0.000 5951.0 5036.0 4966.0 2804.0 1.0
theta[8] 0.002 0.001 0.000 0.004 0.000 0.000 5728.0 4474.0 3853.0 2085.0 1.0
theta[9] 0.003 0.002 0.001 0.006 0.000 0.000 4680.0 4357.0 4085.0 2378.0 1.0
theta[10] 0.001 0.001 0.000 0.002 0.000 0.000 5483.0 5037.0 2839.0 1820.0 1.0
theta[11] 0.001 0.001 0.000 0.003 0.000 0.000 5324.0 4342.0 3747.0 1825.0 1.0
theta[12] 0.001 0.001 0.000 0.003 0.000 0.000 5051.0 4094.0 4020.0 2187.0 1.0
theta[13] 0.001 0.001 0.000 0.003 0.000 0.000 5220.0 4370.0 4287.0 2282.0 1.0
theta[14] 0.001 0.000 0.001 0.001 0.000 0.000 5364.0 5178.0 5386.0 3053.0 1.0
theta[15] 0.001 0.001 0.000 0.002 0.000 0.000 4457.0 3987.0 2666.0 1360.0 1.0
theta[16] 0.001 0.001 0.000 0.003 0.000 0.000 4824.0 4265.0 2440.0 1511.0 1.0
theta[17] 0.002 0.001 0.000 0.004 0.000 0.000 4558.0 3619.0 3650.0 1944.0 1.0
theta[18] 0.003 0.002 0.001 0.006 0.000 0.000 4109.0 3643.0 3975.0 2559.0 1.0
theta[19] 0.001 0.001 0.000 0.003 0.000 0.000 4775.0 4391.0 2542.0 1862.0 1.0

In [15]:
az.plot_forest(trace_h, var_names=["theta"], credible_interval=0.95);



In [16]:
az.plot_forest(trace_h, var_names=["theta"], combined=True, credible_interval=0.95);



In [17]:
az.plot_forest(trace_h, var_names=["theta"], combined=True, kind='ridgeplot');


Gaussian means (8 schools data)

This example is from "Bayesian Data Analysis".


In [71]:
# https://github.com/probml/pyprobml/blob/master/scripts/schools8_pymc3.py

# Data of the Eight Schools Model
J = 8
y = np.array([28.,  8., -3.,  7., -1.,  1., 18., 12.])
sigma = np.array([15., 10., 16., 11.,  9., 11., 10., 18.])

names=[]; 
for t in range(8):
    names.append('theta {}'.format(t)); 
print(names)

# Plot raw data
fig, ax = plt.subplots()
y_pos = np.arange(8)
ax.errorbar(y,y_pos, xerr=sigma, fmt='o')
ax.set_yticks(y_pos)
ax.set_yticklabels(names)
ax.invert_yaxis()  # labels read top-to-bottom
plt.show()


['theta 0', 'theta 1', 'theta 2', 'theta 3', 'theta 4', 'theta 5', 'theta 6', 'theta 7']

In [72]:
# Centered model
with pm.Model() as Centered_eight:
    mu_alpha = pm.Normal('mu_alpha', mu=0, sigma=5)
    sigma_alpha = pm.HalfCauchy('sigma_alpha', beta=5)
    alpha = pm.Normal('alpha', mu=mu_alpha, sigma=sigma_alpha, shape=J)
    obs = pm.Normal('obs', mu=alpha, sigma=sigma, observed=y)
    
np.random.seed(0)
with Centered_eight:
    trace_centered = pm.sample(1000, chains=2)
    
pm.summary(trace_centered).round(2)
# PyMC3 gives multiple warnings about  divergences
# Also, see r_hat ~ 1.01, ESS << nchains*1000, especially for sigma_alpha
# We can solve these problems below by using a non-centered parameterization.
# In practice, for this model, the results are very similar.


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (4 chains in 1 job)
NUTS: [alpha, sigma_alpha, mu_alpha]
Sampling chain 0, 6 divergences: 100%|██████████| 1500/1500 [00:03<00:00, 427.03it/s]
Sampling chain 1, 16 divergences: 100%|██████████| 1500/1500 [00:02<00:00, 508.83it/s]
Sampling chain 2, 13 divergences: 100%|██████████| 1500/1500 [00:03<00:00, 461.16it/s]
Sampling chain 3, 19 divergences: 100%|██████████| 1500/1500 [00:03<00:00, 470.60it/s]
There were 6 divergences after tuning. Increase `target_accept` or reparameterize.
There were 22 divergences after tuning. Increase `target_accept` or reparameterize.
There were 35 divergences after tuning. Increase `target_accept` or reparameterize.
There were 54 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 10% for some parameters.
Out[72]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
mu_alpha 4.34 3.27 -2.01 10.25 0.13 0.10 647.0 497.0 644.0 647.0 1.01
alpha[0] 6.62 5.95 -3.81 17.70 0.19 0.14 997.0 966.0 980.0 1534.0 1.01
alpha[1] 5.08 4.95 -4.06 14.67 0.14 0.10 1235.0 1235.0 1143.0 1860.0 1.01
alpha[2] 3.84 5.61 -6.89 14.05 0.16 0.12 1271.0 1200.0 1106.0 1401.0 1.01
alpha[3] 4.75 4.96 -5.07 13.70 0.14 0.10 1334.0 1263.0 1171.0 1929.0 1.01
alpha[4] 3.46 4.78 -5.76 12.42 0.16 0.12 927.0 809.0 848.0 1538.0 1.01
alpha[5] 4.02 4.82 -4.97 13.47 0.14 0.10 1114.0 1080.0 1006.0 1738.0 1.01
alpha[6] 6.63 5.38 -2.60 17.34 0.17 0.12 965.0 965.0 894.0 1541.0 1.01
alpha[7] 4.86 5.64 -5.88 15.72 0.15 0.11 1389.0 1242.0 1200.0 1734.0 1.01
sigma_alpha 4.13 3.20 0.80 9.72 0.16 0.11 400.0 400.0 235.0 147.0 1.02

In [73]:
# Display the total number and percentage of divergent chains
diverging = trace_centered['diverging']
print('Number of Divergent Chains: {}'.format(diverging.nonzero()[0].size))
diverging_pct = diverging.nonzero()[0].size / len(trace_centered) * 100
print('Percentage of Divergent Chains: {:.1f}'.format(diverging_pct))


Number of Divergent Chains: 54
Percentage of Divergent Chains: 5.4

In [27]:
# We can see somewhat high auto correlation of the samples
az.plot_autocorr(trace_centered, var_names=['mu_alpha', 'sigma_alpha']);



In [38]:
az.plot_forest(trace_centered, var_names="alpha", 
               credible_interval=0.95, combined=True);



In [29]:
# Non-centered parameterization

with pm.Model() as NonCentered_eight:
    mu_alpha = pm.Normal('mu_alpha', mu=0, sigma=5)
    sigma_alpha = pm.HalfCauchy('sigma_alpha', beta=5)
    alpha_offset = pm.Normal('alpha_offset', mu=0, sigma=1, shape=J)
    alpha = pm.Deterministic('alpha', mu_alpha + sigma_alpha * alpha_offset)
    #alpha = pm.Normal('alpha', mu=mu_alpha, sigma=sigma_alpha, shape=J)
    obs = pm.Normal('obs', mu=alpha, sigma=sigma, observed=y)
    
np.random.seed(0)
with NonCentered_eight:
    trace_noncentered = pm.sample(1000)
    
pm.summary(trace_noncentered).round(2)
# Samples look good: r_hat = 1, ESS ~= nchains*1000


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [alpha_offset, sigma_alpha, mu_alpha]
Sampling chain 0, 0 divergences: 100%|██████████| 1500/1500 [00:01<00:00, 769.54it/s]
Sampling chain 1, 6 divergences: 100%|██████████| 1500/1500 [00:01<00:00, 790.51it/s]
There were 6 divergences after tuning. Increase `target_accept` or reparameterize.
Out[29]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
mu_alpha 4.35 3.33 -1.60 10.80 0.07 0.06 2109.0 1785.0 2128.0 1133.0 1.0
alpha_offset[0] 0.32 0.97 -1.56 2.02 0.02 0.02 2201.0 823.0 2178.0 1134.0 1.0
alpha_offset[1] 0.10 0.95 -1.80 1.82 0.02 0.02 2368.0 879.0 2360.0 1510.0 1.0
alpha_offset[2] -0.08 1.00 -1.82 1.95 0.02 0.03 2302.0 740.0 2299.0 1198.0 1.0
alpha_offset[3] 0.04 0.95 -1.72 1.86 0.02 0.02 2624.0 850.0 2616.0 1380.0 1.0
alpha_offset[4] -0.17 0.94 -2.01 1.50 0.02 0.02 2348.0 1021.0 2342.0 1520.0 1.0
alpha_offset[5] -0.08 0.99 -2.00 1.68 0.02 0.02 2581.0 856.0 2665.0 1197.0 1.0
alpha_offset[6] 0.38 0.98 -1.50 2.09 0.02 0.02 2256.0 1117.0 2249.0 1480.0 1.0
alpha_offset[7] 0.12 0.97 -1.72 1.86 0.02 0.02 2216.0 836.0 2229.0 1501.0 1.0
sigma_alpha 3.60 3.17 0.02 9.39 0.09 0.06 1280.0 1280.0 1161.0 1066.0 1.0
alpha[0] 6.11 5.56 -3.29 17.28 0.13 0.10 1841.0 1409.0 2077.0 1472.0 1.0
alpha[1] 4.93 4.54 -3.68 13.11 0.10 0.09 1920.0 1381.0 2086.0 1462.0 1.0
alpha[2] 4.03 5.26 -6.54 13.50 0.12 0.10 1899.0 1484.0 2045.0 1266.0 1.0
alpha[3] 4.56 4.84 -5.06 12.98 0.11 0.08 2087.0 1676.0 2273.0 1268.0 1.0
alpha[4] 3.44 4.67 -5.05 11.93 0.10 0.08 2011.0 1719.0 2138.0 1605.0 1.0
alpha[5] 3.96 4.98 -5.31 12.80 0.10 0.08 2280.0 1728.0 2398.0 1630.0 1.0
alpha[6] 6.50 5.29 -2.97 17.01 0.12 0.09 2032.0 1622.0 2177.0 1495.0 1.0
alpha[7] 4.97 5.40 -5.63 14.53 0.13 0.10 1623.0 1463.0 1759.0 1191.0 1.0

In [31]:
az.plot_autocorr(trace_noncentered, var_names=['mu_alpha', 'sigma_alpha']);



In [51]:
az.plot_forest(trace_noncentered, var_names="theta",
               combined=True, credible_interval=0.95);



In [0]:
# Plot the "funnel of hell"
# Based on
# https://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/GLM_hierarchical_non_centered.ipynb

group = 0
x = pd.Series(trace_centered['alpha'][:, group], name=f'alpha {group}')
y = pd.Series(trace_centered['sigma_alpha'], name='sigma_alpha')
sns.jointplot(x, y);
plt.suptitle('centered')

In [0]:
group = 0
x = pd.Series(trace_noncentered['alpha'][:, group], name=f'alpha {group}')
y = pd.Series(trace_noncentered['sigma_alpha'], name='sigma_alpha')
sns.jointplot(x, y);
plt.suptitle('noncentered')

In [40]:
for group in range(J):


  fig, axs = plt.subplots(ncols=2, sharex=True, sharey=True)
  x = pd.Series(trace_centered['alpha'][:, group], name=f'alpha {group}')
  y  = pd.Series(trace_centered['sigma_alpha'], name='sigma_alpha')
  axs[0].plot(x, y, '.');
  axs[0].set(title='Centered', ylabel='sigma_alpha', xlabel=f'alpha {group}')
  x = pd.Series(trace_noncentered['alpha'][:, group], name=f'alpha {group}')
  y  = pd.Series(trace_noncentered['sigma_alpha'], name='sigma_alpha')
  axs[1].plot(x, y, '.');
  axs[1].set(title='Non-centered', ylabel='sigma_alpha', xlabel=f'alpha {group}')


Linear regression (synthetic 1d data)

This section is based on Bayesian Analysis with Python, ch 3 and this blog post from Thomas Wiecki.


In [57]:
N = 10 #20
M = 8 # num groups
idx = np.repeat(range(M-1), N) # N samples for groups 0-6
idx = np.append(idx, 7) # 1 sample for 7'th group
np.random.seed(123)

#alpha_real = np.random.normal(2.5, 0.5, size=M)
#beta_real = np.random.beta(6, 1, size=M)
#eps_real = np.random.normal(0, 0.5, size=len(idx))

alpha_real = np.random.normal(2.5, 0.5, size=M)
beta_real = np.random.beta(1, 1, size=M) # slope is closer to 0
eps_real = np.random.normal(0, 0.5, size=len(idx))

print(beta_real)

y_m = np.zeros(len(idx))
x_m = np.random.normal(10, 1, len(idx))
y_m = alpha_real[idx] + beta_real[idx] * x_m + eps_real
x_centered = x_m - x_m.mean()

_, ax = plt.subplots(2, 4, figsize=(10, 5), sharex=True, sharey=True)
ax = np.ravel(ax)
j, k = 0, N
for i in range(M):
    ax[i].scatter(x_m[j:k], y_m[j:k])
    ax[i].set_xlabel(f'x_{i}')
    ax[i].set_ylabel(f'y_{i}', rotation=0, labelpad=15)
    ax[i].set_xlim(6, 15)
    ax[i].set_ylim(1, 18)
    j += N
    k += N
plt.tight_layout()


[0.88022503 0.50983392 0.61314719 0.31763509 0.17516902 0.46602537
 0.57693427 0.84366837]

In [58]:
# Fit separarate models per group

with pm.Model() as unpooled_model:
    α = pm.Normal('α', mu=0, sd=10, shape=M)
    β = pm.Normal('β', mu=0, sd=10, shape=M)
    ϵ = pm.HalfCauchy('ϵ', 5)

    y_pred = pm.Normal('y_pred', mu=α[idx] + β[idx] * x_m,
                         sd=ϵ,  observed=y_m)
    trace_up = pm.sample(1000)

az.summary(trace_up)


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [ϵ, β, α]
Sampling chain 0, 0 divergences: 100%|██████████| 1500/1500 [00:24<00:00, 60.08it/s]
Sampling chain 1, 0 divergences: 100%|██████████| 1500/1500 [00:23<00:00, 63.04it/s]
Out[58]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
α[0] 2.749 1.645 -0.314 5.723 0.036 0.026 2125.0 1993.0 2132.0 1486.0 1.00
α[1] 3.116 2.507 -1.687 7.561 0.051 0.047 2455.0 1395.0 2506.0 1305.0 1.00
α[2] 2.900 2.017 -0.841 6.609 0.052 0.040 1493.0 1289.0 1502.0 1282.0 1.00
α[3] 0.630 3.235 -5.386 6.556 0.065 0.071 2494.0 1052.0 2484.0 1361.0 1.00
α[4] 4.467 1.946 1.228 8.495 0.047 0.034 1686.0 1686.0 1692.0 1369.0 1.00
α[5] 0.315 2.317 -3.947 4.733 0.048 0.054 2315.0 930.0 2340.0 1412.0 1.00
α[6] -2.924 3.008 -7.909 3.450 0.062 0.052 2371.0 1648.0 2380.0 1305.0 1.01
α[7] -0.019 10.236 -18.613 19.346 0.199 0.247 2653.0 859.0 2646.0 1285.0 1.00
β[0] 0.822 0.163 0.524 1.124 0.004 0.003 2131.0 2038.0 2143.0 1416.0 1.00
β[1] 0.523 0.245 0.080 0.990 0.005 0.004 2427.0 2277.0 2478.0 1287.0 1.00
β[2] 0.588 0.199 0.214 0.950 0.005 0.004 1566.0 1478.0 1576.0 1335.0 1.00
β[3] 0.404 0.310 -0.201 0.941 0.006 0.005 2500.0 1876.0 2492.0 1360.0 1.00
β[4] -0.046 0.198 -0.429 0.304 0.005 0.004 1695.0 1224.0 1701.0 1406.0 1.00
β[5] 0.765 0.232 0.321 1.183 0.005 0.004 2331.0 2098.0 2352.0 1458.0 1.00
β[6] 0.991 0.298 0.381 1.498 0.006 0.004 2361.0 2340.0 2379.0 1297.0 1.01
β[7] 1.025 0.896 -0.683 2.646 0.017 0.016 2649.0 1568.0 2645.0 1303.0 1.00
ϵ 0.596 0.059 0.492 0.706 0.001 0.001 2028.0 1988.0 2056.0 1358.0 1.00

In [0]:
def plot_post_pred_samples(trace, nsamples=20):
    _, ax = plt.subplots(2, 4, figsize=(10, 5), sharex=True, sharey=True,
                       constrained_layout=True)
    ax = np.ravel(ax)
    j, k = 0, N
    x_range = np.linspace(x_m.min(), x_m.max(), 10)
    X =  x_range[:, np.newaxis]
    
    for i in range(M):
        ax[i].scatter(x_m[j:k], y_m[j:k])
        ax[i].set_xlabel(f'x_{i}')
        ax[i].set_ylabel(f'y_{i}', labelpad=17, rotation=0)
        alpha_m = trace['α'][:, i].mean()
        beta_m = trace['β'][:, i].mean()
        ax[i].plot(x_range, alpha_m + beta_m * x_range, c='r', lw=3,
                  label=f'y = {alpha_m:.2f} + {beta_m:.2f} * x')
        plt.xlim(x_m.min()-1, x_m.max()+1)
        plt.ylim(y_m.min()-1, y_m.max()+1)
        alpha_samples = trace['α'][:,i]
        beta_samples = trace['β'][:,i]
        ndx = np.random.choice(np.arange(len(alpha_samples)), nsamples)
        alpha_samples_thinned = alpha_samples[ndx]
        beta_samples_thinned = beta_samples[ndx]
        ax[i].plot(x_range, alpha_samples_thinned + beta_samples_thinned * X,
            c='gray', alpha=0.5)
        
        j += N
        k += N

In [60]:
plot_post_pred_samples(trace_up)



In [61]:
# Fit the centered model to the raw data
with pm.Model() as model_centered:
    # hyper-priors
    μ_α = pm.Normal('μ_α', mu=0, sd=10)
    σ_α = pm.HalfNormal('σ_α', 10)
    μ_β = pm.Normal('μ_β', mu=0, sd=10)
    σ_β = pm.HalfNormal('σ_β', sd=10)

    # priors
    α = pm.Normal('α', mu=μ_α, sd=σ_α, shape=M)
    β = pm.Normal('β', mu=μ_β, sd=σ_β, shape=M)
    ϵ = pm.HalfCauchy('ϵ', 5)

    y_pred = pm.Normal('y_pred', mu=α[idx] + β[idx] * x_m,
                         sd=ϵ, observed=y_m)

    trace_centered = pm.sample(1000)

az.summary(trace_centered)
# Lots of warnings about divergence. We will fix this below
# when we switch to non-centered parameterization.


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [ν, ϵ, β, α, σ_β, μ_β, σ_α, μ_α]
Sampling chain 0, 10 divergences: 100%|██████████| 1500/1500 [00:19<00:00, 76.42it/s]
Sampling chain 1, 3 divergences: 100%|██████████| 1500/1500 [00:18<00:00, 83.23it/s]
There were 10 divergences after tuning. Increase `target_accept` or reparameterize.
There were 13 divergences after tuning. Increase `target_accept` or reparameterize.
The estimated number of effective samples is smaller than 200 for some parameters.
Out[61]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
μ_α 2.264 1.044 0.219 4.124 0.050 0.036 442.0 411.0 434.0 700.0 1.0
μ_β 0.557 0.141 0.298 0.813 0.006 0.004 518.0 518.0 516.0 944.0 1.0
α[0] 2.878 1.190 0.882 5.400 0.059 0.042 411.0 411.0 415.0 652.0 1.0
α[1] 2.503 1.342 -0.248 4.953 0.054 0.039 621.0 605.0 599.0 968.0 1.0
α[2] 2.642 1.224 0.401 5.009 0.054 0.039 518.0 495.0 519.0 1105.0 1.0
α[3] 1.539 1.618 -1.732 4.426 0.074 0.052 480.0 480.0 486.0 888.0 1.0
α[4] 2.255 1.356 -0.747 4.543 0.054 0.038 642.0 642.0 646.0 526.0 1.0
α[5] 1.914 1.253 -0.645 4.034 0.057 0.041 482.0 474.0 476.0 871.0 1.0
α[6] 1.541 1.536 -1.291 4.413 0.067 0.047 525.0 525.0 544.0 817.0 1.0
α[7] 2.842 1.757 -0.274 6.420 0.075 0.059 546.0 441.0 565.0 660.0 1.0
β[0] 0.807 0.118 0.580 1.021 0.006 0.004 425.0 425.0 427.0 661.0 1.0
β[1] 0.583 0.132 0.327 0.839 0.005 0.004 638.0 638.0 613.0 955.0 1.0
β[2] 0.613 0.120 0.393 0.839 0.005 0.004 525.0 525.0 527.0 1145.0 1.0
β[3] 0.318 0.155 0.028 0.617 0.007 0.005 483.0 449.0 488.0 921.0 1.0
β[4] 0.180 0.139 -0.057 0.481 0.005 0.004 660.0 581.0 661.0 542.0 1.0
β[5] 0.606 0.126 0.382 0.850 0.006 0.004 487.0 487.0 481.0 1032.0 1.0
β[6] 0.549 0.152 0.293 0.859 0.007 0.005 542.0 520.0 557.0 820.0 1.0
β[7] 0.768 0.157 0.438 1.023 0.007 0.005 537.0 537.0 558.0 729.0 1.0
σ_α 1.334 0.987 0.133 3.057 0.064 0.045 238.0 238.0 170.0 254.0 1.0
σ_β 0.288 0.123 0.095 0.512 0.004 0.003 827.0 727.0 831.0 540.0 1.0
ϵ 0.579 0.057 0.480 0.688 0.002 0.001 1300.0 1300.0 1262.0 1048.0 1.0
ν 44.776 33.242 3.721 104.553 0.862 0.610 1488.0 1483.0 1322.0 1146.0 1.0

In [62]:
az.plot_autocorr(trace_centered, var_names=['μ_α', 'σ_α', 'μ_β', 'σ_β']); #sigma_alpha chain is highly correlated



In [63]:
plot_post_pred_samples(trace_centered)



In [65]:
# Fit the non-centered model to the raw data
with pm.Model() as model_noncentered:
    # hyper-priors
    μ_α = pm.Normal('μ_α', mu=0, sd=10)
    σ_α = pm.HalfNormal('σ_α', 10)
    μ_β = pm.Normal('μ_β', mu=0, sd=10)
    σ_β = pm.HalfNormal('σ_β', sd=10)

    # priors
    α_offset = pm.Normal('α_offset', mu=0, sd=1, shape=M)
    α = pm.Deterministic('α', μ_α + σ_α * α_offset) 
    β_offset = pm.Normal('β_offset', mu=0, sd=1, shape=M)
    β = pm.Deterministic('β', μ_β + σ_β * β_offset) 

    ϵ = pm.HalfCauchy('ϵ', 5)

    y_pred = pm.Normal('y_pred', mu=α[idx] + β[idx] * x_m,
                         sd=ϵ, observed=y_m)

    trace_noncentered = pm.sample(1000)

az.summary(trace_noncentered)
# very few divergences :)


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [ϵ, β_offset, α_offset, σ_β, μ_β, σ_α, μ_α]
Sampling chain 0, 13 divergences: 100%|██████████| 1500/1500 [00:31<00:00, 48.14it/s]
Sampling chain 1, 51 divergences: 100%|██████████| 1500/1500 [00:32<00:00, 46.79it/s]
There were 13 divergences after tuning. Increase `target_accept` or reparameterize.
There were 64 divergences after tuning. Increase `target_accept` or reparameterize.
The estimated number of effective samples is smaller than 200 for some parameters.
Out[65]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
μ_α 2.271 0.973 0.513 4.191 0.031 0.023 978.0 932.0 1002.0 748.0 1.00
μ_β 0.557 0.146 0.278 0.824 0.005 0.004 749.0 700.0 754.0 670.0 1.01
α_offset[0] 0.356 0.852 -1.308 1.935 0.026 0.019 1093.0 970.0 1042.0 1308.0 1.00
α_offset[1] 0.168 0.830 -1.492 1.618 0.023 0.020 1338.0 868.0 1321.0 1340.0 1.00
α_offset[2] 0.218 0.815 -1.296 1.742 0.020 0.019 1611.0 957.0 1590.0 1254.0 1.00
α_offset[3] -0.399 0.918 -2.167 1.275 0.021 0.021 1889.0 966.0 1839.0 1442.0 1.00
α_offset[4] 0.054 0.924 -1.667 1.807 0.025 0.024 1374.0 736.0 1370.0 1289.0 1.01
α_offset[5] -0.207 0.845 -1.703 1.531 0.022 0.022 1533.0 723.0 1517.0 1160.0 1.01
α_offset[6] -0.479 0.928 -2.187 1.303 0.030 0.025 973.0 708.0 858.0 739.0 1.01
α_offset[7] 0.290 0.946 -1.571 2.045 0.024 0.027 1573.0 610.0 1599.0 653.0 1.00
β_offset[0] 0.977 0.577 -0.171 1.954 0.032 0.024 329.0 292.0 346.0 722.0 1.00
β_offset[1] 0.037 0.576 -1.463 0.872 0.052 0.050 121.0 66.0 193.0 33.0 1.01
β_offset[2] 0.177 0.517 -0.914 1.046 0.036 0.026 201.0 201.0 229.0 66.0 1.01
β_offset[3] -0.875 0.585 -2.035 0.096 0.027 0.019 483.0 483.0 473.0 1104.0 1.00
β_offset[4] -1.365 0.605 -2.438 -0.218 0.022 0.015 785.0 776.0 790.0 912.0 1.00
β_offset[5] 0.146 0.484 -0.835 0.958 0.021 0.015 527.0 527.0 521.0 876.0 1.00
β_offset[6] 0.034 0.675 -0.830 1.918 0.067 0.059 102.0 67.0 173.0 38.0 1.01
β_offset[7] 0.757 0.573 -0.237 1.820 0.026 0.019 469.0 469.0 470.0 800.0 1.00
σ_α 1.304 1.155 0.001 3.663 0.131 0.117 78.0 49.0 150.0 40.0 1.01
σ_β 0.307 0.127 0.123 0.526 0.005 0.004 595.0 580.0 654.0 892.0 1.00
α[0] 2.707 1.118 0.578 4.754 0.051 0.036 490.0 490.0 456.0 589.0 1.00
α[1] 2.628 1.464 0.419 6.469 0.133 0.126 121.0 68.0 191.0 45.0 1.01
α[2] 2.678 1.269 0.669 5.541 0.101 0.087 159.0 107.0 200.0 78.0 1.01
α[3] 1.602 1.505 -1.274 4.153 0.057 0.040 692.0 692.0 781.0 983.0 1.00
α[4] 2.291 1.264 -0.187 4.521 0.048 0.034 694.0 694.0 634.0 1401.0 1.00
α[5] 1.991 1.207 -0.511 4.076 0.035 0.025 1183.0 1137.0 1192.0 1043.0 1.00
α[6] 1.430 1.734 -2.838 3.927 0.145 0.103 142.0 142.0 242.0 61.0 1.01
α[7] 2.794 1.643 0.057 6.148 0.102 0.078 258.0 220.0 292.0 312.0 1.01
β[0] 0.824 0.112 0.593 1.009 0.005 0.004 530.0 482.0 493.0 580.0 1.00
β[1] 0.569 0.145 0.189 0.782 0.013 0.009 119.0 119.0 186.0 40.0 1.01
β[2] 0.609 0.125 0.328 0.800 0.010 0.007 163.0 163.0 207.0 57.0 1.01
β[3] 0.312 0.144 0.059 0.586 0.005 0.004 734.0 647.0 892.0 985.0 1.00
β[4] 0.176 0.130 -0.079 0.410 0.005 0.003 799.0 799.0 736.0 1343.0 1.00
β[5] 0.597 0.121 0.380 0.840 0.003 0.003 1204.0 1158.0 1216.0 1099.0 1.00
β[6] 0.561 0.172 0.314 0.983 0.014 0.012 144.0 108.0 243.0 58.0 1.01
β[7] 0.771 0.149 0.502 1.026 0.008 0.006 335.0 335.0 362.0 788.0 1.00
ϵ 0.596 0.056 0.486 0.689 0.001 0.001 1829.0 1772.0 1892.0 1188.0 1.00

In [66]:
az.plot_autocorr(trace_noncentered, var_names=['μ_α', 'σ_α', 'μ_β', 'σ_β']);



In [67]:
plot_post_pred_samples(trace_noncentered)



In [0]: