Bayesian Modeling for the Busy and the Confused - Part II

Markov Chain Monte-Carlo

Currently, the capacity to gather data is far ahead of the ability to generate meaningful insight using conventional approaches. Hopes of alleviating this bottleneck has come through the application of machine learning tools. Among these tools one that is increasingly garnering traction is probabilistic programming, particularly Bayesian modeling. In this paradigm, variables that are used to define models carry a probabilistic distribution rather than a scalar value. "Fitting" a model to data can then , simplistically, be construed as finding the appropriate parameterization for these distributions, given the model structure and the data. This offers a number of advantages over other methods, not the least of which is the estimation of uncertainty around model results. This in turn can better inform subsequent processes, such as decision-making, and/or scientific discovery.



The present is the first of a two-notebook series, the subject of which is a brief, basic, but hands-on programmatic introduction to Bayesian modeling. This notebook contains an of a few key probability principles relevant to Bayesian inference. An illustration of how to put these in practice follows. In particular, I will explain one of the conmore intuitve approaches to Bayesian computation; Grid Approximation (GA). With this framework I will show how to create simple models that can be used to interpret and predict real world data.
GA is computationally intensive and runs into problems quickly when the data set is large and/or the model increases in complexity. One of the more popular solutions to this problem is the use of the Markov Chain Monte-Carlo (MCMC) algorithm. The implementation of MCMC in Bayesian models will be the subject of the second notebook of this series.
As of this writing the most popular programming language in machine learning is Python. Python is an easy language to pickup: pedagogical resources abound. Python is free, open source, and a large number of very useful libraries have been written over the years that have propelled it to its current place of prominence in a number of fields, in addition to machine learning.

I use Python (3.6+) code to illustrate the mechanics of Bayesian inference in lieu of lengthy explanations. I also use a number of dedicated Python libraries that shortens the code considerably. A solid understanding of Bayesian modeling cannot be spoon-fed and can only come from getting one's hands dirty.. Emphasis is therefore on readable reproducible code. This should ease the work the interested has to do to get some practice re-running the notebook and experimenting with some of the coding and Bayesian modeling patterns presented. Some know-how is required regarding installing and running a Python distribution, the required libraries, and jupyter notebooks; this is easily gleaned from the internet. A popular option in the machine learning community is Anaconda.


In [2]:
import pickle
import warnings
import sys
from IPython.display import Image, HTML

import pandas as pd
import numpy as np
from scipy.stats import norm as gaussian, uniform
import pymc3 as pm
from theano import shared

import seaborn as sb
import matplotlib.pyplot as pl
from matplotlib import rcParams
from matplotlib import ticker as mtick
import arviz as ar

In [ ]:
print('Versions:')
print('---------')
print(f'python:  {sys.version.split("|")[0]}')
print(f'numpy:   {np.__version__}')
print(f'pandas:  {pd.__version__}')
print(f'seaborn: {sb.__version__}')
print(f'pymc3:   {pm.__version__}')
print(f'arviz:   {ar.__version__}')

In [ ]:
%matplotlib inline
warnings.filterwarnings('ignore',  category=FutureWarning)

Under the hood: Inferring chlorophyll distribution

  • Grid approximation: computing probability everywhere
  • Magical MCMC: Dealing with computational complexity
  • Probabilistic Programming with PyMC3: Industrial grade MCMC

Magical MCMC: Dealing with computational complexity

  • Grid approximation:
    • useful for understanding mechanics of Bayesian computation
    • computationally intensive
    • impractical and often intractable for large data sets or high-dimension models
  • MCMC allows sampling where it **probabilistically matters**:
    • compute current probability given location in parameter space
    • propose jump to new location in parameter space
    • compute new probability at proposed location
    • jump to new location if $\frac{new\ probability}{current\ probability}>1$
    • jump to new location if $\frac{new\ probability}{current\ probability}>\gamma\in [0, 1]$
    • otherwise stay in current location

In [ ]:
def mcmc(data, μ_0=0.5, n_samples=1000,):
    print(f'{data.size} data points')
    data = data.reshape(1, -1)
    # set priors
    σ=0.75 # keep σ fixed for simplicity
    trace_μ = np.nan * np.ones(n_samples) # trace: where the sampler has been
    trace_μ[0] = μ_0 # start with a first guess
    for i in range(1, n_samples):
        proposed_μ = norm.rvs(loc=trace_μ[i-1], scale=0.1, size=1)
        prop_par_dict = dict(μ=proposed_μ, σ=σ)
        curr_par_dict = dict(μ=trace_μ[i-1], σ=σ)
        log_prob_prop = get_log_lik(data, prop_par_dict
                                   ) + get_log_prior(prop_par_dict)
        log_prob_curr = get_log_lik(data, curr_par_dict
                                   ) + get_log_prior(curr_par_dict) 
        ratio = np.exp(log_prob_prop -  log_prob_curr)
        if ratio > 1:
            # accept proposal
            trace_μ[i] = proposed_μ
        else:
            # evaluate low proba proposal
            if uniform.rvs(size=1, loc=0, scale=1) > ratio:
                # reject proposal
                trace_μ[i] = trace_μ[i-1]    
            else:
                # accept proposal
                trace_μ[i] = proposed_μ
    return trace_μ

In [ ]:
def get_log_lik(data, param_dict):
    return np.sum(norm.logpdf(data, loc=param_dict['μ'],
                              scale=param_dict['σ']
                             ),
                  axis=1)

def get_log_prior(par_dict, loc=1, scale=1):
    return norm.logpdf(par_dict['μ'], loc=loc, scale=scale)

Timing MCMC


In [ ]:
%%time
mcmc_n_samples = 2000
trace1 = mcmc(data=df_data_s.chl_l.values, n_samples=mcmc_n_samples)

In [ ]:
f, ax = pl.subplots(nrows=2, figsize=(8, 8))
ax[0].plot(np.arange(mcmc_n_samples), trace1, marker='.',
           ls=':', color='k')
ax[0].set_title('trace of μ, 500 data points')
ax[1].set_title('μ marginal posterior')
pm.plots.kdeplot(trace1, ax=ax[1], label='mcmc',
                 color='orange', lw=2, zorder=1)
ax[1].legend(loc='upper left')
ax[1].set_ylim(bottom=0)
df_μ = df_grid_3.groupby(['μ']).sum().drop('σ',
                                     axis=1)[['post_prob']
                                            ].reset_index()
ax2 = ax[1].twinx()
df_μ.plot(x='μ', y='post_prob', ax=ax2, color='k',
         label='grid',)
ax2.set_ylim(bottom=0);
ax2.legend(loc='upper right')
f.tight_layout()

f.savefig('./figJar/Presentation/mcmc_1.svg')


In [ ]:
%%time
samples = 2000
trace2 = mcmc(data=df_data.chl_l.values, n_samples=samples)

In [ ]:
f, ax = pl.subplots(nrows=2, figsize=(8, 8))
ax[0].plot(np.arange(samples), trace2, marker='.',
           ls=':', color='k')
ax[0].set_title(f'trace of μ, {df_data.chl_l.size} data points')
ax[1].set_title('μ marginal posterior')
pm.plots.kdeplot(trace2, ax=ax[1], label='mcmc',
                 color='orange', lw=2, zorder=1)
ax[1].legend(loc='upper left')
ax[1].set_ylim(bottom=0)
f.tight_layout()
f.savefig('./figJar/Presentation/mcmc_2.svg')


In [ ]:
f, ax = pl.subplots(ncols=2, figsize=(12, 5))
ax[0].stem(pm.autocorr(trace1[1500:]))
ax[1].stem(pm.autocorr(trace2[1500:]))
ax[0].set_title(f'{df_data_s.chl_l.size} data points')
ax[1].set_title(f'{df_data.chl_l.size} data points')
f.suptitle('trace autocorrelation', fontsize=19)
f.savefig('./figJar/Presentation/grid8.svg')

In [ ]:
f, ax = pl.subplots(nrows=2, figsize=(8, 8))
thinned_trace = np.random.choice(trace2[100:], size=200, replace=False)
ax[0].plot(np.arange(200), thinned_trace, marker='.',
           ls=':', color='k')
ax[0].set_title('thinned trace of μ')
ax[1].set_title('μ marginal posterior')
pm.plots.kdeplot(thinned_trace, ax=ax[1], label='mcmc',
                 color='orange', lw=2, zorder=1)
ax[1].legend(loc='upper left')
ax[1].set_ylim(bottom=0)
f.tight_layout()
f.savefig('./figJar/Presentation/grid9.svg')

In [ ]:
f, ax = pl.subplots()
ax.stem(pm.autocorr(thinned_trace[:20]));
f.savefig('./figJar/Presentation/stem2.svg', dpi=300, format='svg');

What's going on?

Highly autocorrelated trace:
$\rightarrow$ inadequate parameter space exploration
$\rightarrow$ poor convergence...

Metropolis MCMC
$\rightarrow$ easy to implement + memory efficient
$\rightarrow$ inefficient parameter space exploration
$\rightarrow$ better MCMC sampler?

  • Hamiltonian Monte Carlo (HMC)
  • Greatly improved convergence
  • Well mixed traces are a signature and an easy diagnostic
  • HMC does require a lot of tuning,
  • Not practical for the inexperienced applied statistician or scientist
  • No-U-Turn Sampler (NUTS), HMC that automates most tuning steps
  • NUTS scales well to complex problems with many parameters (1000's)
  • Implemented in popular libraries
Probabilistic modeling for the beginner
  • Under the hood: Inferring chlorophyll distribution
    • Grid approximation: computing probability everywhere
    • MCMC: how it works
    • Probabilistic Programming with PyMC3: Industrial grade MCMC

Probabilistic Programming with PyMC3

  • relatively simple syntax
  • easily used in conjuction with mainstream python scientific data structures
    $\rightarrow$numpy arrays
    $\rightarrow$pandas dataframes
  • models of reasonable complexity span ~10-20 lines.

In [ ]:
with pm.Model() as m1:
    μ_ = pm.Normal('μ', mu=1, sd=1)
    σ = pm.Uniform('σ', lower=0, upper=2)
    lkl = pm.Normal('likelihood', mu=μ_, sd=σ,
                    observed=df_data.chl_l.dropna())

In [ ]:
graph_m1 = pm.model_to_graphviz(m1)
graph_m1.format = 'svg'
graph_m1.render('./figJar/Presentation/graph_m1');

In [ ]:
with m1:
    trace_m1 = pm.sample(2000, tune=1000, chains=4)

In [ ]:
pm.traceplot(trace_m1);

In [ ]:
ar.plot_posterior(trace_m1, kind='hist', round_to=2);

Tutorial Overview:

  • Probabilistic modeling for the beginner
    $\rightarrow$The basics
    $\rightarrow$Starting easy: inferring chlorophyll
    $\rightarrow$Regression: adding a predictor to estimate chlorophyll

Regression: Adding a predictor to estimate chlorophyll

  • Data preparation
  • Writing a regression model in PyMC3
  • Are my priors making sense?
  • Model fitting
  • Flavors of uncertainty

Linear regression takes the form

$$ y = \alpha + \beta x $$

where $$\ \ \ \ \ y = log_{10}(chl)$$ and $$x = log_{10}\left(\frac{Gr}{MxBl}\right)$$


In [ ]:
df_data.head().T

In [ ]:
df_data['Gr-MxBl'] = -1 * df_data['MxBl-Gr']

Regression coefficients easier to interpret with centered predictor:

$$x_c = x - \bar{x}$$


In [ ]:
df_data['Gr-MxBl_c'] = df_data['Gr-MxBl'] - df_data['Gr-MxBl'].mean()

In [ ]:
df_data[['Gr-MxBl_c', 'chl_l']].info()

In [ ]:
x_c = df_data.dropna()['Gr-MxBl_c'].values
y = df_data.dropna().chl_l.values
$$ y = \alpha + \beta x_c$$


$\rightarrow \alpha=y$ when $x=\bar{x}$
$\rightarrow \beta=\Delta y$ when $x$ increases by one unit


In [ ]:
g3 = sb.PairGrid(df_data.loc[:, ['Gr-MxBl_c', 'chl_l']], height=3,
                         diag_sharey=False,)
g3.map_diag(sb.kdeplot, color='k')
g3.map_offdiag(sb.scatterplot, color='k');
make_lower_triangle(g3)
f = pl.gcf()
axs = f.get_axes()
xlabel = r'$log_{10}\left(\frac{Rrs_{green}}{max(Rrs_{blue})}\right), centered$'
ylabel = r'$log_{10}(chl)$'
axs[0].set_xlabel(xlabel)
axs[2].set_xlabel(xlabel)
axs[2].set_ylabel(ylabel)
axs[3].set_xlabel(ylabel)
f.tight_layout()
f.savefig('./figJar/Presentation/pairwise_1.png')

Regression: Adding a predictor to estimate chlorophyll

  • Data preparation
  • Writing a regression model in PyMC3
  • Are my priors making sense?
  • Model fitting
  • Flavors of uncertainty

In [ ]:
with pm.Model() as m_vague_prior:
    # priors
    σ = pm.Uniform('σ', lower=0, upper=2)
    α = pm.Normal('α', mu=0, sd=1)
    β = pm.Normal('β', mu=0, sd=1)
    # deterministic model
    μ = α + β * x_c
    # likelihood
    chl_i = pm.Normal('chl_i', mu=μ, sd=σ, observed=y)

Regression: Adding a predictor to estimate chlorophyll

  • Data preparation
  • Writing a regression model in PyMC3
  • Are my priors making sense?
  • Model fitting
  • Flavors of uncertainty

In [ ]:
vague_priors = pm.sample_prior_predictive(samples=500, model=m_vague_prior, vars=['α', 'β',])

In [ ]:
x_dummy = np.linspace(-1.5, 1.5, num=50).reshape(-1, 1)

In [ ]:
α_prior_vague = vague_priors['α'].reshape(1, -1)
β_prior_vague = vague_priors['β'].reshape(1, -1)
chl_l_prior_μ_vague = α_prior_vague + β_prior_vague * x_dummy

In [ ]:
f, ax = pl.subplots( figsize=(6, 5))
ax.plot(x_dummy, chl_l_prior_μ_vague, color='k', alpha=0.1,);
ax.set_xlabel(r'$log_{10}\left(\frac{green}{max(blue)}\right)$, centered')
ax.set_ylabel('$log_{10}(chl)$')
ax.set_title('Vague priors')
ax.set_ylim(-3.5, 3.5)
f.tight_layout(pad=1)
f.savefig('./figJar/Presentation/prior_checks_1.png')

</center


In [ ]:
with pm.Model() as m_informative_prior:
    α = pm.Normal('α', mu=0, sd=0.2)
    β = pm.Normal('β', mu=0, sd=0.5)
    σ = pm.Uniform('σ', lower=0, upper=2)
    μ = α + β * x_c
    chl_i = pm.Normal('chl_i', mu=μ, sd=σ, observed=y)

In [ ]:
prior_info = pm.sample_prior_predictive(model=m_informative_prior, vars=['α', 'β'])

In [ ]:
α_prior_info = prior_info['α'].reshape(1, -1)
β_prior_info = prior_info['β'].reshape(1, -1)
chl_l_prior_info = α_prior_info + β_prior_info * x_dummy

In [ ]:
f, ax = pl.subplots( figsize=(6, 5))
ax.plot(x_dummy, chl_l_prior_info, color='k', alpha=0.1,);
ax.set_xlabel(r'$log_{10}\left(\frac{green}{max(blue}\right)$, centered')
ax.set_ylabel('$log_{10}(chl)$')
ax.set_title('Weakly informative priors')
ax.set_ylim(-3.5, 3.5)
f.tight_layout(pad=1)
f.savefig('./figJar/Presentation/prior_checks_2.png')

Regression: Adding a predictor to estimate chlorophyll

  • Data preparatrion
  • Writing a regression model in PyMC3
  • Are my priors making sense?
  • Model fitting
  • Flavors of uncertainty

In [ ]:
with m_vague_prior:
    trace_vague = pm.sample(2000, tune=1000, chains=4)

with m_informative_prior:
    trace_inf = pm.sample(2000, tune=1000, chains=4)

In [ ]:
f, axs = pl.subplots(ncols=2, nrows=2, figsize=(12, 7))
ar.plot_posterior(trace_vague, var_names=['α', 'β'], round_to=2, ax=axs[0,:], kind='hist');
ar.plot_posterior(trace_inf, var_names=['α', 'β'], round_to=2, ax=axs[1, :], kind='hist',
                                        color='brown');
axs[0,0].tick_params(rotation=20)
axs[0,0].text(-0.137, 430, 'vague priors',
              fontdict={'fontsize': 15})
axs[1,0].tick_params(rotation=20)
axs[1,0].text(-0.137, 430, 'informative priors',
              fontdict={'fontsize': 15})
f.tight_layout()
f.savefig('./figJar/Presentation/reg_posteriors.svg')

Regression: Adding a predictor to estimate chlorophyll

  • Data preparation
  • Writing a regression model in PyMC3
  • Are my priors making sense?
  • Data review and model fitting
  • Flavors of uncertainty

Two types of uncertainties:

  1. model uncertainty
  2. prediction uncertainty

In [ ]:
α_posterior = trace_inf.get_values('α').reshape(1, -1)
β_posterior = trace_inf.get_values('β').reshape(1, -1)
σ_posterior = trace_inf.get_values('σ').reshape(1, -1)
  1. model uncertainty: uncertainty around the model mean

In [ ]:
μ_posterior = α_posterior + β_posterior * x_dummy

In [ ]:
pl.plot(x_dummy, μ_posterior[:, ::16], color='k', alpha=0.1);
pl.plot(x_dummy, μ_posterior[:, 1], color='k', label='model mean')

pl.scatter(x_c, y, color='orange', edgecolor='k', alpha=0.5, label='obs'); pl.legend();
pl.ylim(-2.5, 2.5); pl.xlim(-1, 1);
pl.xlabel(r'$log_{10}\left(\frac{Gr}{max(Blue)}\right)$')
pl.ylabel(r'$log_{10}(chlorophyll)$')
f = pl.gcf()
f.savefig('./figJar/Presentation/mu_posterior.svg')
  1. prediction uncertainty: posterior predictive checks

In [ ]:
ppc = norm.rvs(loc=μ_posterior, scale=σ_posterior);
ci_94_perc = pm.hpd(ppc.T, alpha=0.06);

In [ ]:
pl.scatter(x_c, y, color='orange', edgecolor='k', alpha=0.5, label='obs'); pl.legend();
pl.plot(x_dummy, ppc.mean(axis=1), color='k', label='mean prediction');
pl.fill_between(x_dummy.flatten(), ci_94_perc[:, 0], ci_94_perc[:, 1], alpha=0.5, color='k',
               label='94% credibility interval:\n94% chance that prediction\nwill be in here!');
pl.xlim(-1, 1); pl.ylim(-2.5, 2.5)
pl.legend(fontsize=12, loc='upper left')
f = pl.gcf()
f.savefig('./figJar/Presentation/ppc.svg')

In Conclusion Probabilistic Programming provides:

  • Transparent modeling:
    • Explicit assumptions
    • Easy to debate/criticize
    • Easy to communicate/reproduce/improve upon
  • Posterior distribution much richer construct than point estimates
  • Principled estimation of model and prediction uncertainty
  • Accessibility
    • Constantly improving algorithms
    • Easy-to-use software
    • Flexible framework, largely problem-agnostic