Bayesian Modeling for the Busy and the Confused - Part II

Markov Chain Monte-Carlo

Currently, the capacity to gather data is far ahead of the ability to generate meaningful insight using conventional approaches. Hopes of alleviating this bottleneck has come through the application of machine learning tools. Among these tools one that is increasingly garnering traction is probabilistic programming, particularly Bayesian modeling. In this paradigm, variables that are used to define models carry a probabilistic distribution rather than a scalar value. "Fitting" a model to data can then , simplistically, be construed as finding the appropriate parameterization for these distributions, given the model structure and the data. This offers a number of advantages over other methods, not the least of which is the estimation of uncertainty around model results. This in turn can better inform subsequent processes, such as decision-making, and/or scientific discovery.

The present is the first of a two-notebook series, the subject of which is a brief, basic, but hands-on programmatic introduction to Bayesian modeling. This notebook contains an of a few key probability principles relevant to Bayesian inference. An illustration of how to put these in practice follows. In particular, I will explain one of the conmore intuitve approaches to Bayesian computation; Grid Approximation (GA). With this framework I will show how to create simple models that can be used to interpret and predict real world data.
GA is computationally intensive and runs into problems quickly when the data set is large and/or the model increases in complexity. One of the more popular solutions to this problem is the use of the Markov Chain Monte-Carlo (MCMC) algorithm. The implementation of MCMC in Bayesian models will be the subject of the second notebook of this series.
As of this writing the most popular programming language in machine learning is Python. Python is an easy language to pickup: pedagogical resources abound. Python is free, open source, and a large number of very useful libraries have been written over the years that have propelled it to its current place of prominence in a number of fields, in addition to machine learning.

I use Python (3.6+) code to illustrate the mechanics of Bayesian inference in lieu of lengthy explanations. I also use a number of dedicated Python libraries that shortens the code considerably. A solid understanding of Bayesian modeling cannot be spoon-fed and can only come from getting one's hands dirty.. Emphasis is therefore on readable reproducible code. This should ease the work the interested has to do to get some practice re-running the notebook and experimenting with some of the coding and Bayesian modeling patterns presented. Some know-how is required regarding installing and running a Python distribution, the required libraries, and jupyter notebooks; this is easily gleaned from the internet. A popular option in the machine learning community is Anaconda.

Notebook Contents

Basics: Joint probability, Inverse probability and Bayes' Theorem
Example: Inferring the Statistical Distribution of Chlorophyll from Data
Regression
[Final Comments](#Conclusion



In [2]:

    
import pickle
import warnings
import sys
from IPython.display import Image, HTML

import pandas as pd
import numpy as np
from scipy.stats import norm as gaussian, uniform
import pymc3 as pm
from theano import shared

import seaborn as sb
import matplotlib.pyplot as pl
from matplotlib import rcParams
from matplotlib import ticker as mtick
import arviz as ar



In [ ]:

    
print('Versions:')
print('---------')
print(f'python:  {sys.version.split("|")[0]}')
print(f'numpy:   {np.__version__}')
print(f'pandas:  {pd.__version__}')
print(f'seaborn: {sb.__version__}')
print(f'pymc3:   {pm.__version__}')
print(f'arviz:   {ar.__version__}')



In [ ]:

    
%matplotlib inline
warnings.filterwarnings('ignore',  category=FutureWarning)

Under the hood: Inferring chlorophyll distribution

~~Grid approximation: computing probability everywhere~~
Magical MCMC: Dealing with computational complexity
Probabilistic Programming with PyMC3: Industrial grade MCMC

Back to Contents

Magical MCMC: Dealing with computational complexity

Grid approximation:
- useful for understanding mechanics of Bayesian computation
- computationally intensive
- impractical and often intractable for large data sets or high-dimension models

MCMC allows sampling where it **probabilistically matters**:
- compute current probability given location in parameter space
- propose jump to new location in parameter space
- compute new probability at proposed location
- jump to new location if $\frac{new\ probability}{current\ probability}>1$
- jump to new location if $\frac{new\ probability}{current\ probability}>\gamma\in [0, 1]$
- otherwise stay in current location



In [ ]:

    
def mcmc(data, μ_0=0.5, n_samples=1000,):
    print(f'{data.size} data points')
    data = data.reshape(1, -1)
    # set priors
    σ=0.75 # keep σ fixed for simplicity
    trace_μ = np.nan * np.ones(n_samples) # trace: where the sampler has been
    trace_μ[0] = μ_0 # start with a first guess
    for i in range(1, n_samples):
        proposed_μ = norm.rvs(loc=trace_μ[i-1], scale=0.1, size=1)
        prop_par_dict = dict(μ=proposed_μ, σ=σ)
        curr_par_dict = dict(μ=trace_μ[i-1], σ=σ)
        log_prob_prop = get_log_lik(data, prop_par_dict
                                   ) + get_log_prior(prop_par_dict)
        log_prob_curr = get_log_lik(data, curr_par_dict
                                   ) + get_log_prior(curr_par_dict) 
        ratio = np.exp(log_prob_prop -  log_prob_curr)
        if ratio > 1:
            # accept proposal
            trace_μ[i] = proposed_μ
        else:
            # evaluate low proba proposal
            if uniform.rvs(size=1, loc=0, scale=1) > ratio:
                # reject proposal
                trace_μ[i] = trace_μ[i-1]    
            else:
                # accept proposal
                trace_μ[i] = proposed_μ
    return trace_μ



In [ ]:

    
def get_log_lik(data, param_dict):
    return np.sum(norm.logpdf(data, loc=param_dict['μ'],
                              scale=param_dict['σ']
                             ),
                  axis=1)

def get_log_prior(par_dict, loc=1, scale=1):
    return norm.logpdf(par_dict['μ'], loc=loc, scale=scale)

Timing MCMC



In [ ]:

    
%%time
mcmc_n_samples = 2000
trace1 = mcmc(data=df_data_s.chl_l.values, n_samples=mcmc_n_samples)



In [ ]:

    
f, ax = pl.subplots(nrows=2, figsize=(8, 8))
ax[0].plot(np.arange(mcmc_n_samples), trace1, marker='.',
           ls=':', color='k')
ax[0].set_title('trace of μ, 500 data points')
ax[1].set_title('μ marginal posterior')
pm.plots.kdeplot(trace1, ax=ax[1], label='mcmc',
                 color='orange', lw=2, zorder=1)
ax[1].legend(loc='upper left')
ax[1].set_ylim(bottom=0)
df_μ = df_grid_3.groupby(['μ']).sum().drop('σ',
                                     axis=1)[['post_prob']
                                            ].reset_index()
ax2 = ax[1].twinx()
df_μ.plot(x='μ', y='post_prob', ax=ax2, color='k',
         label='grid',)
ax2.set_ylim(bottom=0);
ax2.legend(loc='upper right')
f.tight_layout()

f.savefig('./figJar/Presentation/mcmc_1.svg')



In [ ]:

    
%%time
samples = 2000
trace2 = mcmc(data=df_data.chl_l.values, n_samples=samples)



In [ ]:

    
f, ax = pl.subplots(nrows=2, figsize=(8, 8))
ax[0].plot(np.arange(samples), trace2, marker='.',
           ls=':', color='k')
ax[0].set_title(f'trace of μ, {df_data.chl_l.size} data points')
ax[1].set_title('μ marginal posterior')
pm.plots.kdeplot(trace2, ax=ax[1], label='mcmc',
                 color='orange', lw=2, zorder=1)
ax[1].legend(loc='upper left')
ax[1].set_ylim(bottom=0)
f.tight_layout()
f.savefig('./figJar/Presentation/mcmc_2.svg')



In [ ]:

    
f, ax = pl.subplots(ncols=2, figsize=(12, 5))
ax[0].stem(pm.autocorr(trace1[1500:]))
ax[1].stem(pm.autocorr(trace2[1500:]))
ax[0].set_title(f'{df_data_s.chl_l.size} data points')
ax[1].set_title(f'{df_data.chl_l.size} data points')
f.suptitle('trace autocorrelation', fontsize=19)
f.savefig('./figJar/Presentation/grid8.svg')



In [ ]:

    
f, ax = pl.subplots(nrows=2, figsize=(8, 8))
thinned_trace = np.random.choice(trace2[100:], size=200, replace=False)
ax[0].plot(np.arange(200), thinned_trace, marker='.',
           ls=':', color='k')
ax[0].set_title('thinned trace of μ')
ax[1].set_title('μ marginal posterior')
pm.plots.kdeplot(thinned_trace, ax=ax[1], label='mcmc',
                 color='orange', lw=2, zorder=1)
ax[1].legend(loc='upper left')
ax[1].set_ylim(bottom=0)
f.tight_layout()
f.savefig('./figJar/Presentation/grid9.svg')



In [ ]:

    
f, ax = pl.subplots()
ax.stem(pm.autocorr(thinned_trace[:20]));
f.savefig('./figJar/Presentation/stem2.svg', dpi=300, format='svg');

What's going on?

Highly autocorrelated trace:
$\rightarrow$ inadequate parameter space exploration
$\rightarrow$ poor convergence...

Metropolis MCMC
$\rightarrow$ easy to implement + memory efficient
$\rightarrow$ inefficient parameter space exploration
$\rightarrow$ better MCMC sampler?

Hamiltonian Monte Carlo (HMC)
Greatly improved convergence
Well mixed traces are a signature and an easy diagnostic
HMC does require a lot of tuning,
Not practical for the inexperienced applied statistician or scientist

No-U-Turn Sampler (NUTS), HMC that automates most tuning steps
NUTS scales well to complex problems with many parameters (1000's)
Implemented in popular libraries

Probabilistic modeling for the beginner

Under the hood: Inferring chlorophyll distribution
- ~~Grid approximation: computing probability everywhere~~
- ~~MCMC: how it works~~
- Probabilistic Programming with PyMC3: Industrial grade MCMC

Back to Contents

Probabilistic Programming with PyMC3

relatively simple syntax
easily used in conjuction with mainstream python scientific data structures
$\rightarrow$numpy arrays
$\rightarrow$pandas dataframes
models of reasonable complexity span ~10-20 lines.



In [ ]:

    
with pm.Model() as m1:
    μ_ = pm.Normal('μ', mu=1, sd=1)
    σ = pm.Uniform('σ', lower=0, upper=2)
    lkl = pm.Normal('likelihood', mu=μ_, sd=σ,
                    observed=df_data.chl_l.dropna())



In [ ]:

    
graph_m1 = pm.model_to_graphviz(m1)
graph_m1.format = 'svg'
graph_m1.render('./figJar/Presentation/graph_m1');



In [ ]:

    
with m1:
    trace_m1 = pm.sample(2000, tune=1000, chains=4)



In [ ]:

    
pm.traceplot(trace_m1);



In [ ]:

    
ar.plot_posterior(trace_m1, kind='hist', round_to=2);

Back to Contents

Tutorial Overview:

Probabilistic modeling for the beginner
$\rightarrow$~~The basics~~
$\rightarrow$~~Starting easy: inferring chlorophyll~~
$\rightarrow$Regression: adding a predictor to estimate chlorophyll

Back to Contents

Regression: Adding a predictor to estimate chlorophyll

Data preparation
Writing a regression model in PyMC3
Are my priors making sense?
Model fitting
Flavors of uncertainty

Linear regression takes the form

$$ y = \alpha + \beta x $$

where $$\ \ \ \ \ y = log_{10}(chl)$$ and $$x = log_{10}\left(\frac{Gr}{MxBl}\right)$$



In [ ]:

    
df_data.head().T



In [ ]:

    
df_data['Gr-MxBl'] = -1 * df_data['MxBl-Gr']

Regression coefficients easier to interpret with centered predictor:

$$x_c = x - \bar{x}$$



In [ ]:

    
df_data['Gr-MxBl_c'] = df_data['Gr-MxBl'] - df_data['Gr-MxBl'].mean()



In [ ]:

    
df_data[['Gr-MxBl_c', 'chl_l']].info()



In [ ]:

    
x_c = df_data.dropna()['Gr-MxBl_c'].values
y = df_data.dropna().chl_l.values

$$ y = \alpha + \beta x_c$$

$\rightarrow \alpha=y$ when $x=\bar{x}$
$\rightarrow \beta=\Delta y$ when $x$ increases by one unit



In [ ]:

    
g3 = sb.PairGrid(df_data.loc[:, ['Gr-MxBl_c', 'chl_l']], height=3,
                         diag_sharey=False,)
g3.map_diag(sb.kdeplot, color='k')
g3.map_offdiag(sb.scatterplot, color='k');
make_lower_triangle(g3)
f = pl.gcf()
axs = f.get_axes()
xlabel = r'$log_{10}\left(\frac{Rrs_{green}}{max(Rrs_{blue})}\right), centered$'
ylabel = r'$log_{10}(chl)$'
axs[0].set_xlabel(xlabel)
axs[2].set_xlabel(xlabel)
axs[2].set_ylabel(ylabel)
axs[3].set_xlabel(ylabel)
f.tight_layout()
f.savefig('./figJar/Presentation/pairwise_1.png')

Back to Contents

Regression: Adding a predictor to estimate chlorophyll

~~Data preparation~~
Writing a regression model in PyMC3
Are my priors making sense?
Model fitting
Flavors of uncertainty



In [ ]:

    
with pm.Model() as m_vague_prior:
    # priors
    σ = pm.Uniform('σ', lower=0, upper=2)
    α = pm.Normal('α', mu=0, sd=1)
    β = pm.Normal('β', mu=0, sd=1)
    # deterministic model
    μ = α + β * x_c
    # likelihood
    chl_i = pm.Normal('chl_i', mu=μ, sd=σ, observed=y)

Back to Contents

Regression: Adding a predictor to estimate chlorophyll

~~Data preparation~~
~~Writing a regression model in PyMC3~~
Are my priors making sense?
Model fitting
Flavors of uncertainty



In [ ]:

    
vague_priors = pm.sample_prior_predictive(samples=500, model=m_vague_prior, vars=['α', 'β',])



In [ ]:

    
x_dummy = np.linspace(-1.5, 1.5, num=50).reshape(-1, 1)



In [ ]:

    
α_prior_vague = vague_priors['α'].reshape(1, -1)
β_prior_vague = vague_priors['β'].reshape(1, -1)
chl_l_prior_μ_vague = α_prior_vague + β_prior_vague * x_dummy



In [ ]:

    
f, ax = pl.subplots( figsize=(6, 5))
ax.plot(x_dummy, chl_l_prior_μ_vague, color='k', alpha=0.1,);
ax.set_xlabel(r'$log_{10}\left(\frac{green}{max(blue)}\right)$, centered')
ax.set_ylabel('$log_{10}(chl)$')
ax.set_title('Vague priors')
ax.set_ylim(-3.5, 3.5)
f.tight_layout(pad=1)
f.savefig('./figJar/Presentation/prior_checks_1.png')

</center



In [ ]:

    
with pm.Model() as m_informative_prior:
    α = pm.Normal('α', mu=0, sd=0.2)
    β = pm.Normal('β', mu=0, sd=0.5)
    σ = pm.Uniform('σ', lower=0, upper=2)
    μ = α + β * x_c
    chl_i = pm.Normal('chl_i', mu=μ, sd=σ, observed=y)



In [ ]:

    
prior_info = pm.sample_prior_predictive(model=m_informative_prior, vars=['α', 'β'])



In [ ]:

    
α_prior_info = prior_info['α'].reshape(1, -1)
β_prior_info = prior_info['β'].reshape(1, -1)
chl_l_prior_info = α_prior_info + β_prior_info * x_dummy



In [ ]:

    
f, ax = pl.subplots( figsize=(6, 5))
ax.plot(x_dummy, chl_l_prior_info, color='k', alpha=0.1,);
ax.set_xlabel(r'$log_{10}\left(\frac{green}{max(blue}\right)$, centered')
ax.set_ylabel('$log_{10}(chl)$')
ax.set_title('Weakly informative priors')
ax.set_ylim(-3.5, 3.5)
f.tight_layout(pad=1)
f.savefig('./figJar/Presentation/prior_checks_2.png')

Back to Contents

Regression: Adding a predictor to estimate chlorophyll

~~Data preparatrion~~
~~Writing a regression model in PyMC3~~
~~Are my priors making sense?~~
Model fitting
Flavors of uncertainty



In [ ]:

    
with m_vague_prior:
    trace_vague = pm.sample(2000, tune=1000, chains=4)

with m_informative_prior:
    trace_inf = pm.sample(2000, tune=1000, chains=4)



In [ ]:

    
f, axs = pl.subplots(ncols=2, nrows=2, figsize=(12, 7))
ar.plot_posterior(trace_vague, var_names=['α', 'β'], round_to=2, ax=axs[0,:], kind='hist');
ar.plot_posterior(trace_inf, var_names=['α', 'β'], round_to=2, ax=axs[1, :], kind='hist',
                                        color='brown');
axs[0,0].tick_params(rotation=20)
axs[0,0].text(-0.137, 430, 'vague priors',
              fontdict={'fontsize': 15})
axs[1,0].tick_params(rotation=20)
axs[1,0].text(-0.137, 430, 'informative priors',
              fontdict={'fontsize': 15})
f.tight_layout()
f.savefig('./figJar/Presentation/reg_posteriors.svg')

Back to Contents

Regression: Adding a predictor to estimate chlorophyll

~~Data preparation~~
~~Writing a regression model in PyMC3~~
~~Are my priors making sense?~~
~~Data review and model fitting~~
Flavors of uncertainty

Two types of uncertainties:

model uncertainty
prediction uncertainty



In [ ]:

    
α_posterior = trace_inf.get_values('α').reshape(1, -1)
β_posterior = trace_inf.get_values('β').reshape(1, -1)
σ_posterior = trace_inf.get_values('σ').reshape(1, -1)

model uncertainty: uncertainty around the model mean



In [ ]:

    
μ_posterior = α_posterior + β_posterior * x_dummy



In [ ]:

    
pl.plot(x_dummy, μ_posterior[:, ::16], color='k', alpha=0.1);
pl.plot(x_dummy, μ_posterior[:, 1], color='k', label='model mean')

pl.scatter(x_c, y, color='orange', edgecolor='k', alpha=0.5, label='obs'); pl.legend();
pl.ylim(-2.5, 2.5); pl.xlim(-1, 1);
pl.xlabel(r'$log_{10}\left(\frac{Gr}{max(Blue)}\right)$')
pl.ylabel(r'$log_{10}(chlorophyll)$')
f = pl.gcf()
f.savefig('./figJar/Presentation/mu_posterior.svg')

prediction uncertainty: posterior predictive checks



In [ ]:

    
ppc = norm.rvs(loc=μ_posterior, scale=σ_posterior);
ci_94_perc = pm.hpd(ppc.T, alpha=0.06);



In [ ]:

    
pl.scatter(x_c, y, color='orange', edgecolor='k', alpha=0.5, label='obs'); pl.legend();
pl.plot(x_dummy, ppc.mean(axis=1), color='k', label='mean prediction');
pl.fill_between(x_dummy.flatten(), ci_94_perc[:, 0], ci_94_perc[:, 1], alpha=0.5, color='k',
               label='94% credibility interval:\n94% chance that prediction\nwill be in here!');
pl.xlim(-1, 1); pl.ylim(-2.5, 2.5)
pl.legend(fontsize=12, loc='upper left')
f = pl.gcf()
f.savefig('./figJar/Presentation/ppc.svg')

Back to Contents

In Conclusion Probabilistic Programming provides:

Transparent modeling:
- Explicit assumptions
- Easy to debate/criticize
- Easy to communicate/reproduce/improve upon
Posterior distribution much richer construct than point estimates
Principled estimation of model and prediction uncertainty
Accessibility
- Constantly improving algorithms
- Easy-to-use software
- Flexible framework, largely problem-agnostic

Back to Contents