Currently, the capacity to gather data is far ahead of the ability to generate meaningful insight using conventional approaches. Hopes of alleviating this bottleneck has come through the application of machine learning tools. Among these tools one that is increasingly garnering traction is probabilistic programming, particularly Bayesian modeling. In this paradigm, variables that are used to define models carry a probabilistic distribution rather than a scalar value. "Fitting" a model to data can then , simplistically, be construed as finding the appropriate parameterization for these distributions, given the model structure and the data. This offers a number of advantages over other methods, not the least of which is the estimation of uncertainty around model results. This in turn can better inform subsequent processes, such as decision-making, and/or scientific discovery.
The present is the first of a two-notebook series, the subject of which is a brief, basic, but hands-on programmatic introduction to Bayesian modeling. This notebook contains an of a few key probability principles relevant to Bayesian inference. An illustration of how to put these in practice follows. In particular, I will explain one of the conmore intuitve approaches to Bayesian computation; Grid Approximation (GA). With this framework I will show how to create simple models that can be used to interpret and predict real world data.
GA is computationally intensive and runs into problems quickly when the data set is large and/or the model increases in complexity. One of the more popular solutions to this problem is the use of the Markov Chain Monte-Carlo (MCMC) algorithm. The implementation of MCMC in Bayesian models will be the subject of the second notebook of this series.
As of this writing the most popular programming language in machine learning is Python. Python is an easy language to pickup: pedagogical resources abound. Python is free, open source, and a large number of very useful libraries have been written over the years that have propelled it to its current place of prominence in a number of fields, in addition to machine learning.
I use Python (3.6+) code to illustrate the mechanics of Bayesian inference in lieu of lengthy explanations. I also use a number of dedicated Python libraries that shortens the code considerably. A solid understanding of Bayesian modeling cannot be spoon-fed and can only come from getting one's hands dirty.. Emphasis is therefore on readable reproducible code. This should ease the work the interested has to do to get some practice re-running the notebook and experimenting with some of the coding and Bayesian modeling patterns presented. Some know-how is required regarding installing and running a Python distribution, the required libraries, and jupyter notebooks; this is easily gleaned from the internet. A popular option in the machine learning community is Anaconda.
In [2]:
import pickle
import warnings
import sys
from IPython.display import Image, HTML
import pandas as pd
import numpy as np
from scipy.stats import norm as gaussian, uniform
import pymc3 as pm
from theano import shared
import seaborn as sb
import matplotlib.pyplot as pl
from matplotlib import rcParams
from matplotlib import ticker as mtick
import arviz as ar
In [ ]:
print('Versions:')
print('---------')
print(f'python: {sys.version.split("|")[0]}')
print(f'numpy: {np.__version__}')
print(f'pandas: {pd.__version__}')
print(f'seaborn: {sb.__version__}')
print(f'pymc3: {pm.__version__}')
print(f'arviz: {ar.__version__}')
In [ ]:
%matplotlib inline
warnings.filterwarnings('ignore', category=FutureWarning)
In [ ]:
def mcmc(data, μ_0=0.5, n_samples=1000,):
print(f'{data.size} data points')
data = data.reshape(1, -1)
# set priors
σ=0.75 # keep σ fixed for simplicity
trace_μ = np.nan * np.ones(n_samples) # trace: where the sampler has been
trace_μ[0] = μ_0 # start with a first guess
for i in range(1, n_samples):
proposed_μ = norm.rvs(loc=trace_μ[i-1], scale=0.1, size=1)
prop_par_dict = dict(μ=proposed_μ, σ=σ)
curr_par_dict = dict(μ=trace_μ[i-1], σ=σ)
log_prob_prop = get_log_lik(data, prop_par_dict
) + get_log_prior(prop_par_dict)
log_prob_curr = get_log_lik(data, curr_par_dict
) + get_log_prior(curr_par_dict)
ratio = np.exp(log_prob_prop - log_prob_curr)
if ratio > 1:
# accept proposal
trace_μ[i] = proposed_μ
else:
# evaluate low proba proposal
if uniform.rvs(size=1, loc=0, scale=1) > ratio:
# reject proposal
trace_μ[i] = trace_μ[i-1]
else:
# accept proposal
trace_μ[i] = proposed_μ
return trace_μ
In [ ]:
def get_log_lik(data, param_dict):
return np.sum(norm.logpdf(data, loc=param_dict['μ'],
scale=param_dict['σ']
),
axis=1)
def get_log_prior(par_dict, loc=1, scale=1):
return norm.logpdf(par_dict['μ'], loc=loc, scale=scale)
In [ ]:
%%time
mcmc_n_samples = 2000
trace1 = mcmc(data=df_data_s.chl_l.values, n_samples=mcmc_n_samples)
In [ ]:
f, ax = pl.subplots(nrows=2, figsize=(8, 8))
ax[0].plot(np.arange(mcmc_n_samples), trace1, marker='.',
ls=':', color='k')
ax[0].set_title('trace of μ, 500 data points')
ax[1].set_title('μ marginal posterior')
pm.plots.kdeplot(trace1, ax=ax[1], label='mcmc',
color='orange', lw=2, zorder=1)
ax[1].legend(loc='upper left')
ax[1].set_ylim(bottom=0)
df_μ = df_grid_3.groupby(['μ']).sum().drop('σ',
axis=1)[['post_prob']
].reset_index()
ax2 = ax[1].twinx()
df_μ.plot(x='μ', y='post_prob', ax=ax2, color='k',
label='grid',)
ax2.set_ylim(bottom=0);
ax2.legend(loc='upper right')
f.tight_layout()
f.savefig('./figJar/Presentation/mcmc_1.svg')
In [ ]:
%%time
samples = 2000
trace2 = mcmc(data=df_data.chl_l.values, n_samples=samples)
In [ ]:
f, ax = pl.subplots(nrows=2, figsize=(8, 8))
ax[0].plot(np.arange(samples), trace2, marker='.',
ls=':', color='k')
ax[0].set_title(f'trace of μ, {df_data.chl_l.size} data points')
ax[1].set_title('μ marginal posterior')
pm.plots.kdeplot(trace2, ax=ax[1], label='mcmc',
color='orange', lw=2, zorder=1)
ax[1].legend(loc='upper left')
ax[1].set_ylim(bottom=0)
f.tight_layout()
f.savefig('./figJar/Presentation/mcmc_2.svg')
In [ ]:
f, ax = pl.subplots(ncols=2, figsize=(12, 5))
ax[0].stem(pm.autocorr(trace1[1500:]))
ax[1].stem(pm.autocorr(trace2[1500:]))
ax[0].set_title(f'{df_data_s.chl_l.size} data points')
ax[1].set_title(f'{df_data.chl_l.size} data points')
f.suptitle('trace autocorrelation', fontsize=19)
f.savefig('./figJar/Presentation/grid8.svg')
In [ ]:
f, ax = pl.subplots(nrows=2, figsize=(8, 8))
thinned_trace = np.random.choice(trace2[100:], size=200, replace=False)
ax[0].plot(np.arange(200), thinned_trace, marker='.',
ls=':', color='k')
ax[0].set_title('thinned trace of μ')
ax[1].set_title('μ marginal posterior')
pm.plots.kdeplot(thinned_trace, ax=ax[1], label='mcmc',
color='orange', lw=2, zorder=1)
ax[1].legend(loc='upper left')
ax[1].set_ylim(bottom=0)
f.tight_layout()
f.savefig('./figJar/Presentation/grid9.svg')
In [ ]:
f, ax = pl.subplots()
ax.stem(pm.autocorr(thinned_trace[:20]));
f.savefig('./figJar/Presentation/stem2.svg', dpi=300, format='svg');
What's going on?
Highly autocorrelated trace:
$\rightarrow$ inadequate parameter space exploration
$\rightarrow$ poor convergence...
Metropolis MCMC
$\rightarrow$ easy to implement + memory efficient
$\rightarrow$ inefficient parameter space exploration
$\rightarrow$ better MCMC sampler?
In [ ]:
with pm.Model() as m1:
μ_ = pm.Normal('μ', mu=1, sd=1)
σ = pm.Uniform('σ', lower=0, upper=2)
lkl = pm.Normal('likelihood', mu=μ_, sd=σ,
observed=df_data.chl_l.dropna())
In [ ]:
graph_m1 = pm.model_to_graphviz(m1)
graph_m1.format = 'svg'
graph_m1.render('./figJar/Presentation/graph_m1');
In [ ]:
with m1:
trace_m1 = pm.sample(2000, tune=1000, chains=4)
In [ ]:
pm.traceplot(trace_m1);
In [ ]:
ar.plot_posterior(trace_m1, kind='hist', round_to=2);
Linear regression takes the form
$$ y = \alpha + \beta x $$where $$\ \ \ \ \ y = log_{10}(chl)$$ and $$x = log_{10}\left(\frac{Gr}{MxBl}\right)$$
In [ ]:
df_data.head().T
In [ ]:
df_data['Gr-MxBl'] = -1 * df_data['MxBl-Gr']
Regression coefficients easier to interpret with centered predictor:
$$x_c = x - \bar{x}$$
In [ ]:
df_data['Gr-MxBl_c'] = df_data['Gr-MxBl'] - df_data['Gr-MxBl'].mean()
In [ ]:
df_data[['Gr-MxBl_c', 'chl_l']].info()
In [ ]:
x_c = df_data.dropna()['Gr-MxBl_c'].values
y = df_data.dropna().chl_l.values
$\rightarrow \alpha=y$ when $x=\bar{x}$
$\rightarrow \beta=\Delta y$ when $x$ increases by one unit
In [ ]:
g3 = sb.PairGrid(df_data.loc[:, ['Gr-MxBl_c', 'chl_l']], height=3,
diag_sharey=False,)
g3.map_diag(sb.kdeplot, color='k')
g3.map_offdiag(sb.scatterplot, color='k');
make_lower_triangle(g3)
f = pl.gcf()
axs = f.get_axes()
xlabel = r'$log_{10}\left(\frac{Rrs_{green}}{max(Rrs_{blue})}\right), centered$'
ylabel = r'$log_{10}(chl)$'
axs[0].set_xlabel(xlabel)
axs[2].set_xlabel(xlabel)
axs[2].set_ylabel(ylabel)
axs[3].set_xlabel(ylabel)
f.tight_layout()
f.savefig('./figJar/Presentation/pairwise_1.png')
In [ ]:
with pm.Model() as m_vague_prior:
# priors
σ = pm.Uniform('σ', lower=0, upper=2)
α = pm.Normal('α', mu=0, sd=1)
β = pm.Normal('β', mu=0, sd=1)
# deterministic model
μ = α + β * x_c
# likelihood
chl_i = pm.Normal('chl_i', mu=μ, sd=σ, observed=y)
In [ ]:
vague_priors = pm.sample_prior_predictive(samples=500, model=m_vague_prior, vars=['α', 'β',])
In [ ]:
x_dummy = np.linspace(-1.5, 1.5, num=50).reshape(-1, 1)
In [ ]:
α_prior_vague = vague_priors['α'].reshape(1, -1)
β_prior_vague = vague_priors['β'].reshape(1, -1)
chl_l_prior_μ_vague = α_prior_vague + β_prior_vague * x_dummy
In [ ]:
f, ax = pl.subplots( figsize=(6, 5))
ax.plot(x_dummy, chl_l_prior_μ_vague, color='k', alpha=0.1,);
ax.set_xlabel(r'$log_{10}\left(\frac{green}{max(blue)}\right)$, centered')
ax.set_ylabel('$log_{10}(chl)$')
ax.set_title('Vague priors')
ax.set_ylim(-3.5, 3.5)
f.tight_layout(pad=1)
f.savefig('./figJar/Presentation/prior_checks_1.png')
In [ ]:
with pm.Model() as m_informative_prior:
α = pm.Normal('α', mu=0, sd=0.2)
β = pm.Normal('β', mu=0, sd=0.5)
σ = pm.Uniform('σ', lower=0, upper=2)
μ = α + β * x_c
chl_i = pm.Normal('chl_i', mu=μ, sd=σ, observed=y)
In [ ]:
prior_info = pm.sample_prior_predictive(model=m_informative_prior, vars=['α', 'β'])
In [ ]:
α_prior_info = prior_info['α'].reshape(1, -1)
β_prior_info = prior_info['β'].reshape(1, -1)
chl_l_prior_info = α_prior_info + β_prior_info * x_dummy
In [ ]:
f, ax = pl.subplots( figsize=(6, 5))
ax.plot(x_dummy, chl_l_prior_info, color='k', alpha=0.1,);
ax.set_xlabel(r'$log_{10}\left(\frac{green}{max(blue}\right)$, centered')
ax.set_ylabel('$log_{10}(chl)$')
ax.set_title('Weakly informative priors')
ax.set_ylim(-3.5, 3.5)
f.tight_layout(pad=1)
f.savefig('./figJar/Presentation/prior_checks_2.png')
|
|
In [ ]:
with m_vague_prior:
trace_vague = pm.sample(2000, tune=1000, chains=4)
with m_informative_prior:
trace_inf = pm.sample(2000, tune=1000, chains=4)
In [ ]:
f, axs = pl.subplots(ncols=2, nrows=2, figsize=(12, 7))
ar.plot_posterior(trace_vague, var_names=['α', 'β'], round_to=2, ax=axs[0,:], kind='hist');
ar.plot_posterior(trace_inf, var_names=['α', 'β'], round_to=2, ax=axs[1, :], kind='hist',
color='brown');
axs[0,0].tick_params(rotation=20)
axs[0,0].text(-0.137, 430, 'vague priors',
fontdict={'fontsize': 15})
axs[1,0].tick_params(rotation=20)
axs[1,0].text(-0.137, 430, 'informative priors',
fontdict={'fontsize': 15})
f.tight_layout()
f.savefig('./figJar/Presentation/reg_posteriors.svg')
Two types of uncertainties:
In [ ]:
α_posterior = trace_inf.get_values('α').reshape(1, -1)
β_posterior = trace_inf.get_values('β').reshape(1, -1)
σ_posterior = trace_inf.get_values('σ').reshape(1, -1)
In [ ]:
μ_posterior = α_posterior + β_posterior * x_dummy
In [ ]:
pl.plot(x_dummy, μ_posterior[:, ::16], color='k', alpha=0.1);
pl.plot(x_dummy, μ_posterior[:, 1], color='k', label='model mean')
pl.scatter(x_c, y, color='orange', edgecolor='k', alpha=0.5, label='obs'); pl.legend();
pl.ylim(-2.5, 2.5); pl.xlim(-1, 1);
pl.xlabel(r'$log_{10}\left(\frac{Gr}{max(Blue)}\right)$')
pl.ylabel(r'$log_{10}(chlorophyll)$')
f = pl.gcf()
f.savefig('./figJar/Presentation/mu_posterior.svg')
In [ ]:
ppc = norm.rvs(loc=μ_posterior, scale=σ_posterior);
ci_94_perc = pm.hpd(ppc.T, alpha=0.06);
In [ ]:
pl.scatter(x_c, y, color='orange', edgecolor='k', alpha=0.5, label='obs'); pl.legend();
pl.plot(x_dummy, ppc.mean(axis=1), color='k', label='mean prediction');
pl.fill_between(x_dummy.flatten(), ci_94_perc[:, 0], ci_94_perc[:, 1], alpha=0.5, color='k',
label='94% credibility interval:\n94% chance that prediction\nwill be in here!');
pl.xlim(-1, 1); pl.ylim(-2.5, 2.5)
pl.legend(fontsize=12, loc='upper left')
f = pl.gcf()
f.savefig('./figJar/Presentation/ppc.svg')