This notebook compares various techniques of initializing NUTS and finds that ADVI works best. Since then, we have made ADVI initialization the default for when you don't submit a sampler. Thus, if you run pymc3.sample(500) on a continuous model it will already initialize your model correctly.
I caught up with Thomas Wiecki after his talk at ODSC London and he mentioned a potential speed increase for NUTS sampling by using ADVI outputs to set the covariance scaling matrix.
This seems like a great idea and there's already a good example in the docs but I wanted to try it myself, and get a feel for the speed increase.
In this Notebook I generate a small, noisy dataset according to a simple linear model, and attempt to recover the parameters comparing 3 techniques:
The final estimates of the model parameter coeffs look quite similar, and close to the correct values. Each NUTS sampler was run for the same count of traces, yet they took quite different times to sample, since the quality of exploration was different:
The scaling really seems to help speed up the NUTS sampling: the traces appear to converge much more quickly and seem more settled. The parameter estimates are good too.
In general, it seems worth trying to set the NUTS scaling using ADVI stdevs.
$> less conda_env_pymc3_examples.yml
name: pymc3_examples
channels:
- defaults
dependencies:
- python=3.5
- jupyter
- ipywidgets
- numpy
- scipy
- matplotlib
- pandas
- pip
- pip:
- watermark
- pymc3
$> conda env create --file conda_env_pymc3_examples.yml
$> source activate pymc3_examples
In [1]:
%matplotlib inline
In [2]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
from scipy import optimize
import pandas as pd
import matplotlib.pyplot as plt
import pymc3 as pm
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = 12, 4
rndst = np.random.RandomState(0)
%load_ext watermark
%watermark -dmvgp numpy,pandas,matplotlib,pymc3,theano,joblib
In [3]:
def generate_data(n=20, p=0, a=1, b=1, c=0, latent_sigma_y=20):
'''
Create a toy dataset based on a very simple model that we might
imagine is a noisy physical process:
1. random x values within a range
2. latent error aka inherent noise in y
3. optionally create labelled outliers with larger noise
Model form: y ~ a + bx + cx^2 + e
NOTE: latent_sigma_y is used to create a normally distributed,
'latent error' aka 'inherent noise' in the 'physical process'
generating thses values, rather than experimental measurement error.
Please don't use the returned `latent_error` values in inferential
models, it's returned in e dataframe for interest only.
'''
rndst = np.random.RandomState(0)
df = pd.DataFrame({'x':rndst.choice(np.arange(max(100, 2*n)), n, replace=False)})
## create linear or quadratic model
df['y'] = a + b*(df['x']) + c*(df['x'])**2
## create latent noise and marked outliers
df['latent_error'] = rndst.normal(0, latent_sigma_y, n)
df['outlier_error'] = rndst.normal(0, latent_sigma_y*10, n)
df['outlier'] = rndst.binomial(1, p, n)
## add noise, with extreme noise for marked outliers
df['y'] += ((1-df['outlier']) * df['latent_error'])
df['y'] += (df['outlier'] * df['outlier_error'])
## round
for col in ['y','latent_error','outlier_error','x']:
df[col] = np.round(df[col],3)
return df
NOTE:
n=1000), which is small but still just about large enough to warrant a fast technique. n. Also try different model parameters p, a, b, c, latent_sigma_y.
In [4]:
n = 1000
df = generate_data(n=n, p=0.01, a=-30, b=9, c=2, latent_sigma_y=40)
df.head()
Out[4]:
In [5]:
with pm.Model() as mdl:
## define Normal priors to give Ridge regression
b0 = pm.Normal('b0', mu=0, sd=100)
b1 = pm.Normal('b1', mu=0, sd=100)
b2 = pm.Normal('b2', mu=0, sd=100)
## define Linear model
yest = b0 + b1 * df['x'] + b2 * np.power(df['x'], 2)
## define Normal likelihood with HalfCauchy noise (fat tails, equiv to HalfT 1DoF)
sigma_y = pm.HalfCauchy('sigma_y', beta=10)
likelihood = pm.Normal('likelihood', mu=yest, sd=sigma_y, observed=df['y'])
In [12]:
with mdl:
trc_met = pm.sample(10000, njobs=3, step=pm.Metropolis())
In [13]:
ax = pm.traceplot(trc_met[:], figsize=(12,4*1.5), combined=False)
Observe:
In [15]:
gs = pm.forestplot(trc_met[-1000:], varnames=['b0', 'b1', 'b2'])
Observe:
In [6]:
with mdl:
v_params = pm.variational.advi(n=100000)
_ = plt.plot(-np.log10(-v_params.elbo_vals))
Observe:
In [7]:
df_v_means = pd.DataFrame(v_params.means, index=['mean'])
df_v_stds = pd.DataFrame(v_params.stds, index=['std'])
df_v_params = pd.merge(df_v_means.T, df_v_stds.T, left_index=True, right_index=True)
df_v_params
Out[7]:
Observe:
In [11]:
with mdl:
trc_nuts = pm.sample(300, njobs=3, step=pm.NUTS())
In [12]:
ax = pm.traceplot(trc_nuts[:], figsize=(12,4*1.5), combined=False)
In [13]:
gs = pm.forestplot(trc_nuts[-10:], varnames=['b0', 'b1', 'b2'])
Observe:
In [18]:
with mdl:
trc_nuts_map = pm.sample(draws=300, step=pm.NUTS(scaling=v_params.means),
start=v_params.means)
NUTS actually gets stuck because the ADVI mean estimate is far off from the true mean.
In [20]:
ax = pm.traceplot(trc_nuts_map[:], figsize=(12,4*1.5), combined=False)
In [21]:
gs = pm.forestplot(trc_nuts_map[-100:], varnames=['b0', 'b1', 'b2'])
Observe:
In [23]:
with mdl:
step = pm.NUTS(scaling=np.power(mdl.dict_to_array(v_params.stds), 2), is_cov=True)
trc_nuts_scale = pm.sample(draws=300, njobs=3, step=step, start=v_params.means)
Note that this initialization is the default for continuous models when you don't submit a sampler. Thus, the line above is identical to:
In [13]:
with mdl:
trc_nuts_scale = pm.sample(draws=300, njobs=3)
And in this case, finding the MAP and using it to start ADVI actually lets ADVI converge to the correct solution and result in even better results. This can be achieved by setting init to 'advi_init'. We can also reduce the number of ADVI iterations.
In [9]:
with mdl:
trc_nuts_scale = pm.sample(draws=300, init='advi_map', n_init=150000, njobs=3)
In [10]:
def plot_traces(traces, varnames, flatten_chains=False):
""" Conv fn: plot traces with overlaid means and values """
ax = pm.traceplot(traces, varnames=varnames, figsize=(12,len(varnames)*1.5),
lines={k: v['mean'] for k, v in pm.df_summary(
traces, varnames=varnames).iterrows()}, combined=flatten_chains)
for i, mn in enumerate(pm.df_summary(traces, varnames=varnames)['mean']):
ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
,xytext=(5, 10), textcoords='offset points', rotation=90
,va='bottom', fontsize='large', color='#AA0022')
In [11]:
ax = pm.traceplot(trc_nuts_scale, figsize=(12, 4*1.5), combined=False)
In [12]:
gs = pm.forestplot(trc_nuts_scale[-100:], varnames=['b0', 'b1', 'b2'])
Observe:
The final estimates of the model parameter coeffs look quite similar, and close to the correct values. Each NUTS sampler was run for the same count of traces, yet they took quite different times to sample, since the quality of exploration was different:
The scaling really seems to help speed up the NUTS sampling: the traces appear to converge much more quickly and seem more settled. The parameter estimates are good too.
In general, it seems worth trying to set the NUTS scaling using ADVI stdevs.
Example originally contributed by Jonathan Sedar 2016-10-16
github.com/jonsedar