NUTS scaling using ADVI

Note

This notebook compares various techniques of initializing NUTS and finds that ADVI works best. Since then, we have made ADVI initialization the default for when you don't submit a sampler. Thus, if you run pymc3.sample(500) on a continuous model it will already initialize your model correctly.

A minimal reproducable example of using the stdevs of ADVI to set the scaling matrix of the NUTS sampler.

I caught up with Thomas Wiecki after his talk at ODSC London and he mentioned a potential speed increase for NUTS sampling by using ADVI outputs to set the covariance scaling matrix.

This seems like a great idea and there's already a good example in the docs but I wanted to try it myself, and get a feel for the speed increase.

Overview

In this Notebook I generate a small, noisy dataset according to a simple linear model, and attempt to recover the parameters comparing 3 techniques:

  1. NUTS, initialised at model test point (zero, the basic choice)
  2. NUTS, initialised at mean ADVI (my default choice to date)
  3. NUTS, initialised at mean ADVI, and scaling the covariance with the ADVI stdev (hopefully a speed increase)

Results Summary

The final estimates of the model parameter coeffs look quite similar, and close to the correct values. Each NUTS sampler was run for the same count of traces, yet they took quite different times to sample, since the quality of exploration was different:

  1. NUTS, initialised at test point zero: 9 sec,
  2. NUTS, initialised at mean ADVI: 163 sec
  3. NUTS, initialised at mean ADVI, and scaled using ADVI stdevs: 3 sec

The scaling really seems to help speed up the NUTS sampling: the traces appear to converge much more quickly and seem more settled. The parameter estimates are good too.

In general, it seems worth trying to set the NUTS scaling using ADVI stdevs.

Contents

Package Requirements (shown as a conda-env YAML):

$> less conda_env_pymc3_examples.yml

name: pymc3_examples
channels:
  - defaults
dependencies:
    - python=3.5
    - jupyter
    - ipywidgets
    - numpy
    - scipy
    - matplotlib
    - pandas
    - pip
    - pip:
        - watermark
        - pymc3        

$> conda env create --file conda_env_pymc3_examples.yml
$> source activate pymc3_examples

Setup


In [1]:
%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
from scipy import optimize
import pandas as pd
import matplotlib.pyplot as plt
import pymc3 as pm

pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = 12, 4
rndst = np.random.RandomState(0)

%load_ext watermark
%watermark -dmvgp numpy,pandas,matplotlib,pymc3,theano,joblib


17/11/2016 

CPython 3.5.2
IPython 5.1.0

numpy 1.11.2
pandas 0.19.0
matplotlib 1.5.1
pymc3 3.0rc1
theano 0.7.0
joblib 0.9.4

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 4.4.0-42-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit
Git hash   : 85e0f9db25938508934c920ad962a64ad308fce1

Local Functions


In [3]:
def generate_data(n=20, p=0, a=1, b=1, c=0, latent_sigma_y=20):
    ''' 
    Create a toy dataset based on a very simple model that we might
    imagine is a noisy physical process:
        1. random x values within a range
        2. latent error aka inherent noise in y
        3. optionally create labelled outliers with larger noise

    Model form: y ~ a + bx + cx^2 + e
    
    NOTE: latent_sigma_y is used to create a normally distributed,
    'latent error' aka 'inherent noise' in the 'physical process' 
    generating thses values, rather than experimental measurement error. 
    Please don't use the returned `latent_error` values in inferential 
    models, it's returned in e dataframe for interest only.
    '''
    rndst = np.random.RandomState(0)
    
    df = pd.DataFrame({'x':rndst.choice(np.arange(max(100, 2*n)), n, replace=False)})
                
    ## create linear or quadratic model
    df['y'] = a + b*(df['x']) + c*(df['x'])**2 

    ## create latent noise and marked outliers
    df['latent_error'] = rndst.normal(0, latent_sigma_y, n)
    df['outlier_error'] = rndst.normal(0, latent_sigma_y*10, n)
    df['outlier'] = rndst.binomial(1, p, n)
    
    ## add noise, with extreme noise for marked outliers
    df['y'] += ((1-df['outlier']) * df['latent_error'])
    df['y'] += (df['outlier'] * df['outlier_error'])
   
    ## round
    for col in ['y','latent_error','outlier_error','x']:
        df[col] = np.round(df[col],3)
       
    return df

Generate Data

NOTE:

  • Dataset is 1000 rows for publishing (n=1000), which is small but still just about large enough to warrant a fast technique.
  • For your own usage, please feel free to increase n. Also try different model parameters p, a, b, c, latent_sigma_y.

In [4]:
n = 1000
df = generate_data(n=n, p=0.01, a=-30, b=9, c=2, latent_sigma_y=40)
df.head()


Out[4]:
x y latent_error outlier_error outlier
0 405 331682.422 17.422 -190.070 0
1 1190 2842856.031 -23.969 509.181 0
2 1132 2573007.324 1.324 -678.453 0
3 731 1075236.834 -34.166 292.073 0
4 1754 6168759.202 -28.798 -742.993 0

Create and Run Linear Model


In [5]:
with pm.Model() as mdl:
        
    ## define Normal priors to give Ridge regression
    b0 = pm.Normal('b0', mu=0, sd=100)
    b1 = pm.Normal('b1', mu=0, sd=100)
    b2 = pm.Normal('b2', mu=0, sd=100)
 
    ## define Linear model
    yest = b0 + b1 * df['x'] + b2 * np.power(df['x'], 2)

    ## define Normal likelihood with HalfCauchy noise (fat tails, equiv to HalfT 1DoF)
    sigma_y = pm.HalfCauchy('sigma_y', beta=10)
    likelihood = pm.Normal('likelihood', mu=yest, sd=sigma_y, observed=df['y'])

Metropolis Sampling


In [12]:
with mdl:
    trc_met = pm.sample(10000, njobs=3, step=pm.Metropolis())


100%|██████████| 10000/10000 [01:02<00:00, 160.60it/s]
View traces

In [13]:
ax = pm.traceplot(trc_met[:], figsize=(12,4*1.5), combined=False)


Observe:

  • Great, the model seems reasonably well specified
  • Metropolis, as ever, takes a while to converge
View Model Coeffs

In [15]:
gs = pm.forestplot(trc_met[-1000:], varnames=['b0', 'b1', 'b2'])


Observe:

  • This simple model took only 10 sec to sample using Metropolis.
  • Parameter estimates for b1 and b2 seem good, b0 seems a little off, but in general parameters are good.

ADVI Estimation


In [6]:
with mdl:
    v_params = pm.variational.advi(n=100000) 

_ = plt.plot(-np.log10(-v_params.elbo_vals))


Average ELBO = -7,476.35: 100%|██████████| 100000/100000 [00:29<00:00, 3433.70it/s]
Finished [100%]: Average ELBO = -7,420.23

Observe:

  • ADVI takes many iterations to converge for this model, but it gets there in the end
  • NOTE: I've plotted the ELBO on a log scale since the values swept through more than 10 orders of magnitude, and on a linear scale it becomes very hard to see convergence

In [7]:
df_v_means = pd.DataFrame(v_params.means, index=['mean'])
df_v_stds = pd.DataFrame(v_params.stds, index=['std'])
df_v_params = pd.merge(df_v_means.T, df_v_stds.T, left_index=True, right_index=True)
df_v_params


Out[7]:
mean std
b0 14.465333 17.356051
b1 8.713063 0.020899
b2 2.000169 0.000013
sigma_y_log_ 5.869877 0.027989

Observe:

  • The fitted parameter values don't look too horrible, but there seems to be an issue with b0 in particular.
  • However, I don't really want to report these values anyhow, instead, I'll use them to parameterise the NUTS sampler.

Test NUTS Sampling

1. NUTS initialise MAP at test_point


In [11]:
with mdl:   
    trc_nuts = pm.sample(300, njobs=3, step=pm.NUTS())


100%|██████████| 300/300 [00:00<00:00, 1552.91it/s]
View Traces

In [12]:
ax = pm.traceplot(trc_nuts[:], figsize=(12,4*1.5), combined=False)


View Model Coeffs

In [13]:
gs = pm.forestplot(trc_nuts[-10:], varnames=['b0', 'b1', 'b2'])


Observe:

  • The model took 9.2 sec to sample
  • We didn't hit convergence until after 100 samples, with some slight movement remaining.
  • The estimated model coeffs look pretty good.

2. NUTS initialised using ADVI mean


In [18]:
with mdl:
    trc_nuts_map = pm.sample(draws=300, step=pm.NUTS(scaling=v_params.means), 
                             start=v_params.means)


INFO (theano.gof.compilelock): Refreshing lock /home/wiecki/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.2-64/lock_dir/lock
 24%|██▍       | 73/300 [01:30<06:53,  1.82s/it]

NUTS actually gets stuck because the ADVI mean estimate is far off from the true mean.

View Traces

In [20]:
ax = pm.traceplot(trc_nuts_map[:], figsize=(12,4*1.5), combined=False)


View Model Coeffs

In [21]:
gs = pm.forestplot(trc_nuts_map[-100:], varnames=['b0', 'b1', 'b2'])


Observe:

  • The model took 163 sec to sample
  • We hit a sort of convergence quickly after ~50 samples, but the traces afterwards appear to vary a lot - never truly settling.
  • The estimated model coeffs for b1 and b2 look fine, but b0 has a very wide variance.

3. NUTS initialised using ADVI mean and scale using ADVI stdevs


In [23]:
with mdl:
    step = pm.NUTS(scaling=np.power(mdl.dict_to_array(v_params.stds), 2), is_cov=True)

    trc_nuts_scale = pm.sample(draws=300, njobs=3, step=step, start=v_params.means)


100%|██████████| 300/300 [00:12<00:00, 24.95it/s]

Note that this initialization is the default for continuous models when you don't submit a sampler. Thus, the line above is identical to:


In [13]:
with mdl:
    trc_nuts_scale = pm.sample(draws=300, njobs=3)


Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -7,456.53: 100%|██████████| 500000/500000 [00:56<00:00, 8904.41it/s]
Finished [100%]: Average ELBO = -7,460.26
100%|██████████| 1/1 [00:00<00:00, 4215.38it/s]
100%|██████████| 300/300 [00:02<00:00, 108.76it/s]

And in this case, finding the MAP and using it to start ADVI actually lets ADVI converge to the correct solution and result in even better results. This can be achieved by setting init to 'advi_init'. We can also reduce the number of ADVI iterations.


In [9]:
with mdl:
    trc_nuts_scale = pm.sample(draws=300, init='advi_map', n_init=150000, njobs=3)


Auto-assigning NUTS sampler...
Initializing NUTS using advi_map...
Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 5375.931832
         Iterations: 79
         Function evaluations: 150
         Gradient evaluations: 139
Average ELBO = -7,448.56: 100%|██████████| 150000/150000 [01:01<00:00, 2420.36it/s]
Finished [100%]: Average ELBO = -7,456.37
100%|██████████| 300/300 [00:09<00:00, 30.63it/s]
View Traces

In [10]:
def plot_traces(traces, varnames, flatten_chains=False):
    """ Conv fn: plot traces with overlaid means and values """
    
    ax = pm.traceplot(traces, varnames=varnames, figsize=(12,len(varnames)*1.5),
            lines={k: v['mean'] for k, v in pm.df_summary(
                traces, varnames=varnames).iterrows()}, combined=flatten_chains)

    for i, mn in enumerate(pm.df_summary(traces, varnames=varnames)['mean']):
        ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
                    ,xytext=(5, 10), textcoords='offset points', rotation=90
                    ,va='bottom', fontsize='large', color='#AA0022')

In [11]:
ax = pm.traceplot(trc_nuts_scale, figsize=(12, 4*1.5), combined=False)


View Model Coeffs

In [12]:
gs = pm.forestplot(trc_nuts_scale[-100:], varnames=['b0', 'b1', 'b2'])


Observe:

  • The model samples far quicker than #4 without the scaling, and quicker than #3 initialised at the test point.
  • The traces appear converged directly from the start and yielded consistent trace values since: the traces look very settled.
  • The estimated model coeffs look good.

In Summary

The final estimates of the model parameter coeffs look quite similar, and close to the correct values. Each NUTS sampler was run for the same count of traces, yet they took quite different times to sample, since the quality of exploration was different:

  1. NUTS, initialised at test point zero: 9 sec,
  2. NUTS, initialised at mean ADVI: 163 sec
  3. NUTS, initialised at mean ADVI, and scaled using ADVI stdevs: 3 sec

The scaling really seems to help speed up the NUTS sampling: the traces appear to converge much more quickly and seem more settled. The parameter estimates are good too.

In general, it seems worth trying to set the NUTS scaling using ADVI stdevs.


Example originally contributed by Jonathan Sedar 2016-10-16
github.com/jonsedar