NUTS scaling using ADVI

Note

This notebook compares various techniques of initializing NUTS and finds that ADVI works best. Since then, we have made ADVI initialization the default for when you don't submit a sampler. Thus, if you run pymc3.sample(500) on a continuous model it will already initialize your model correctly.

A minimal reproducable example of using the stdevs of ADVI to set the scaling matrix of the NUTS sampler.

I caught up with Thomas Wiecki after his talk at ODSC London and he mentioned a potential speed increase for NUTS sampling by using ADVI outputs to set the covariance scaling matrix.

This seems like a great idea and there's already a good example in the docs but I wanted to try it myself, and get a feel for the speed increase.

Overview

In this Notebook I generate a small, noisy dataset according to a simple linear model, and attempt to recover the parameters comparing 3 techniques:

NUTS, initialised at model test point (zero, the basic choice)
NUTS, initialised at mean ADVI (my default choice to date)
NUTS, initialised at mean ADVI, and scaling the covariance with the ADVI stdev (hopefully a speed increase)

Results Summary

The final estimates of the model parameter coeffs look quite similar, and close to the correct values. Each NUTS sampler was run for the same count of traces, yet they took quite different times to sample, since the quality of exploration was different:

NUTS, initialised at test point zero: 9 sec,
NUTS, initialised at mean ADVI: 163 sec
NUTS, initialised at mean ADVI, and scaled using ADVI stdevs: 3 sec

The scaling really seems to help speed up the NUTS sampling: the traces appear to converge much more quickly and seem more settled. The parameter estimates are good too.

In general, it seems worth trying to set the NUTS scaling using ADVI stdevs.

Setup
- Local Functions
- Generate Data

Create and Test Linear Model
- Metropolis Sampling
- ADVI Estimation

Test NUTS Sampling

Package Requirements (shown as a conda-env YAML):

$> less conda_env_pymc3_examples.yml

name: pymc3_examples
channels:
  - defaults
dependencies:
    - python=3.5
    - jupyter
    - ipywidgets
    - numpy
    - scipy
    - matplotlib
    - pandas
    - pip
    - pip:
        - watermark
        - pymc3        

$> conda env create --file conda_env_pymc3_examples.yml
$> source activate pymc3_examples

Setup



In [1]:

    
%matplotlib inline



In [2]:

    
import warnings
warnings.filterwarnings('ignore')

import numpy as np
from scipy import optimize
import pandas as pd
import matplotlib.pyplot as plt
import pymc3 as pm

pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = 12, 4
rndst = np.random.RandomState(0)

%load_ext watermark
%watermark -dmvgp numpy,pandas,matplotlib,pymc3,theano,joblib









    



17/11/2016 

CPython 3.5.2
IPython 5.1.0

numpy 1.11.2
pandas 0.19.0
matplotlib 1.5.1
pymc3 3.0rc1
theano 0.7.0
joblib 0.9.4

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 4.4.0-42-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit
Git hash   : 85e0f9db25938508934c920ad962a64ad308fce1

Local Functions



In [3]:

    
def generate_data(n=20, p=0, a=1, b=1, c=0, latent_sigma_y=20):
    ''' 
    Create a toy dataset based on a very simple model that we might
    imagine is a noisy physical process:
        1. random x values within a range
        2. latent error aka inherent noise in y
        3. optionally create labelled outliers with larger noise

    Model form: y ~ a + bx + cx^2 + e
    
    NOTE: latent_sigma_y is used to create a normally distributed,
    'latent error' aka 'inherent noise' in the 'physical process' 
    generating thses values, rather than experimental measurement error. 
    Please don't use the returned `latent_error` values in inferential 
    models, it's returned in e dataframe for interest only.
    '''
    rndst = np.random.RandomState(0)
    
    df = pd.DataFrame({'x':rndst.choice(np.arange(max(100, 2*n)), n, replace=False)})
                
    ## create linear or quadratic model
    df['y'] = a + b*(df['x']) + c*(df['x'])**2 

    ## create latent noise and marked outliers
    df['latent_error'] = rndst.normal(0, latent_sigma_y, n)
    df['outlier_error'] = rndst.normal(0, latent_sigma_y*10, n)
    df['outlier'] = rndst.binomial(1, p, n)
    
    ## add noise, with extreme noise for marked outliers
    df['y'] += ((1-df['outlier']) * df['latent_error'])
    df['y'] += (df['outlier'] * df['outlier_error'])
   
    ## round
    for col in ['y','latent_error','outlier_error','x']:
        df[col] = np.round(df[col],3)
       
    return df

Generate Data

NOTE:

Dataset is 1000 rows for publishing (n=1000), which is small but still just about large enough to warrant a fast technique.
For your own usage, please feel free to increase n. Also try different model parameters p, a, b, c, latent_sigma_y.



In [4]:

    
n = 1000
df = generate_data(n=n, p=0.01, a=-30, b=9, c=2, latent_sigma_y=40)
df.head()









    Out[4]:






  
    
      
      x
      y
      latent_error
      outlier_error
      outlier
    
  
  
    
      0
      405
      331682.422
      17.422
      -190.070
      0
    
    
      1
      1190
      2842856.031
      -23.969
      509.181
      0
    
    
      2
      1132
      2573007.324
      1.324
      -678.453
      0
    
    
      3
      731
      1075236.834
      -34.166
      292.073
      0
    
    
      4
      1754
      6168759.202
      -28.798
      -742.993
      0

Create and Run Linear Model



In [5]:

    
with pm.Model() as mdl:
        
    ## define Normal priors to give Ridge regression
    b0 = pm.Normal('b0', mu=0, sd=100)
    b1 = pm.Normal('b1', mu=0, sd=100)
    b2 = pm.Normal('b2', mu=0, sd=100)
 
    ## define Linear model
    yest = b0 + b1 * df['x'] + b2 * np.power(df['x'], 2)

    ## define Normal likelihood with HalfCauchy noise (fat tails, equiv to HalfT 1DoF)
    sigma_y = pm.HalfCauchy('sigma_y', beta=10)
    likelihood = pm.Normal('likelihood', mu=yest, sd=sigma_y, observed=df['y'])

Metropolis Sampling



In [12]:

    
with mdl:
    trc_met = pm.sample(10000, njobs=3, step=pm.Metropolis())









    



100%|██████████| 10000/10000 [01:02<00:00, 160.60it/s]

View traces



In [13]:

    
ax = pm.traceplot(trc_met[:], figsize=(12,4*1.5), combined=False)

Observe:

Great, the model seems reasonably well specified
Metropolis, as ever, takes a while to converge

View Model Coeffs



In [15]:

    
gs = pm.forestplot(trc_met[-1000:], varnames=['b0', 'b1', 'b2'])

Observe:

This simple model took only 10 sec to sample using Metropolis.
Parameter estimates for b1 and b2 seem good, b0 seems a little off, but in general parameters are good.

ADVI Estimation



In [6]:

    
with mdl:
    v_params = pm.variational.advi(n=100000) 

_ = plt.plot(-np.log10(-v_params.elbo_vals))









    



Average ELBO = -7,476.35: 100%|██████████| 100000/100000 [00:29<00:00, 3433.70it/s]
Finished [100%]: Average ELBO = -7,420.23

Observe:

ADVI takes many iterations to converge for this model, but it gets there in the end
NOTE: I've plotted the ELBO on a log scale since the values swept through more than 10 orders of magnitude, and on a linear scale it becomes very hard to see convergence



In [7]:

    
df_v_means = pd.DataFrame(v_params.means, index=['mean'])
df_v_stds = pd.DataFrame(v_params.stds, index=['std'])
df_v_params = pd.merge(df_v_means.T, df_v_stds.T, left_index=True, right_index=True)
df_v_params









    Out[7]:






  
    
      
      mean
      std
    
  
  
    
      b0
      14.465333
      17.356051
    
    
      b1
      8.713063
      0.020899
    
    
      b2
      2.000169
      0.000013
    
    
      sigma_y_log_
      5.869877
      0.027989

Observe:

The fitted parameter values don't look too horrible, but there seems to be an issue with b0 in particular.
However, I don't really want to report these values anyhow, instead, I'll use them to parameterise the NUTS sampler.

Test NUTS Sampling

1. NUTS initialise MAP at test_point



In [11]:

    
with mdl:   
    trc_nuts = pm.sample(300, njobs=3, step=pm.NUTS())









    



100%|██████████| 300/300 [00:00<00:00, 1552.91it/s]

View Traces



In [12]:

    
ax = pm.traceplot(trc_nuts[:], figsize=(12,4*1.5), combined=False)

View Model Coeffs



In [13]:

    
gs = pm.forestplot(trc_nuts[-10:], varnames=['b0', 'b1', 'b2'])

Observe:

The model took 9.2 sec to sample
We didn't hit convergence until after 100 samples, with some slight movement remaining.
The estimated model coeffs look pretty good.

2. NUTS initialised using ADVI mean



In [18]:

    
with mdl:
    trc_nuts_map = pm.sample(draws=300, step=pm.NUTS(scaling=v_params.means), 
                             start=v_params.means)









    



INFO (theano.gof.compilelock): Refreshing lock /home/wiecki/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.2-64/lock_dir/lock
 24%|██▍       | 73/300 [01:30<06:53,  1.82s/it]

NUTS actually gets stuck because the ADVI mean estimate is far off from the true mean.

View Traces



In [20]:

    
ax = pm.traceplot(trc_nuts_map[:], figsize=(12,4*1.5), combined=False)

View Model Coeffs



In [21]:

    
gs = pm.forestplot(trc_nuts_map[-100:], varnames=['b0', 'b1', 'b2'])

Observe:

The model took 163 sec to sample
We hit a sort of convergence quickly after ~50 samples, but the traces afterwards appear to vary a lot - never truly settling.
The estimated model coeffs for b1 and b2 look fine, but b0 has a very wide variance.

3. NUTS initialised using ADVI mean and scale using ADVI stdevs



In [23]:

    
with mdl:
    step = pm.NUTS(scaling=np.power(mdl.dict_to_array(v_params.stds), 2), is_cov=True)

    trc_nuts_scale = pm.sample(draws=300, njobs=3, step=step, start=v_params.means)









    



100%|██████████| 300/300 [00:12<00:00, 24.95it/s]

Note that this initialization is the default for continuous models when you don't submit a sampler. Thus, the line above is identical to:



In [13]:

    
with mdl:
    trc_nuts_scale = pm.sample(draws=300, njobs=3)









    



Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -7,456.53: 100%|██████████| 500000/500000 [00:56<00:00, 8904.41it/s]
Finished [100%]: Average ELBO = -7,460.26
100%|██████████| 1/1 [00:00<00:00, 4215.38it/s]
100%|██████████| 300/300 [00:02<00:00, 108.76it/s]

And in this case, finding the MAP and using it to start ADVI actually lets ADVI converge to the correct solution and result in even better results. This can be achieved by setting init to 'advi_init'. We can also reduce the number of ADVI iterations.



In [9]:

    
with mdl:
    trc_nuts_scale = pm.sample(draws=300, init='advi_map', n_init=150000, njobs=3)









    



Auto-assigning NUTS sampler...
Initializing NUTS using advi_map...






    



Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 5375.931832
         Iterations: 79
         Function evaluations: 150
         Gradient evaluations: 139






    



Average ELBO = -7,448.56: 100%|██████████| 150000/150000 [01:01<00:00, 2420.36it/s]
Finished [100%]: Average ELBO = -7,456.37
100%|██████████| 300/300 [00:09<00:00, 30.63it/s]

View Traces



In [10]:

    
def plot_traces(traces, varnames, flatten_chains=False):
    """ Conv fn: plot traces with overlaid means and values """
    
    ax = pm.traceplot(traces, varnames=varnames, figsize=(12,len(varnames)*1.5),
            lines={k: v['mean'] for k, v in pm.df_summary(
                traces, varnames=varnames).iterrows()}, combined=flatten_chains)

    for i, mn in enumerate(pm.df_summary(traces, varnames=varnames)['mean']):
        ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
                    ,xytext=(5, 10), textcoords='offset points', rotation=90
                    ,va='bottom', fontsize='large', color='#AA0022')



In [11]:

    
ax = pm.traceplot(trc_nuts_scale, figsize=(12, 4*1.5), combined=False)

View Model Coeffs



In [12]:

    
gs = pm.forestplot(trc_nuts_scale[-100:], varnames=['b0', 'b1', 'b2'])

Observe:

The model samples far quicker than #4 without the scaling, and quicker than #3 initialised at the test point.
The traces appear converged directly from the start and yielded consistent trace values since: the traces look very settled.
The estimated model coeffs look good.

In Summary

NUTS, initialised at test point zero: 9 sec,
NUTS, initialised at mean ADVI: 163 sec
NUTS, initialised at mean ADVI, and scaled using ADVI stdevs: 3 sec

The scaling really seems to help speed up the NUTS sampling: the traces appear to converge much more quickly and seem more settled. The parameter estimates are good too.

In general, it seems worth trying to set the NUTS scaling using ADVI stdevs.

Example originally contributed by Jonathan Sedar 2016-10-16
github.com/jonsedar

	x	y	latent_error	outlier_error
0	405	331682.422	17.422	-190.070
1	1190	2842856.031	-23.969	509.181
2	1132	2573007.324	1.324	-678.453
3	731	1075236.834	-34.166	292.073
4	1754	6168759.202	-28.798	-742.993

	mean	std
b0	14.465333	17.356051
b1	8.713063	0.020899
b2	2.000169	0.000013
sigma_y_log_	5.869877	0.027989

NUTS scaling using ADVI

Note

A minimal reproducable example of using the stdevs of ADVI to set the scaling matrix of the NUTS sampler.

Overview

Results Summary

Contents

Package Requirements (shown as a conda-env YAML):

Setup

Local Functions

Generate Data

Create and Run Linear Model

Metropolis Sampling

View traces

View Model Coeffs

ADVI Estimation

Test NUTS Sampling

1. NUTS initialise MAP at test_point

View Traces

View Model Coeffs

2. NUTS initialised using ADVI mean

View Traces

View Model Coeffs

3. NUTS initialised using ADVI mean and scale using ADVI stdevs

View Traces

View Model Coeffs

In Summary