Running Stan on Jupyter or JuliaBox

Introduction to Stan and Stan.jl on Jupyter and JuliaBox

Stan ( http://mc-stan.org ) is one of several tools available to perform Bayesian sampling. Statistical inference based on Stan's result (the posterior samples in 1 or more chains) are typically done in R, Python or Matlab. Stan.jl uses Stan as an external program and simplifies running Stan models and subsequent inference, all from within Julia, thus providing aother 'interface' to Stan.

Stan is written in C++ but still attractive to Julia users because it is a very good tool for users interested in Bayesian models that can be efficiently sampled with Stan. It is also supported by a very capable team and community willing to help out with modeling and installation issues. Reasons enough for the existance of Stan.jl.

For Julia users who prefer a Notebook-like interface (over a Julia REPL + Editor interface), Jupyter is available and this Notebook demonstrates one way of using Stan.jl in a Notebook, including several options for the graphical display of results.

JuliaBox contributes another major benefit in that it removes the burden of installing Stan from the end users. The user still has a 'private' permanent file system to store Julia scripts that include the Stan model and available data.

Finally, pushing Stan's envelope can require knowledge of C++, C++ templates, the C++ Boost libraries, R, the Stan modeling language, and its extensions (e.g. functions). This can be a daunting task. Stan.jl can't help here but Julia's ecosystem provides options entirely written in Julia, like Mamba.jl ( http://mambajl.readthedocs.org/en/latest/index.html ) and Lora.jl which, if access to lower level MCMC functionality is required, might be a better route. Using Stan.jl will pave the way to such a migration.

Getting started, some preliminaries ...

Running below cell is occasionally useful to make sure all packages are up to date.


In [1]:
Pkg.update()


INFO: Updating METADATA...
INFO: Updating Geomechanics...
INFO: Updating ParSpMatVec...
INFO: Updating ODE...
INFO: Updating GLAbstraction...
INFO: Updating MeshIO...
INFO: Updating FixedSizeArrays...
INFO: Updating CSoM...
INFO: Updating ModernGL...
INFO: Updating ClassicalLaminateTheory...
INFO: Updating Plots...
INFO: Updating GeometryTypes...
INFO: Updating Gadfly...
INFO: Updating GLVisualize...
INFO: Updating Mustache...
INFO: Updating GLWindow...
INFO: Computing changes...
INFO: No packages to install, update or remove

Stan.jl relies on Mamba.jl's very broad collection of functions for posterior inference and display after the MCMC sampling has been performed by Stan.


In [7]:
using Mamba, Stan;

In this example we'll use one of several examples that are included in Stan.jl, i.e. estimating the distribution of the success probability (theta) based on 4 sets of 10 observed outcomes.

I like to control where I store intermediate and result files. In this case I 'borrow' the Bernoulli directory inside Stan.jl.


In [8]:
old = pwd()
ProjDir = Pkg.dir("Stan", "Examples", "Bernoulli")
cd(ProjDir)

A Stan model and its data

Define a simple Stan model. On JuliaBox it is best to define the model inline, but it could also be stored in an external file and uploaded to JuliaBox.


In [9]:
bernoullimodel = "
data { 
  int<lower=0> N; 
  int<lower=0,upper=1> y[N];
} 
parameters {
  real<lower=0,upper=1> theta;
} 
model {
  theta ~ beta(1,1);
    y ~ bernoulli(theta);
}
";

Define the observed data, in this case the number of observations in the 4 experiments and the outcomes of these observations. This is an array of dictionaries. By default Stan will run 4 chains and in this setup each chain is assigned a dictionary with the results of 1 experiment. If that is an acceptable way to analyze these results depends on the purpose of the 4 experiments. It's used here purely as an illustration, e.g. one could increase the probability of success in one of the experiment by updating the y vector. This will show up in the overall value of theta below and in the plots for that chain.


In [10]:
bernoullidata = [
  Dict("N" => 10, "y" => [0, 1, 0, 1, 0, 0, 0, 0, 0, 1]),
  Dict("N" => 10, "y" => [0, 1, 0, 0, 0, 0, 1, 0, 0, 1]),
  Dict("N" => 10, "y" => [0, 0, 0, 0, 0, 0, 1, 0, 1, 1]),
  Dict("N" => 10, "y" => [0, 0, 0, 1, 1, 0, 0, 1, 0, 1])
];

Setting up a Stanmodel and sampling from the model

Stanmodel() and stan() are the main functions provided by Stan.jl. For details on the arguments of these functions see the CmdStan documentation and the README in Stan.jl ( https://github.com/goedman/Stan.jl )


In [11]:
stanmodel = Stanmodel(update=1200, thin=2, name="bernoulli", model=bernoullimodel);


File /Users/rob/.julia/v0.4/Stan/Examples/Bernoulli/tmp/bernoulli.stan will be updated.


In [13]:
sim1 = stan(stanmodel, bernoullidata, ProjDir, diagnostics=false);


make: `/Users/rob/.julia/v0.4/Stan/Examples/Bernoulli/tmp/bernoulli' is up to date.

Inference for Stan model: bernoulli_model
4 chains: each with iter=(1200,1200,1200,1200); warmup=(0,0,0,0); thin=(1,1,1,1); 4800 iterations saved.

Warmup took (0.0081, 0.0070, 0.0073, 0.0072) seconds, 0.030 seconds total
Sampling took (0.015, 0.015, 0.016, 0.016) seconds, 0.062 seconds total

                Mean     MCSE  StdDev    5%   50%   95%  N_Eff  N_Eff/s    R_hat
lp__            -8.3  1.7e-01    0.78  -9.8  -8.1  -7.6     21      339  1.1e+00
accept_stat__   0.91  1.9e-03    0.13  0.63  0.97   1.0   4800    77820  1.0e+00
stepsize__      0.99  4.3e-02   0.060  0.93   1.0   1.1    2.0       32  2.7e+13
treedepth__      1.7  8.5e-03    0.50   1.0   2.0   2.0   3389    54947  1.0e+00
n_leapfrog__     2.4  2.1e-02     1.1   1.0   3.0   3.0   2849    46197  1.0e+00
n_divergent__   0.00  0.0e+00    0.00  0.00  0.00  0.00   4800    77820      nan
theta           0.35  2.5e-02    0.14  0.14  0.34  0.59     29      470  1.0e+00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

From above summary you can see that by default Stan.jl runs 4 chains. Stan will not perform any thinning, this is done later on in fact as we defined a thinning factor of 2 in the Stanmodel() call.

Notice that if we were to run the simulation again the Stan Model would not be compiled unless the model has been changed.

Posterior analysis

Stan records 6 variables by default augmented by model specific variables (only theta in this example). Not all of these are suitable for posterior inference. In this example we'll only perform posterior analysis on "theta", "lp" and "accept_stat".

In some cases Stan might record 10s, 100s even 1000s of model specific variables. In those cases a variable like 'posterior' can be used to constrain the number of variables used in posterior analysis.


In [14]:
posterior = ["theta", "lp__", "accept_stat__"];

Apply the subset 'posterior' before calling describe().


In [15]:
sim = sim1[:, posterior, :]
describe(sim)


Iterations = 1:2399
Thinning interval = 2
Chains = 1,2,3,4
Samples per chain = 1200

Empirical Posterior Estimates:
                  Mean        SD       Naive SE       MCSE         ESS   
        theta  0.35214848 0.13725089 0.0019810459 0.0062985868  474.83649
         lp__ -8.29628018 0.77869175 0.0112394473 0.0369522593  444.06767
accept_stat__  0.91377397 0.13120771 0.0018938202 0.0030147146 1200.00000

Quantiles:
                   2.5%        25.0%       50.0%      75.0%      97.5%  
        theta   0.117667875  0.25028825  0.3444205  0.4455125  0.6391171
         lp__ -10.481785000 -8.51547000 -8.1382900 -7.7305450 -7.6389692
accept_stat__   0.501218975  0.88367325  0.9699055  1.0000000  1.0000000

Mamba.jl provides a range of function to diagnose and analyze the simulation result, e.g. gelmandiag() is shown below. See the Mamba documentation ( http://mambajl.readthedocs.org/en/latest/index.html ) or the Bernoulli example in Stan.jl for other options.


In [16]:
gelmandiag(sim, mpsrf=true, transform=true)


Out[16]:
               PSRF 97.5%
        theta 1.060 1.172
         lp__ 1.081 1.224
accept_stat__ 1.056 1.098
 Multivariate 1.404   NaN

Plot the results


In [17]:
p = plot(sim, [:trace, :mean, :density, :autocor], legend=true);

In [18]:
draw(p, ncol=4)


Lag 0 20 40 60 80 1 2 3 4 Chain -0.5 0.0 0.5 1.0 Autocorrelation accept_stat__ Value 0.0 0.5 1.0 1.5 0 5 10 15 Density accept_stat__ Iteration 0 500 1000 1500 2000 2500 0.80 0.85 0.90 0.95 1.00 Mean accept_stat__ Iteration 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Value accept_stat__ Lag 0 20 40 60 80 1 2 3 4 Chain -0.5 0.0 0.5 1.0 Autocorrelation lp__ Value -12 -11 -10 -9 -8 -7 0.0 0.5 1.0 1.5 2.0 Density lp__ Iteration 0 500 1000 1500 2000 2500 -9.0 -8.5 -8.0 -7.5 Mean lp__ Iteration 0 500 1000 1500 2000 2500 -15.0 -12.5 -10.0 -7.5 Value lp__ Lag 0 20 40 60 80 1 2 3 4 Chain -0.5 0.0 0.5 1.0 Autocorrelation theta Value 0.0 0.5 1.0 0 1 2 3 4 Density theta Iteration 0 500 1000 1500 2000 2500 0.2 0.3 0.4 0.5 0.6 Mean theta Iteration 0 500 1000 1500 2000 2500 0.0 0.5 1.0 Value theta

Housekeeping ...

Intermediate results are all in the 'tmp' directory. Results that you might want to keep around are left in the current working directory (ProjDir as defined above).


In [19]:
;ls


bernoulli-contourplot.pdf
bernoulli-contourplot.svg
bernoulli-summaryplot-1.pdf
bernoulli-summaryplot-1.svg
bernoulli-variationalplot-1.pdf
bernoulli-variationalplot-1.svg
bernoulli.jl
bernoulli_diagnose.jl
bernoulli_optimize.jl
bernoulli_variational.jl
tmp

What's in the 'tmp' directory?


In [20]:
;ls tmp


bernoulli
bernoulli.hpp
bernoulli.stan
bernoulli_1.data.R
bernoulli_2.data.R
bernoulli_3.data.R
bernoulli_4.data.R
bernoulli_build.log
bernoulli_run.log
bernoulli_samples_1.csv
bernoulli_samples_2.csv
bernoulli_samples_3.csv
bernoulli_samples_4.csv
You can remove the 'tmp' directory. If you do this, future simulations will require redefinition of the Stanmodel() and recompilation when calling stan()!

In [21]:
isdir("tmp") && rm("tmp", recursive=true)

In [16]: