Author: Aki Vehtari, Tuomas Sivula, Co-authors: Lassi Meronen, Pellervo Ruponen
Last modified 2018-Dec.
License: CC-BY
This notebook contains several examples of how to use Stan in python with PyStan. This notebook assumes basic knowledge of Bayesian inference and MCMC. The Stan models are stored in separate .stan-files. The examples are related to Bayesian data analysis course by Aki Vehtari.
We begin by importing the PyStan module as well at the matplotlib module for basic graphics facilities.
We also import some utilities by Michael Betancourt and introduced in PyStan workflow case study in module stan_utility
.
In [73]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import pystan
import arviz as az # For visualization and loo
In [74]:
# add utilities directory to path
import os, sys
util_path = os.path.abspath(os.path.join(os.path.pardir, 'utilities_and_data'))
if util_path not in sys.path and os.path.exists(util_path):
sys.path.insert(0, util_path)
# import from utilities
import stan_utility
In [75]:
# edit default plot settings
plt.rc('font', size=12)
az.style.use('seaborn-darkgrid')
mpl.rcParams['figure.figsize'] = [7.0, 5.4]
In [76]:
data = dict(N=10, y=[0,1,0,0,1,1,1,0,1,0])
Bernoulli model with a Beta(1,1) (uniform) prior
In [77]:
with open('bern.stan') as file:
print(file.read())
Given the Stan program we then use the compile_model
method of stan_utility
module to compile the Stan program into a C++ executable. This utility function automatically saves a cached version of the compiled model to the disk for possible future use.
In [78]:
model = stan_utility.compile_model('bern.stan')
Getting the model again with the utility function uses the cached version automatically.
In [79]:
del model
model = stan_utility.compile_model('bern.stan')
Sample from the posterior, show the summary and plot the histogram of the posterior draws. In addition, plot estimated posterior density with arviz posterior plot and 95% credible interval. We recommend explicitly specifying the seed of Stan's random number generator, as we have done here, so that we can reproduce these exactly results in the future, at least when using the same machine, operating system, and interface. This is especially helpful for the more subtle pathologies that may not always be found, which results in seemingly stochastic behavior.
In [80]:
fit = model.sampling(data=data, seed=194838)
print(fit)
samples = fit.extract(permuted=True)
plt.hist(samples['theta'], 33, color='lightblue', edgecolor='grey')
plt.xlabel(r'Probability of success (theta)')
plt.ylabel('Amount of draws')
# Plot with arviz plot_posterior.
# - round_to: sets accuracy of values in plot
az.plot_posterior(fit, var_names=['theta'], credible_interval=0.95, round_to=2)
plt.xlabel(r'Probability of success (theta)')
plt.ylabel('Relative density');
In [81]:
data = dict(N=10, y=7)
And then we use Binomial model with Beta(1,1) prior for the probability of success.
In [82]:
with open('binom.stan') as file:
print(file.read())
Sample from the posterior and plot the posterior. The histogram should look similar as in the Bernoulli case, except that now the number of successes is 7 instead of 5.
In [83]:
model = stan_utility.compile_model('binom.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
plt.hist(samples['theta'], 33, color='lightblue', edgecolor='grey')
plt.xlabel(r'Probability of success (theta)')
plt.ylabel('Amount of draws')
az.plot_posterior(fit, var_names=['theta'], credible_interval=0.95, round_to=2)
plt.xlabel(r'Probability of success (theta)')
plt.ylabel('Relative density');
Now we re-run the model with a new data (now number of successes being 70 out of 100). The previously compiled Stan program is re-used making the re-use faster.
In [84]:
data = dict(N=100, y=70)
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
plt.hist(samples['theta'], 33, color='lightblue', edgecolor='grey') # histogram
plt.xlabel(r'Probability of success (theta)')
plt.ylabel('Amount of draws')
az.plot_posterior(fit, var_names=['theta'], credible_interval=0.95, round_to=2)
plt.xlabel(r'Probability of success (theta)')
plt.ylabel('Relative density');
In the above examples the probability of success θ was declared as:
real<lower=0,upper=1> theta;
Stan makes automatic transformation of the variable to the unconstrained space using logit transofrmation for interval constrained and log transformation for half constraints.
The following example shows how we can also make an explicit transformation and use binomial_logit function which takes the unconstrained parameter as an argument and uses logit transformation internally. This form can be useful for better numerical stability.
In [85]:
with open('binomb.stan') as file:
print(file.read())
In [86]:
model = stan_utility.compile_model('binomb.stan')
Here we have used Gaussian prior in the unconstrained space, which produces close to uniform prior for theta.
Sample from the posterior and plot the posterior. The histogram should look similar as with the previous model
In [87]:
data = dict(N=100, y=70)
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
plt.hist(samples['theta'], 33, color='lightblue', edgecolor='grey')
plt.xlabel(r'Probability of success (theta)')
plt.ylabel('Amount of draws')
az.plot_posterior(fit, var_names=['theta'], credible_interval=0.95, round_to=2)
plt.xlabel(r'Probability of success (theta)')
plt.ylabel('Relative density');
An experiment was performed to estimate the effect of beta-blockers on mortality of cardiac patients. A group of patients were randomly assigned to treatment and control groups:
Data:
In [88]:
data = dict(N1=674, y1=39, N2=680, y2=22)
To analyse whether the treatment is useful, we can use Binomial model for both groups and compute odds-ratio. If the odds-ratio is less than 1, it means that the treatment is useful.
In [89]:
with open('binom2.stan') as file:
print(file.read())
Sample from the posterior and plot the posterior for the odds-ratio. We plot the posterior both as a histogram and and as an estimated probability density function. Using ArviZ, we can easily see the probability of the odds-ratio being less than 1.
In [90]:
model = stan_utility.compile_model('binom2.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
plt.hist(samples['oddsratio'], 33, color='lightblue', edgecolor='grey')
plt.xlabel('oddsratio')
plt.ylabel('Amount of draws')
# With az.plot_posterior we use parameter ref_val = 1
# since treatment is beneficial if oddsratio is smaller than 1.
az.plot_posterior(fit, var_names=['oddsratio'], credible_interval=0.95, round_to=2, ref_val=1)
plt.xlabel('oddsratio')
plt.ylabel('Relative density')
plt.savefig('odd-ratio.pdf')
In [91]:
data_path = os.path.abspath(
os.path.join(
os.path.pardir,
'utilities_and_data',
'kilpisjarvi-summer-temp.csv'
)
)
d = np.loadtxt(data_path, dtype=np.double, delimiter=';', skiprows=1)
x = d[:, 0]
y = d[:, 4]
N = len(x)
xpred = 2016
Plot the data
In [92]:
# make slightly wider figure
wide_figsize = plt.rcParams['figure.figsize'].copy()
wide_figsize[0] *= 1.4
plt.figure(figsize=wide_figsize)
plt.scatter(x, y)
plt.xlabel('year')
plt.ylabel('summer temperature at Kilpisjarvi');
To analyse whether the average summer month temperature is rising, we use a linear model with Gaussian model for the unexplained variation.
The following Stan code allows also setting hyperparameter values as data allowing easier way to use different priors in different analyses:
In [93]:
with open('lin.stan') as file:
print(file.read())
Create a list with data and priors:
In [94]:
data = dict(
N = N,
x = x,
y = y,
xpred = xpred,
pmualpha = y.mean(), # centered
psalpha = 100, # weakly informative prior
pmubeta = 0, # a priori increase and decrese as likely
psbeta = (.1--.1)/6.0 # avg temp probably does not increase more than 1
# degree per 10 years
)
Run Stan
In [95]:
model = stan_utility.compile_model('lin.stan')
fit_lin = model.sampling(data=data, seed=194838)
samples = fit_lin.extract(permuted=True)
Check the n_eff
and Rhat
In [96]:
print(fit_lin)
Check the treedepth, E-BFMI, and divergences
In [97]:
stan_utility.check_treedepth(fit_lin)
stan_utility.check_energy(fit_lin)
stan_utility.check_div(fit_lin)
We get a warning that several iterations saturated the maximum treedepth. The reason for this is a very strong posterior correlation between alpha and beta as shown in the next plot. This doesn't invalidate the results, but leads to suboptimal performance. We'll later look at alternative which reduces the posterior correlation. The following figure shows the strong posterior correlation in the current case.
In [98]:
samples = fit_lin.extract(permuted=True)
# preview 500 posterior samples of (alpha, beta)
plt.scatter(samples['alpha'][:500], samples['beta'][:500], 50, marker='.', alpha=0.25)
plt.xlabel('alpha')
plt.ylabel('beta');
Compute the probability that the summer temperature is increasing. The beta-parameter represents the slope of the temperature change. Hence, by checking the probability that beta > 0, we get the probability that the temperature is rising.
In [99]:
print('Pr(beta > 0) = {}'.format(np.mean(samples['beta'] > 0)))
Plot the data, the model fit and prediction for year 2016.
In [100]:
# make slightly wider figure of 3 plots
figsize = plt.rcParams['figure.figsize'].copy()
figsize[0] *= 2.2 # width
figsize[1] *= 1.2 # width
fig, ax = plt.subplots(1, 1, figsize=figsize)
# plot 1: scatterplot and lines
color_scatter = 'C0' # 'C0' for default color #0
color_line = 'C3' # 'C1' for default color #1
ax.fill_between(
x,
np.percentile(samples['mu'], 5, axis=0),
np.percentile(samples['mu'], 95, axis=0),
# lighten color_line
color=1 - 0.4*(1 - np.array(mpl.colors.to_rgb(color_line)))
)
ax.plot(
x,
np.percentile(samples['mu'], 50, axis=0),
color=color_line,
linewidth=2,
alpha=0.8
)
ax.scatter(x, y, 10, color=color_scatter)
ax.set_xlabel('year')
ax.set_ylabel('summer temp. @Kilpisjarvi')
ax.set_xlim((1951.5, 2013.5))
az.plot_posterior(fit_lin, var_names=['beta','sigma','ypred'], credible_interval=0.95, round_to=2, kind='hist', bins=35)
plt.xlabel('y-prediction for x={}'.format(xpred));
For the beta parameter (the slope), let's also plot the relative density with reference value of 0. Now we can easily evaluate the probability for slope being positive.
In [101]:
az.plot_posterior(fit_lin, var_names=['beta'], credible_interval=0.95, round_to=2, ref_val=0, kind='kde')
plt.xlabel('beta')
plt.ylabel('Relative density');
In the above we used the unnormalized data and as x values are far away from zero, this will cause very strong posterior dependency between alpha and beta. The strong posterior dependency can be removed by normalizing the data to have zero mean. The following Stan code makes it in Stan. In generated quantities we do correspnding transformation back to the original scale.
In [102]:
with open('lin_std.stan') as file:
print(file.read())
Run Stan
In [103]:
model = stan_utility.compile_model('lin_std.stan')
fit_lin_std = model.sampling(data=data, seed=194838)
samples = fit_lin_std.extract(permuted=True)
Check the n_eff
and Rhat
In [104]:
print(fit_lin_std)
We get now much better effective sample sizes n_eff
.
Check the treedepth, E-BFMI, and divergences
In [105]:
stan_utility.check_treedepth(fit_lin_std)
stan_utility.check_energy(fit_lin_std)
stan_utility.check_div(fit_lin_std)
Everything is fine now. The next figure shows that with the standardized data there is not much posterior correlation:
In [106]:
samples = fit_lin_std.extract(permuted=True)
plt.figure()
# preview 2000 samples
plt.scatter(samples['alpha'][:2000], samples['beta'][:2000], 50, marker='.', alpha=0.25)
plt.xlabel('alpha')
plt.ylabel('beta');
Compute the probability that the summer temperature is increasing.
In [107]:
print('Pr(beta > 0) = {}'.format(np.mean(samples['beta'] > 0)))
Plot the data, the model fit and prediction for year 2016.
In [108]:
# make slightly wider figure of 3 plots
figsize = plt.rcParams['figure.figsize'].copy()
figsize[0] *= 2.2 # width
figsize[1] *= 1.2 # width
fig, ax = plt.subplots(1, 1, figsize=figsize)
# plot 1: scatterplot and lines
color_scatter = 'C0' # 'C0' for default color #0
color_line = 'C3' # 'C1' for default color #1
ax.fill_between(
x,
np.percentile(samples['mu'], 5, axis=0),
np.percentile(samples['mu'], 95, axis=0),
# lighten color_line
color=1 - 0.4*(1 - np.array(mpl.colors.to_rgb(color_line)))
)
ax.plot(
x,
np.percentile(samples['mu'], 50, axis=0),
color=color_line,
linewidth=2,
alpha=0.8
)
ax.scatter(x, y, 10, color=color_scatter)
ax.set_xlabel('year')
ax.set_ylabel('summer temp. @Kilpisjarvi')
ax.set_xlim((1951.5, 2013.5))
az.plot_posterior(fit_lin_std, var_names=['beta','sigma','ypred'], credible_interval=0.95, round_to=2, kind='hist', bins=35)
plt.xlabel('y-prediction for x={}'.format(xpred));
Now we can also plot ridgeplot for mu samples. Ridge plot better illustrates the fact that each year is represented by distribution for mu parameter.
In [109]:
# Make complementary ridge plot
fig, axes = az.plot_forest(fit_lin_std,
kind='ridgeplot',
var_names=['mu'],
combined=True,
r_hat = False,
n_eff = True,
ridgeplot_overlap=5,
ridgeplot_alpha=0.7,
colors='lightblue',
figsize=(9, 7))
# Let's adjust labels (default arviz labels are not so good for large amount of distributions)
axes[0].set_yticklabels(
[str(year) if year % 2 == 0 else '' for year in range(2013, 1952-1, -1)])
axes[0].set_ylabel('Year')
axes[0].set_xlabel('Average temperature')
# make figure compact
fig.tight_layout();
The temperatures used in the above analyses are averages over three months, which makes it more likely that they are normally distributed, but there can be extreme events in the weather and we can check whether more robust Student's t observation model would give different results. For the Student's t distribution we also need the degree of freedom, nu, as an additional parameter.
In [110]:
with open('lin_t.stan') as file:
print(file.read())
In [111]:
model = stan_utility.compile_model('lin_t.stan')
fit_lin_t = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
Check the n_eff
and Rhat
In [112]:
print(fit_lin_t)
Again without standardization we get smaller n_eff's, but large enough for practical purposes.
Check the treedepth, E-BFMI, and divergences
In [113]:
stan_utility.check_treedepth(fit_lin_t)
stan_utility.check_energy(fit_lin_t)
stan_utility.check_div(fit_lin_t)
We see again many iterations saturating the maximum tree depth, which is harmful for the efficiency but doesn't invalidate the results.
Compute the probability that the summer temperature is increasing.
In [114]:
samples = fit_lin_t.extract(permuted=True)
print('Pr(beta > 0) = {}'.format(np.mean(samples['beta'] > 0)))
We get similar probability as with Gaussian obervation model.
Plot the data, the model fit, and the marginal posteriors for sigma and nu.
In [115]:
figsize = plt.rcParams['figure.figsize'].copy()
figsize[0] *= 2.2 # width
figsize[1] *= 1.2 # width
fig, ax = plt.subplots(1, 1, figsize=figsize)
# plot 1: scatterplot and lines
color_scatter = 'C0' # 'C0' for default color #0
color_line = 'C3' # 'C1' for default color #1
ax.fill_between(
x,
np.percentile(samples['mu'], 5, axis=0),
np.percentile(samples['mu'], 95, axis=0),
# lighten color_line
color=1 - 0.4*(1 - np.array(mpl.colors.to_rgb(color_line)))
)
ax.plot(
x,
np.percentile(samples['mu'], 50, axis=0),
color=color_line,
linewidth=2,
alpha=0.8
)
ax.scatter(x, y, 10, color=color_scatter)
ax.set_xlabel('year')
ax.set_ylabel('summer temp. @Kilpisjarvi')
ax.set_xlim((1951.5, 2013.5))
az.plot_posterior(fit_lin_t, var_names=['beta','sigma'], credible_interval=0.95, round_to=2, kind='hist', bins=35)
az.plot_posterior(fit_lin_t, var_names=['ypred', 'nu'], credible_interval=0.95, round_to=2, kind='hist', bins=35)
plt.xlabel('y-prediction for x={}'.format(xpred));
# make figure compact
fig.tight_layout();
The posterior of the degree of freedom "nu" reveals that Gaussian model is likely to be ok, as Student's t with nu > 20
is very close to Gaussian.
We can use leave-one-out cross-validation to compare the expected predictive performance. For the following three lines to execute, the log-likelihood needs to be evaluated in the stan code. For an example, see:
In [131]:
# Convert stan objects to ArviZ inference data objects
inference_lin = az.convert_to_inference_data(fit_lin, log_likelihood='log_lik')
inference_lin_t = az.convert_to_inference_data(fit_lin_t, log_likelihood='log_lik')
# Store inference data objects to dictionary
models = dict([
("linear_model", inference_lin),
("linear_model_t", inference_lin_t)
])
# Compare models with loo
comparison = az.compare(models, ic='loo')
print(comparison)
Loo values are similar for models. Hence, there is no practical difference between Gaussian and Student’s t observation model for this data.
In following, for each year, we analyze the monthly average temperatures for three summer months: June, July and August. In previous example, we used the average of these three summer months as estimate for yearly average summer temperarute. Now we would like to know, if there is a difference between three summer months: June, July and August.
In [117]:
data_path = os.path.abspath(
os.path.join(
os.path.pardir,
'utilities_and_data',
'kilpisjarvi-summer-temp.csv'
)
)
# Load data
d = np.loadtxt(data_path, dtype=np.double, delimiter=';', skiprows=1)
# summer months are numbered from 1 to 3
x = np.tile(np.arange(1, 4), d.shape[0]) # create array with repeating numbers 1,2,3 for each year
y = d[:, 1:4].ravel()
N = len(x)
data = dict(
N = N,
K = 3, # 3 groups
x = x, # group indicators
y = y) # observations
In [118]:
with open('grp_aov.stan') as file:
print(file.read())
Fit the model
In [119]:
model = stan_utility.compile_model('grp_aov.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
Check the n_eff
and Rhat
values for sampled model.
In [120]:
print(fit)
Check the treedepth, E-BFMI, and divergences
In [121]:
stan_utility.check_treedepth(fit)
stan_utility.check_energy(fit)
stan_utility.check_div(fit)
Plot group mean distributions and matrix of probabilities that one mu is larger than other.
In [122]:
# matrix of probabilities that one mu is larger than other
mu = fit.extract(permuted=True)['mu']
ps = np.zeros((3, 3))
for k1 in range(3):
for k2 in range(k1+1, 3):
ps[k1, k2] = np.mean(mu[:, k1] > mu[:, k2])
ps[k2, k1] = 1 - ps[k1, k2]
print("Matrix of probabilities that one mu is larger than other:")
print(ps)
months = ['June', 'July', 'August']
fig, ax = az.plot_forest(fit,
kind='ridgeplot',
var_names=['mu'],
combined=True,
ridgeplot_overlap=1,
r_hat=False,
n_eff=False,
figsize=(7, 7))
ax[0].set_yticklabels(months[::-1]) # Order of months if reversed, since arviz flips the order of ticks in plot
ax[0].set_xlabel('Average temperature');
In [123]:
with open('grp_prior_mean.stan') as file:
print(file.read())
Fit the model
In [124]:
model = stan_utility.compile_model('grp_prior_mean.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
Plot group mean distributions and matrix of probabilities that one mu is larger than other. The probabilities are practically either 1 or 0 since the distributions are not overlapping almost at all. This can be visually seen from ridge plot below.
In [125]:
# matrix of probabilities that one mu is larger than other
mu = fit.extract(permuted=True)['mu']
ps = np.zeros((3, 3))
for k1 in range(3):
for k2 in range(k1+1, 3):
ps[k1, k2] = np.mean(mu[:, k1] > mu[:, k2])
ps[k2, k1] = 1 - ps[k1, k2]
print("Matrix of probabilities that one mu is larger than other:")
print(ps)
months = ['June', 'July', 'August']
fig, ax = az.plot_forest(fit,
kind='ridgeplot',
var_names=['mu'],
combined=True,
ridgeplot_overlap=1,
r_hat=False,
n_eff=False,
figsize=(7, 7))
ax[0].set_yticklabels(months[::-1])
ax[0].set_xlabel('Average temperature');
In [126]:
with open('grp_prior_mean_var.stan') as file:
print(file.read())
Fit the model
In [127]:
model = stan_utility.compile_model('grp_prior_mean_var.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
Plot group mean distributions and matrix of probabilities that one mu is larger than other:
In [128]:
samples = fit.extract(permuted=True)
print("std(mu0): {:.2f}".format(np.std(samples['mu0'])))
mu = samples['mu']
# matrix of probabilities that one mu is larger than other
ps = np.zeros((3, 3))
for k1 in range(3):
for k2 in range(k1+1, 3):
ps[k1, k2] = np.mean(mu[:, k1] > mu[:, k2])
ps[k2, k1] = 1 - ps[k1, k2]
print("Matrix of probabilities that one mu is larger than other:")
print(ps)
months = ['June', 'July', 'August']
fig, ax = az.plot_forest(fit,
kind='ridgeplot',
var_names=['mu'],
combined=True,
ridgeplot_overlap=1,
r_hat=False,
n_eff=False,
figsize=(7, 7))
ax[0].set_yticklabels(months[::-1])
ax[0].set_xlabel('Average temperature');