We begin by importing the PyStan module as well at the matplotlib module for basic graphics facilities.
We also import some utilities by Michael Betancourt and introduced in PyStan workflow case study in module stan_utility
.
In [1]:
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pystan
In [2]:
# add utilities directory to path
import os, sys
util_path = os.path.abspath(os.path.join(os.path.pardir, 'utilities_and_data'))
if util_path not in sys.path and os.path.exists(util_path):
sys.path.insert(0, util_path)
# import from utilities
import stan_utility
In [3]:
# edit default plot settings
plt.rc('font', size=12)
In [4]:
data = dict(N=10, y=[0,1,0,0,1,1,1,0,1,0])
Bernoulli model with a Beta(1,1) (uniform) prior
In [5]:
with open('bern.stan') as file:
print(file.read())
Given the Stan program we then use the compile_model
method of stan_utility
module to compile the Stan program into a C++ executable. This utility function automatically saves a cached version of the compiled model to the disk for possible future use.
In [6]:
model = stan_utility.compile_model('bern.stan')
Getting the model again with the utility function uses the cached version automatically.
In [7]:
del model
model = stan_utility.compile_model('bern.stan')
Sample form the posterior, show the summary and plot the histogram of the posterior draws. We recommend explicitly specifying the seed of Stan's random number generator, as we have done here, so that we can reproduce these exactly results in the future, at least when using the same machine, operating system, and interface. This is especially helpful for the more subtle pathologies that may not always be found, which results in seemingly stochastic behavior.
In [8]:
fit = model.sampling(data=data, seed=194838)
print(fit)
samples = fit.extract(permuted=True)
plt.hist(samples['theta'], 50);
In [9]:
data = dict(N=10, y=8)
And then we use Binomial model with Beta(1,1) prior for the probability of success.
In [10]:
with open('binom.stan') as file:
print(file.read())
Sample from the posterior and plot the posterior. The histogram should look similar as in the Bernoulli case.
In [11]:
model = stan_utility.compile_model('binom.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
plt.hist(samples['theta'], 50);
Re-run the model with a new data. Here the same compiled model object is reused.
In [12]:
data = dict(N=10, y=10)
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
plt.hist(samples['theta'], 50);
An experiment was performed to estimate the effect of beta-blockers on mortality of cardiac patients. A group of patients were randomly assigned to treatment and control groups:
In [13]:
data = dict(N1=674, y1=39, N2=680, y2=22)
To analyse whether the treatment is useful, we can use Binomial model for both groups and compute odds-ratio:
In [14]:
with open('binom2.stan') as file:
print(file.read())
Sample from the posterior and plot the posterior
In [15]:
model = stan_utility.compile_model('binom2.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
plt.hist(samples['oddsratio'], 50);
In [16]:
data_path = os.path.abspath(
os.path.join(
os.path.pardir,
'utilities_and_data',
'kilpisjarvi-summer-temp.csv'
)
)
d = np.loadtxt(data_path, dtype=np.double, delimiter=';', skiprows=1)
x = d[:, 0]
y = d[:, 4]
N = len(x)
xpred = 2016
Plot the data
In [17]:
# make slightly wider figure
wide_figsize = plt.rcParams['figure.figsize'].copy()
wide_figsize[0] *= 1.4
plt.figure(figsize=wide_figsize)
plt.scatter(x, y)
plt.xlabel('year')
plt.ylabel('summer temperature at Kilpisjarvi');
To analyse whether the average summer month temperature is rising, we use a linear model with Gaussian model for the unexplained variation.
The folloing Stan code allows also setting hyperparameter values as data allowing easier way to use different priors in different analyses:
In [18]:
with open('lin.stan') as file:
print(file.read())
Create a list with data and priors:
In [19]:
data = dict(
N = N,
x = x,
y = y,
xpred = xpred,
pmualpha = y.mean(), # centered
psalpha = 100, # weakly informative prior
pmubeta = 0, # a priori increase and decrese as likely
psbeta = (.1--.1)/6.0 # avg temp probably does not increase more than 1
# degree per 10 years
)
Run Stan
In [20]:
model = stan_utility.compile_model('lin.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
Check the n_eff
and Rhat
In [21]:
print(fit)
Check the treedepth, E-BFMI, and divergences
In [22]:
stan_utility.check_treedepth(fit)
stan_utility.check_energy(fit)
stan_utility.check_div(fit)
We get a warning that several iterations that saturated the maximum treedepth". The reason for this is a very strong posterior correlation between alpha and beta as shown in the next plot. This doesn't imvalidate the results, but leads to suboptimal performance. We'll later look at alternative which reduces the posterior correlation. The following figure shows the strong posterior correlation in the current case.
In [23]:
samples = fit.extract(permuted=True)
# preview 500 posterior samples of (alpha, beta)
plt.scatter(samples['alpha'][:500], samples['beta'][:500], 5)
plt.xlabel('alpha')
plt.ylabel('beta');
Compute the probability that the summer temperature is increasing.
In [24]:
print('Pr(beta > 0) = {}'.format(np.mean(samples['beta'] > 0)))
Plot the data, the model fit and prediction for year 2016.
In [25]:
# make slightly wider figure of 3 plots
figsize = plt.rcParams['figure.figsize'].copy()
figsize[0] *= 1.4 # width
figsize[1] *= 2.5 # height
fig, axes = plt.subplots(3, 1, figsize=figsize)
# plot 1: scatterplot and lines
color_scatter = 'C0' # 'C0' for default color #0
color_line = 'C1' # 'C1' for default color #1
# lighten color_line
color_shade = (
1 - 0.1*(1 - np.array(mpl.colors.to_rgb(color_line)))
)
# plot
ax = axes[0]
ax.fill_between(
x,
np.percentile(samples['mu'], 5, axis=0),
np.percentile(samples['mu'], 95, axis=0),
color=color_shade
)
ax.plot(
x,
np.percentile(samples['mu'], 50, axis=0),
color=color_line,
linewidth=1
)
ax.scatter(x, y, 5, color=color_scatter)
ax.set_xlabel('year')
ax.set_ylabel('summer temperature at Kilpisjarvi')
ax.set_xlim((1952, 2013))
# plot 2: histogram
ax = axes[1]
ax.hist(samples['beta'], 50)
ax.set_xlabel('beta')
# plot 3: histogram
ax = axes[2]
ax.hist(samples['ypred'], 50)
ax.set_xlabel('y-prediction for x={}'.format(xpred))
# make figure compact
fig.tight_layout();
In the above we used the unnormalized data and as x values are far away from zero, this will cause very strong posterior dependency between alpha and beta (did you use ShinyStan for the above model?). The strong posterior dependency can be removed by normalizing the data to have zero mean. The following Stan code makes it in Stan. In generated quantities we do correspnding transformation back to the original scale.
In [26]:
with open('lin_std.stan') as file:
print(file.read())
Run Stan
In [27]:
model = stan_utility.compile_model('lin_std.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
Check the n_eff
and Rhat
In [28]:
print(fit)
We get now much better effective sample sizes n_eff
.
Check the treedepth, E-BFMI, and divergences
In [29]:
stan_utility.check_treedepth(fit)
stan_utility.check_energy(fit)
stan_utility.check_div(fit)
Everything is fine now. The next figure shows that with the standardized data there is not much posterior correlation:
In [30]:
samples = fit.extract(permuted=True)
plt.figure()
# preview 2000 samples
plt.scatter(samples['alpha'][:2000], samples['beta'][:2000], 5)
plt.xlabel('alpha')
plt.ylabel('beta');
Compute the probability that the summer temperature is increasing.
In [31]:
print('Pr(beta > 0) = {}'.format(np.mean(samples['beta'] > 0)))
Plot the data, the model fit and prediction for year 2016.
In [32]:
# make slightly wider figure of 3 plots
figsize = plt.rcParams['figure.figsize'].copy()
figsize[0] *= 1.4 # width
figsize[1] *= 2.5 # height
fig, axes = plt.subplots(3, 1, figsize=figsize)
# plot 1: scatterplot and lines
color_scatter = 'C0' # 'C0' for default color #0
color_line = 'C1' # 'C1' for default color #1
# lighten color_line
color_shade = (
1 - 0.1*(1 - np.array(mpl.colors.to_rgb(color_line)))
)
# plot
ax = axes[0]
ax.fill_between(
x,
np.percentile(samples['mu'], 5, axis=0),
np.percentile(samples['mu'], 95, axis=0),
color=color_shade
)
ax.plot(
x,
np.percentile(samples['mu'], 50, axis=0),
color=color_line,
linewidth=1
)
ax.scatter(x, y, 5, color=color_scatter)
ax.set_xlabel('year')
ax.set_ylabel('summer temperature at Kilpisjarvi')
ax.set_xlim((1952, 2013))
# plot 2: histogram
ax = axes[1]
ax.hist(samples['beta'], 50)
ax.set_xlabel('beta')
# plot 3: histogram
ax = axes[2]
ax.hist(samples['ypred'], 50)
ax.set_xlabel('y-prediction for x={}'.format(xpred))
# make figure compact
fig.tight_layout();
The temperatures used in the above analyses are averages over three months, which makes it more likely that they are normally distributed, but there can be extreme events in the feather and we can check whether more robust Student's t observation model woul give different results.
In [33]:
with open('lin_t.stan') as file:
print(file.read())
In [34]:
model = stan_utility.compile_model('lin_t.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
Check the n_eff
and Rhat
In [35]:
print(fit)
Without standardization we'are again get smaller n_eff
's, but large enough for practical purposes.
Check the treedepth, E-BFMI, and divergences
In [36]:
stan_utility.check_treedepth(fit)
stan_utility.check_energy(fit)
stan_utility.check_div(fit)
We see again many iterations saturating the maximum tree depth, which is harmful for the efficiency but doesn't invalidate the results.
Compute the probability that the summer temperature is increasing.
In [37]:
samples = fit.extract(permuted=True)
print('Pr(beta > 0) = {}'.format(np.mean(samples['beta'] > 0)))
We get similar probability as with Gaussian obervation model.
Plot the data, the model fit, and the marginal posteriors for sigma and nu.
In [38]:
# make slightly wider figure of 4 plots
figsize = plt.rcParams['figure.figsize'].copy()
figsize[0] *= 1.4 # width
figsize[1] *= 3 # height
fig, axes = plt.subplots(4, 1, figsize=figsize)
# plot 1: scatterplot and lines
color_scatter = 'C0' # 'C0' for default color #0
color_line = 'C1' # 'C1' for default color #1
# lighten color_line
color_shade = (
1 - 0.1*(1 - np.array(mpl.colors.to_rgb(color_line)))
)
# plot
ax = axes[0]
ax.fill_between(
x,
np.percentile(samples['mu'], 5, axis=0),
np.percentile(samples['mu'], 95, axis=0),
color=color_shade
)
ax.plot(
x,
np.percentile(samples['mu'], 50, axis=0),
color=color_line,
linewidth=1
)
ax.scatter(x, y, 5, color=color_scatter)
ax.set_xlabel('year')
ax.set_ylabel('summer temperature at Kilpisjarvi')
ax.set_xlim((1952, 2013))
# plot 2: histogram
ax = axes[1]
ax.hist(samples['beta'], 50)
ax.set_xlabel('beta')
# plot 3: histogram
ax = axes[2]
ax.hist(samples['sigma'], 50)
ax.set_xlabel('sigma')
# plot 4: histogram
ax = axes[3]
ax.hist(samples['nu'], 50)
ax.set_xlabel('nu')
# make figure compact
fig.tight_layout();
The posterior of nu reveals that Gaussian model is likely to be ok, as Student's t with nu > 20
is very close to Gaussian.
In [39]:
data_path = os.path.abspath(
os.path.join(
os.path.pardir,
'utilities_and_data',
'kilpisjarvi-summer-temp.csv'
)
)
d = np.loadtxt(data_path, dtype=np.double, delimiter=';', skiprows=1)
# Is there difference between different summer months?
x = np.tile(np.arange(1, 4), d.shape[0]) # summer months are numbered from 1 to 3
y = d[:, 1:4].ravel()
N = len(x)
data = dict(
N = N,
K = 3, # 3 groups
x = x, # group indicators
y = y # observations
)
In [40]:
with open('grp_aov.stan') as file:
print(file.read())
Fit the model
In [41]:
model = stan_utility.compile_model('grp_aov.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
Check the n_eff
and Rhat
In [42]:
print(fit)
Check the treedepth, E-BFMI, and divergences
In [43]:
stan_utility.check_treedepth(fit)
stan_utility.check_energy(fit)
stan_utility.check_div(fit)
Plot group mean distributions and matrix of probabilities that one mu is larger than other.
In [44]:
# matrix of probabilities that one mu is larger than other
mu = fit.extract(permuted=True)['mu']
ps = np.zeros((3, 3))
for k1 in range(3):
for k2 in range(k1+1, 3):
ps[k1, k2] = np.mean(mu[:, k1] > mu[:, k2])
ps[k2, k1] = 1 - ps[k1, k2]
print("Matrix of probabilities that one mu is larger than other:")
print(ps)
# plot violins
plt.violinplot(mu, showmeans=True, showextrema=False)
plt.xticks(np.arange(1, 4))
plt.xlabel('group')
plt.ylabel('mu');
In [45]:
with open('grp_prior_mean.stan') as file:
print(file.read())
Fit the model
In [46]:
model = stan_utility.compile_model('grp_prior_mean.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
Plot group mean distributions and matrix of probabilities that one mu is larger than other.
In [47]:
# matrix of probabilities that one mu is larger than other
mu = fit.extract(permuted=True)['mu']
ps = np.zeros((3, 3))
for k1 in range(3):
for k2 in range(k1+1, 3):
ps[k1, k2] = np.mean(mu[:, k1] > mu[:, k2])
ps[k2, k1] = 1 - ps[k1, k2]
print("Matrix of probabilities that one mu is larger than other:")
print(ps)
# plot violins
plt.violinplot(mu, showmeans=True, showextrema=False)
plt.xticks(np.arange(1, 4))
plt.xlabel('group')
plt.ylabel('mu');
In [48]:
with open('grp_prior_mean_var.stan') as file:
print(file.read())
Fit the model
In [49]:
model = stan_utility.compile_model('grp_prior_mean_var.stan')
fit = model.sampling(data=data, seed=194838)
samples = fit.extract(permuted=True)
Plot group mean distributions and matrix of probabilities that one mu is larger than other:
In [50]:
samples = fit.extract(permuted=True)
print("std(mu0): {:.2f}".format(np.std(samples['mu0'])))
mu = samples['mu']
# matrix of probabilities that one mu is larger than other
ps = np.zeros((3, 3))
for k1 in range(3):
for k2 in range(k1+1, 3):
ps[k1, k2] = np.mean(mu[:, k1] > mu[:, k2])
ps[k2, k1] = 1 - ps[k1, k2]
print("Matrix of probabilities that one mu is larger than other:")
print(ps)
# plot violins
plt.violinplot(mu, showmeans=True, showextrema=False)
plt.xticks(np.arange(1, 4))
plt.xlabel('group')
plt.ylabel('mu');