Model Selection

In Bayesian statistics, an objective model comparison is carried out by comparing the models' marginal likelihood. The likelihood function describes the probability (density) of the data, given the parameter values (and thereby the chosen model). By integrating out (marginalizing) the parameter values, one obtains the marginal likelihood (also called the model evidence), which directly measures the fitness of the model at hand. The model evidence represents a powerful tool for model selection, as it does not assume specific distributions (e.g. Student's t-test assumes Gaussian distributed variables) and automatically follows the principle of Occam's razor.

The forward-backward algorithm that bayesloop allows to approximate the model evidence based on the discretized parameter distributions. Details on this method in the context of Hidden Markov models can be found here.

This section aims at giving a very brief introduction of Bayes factors together with an example based on the coal mining data and further introduces combined transition models in bayesloop.


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt # plotting
import seaborn as sns           # nicer plots
sns.set_style('whitegrid')      # plot styling

import numpy as np
import bayesloop as bl

# prepare study for coal mining data
S = bl.Study()
S.loadExampleData()

L = bl.om.Poisson('accident_rate', bl.oint(0, 6, 1000))
S.set(L)


+ Created new study.
+ Successfully imported example data.
+ Observation model: Poisson. Parameter(s): ['accident_rate']

Bayes factors

Instead of interpreting the value of the marginal likelihood for a single model, it is common to compare two competing models/explanations $M_1$ and $M_2$ by evaluating the Bayes factor, here denoted as $B_{12}$. The Bayes factor is given by the ratio of the two marginal likelihood values:

$$B_{12} = \frac{p(D|M_1)}{p(D|M_2)}$$

where $p(D|M_i)$ states the probability of the data (marginal likelihood) given model $M_i$. A value of $B_{12} > 1$ therefore indicates that the data is better explained by model $M_1$, compared to $M_2$. More detailed information for the interpretation of the value of Bayes factors can be found here and in the references therein.

As a first example, we investigate whether the inferred disaster rate of the coal mining data set indeed should be modeled as a time-varying parameter (a constant rate is an equally valid hypothesis). We first fit the model using the GaussianRandomWalk transition model with a standard deviation of $\sigma = 0.2$ to determine the corresponding model evidence (on a log scale). Subsequently, we use the simpler Static transition model (assuming no change of the disaster rate over time) and compare the resulting model evidence by computing the Bayes factor. Note that for computational efficiency, the keyword argument evidenceOnly=True is used, which evaluates the model evidence, but does not store any results for plotting.


In [2]:
# dynamic disaster rate
T = bl.tm.GaussianRandomWalk('sigma', 0.2, target='accident_rate')
S.set(T)
S.fit(evidenceOnly=True)

dynamicLogEvidence = S.log10Evidence

#static disaster rate
T = bl.tm.Static()
S.set(T)
S.fit(evidenceOnly=True)

staticLogEvidence = S.log10Evidence

# determine Bayes factor
B = 10**(dynamicLogEvidence - staticLogEvidence)
print '\nBayes factor: B =', B


+ Transition model: Gaussian random walk. Hyper-Parameter(s): ['sigma']
+ Started new fit:
    + Formatted data.
    + Set prior (function): jeffreys. Values have been re-normalized.

    + Finished forward pass.
    + Log10-evidence: -74.59055
+ Transition model: Static/constant parameter values. Hyper-Parameter(s): []
+ Started new fit:
    + Formatted data.
    + Set prior (function): jeffreys. Values have been re-normalized.

    + Finished forward pass.
    + Log10-evidence: -88.00564

Bayes factor: B = 2.60066520417e+13

The computed Bayes factor $B = 2.6 \cdot 10^{13}$ shows decisive support for the first hypothesis of a dynamic disaster rate.

While this analysis conducted above clearly rules for a time-varying rate, there may still exist a dynamic model that represents a better fit than the Gaussian random walk with $\sigma=0.2$. A very simple dynamic model is given by the transition model ChangePoint that assumes an abrupt change of the disaster rate at a predefined point in time, we choose 1890 here. Note that the transition model ChangePoint does not need a target parameter, as it is applied to all parameters of an observation model. We perform a full fit in this case, as we want to provide a plot of the result.


In [3]:
# dynamic disaster rate (change-point model)
T = bl.tm.ChangePoint('tChange', 1890)
S.set(T)
S.fit()

dynamicLogEvidence2 = S.log10Evidence

# determine Bayes factor
B = 10**(dynamicLogEvidence2 - dynamicLogEvidence)
print '\nBayes factor: B =', B

plt.figure(figsize=(8, 4))
plt.bar(S.rawTimestamps, S.rawData, align='center', facecolor='r', alpha=.5)
S.plot('accident_rate')
plt.xlim([1851, 1962])
plt.xlabel('year');


+ Transition model: Change-point. Hyper-Parameter(s): ['tChange']
+ Started new fit:
    + Formatted data.
    + Set prior (function): jeffreys. Values have been re-normalized.

    + Finished forward pass.
    + Log10-evidence: -74.41178

    + Finished backward pass.
    + Computed mean parameter values.

Bayes factor: B = 1.50929883739

The Bayes factor shows support in favor of the new change-point model. There is a 50% increased probability that the data is generated by the change-point model, compared to the Gaussian random walk model. Some may however argue that the value of $B = 1.5$ indicates only very weak support and is not worth more than a bare mention. Based on the data at hand, no clear decision between the two models can be made. While the change-point model is favored because it is more restricted (the number of possible data sets that can be described by this model is much smaller than for the Gaussian random walk) and therefore "simpler", it cannot capture fluctuations of the disaster rate before and after 1890 like the Gaussian random walk model does.

Combined transition models

The discussion above shows that depending on the data set, different transition models better explain different aspects of the data. For some data sets, a satisfactory transition model may only be found by combining several simple transition models. bayesloop provides a transition model class called CombinedTransitionModel that can be supplied with a sequence of transition models that are subsequently applied at every time step. Here, we combine the change-point model and the Gaussian random walk model to check whether the combined model yields a better explanation of the data, compared to the change-point model alone:


In [4]:
# combined model (change-point model + Gaussian random walk)
T = bl.tm.CombinedTransitionModel(bl.tm.ChangePoint('tChange', 1890),
                                  bl.tm.GaussianRandomWalk('sigma', 0.2, target='accident_rate'))
S.set(T)
S.fit()

combinedLogEvidence = S.log10Evidence

# determine Bayes factor
B = 10**(combinedLogEvidence - dynamicLogEvidence2)
print '\nBayes factor: B =', B

plt.figure(figsize=(8, 4))
plt.bar(S.rawTimestamps, S.rawData, align='center', facecolor='r', alpha=.5)
S.plot('accident_rate')
plt.xlim([1851, 1962])
plt.xlabel('year');


+ Transition model: Combined transition model. Hyper-Parameter(s): ['tChange', 'sigma']
+ Started new fit:
    + Formatted data.
    + Set prior (function): jeffreys. Values have been re-normalized.

    + Finished forward pass.
    + Log10-evidence: -74.01460

    + Finished backward pass.
    + Computed mean parameter values.

Bayes factor: B = 2.49563205383

Again, the refined model is favored by a Bayes factor of $B = 2.5$.

Serial transition model

The combined transition models introduced above substantially extend the number of different transition models. These transition models imply identical parameter dynamics for all time steps. In many applications, however, there exist so-called structural breaks when parameter dynamics exhibit a fundamental change. In contrast to an abrupt change of the parameter values in case of a change-point, a structural break indicates an abrupt change of the transition model at a specific point in time. The class SerialTransitionModel allows to define a sequence of transition models (including combined transition models) together with a sequence of time steps that denote the structural breaks. Each break-point is defined just like any other transition model, containing a unique name and a time stamp (or an array of possible time stamps, see this tutorial on change-point studies). Note, however, that the BreakPoint transition model class can only be used as a sub-model of an SerialTransitionModel instance.

We use this new class of transition model to explore the idea that the number of coal mining disasters did not decrease instantaneously, but instead decreased linearly over the course of a few years. We assume a static disaster rate until 1885, followed by a deterministic decrease of 0.2 disasters per year (deterministic transition models are defined easily by custom Python functions, see below) until 1895. Finally, the disaster rate after the year 1895 is modeled by a Gaussian random walk of the disaster rate with a standard deviation of $\sigma=0.1$. Note that we pass the transition models and the corresponding structural breaks (time steps) to the SerialTransitionModel in turns. While this order may increase readability, one can also pass the models first, followed by the corresponding time steps.


In [5]:
# Definition of a linear decrease transition model.
# The first argument of any determinsitic model must be the time stamp
# and any hyper-parameters of the model are supplied as keyword-arguments.
# The hyper-parameter value(s) is/are supplied as default values.
def linear(t, slope=-0.2):
    return slope*t

T = bl.tm.SerialTransitionModel(bl.tm.Static(),
                                bl.tm.BreakPoint('t_1', 1885),
                                bl.tm.Deterministic(linear, target='accident_rate'),
                                bl.tm.BreakPoint('t_2', 1895),
                                bl.tm.GaussianRandomWalk('sigma', 0.1, target='accident_rate'))
S.set(T)
S.fit()

serialLogEvidence = S.log10Evidence

# determine Bayes factor
B = 10**(serialLogEvidence - combinedLogEvidence)
print '\nBayes factor: B =', B

plt.figure(figsize=(8, 4))
plt.bar(S.rawTimestamps, S.rawData, align='center', facecolor='r', alpha=.5)
S.plot('accident_rate')
plt.xlim([1851, 1962])
plt.xlabel('year');


+ Transition model: Serial transition model. Hyper-Parameter(s): ['slope', 'sigma', 't_1', 't_2']
+ Started new fit:
    + Formatted data.
    + Set prior (function): jeffreys. Values have been re-normalized.

    + Finished forward pass.
    + Log10-evidence: -72.93384

    + Finished backward pass.
    + Computed mean parameter values.

Bayes factor: B = 12.0436384646

The Bayes factor of the serial model compared to the combined model is determined to $B = 12.0$. This value indicates positive/strong evidence in favor of the serial model. Keep in mind, though, that the time steps of the structural breaks and the slope of the linear decrease are not inferred in this example, but are instead set by the user. The determined Bayes factor therefore relies on the assumption that these values are true, and the uncertainty tied to these values is therefore not reflected in the value of the Bayes factor. More elaborate studies that take into account the uncertainty tied to these hyper-parameters are introduced here and here.

The iterative approach of creating new hypotheses and comparing them to the best hypothesis currently available provides an objective and never-the-less straight-forward way of exploring new data sets. The subsequent tutorial provides methods to improve upon values for hyper-parameters that are "blindly" set by the user.