While most of PyMC3's user-facing features are written in pure Python, it leverages Theano (Bergstra et al., 2010) to transparently transcode models to C and compile them to machine code, thereby boosting performance. Theano is a library that allows expressions to be defined using generalized vector data structures called tensors, which are tightly integrated with the popular NumPy ndarray
data structure, and similarly allow for broadcasting and advanced indexing, just as NumPy arrays do. Theano also automatically optimizes the likelihood's computational graph for speed and provides simple GPU integration.
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:
Theano is part programming language, part compiler. It is often used to build machine learning, just as packages like TensorFlow are, though it is not in itself a machine learning toolkit; think of it as a mathematical toolkit.
The easiest way to install Theano is to build it from source, using pip:
pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git
however, if you have PyMC3 installed, then Theano will already be available.
In [ ]:
from theano import function, shared
from theano import tensor as T
import theano
x = T.dscalar('x')
y = T.dscalar('y')
In Theano, all symbols must be typed. In particular, T.dscalar
is the type we assign to "0-dimensional arrays (scalar
) of doubles
(d
)". It is a Theano type
.
In [ ]:
type(x)
In [ ]:
x.type
In [ ]:
T.dscalar
In [ ]:
z = x + y
z is yet another Variable which represents the addition of
x and y. You can use the pp
function to pretty-print out the computation associated to z.
In [ ]:
from theano.printing import pp
print(pp(z))
In [ ]:
f = function([x, y], z)
The first argument to function()
is a list of Variable
s that will be provided as inputs to the function. The second argument is a single Variable
or a list of Variable
s. For either case, the second argument is what we want to see as output when we apply the function. f
may then be used like a normal Python function.
Now we can call the function:
In [ ]:
f(2, 3)
In [ ]:
f(16.3, 12.1)
If you are following along and typing into an interpreter, you may have
noticed that there was a slight delay in executing the function
instruction. Behind the scenes, f was being compiled into C code.
Internally, Theano builds a graph structure composed of interconnected Variable
nodes, op
nodes and apply
nodes.
An apply
node represents the application of an op
to some variables. It is important to draw the difference between the definition of a computation represented by an op
and its application to some actual data which is represented by the apply node.
Here is the expression graph corresponding to the addition of x
and y
:
In [ ]:
from theano import printing
printing.pydotprint(f, 'images/f.png')
In [ ]:
from IPython.display import Image
Image('images/f.png', width='80%')
A Variable
is the main data structure you work with when using Theano. By calling T.dscalar
with a string argument, you create a Variable
representing a floating-point scalar quantity with the given name. If you provide no argument, the symbol will be unnamed. Names are not required, but they can help debugging.
In [ ]:
x = T.dmatrix('x')
y = T.dmatrix('y')
z = x + y
f = function([x, y], z)
dmatrix
is the Type for matrices of doubles. Then we can use
our new function on 2D arrays:
In [ ]:
f([[1, 2], [3, 4]], [[10, 20], [30, 40]])
The following types are available:
bscalar, bvector, bmatrix, brow, bcol, btensor3, btensor4
wscalar, wvector, wmatrix, wrow, wcol, wtensor3, wtensor4
iscalar, ivector, imatrix, irow, icol, itensor3, itensor4
lscalar, lvector, lmatrix, lrow, lcol, ltensor3, ltensor4
fscalar, fvector, fmatrix, frow, fcol, ftensor3, ftensor4
dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4
cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4
An example of a slightly more interesting function is the logistic curve. Let's create a matrix, and apply the logistic transformation to it:
In [ ]:
x = T.dmatrix('x')
The logistic transformation:
In [ ]:
s = 1 / (1 + T.exp(-x))
In [ ]:
logistic = function([x], s)
logistic([[0, 1], [-1, -2]])
Theano supports functions with multiple outputs. For example, we can compute the elementwise difference, absolute difference, and squared difference between two matrices a
and b
at the same time.
In [ ]:
a, b = T.dmatrices('a', 'b')
# Operations
diff = a - b
abs_diff = abs(diff)
diff_squared = diff ** 2
When we use the function f
, it returns the three computed results as a list.
In [ ]:
f = function([a, b], [diff, abs_diff, diff_squared])
f([[1, 1], [1, 1]], [[0, 1], [2, 3]])
Let's say you want to define a function that adds two numbers, except that if you only provide one number, the other input is assumed to be one. In Python, the default value for parameters achieves this effect.
In Theano we make use of the `In` class, which allows you to specify properties of your function's parameters with greater detail. Here we give a default value of 1 for y
by creating an In
instance with its value
field set to 1. Inputs with default values must follow inputs without default values (like Python's functions). There can be multiple inputs with default values. These parameters can be set positionally or by name, as in standard Python.
In [ ]:
from theano import In
x, y, w = T.dscalars('x', 'y', 'w')
z = (x + y) * w
g = function([x, In(y, value=1), In(w, value=2, name='w_by_name')], z)
In [ ]:
print('g(33) = {}'.format(g(33)))
In [ ]:
print('g(33, 0, 1) = {}'.format(g(33, 0, 1)))
In [ ]:
print('g(33, w_by_name=1) = {}'.format(g(33, w_by_name=1)))
In [ ]:
print('g(33, w_by_name=1, y=0) = {}'.format(g(33, w_by_name=1, y=0)))
It is also possible to make a function with an internal state. For example, let’s say we want to make an accumulator: at the beginning, the state is initialized to zero. Then, on each function call, the state is incremented by the function’s argument.
First let’s define the accumulator function. It adds its argument to the internal state, and returns the old state value.
In [ ]:
state = shared(0)
inc = T.iscalar('inc')
accumulator = function([inc], state, updates=[(state, state+inc)])
This code introduces a couple of new concepts. The shared
function constructs so-called shared variables.
state = shared(0)
These are hybrid symbolic and non-symbolic variables whose value may be shared between multiple functions.
Shared variables can be used in symbolic expressions but they also have an internal value that defines the value taken by this symbolic variable in all the functions that use it. It is called a shared variable because its value is shared between many functions. The value can be accessed and modified by the get_value
and set_value
methods.
The other new thing in this code is the updates
parameter of function.
updates=[(state, state+inc)
updates
must be supplied with a list of pairs of the form (shared-variable, new expression)
. It can also be a dictionary whose keys are shared-variables and values are the new expressions. Here, the accumulator replaces the state
‘s value with the sum of state
and the increment amount inc
.
In [ ]:
print(state.get_value())
In [ ]:
print(accumulator(1))
In [ ]:
print(state.get_value())
In [ ]:
print(accumulator(300))
In [ ]:
print(state.get_value())
It is possible to reset the state. Just use the set_value
method:
In [ ]:
state.set_value(-1)
In [ ]:
print(accumulator(3))
In [ ]:
print(state.get_value())
As we mentioned above, you can define more than one function to use the same shared variable. These functions can all update the value.
In [ ]:
decrementor = function([inc], state, updates=[(state, state-inc)])
In [ ]:
print(decrementor(2))
In [ ]:
print(state.get_value())
You might be wondering why the updates mechanism exists. You can always achieve a similar result by returning the new expressions, and working with them in NumPy as usual.
While the updates mechanism can be a syntactic convenience, it is mainly there for efficiency. Updates to shared variables can sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix updates).
Also, Theano has more control over where and how shared variables are allocated, which is one of the important elements of getting good performance on the GPU.
In [ ]:
def make_vector():
"""
Create and return a new Theano vector.
"""
pass
def make_matrix():
"""
Create and return a new Theano matrix.
"""
pass
def elemwise_mul(a, b):
"""
a: A theano matrix
b: A theano matrix
Calcuate the elementwise product of a and b and return it
"""
pass
def matrix_vector_mul(a, b):
"""
a: A theano matrix
b: A theano vector
Calculate the matrix-vector product of a and b and return it
"""
pass
a = make_vector()
b = make_vector()
c = elemwise_mul(a, b)
d = make_matrix()
e = matrix_vector_mul(d, c)
f = function([a, b, d], e)
import numpy as np
rng = np.random.RandomState([1, 2, 3])
a_value = rng.randn(5).astype(a.dtype)
b_value = rng.rand(5).astype(b.dtype)
c_value = a_value * b_value
d_value = rng.randn(5, 5).astype(d.dtype)
expected = np.dot(d_value, c_value)
actual = f(a_value, b_value, d_value)
assert np.allclose(actual, expected)
print("SUCCESS!")
In [ ]:
import numpy as np
rng = np.random
dose = np.array([-0.86, -0.3 , -0.05, 0.73])
deaths = np.array([0, 1, 3, 5])
training_steps = 1000
We first declare Theano symbolic variables:
In [ ]:
x = T.vector("x")
y = T.vector("y")
w = theano.shared(1., name="w")
b = theano.shared(0., name="b")
print("Initial model:", w.get_value(), b.get_value())
... then construct the expression graph:
In [ ]:
# Probability that target = 1
p_1 = 1 / (1 + T.exp(-(x*w + b)))
# The prediction threshold
prediction = p_1 > 0.5
# Cross-entropy loss function
xent = -y * T.log(p_1) - (5-y) * T.log(1-p_1)
# The cost to minimize
cost = xent.mean()
# Compute the gradient of the cost
gw, gb = T.grad(cost, [w, b])
Compile Theano functions:
In [ ]:
step = theano.shared(10., name='step')
train = theano.function(
inputs=[x, y],
outputs=[prediction, xent],
updates=((w, w - step * gw), (b, b - step * gb), (step, step * 0.99)))
predict = theano.function(inputs=[x], outputs=prediction)
Train model:
In [ ]:
for i in range(training_steps):
pred, err = train(dose, deaths)
print("Final model:", w.get_value(), b.get_value())
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
logit = lambda x: 1. / (1 + np.exp(-x))
xvals = np.linspace(-1, 1)
plt.plot(xvals, logit(7.8*xvals + .85))
plt.plot(dose, deaths/5., 'ro')
In [ ]:
def grad_sum(x, y, z):
"""
x: A theano variable
y: A theano variable
z: A theano expression involving x and y
Returns dz / dx + dz / dy
"""
pass
x = T.scalar()
y = T.scalar()
z = x + y
s = grad_sum(x, y, z)
assert s.eval({x: 0, y: 0}) == 2
print("SUCCESS!")
Because in Theano you first express everything symbolically and afterwards compile this expression to get functions, using pseudo-random numbers is not as straightforward as it is in NumPy.
The way to think about putting randomness into Theano’s computations is to put random variables in your graph. Theano will allocate a NumPy RandomStream
object (a random number generator) for each such variable, and draw from it as necessary. We will call this sort of sequence of random numbers a random stream.
In [ ]:
from theano.tensor.shared_randomstreams import RandomStreams
srng = RandomStreams(seed=234)
rv_u = srng.uniform((2,2))
f = function([], rv_u)
In [ ]:
f()
In [ ]:
f()
The scan
function provides the ability to write loops in Theano. We are not able to use Python for
loops with Theano because Theano needs to be able to build and optimize the expression graph before compiling it into faster code, and be able to use symbolic differentiation for calculating gradients.
Assume that, given $k$ you want to get $A^k$ using a loop. More precisely, if $A$ is a tensor you want to compute $A^k$ elementwise. The python code might look like:
result = 1
for i in range(k):
result = result * A
There are three things here that we need to handle: the initial value assigned to result, the accumulation of results in result, and the unchanging variable A. Unchanging variables are passed to scan as non_sequences. Initialization occurs in outputs_info, and the accumulation happens automatically.
The equivalent Theano code would be:
In [ ]:
k = T.iscalar("k")
A = T.vector("A")
# Symbolic description of the result
result, updates = theano.scan(fn=lambda prior_result, A: prior_result * A,
outputs_info=T.ones_like(A),
non_sequences=A,
n_steps=k)
# We only care about A**k, but scan has provided us with A**1 through A**k.
# Discard the values that we don't care about. Scan is smart enough to
# notice this and not waste memory saving them.
final_result = result[-1]
# compiled function that returns A**k
power = theano.function(inputs=[A,k], outputs=final_result, updates=updates)
print(power(range(10),2))
print(power(range(10),4))
Let us go through the example line by line. What we did is first to construct a function (using a lambda expression) that given prior_result
and A
returns prior_result * A
. The order of parameters is fixed by scan
: the output of the prior call to fn
is the first parameter, followed by all non-sequences.
Next we initialize the output as a tensor with same shape and dtype
as A
, filled with ones. We give A
to scan
as a non sequence parameter and specify the number of steps k
to iterate over our lambda
expression.
Scan returns a tuple containing our result (result
) and a dictionary of updates (empty in this case). Note that the result is not a matrix, but a 3D tensor containing the value of $A^k$ for each step. We want the last value (after k steps) so we compile a function to return just that.
Note that there is an optimization, that at compile time will detect that you are using just the last value of the result and ensure that scan does not store all the intermediate values that are used. So do not worry if A
and k
are large.
In addition to looping a fixed number of times, scan can iterate over the leading dimension of tensors (similar to Python’s list comprehension for x in a_list
).
The tensor(s) to be looped over should be provided to scan
using the sequence
keyword argument.
Here’s an example that builds a symbolic calculation of a polynomial from a list of its coefficients:
In [ ]:
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x")
# Generate the components of the polynomial
components, updates = theano.scan(fn=lambda coefficient, power, val: coefficient * (val ** power),
outputs_info=None,
sequences=[coefficients, theano.tensor.arange(1000)],
non_sequences=x)
# Sum them up
polynomial = components.sum()
# Compile a function
calculate_polynomial = theano.function(inputs=[coefficients, x], outputs=polynomial)
# Test
test_coefficients = np.asarray([1, 0, 2], dtype=np.float32)
test_value = 3
print(calculate_polynomial(test_coefficients, test_value))
PyMC3 has the standard sampling algorithms like adaptive Metropolis-Hastings and adaptive slice sampling, but PyMC3's most capable step method is the No-U-Turn Sampler. NUTS is especially useful on models that have many continuous parameters, a situation where other MCMC algorithms work very slowly. It takes advantage of information about where regions of higher probability are, based on the gradient of the log posterior-density. This helps it achieve dramatically faster convergence on large problems than traditional sampling methods achieve.
PyMC3 relies on Theano to analytically compute model gradients via automatic differentiation of the posterior density. NUTS also has several self-tuning strategies for adaptively setting the tunable parameters of Hamiltonian Monte Carlo. For random variables that are undifferentiable (namely, discrete variables) NUTS cannot be used, but it may still be used on the differentiable variables in a model that contains undifferentiable variables.
As an informal comparison, we will demonstrate samples generated from a simple statistical model using both the Metropolis and NUTS sampler in PyMC3. The set of examples includes a univariate linear model that is fit to simulated data via the glm
module.
with Model() as model:
glm.glm('y ~ x', data)
The model contains three parameters (intercept, slope and sampling standard deviation), each of which is continuous, so the model can be fit by either algorithm. We will run a short chain for each, and compare the output graphically:
In [ ]:
from pymc3.examples import glm_linear
from pymc3 import sample, Metropolis, NUTS
In [ ]:
with glm_linear.model:
trace_metropolis = sample(1000, step=Metropolis())
In [ ]:
with glm_linear.model:
trace_nuts = sample(1000, step=NUTS())
In [ ]:
traceplot(trace_metropolis);
In [ ]:
traceplot(trace_nuts);
The samples from Metropolis
shows very poor mixing during the first 1000 iterations, and has clearly converged. The NUTS
samples are more homogeneous, with better mixing and less autocorrelation.
Gradient-based MCMC samplers like HMC and NUTS do not support the sampling of discrete parameters because discrete parameters, being non-continuous, do not support the calculation of gradients. In PyMC3, users are left with two options when dealing with models that include discrete random variables:
The advantage of the first option is that it is easy; in fact, PyMC3 will do this for you automatically unless you specify otherwise. The disadvantage is that, at best, you sacrifice the efficiency of gradient-based sampling, since you would still need to run models an order of magnitude longer (or more) to allow the discrete variables to converge. Also, the effect on the gradient-based sampler of having other variables not used in the gradient calculation changing at each iteration is not well-known.
The advantage of margnialization is that your entire model can then be fit using gradient-based methods, which maximizes sampling efficiency (in terms of effective sample size per iteration). The disadvantage is that it requires some math in order to correctly marginalize the discrete variables.
Marginalization is feasible if the discrete variables are bounded. We can try re-formulating the coal mining disasters example from earlier in the course. Recall that the model was described as the following:
$$\begin{array}{ccc} (y_t | \tau, \lambda_1, \lambda_2) \sim\text{Poisson}\left(r_t\right), & r_t=\left\{ \begin{array}{lll} \lambda_1 &\text{if}& t< \tau\\ \lambda_2 &\text{if}& t\ge \tau \end{array}\right.,&t\in[t_l,t_h]\\ \tau \sim \text{DiscreteUniform}(t_l, t_h)\\ \lambda_1\sim \text{Exponential}(a)\\ \lambda_2\sim \text{Exponential}(b) \end{array}$$We wish to eliminate the discrete changepoint $\tau$ by marginalization. That is, we would like to express the factorization of the joint probability as follows:
$$\pi(y, \lambda_1, \lambda_2) = L(y | \lambda_1, \lambda_2) \pi(\lambda_1, \lambda_2)$$the marginalization proceeds by summing over the values of the discrete parameter (the years in the time series):
$$\begin{array}{} L(y | \lambda_1, \lambda_2) &= \sum_{t=0}^{110} L(Y | \lambda_1, \lambda_2, \tau=t) p(\tau=t) \\ &= \sum_{t=0}^{110} U(t | 0, 110) \prod_{s=0}^{110} \text{Poisson}(y_{s < t}| \lambda_1) \text{Poisson}(y_{s >= t}| \lambda_2) \end{array}$$
In [ ]:
from pymc3 import Exponential, Poisson, switch, log, sum as tsum, exp, Model, Potential
disasters_data = np.array([4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
2, 2, 3, 4, 2, 1, 3, 2, 2, 1, 1, 1, 1, 3, 0, 0,
1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
3, 3, 1, 1, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
years = len(disasters_data)
def log_sum_exp(x):
return log(tsum(exp(x)))
with Model() as marginal_model:
# Priors for pre- and post-switch mean number of disasters
early_mean = Exponential('early_mean', lam=1.)
late_mean = Exponential('late_mean', lam=1.)
# Generate all possible combinations of switchpoints and years
year_idx = np.arange(years)
s, t = np.array([(t,s) for t in year_idx for s in year_idx]).T
# Marginalize the switchpoint
log_like = log_sum_exp(-log(years) + Poisson.dist(switch(t<s, early_mean, late_mean))
.logp(disasters_data[t]).reshape((years, years)).sum(1))
# Data likelihood
disasters = Potential('disasters', log_like)
In [ ]:
with marginal_model:
trace_marginal = sample(2000)
In [ ]:
from pymc3 import plot_posterior
plot_posterior(trace_marginal[1000:])
In [ ]:
from pymc3 import effective_n
effective_n(trace_marginal[1000:])
Let's compare this with the original formulation:
In [ ]:
from pymc3 import DiscreteUniform
with Model() as original_model:
# Prior for distribution of switchpoint location
switchpoint = DiscreteUniform('switchpoint', lower=0, upper=years)
# Priors for pre- and post-switch mean number of disasters
early_mean = Exponential('early_mean', lam=1.)
late_mean = Exponential('late_mean', lam=1.)
# Allocate appropriate Poisson rates to years before and after current
# switchpoint location
idx = np.arange(years)
rate = switch(switchpoint >= idx, early_mean, late_mean)
# Data likelihood
disasters = Poisson('disasters', rate, observed=disasters_data)
In [ ]:
with original_model:
trace_original = sample(2000, njobs=2)
In [ ]:
effective_n(trace_original[1000:])