In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License"); { display-mode: "form" }
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
|
|
|
In this notebook, we'll explore TensorFlow Distributions (TFD for short). The goal of this notebook is to get you gently up the learning curve, including understanding TFD's handling of tensor shapes. This notebook tries to present examples before rather than abstract concepts. We'll present canonical easy ways to do things first, and save the most general abstract view until the end. If you're the type who prefers a more abstract and reference-style tutorial, check out Understanding TensorFlow Distributions Shapes. If you have any questions about the material here, don't hesitate to contact (or join) the TensorFlow Probability mailing list. We're happy to help.
Before we start, we need to import the appropriate libraries. Our overall library is tensorflow_probability
. By convention, we generally refer to the distributions library as tfd
.
Tensorflow Eager is an imperative execution environment for TensorFlow. In TensorFlow eager, every TF operation is immediately evaluated and produces a result. This is in contrast to TensorFlow's standard "graph" mode, in which TF operations add nodes to a graph which is later executed. This entire notebook is written using TF Eager, although none of the concepts presented here rely on that, and TFP can be used in graph mode.
In [0]:
import collections
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
try:
tf.compat.v1.enable_eager_execution()
except ValueError:
pass
import matplotlib.pyplot as plt
Let's dive right in and create a normal distribution:
In [3]:
n = tfd.Normal(loc=0., scale=1.)
n
Out[3]:
We can draw a sample from it:
In [4]:
n.sample()
Out[4]:
We can draw multiple samples:
In [5]:
n.sample(3)
Out[5]:
We can evaluate a log prob:
In [6]:
n.log_prob(0.)
Out[6]:
We can evaluate multiple log probabilities:
In [7]:
n.log_prob([0., 2., 4.])
Out[7]:
We have a wide range of distributions. Let's try a Bernoulli:
In [8]:
b = tfd.Bernoulli(probs=0.7)
b
Out[8]:
In [9]:
b.sample()
Out[9]:
In [10]:
b.sample(8)
Out[10]:
In [11]:
b.log_prob(1)
Out[11]:
In [12]:
b.log_prob([1, 0, 1, 0])
Out[12]:
We'll create a multivariate normal with a diagonal covariance:
In [13]:
nd = tfd.MultivariateNormalDiag(loc=[0., 10.], scale_diag=[1., 4.])
nd
Out[13]:
Comparing this to the univariate normal we created earlier, what's different?
In [14]:
tfd.Normal(loc=0., scale=1.)
Out[14]:
We see that the univariate normal has an event_shape
of ()
, indicating it's a scalar distribution. The multivariate normal has an event_shape
of 2
, indicating the basic event space of this distribution is two-dimensional.
Sampling works just as before:
In [15]:
nd.sample()
Out[15]:
In [16]:
nd.sample(5)
Out[16]:
In [17]:
nd.log_prob([0., 10])
Out[17]:
Multivariate normals do not in general have diagonal covariance. TFD offers multiple ways to create multivariate normals, including a full-covariance specification, which we use here.
In [18]:
nd = tfd.MultivariateNormalFullCovariance(
loc = [0., 5], covariance_matrix = [[1., .7], [.7, 1.]])
data = nd.sample(200)
plt.scatter(data[:, 0], data[:, 1], color='blue', alpha=0.4)
plt.axis([-5, 5, 0, 10])
plt.title("Data set")
plt.show()
Our first Bernoulli distribution represented a flip of a single fair coin. We can also create a batch of independent Bernoulli distributions, each with their own parameters, in a single Distribution
object:
In [19]:
b3 = tfd.Bernoulli(probs=[.3, .5, .7])
b3
Out[19]:
It's important to be clear on what this means. The above call defines three independent Bernoulli distributions, which happen to be contained in the same Python Distribution
object. The three distributions cannot be manipulated individually. Note how the batch_shape
is (3,)
, indicating a batch of three distributions, and the event_shape
is ()
, indicating the individual distributions have a univariate event space.
If we call sample
, we get a sample from all three:
In [20]:
b3.sample()
Out[20]:
In [21]:
b3.sample(6)
Out[21]:
If we call prob
, (this has the same shape semantics as log_prob
; we use prob
with these small Bernoulli examples for clarity, although log_prob
is usually preferred in applications) we can pass it a vector and evaluate the probability of each coin yielding that value:
In [22]:
b3.prob([1, 1, 0])
Out[22]:
Why does the API include batch shape? Semantically, one could perform the same computations by creating a list of distributions and iterating over them with a for
loop (at least in Eager mode, in TF graph mode you'd need a tf.while
loop). However, having a (potentially large) set of identically parameterized distributions is extremely common, and the use of vectorized computations whenever possible is a key ingredient in being able to perform fast computations using hardware accelerators.
In the previous section, we created b3
, a single Distribution
object that represented three coin flips. If we called b3.prob
on a vector $v$, the $i$'th entry was the probability that the $i$th coin takes value $v[i]$.
Suppose we'd instead like to specify a "joint" distribution over independent random variables from the same underlying family. This is a different object mathematically, in that for this new distribution, prob
on a vector $v$ will return a single value representing the probability that the entire set of coins matches the vector $v$.
How do we accomplish this? We use a "higher-order" distribution called Independent
, which takes a distribution and yields a new distribution with the batch shape moved to the event shape:
In [23]:
b3_joint = tfd.Independent(b3, reinterpreted_batch_ndims=1)
b3_joint
Out[23]:
Compare the shape to that of the original b3
:
In [24]:
b3
Out[24]:
As promised, we see that that Independent
has moved the batch shape into the event shape: b3_joint
is a single distribution (batch_shape = ()
) over a three-dimensional event space (event_shape = (3,)
).
Let's check the semantics:
In [25]:
b3_joint.prob([1, 1, 0])
Out[25]:
An alternate way to get the same result would be to compute probabilities using b3
and do the reduction manually by multiplying (or, in the more usual case where log probabilities are used, summing):
In [26]:
tf.reduce_prod(b3.prob([1, 1, 0]))
Out[26]:
Indpendent
allows the user to more explicitly represent the desired concept. We view this as extremely useful, although it's not strictly necessary.
Fun facts:
b3.sample
and b3_joint.sample
have different conceptual implementations, but indistinguishable outputs: the difference between a batch of independent distributions and a single distribution created from the batch using Independent
shows up when computing probabilites, not when sampling.MultivariateNormalDiag
could be trivially implemented using the scalar Normal
and Independent
distributions (it isn't actually implemented this way, but it could be).Let's create a batch of three full-covariance two-dimensional multivariate normals:
In [27]:
nd_batch = tfd.MultivariateNormalFullCovariance(
loc = [[0., 0.], [1., 1.], [2., 2.]],
covariance_matrix = [[[1., .1], [.1, 1.]],
[[1., .3], [.3, 1.]],
[[1., .5], [.5, 1.]]])
nd_batch
Out[27]:
We see batch_shape = (3,)
, so there are three independent multivariate normals, and event_shape = (2,)
, so each multivariate normal is two-dimensional. In this example, the individual distributions do not have independent elements.
Sampling works:
In [28]:
nd_batch.sample(4)
Out[28]:
Since batch_shape = (3,)
and event_shape = (2,)
, we pass a tensor of shape (3, 2)
to log_prob
:
In [29]:
nd_batch.log_prob([[0., 0.], [1., 1.], [2., 2.]])
Out[29]:
Abstracting out what we've done so far, every distribution has an batch shape B
and an event shape E
. Let BE
be the concatenation of the event shapes:
n
and b
, BE = ().
.nd
. BE = (2).
b3
and b3_joint
, BE = (3).
ndb
, BE = (3, 2).
The "evaluation rules" we've been using so far are:
BE
; sampling with a scalar n returns an "n by BE
" tensor.prob
and log_prob
take a tensor of shape BE
and return a result of shape B
.The actual "evaluation rule" for prob
and log_prob
is more complicated, in a way that offers potential power and speed but also complexity and challenges. The actual rule is (essentially) that the argument to log_prob
must be broadcastable against BE
; any "extra" dimensions are preserved in the output.
Let's explore the implications. For the univariate normal n
, BE = ()
, so log_prob
expects a scalar. If we pass log_prob
a tensor with non-empty shape, those show up as batch dimensions in the output:
In [30]:
n = tfd.Normal(loc=0., scale=1.)
n
Out[30]:
In [31]:
n.log_prob(0.)
Out[31]:
In [32]:
n.log_prob([0.])
Out[32]:
In [33]:
n.log_prob([[0., 1.], [-1., 2.]])
Out[33]:
Let's turn to the two-dimensional multivariate normal nd
(parameters changed for illustrative purposes):
In [34]:
nd = tfd.MultivariateNormalDiag(loc=[0., 1.], scale_diag=[1., 1.])
nd
Out[34]:
log_prob
"expects" an argument with shape (2,)
, but it will accept any argument that broadcasts against this shape:
In [35]:
nd.log_prob([0., 0.])
Out[35]:
But we can pass in "more" examples, and evaluate all their log_prob
's at once:
In [36]:
nd.log_prob([[0., 0.],
[1., 1.],
[2., 2.]])
Out[36]:
Perhaps less appealingly, we can broadcast over the event dimensions:
In [37]:
nd.log_prob([0.])
Out[37]:
In [38]:
nd.log_prob([[0.], [1.], [2.]])
Out[38]:
Broadcasting this way is a consequence of our "enable broadcasting whenever possible" design; this usage is somewhat controversial and could potentially be removed in a future version of TFP.
Now let's look at the three coins example again:
In [0]:
b3 = tfd.Bernoulli(probs=[.3, .5, .7])
Here, using broadcasting to represent the probability that each coin comes up heads is quite intuitive:
In [40]:
b3.prob([1])
Out[40]:
(Compare this to b3.prob([1., 1., 1.])
, which we would have used back where b3
was introduced.)
Now suppose we want to know, for each coin, the probability the coin comes up heads and the probability it comes up tails. We could imagine trying:
b3.log_prob([0, 1])
Unfortunately, this produces an error with a long and not-very-readable stack trace. b3
has BE = (3)
, so we must pass b3.prob
something broadcastable against (3,)
. [0, 1]
has shape (2)
, so it doesn't broadcast and creates an error. Instead, we have to say:
In [41]:
b3.prob([[0], [1]])
Out[41]:
Why? [[0], [1]]
has shape (2, 1)
, so it broadcasts against shape (3)
to make a broadcast shape of (2, 3)
.
Broadcasting is quite powerful: there are cases where it allows order-of-magnitude reduction in the amount of memory used, and it often makes user code shorter. However, it can be challenging to program with. If you call log_prob
and get an error, a failure to broadcast is nearly always the problem.
In this tutorial, we've (hopefully) provided a simple introduction. A few pointers for going further:
event_shape
, batch_shape
and sample_shape
can be arbitrary rank (in this tutorial they are always either scalar or rank 1). This increases the power but again can lead to programming challenges, especially when broadcasting is involved. For an additional deep dive into shape manipulation, see the Understanding TensorFlow Distributions Shapes. Bijectors
, which in conjunction with TransformedDistribution
, yields a flexible, compositional way to easily create new distributions that are invertible transformations of existing distributions. We'll try to write a tutorial on this soon, but in the meantime, check out the documentation