author: Jacob Schreiber
contact: jmschreiber91@gmail.com
This tutorial will cover the various distributions that are implemented in pomegranate. These distributions can be used by themselves, or more typically as a part of a larger compositional model like a hidden Markov model.
The distributions that are currently available:
Univariate Distributions
Univariate Kernel Densities
Multivariate Distributions
The following methods are supported by all distributions:
1. d.probability(X): Return the probability of a sample under the distribution or a vector of samples
2. d.log_probability(X): Return the log probability of a sample under the distribution or a vector of samples
3. d.fit(X, weights=None, inertia): Use MLE estimates to update the parameters of the distribution. Optional weight vector is accepted, or inertia that pushes parameters from where they currently are to the update that portion of the way.
4. d.summarize(X, weights=None): Extract and update the stored sufficient statistics on a data set with optional weights, but don't update the parameters yet
5. d.from_summaries(X, inertia=0.0): Use the stored sufficient statistics to do a parameter update
6. d.sample(n=1): Generate a sample, or a vector of samples, from the distribution
7. d.to_json(): Serialize the distribution to a string json
8. Distribution.from_json(s): Load up a stored distriibution, either from a file if passed a filename that ends in '.json', or from the string passed in itself.
9. d.copy(): Make a deep copy of the distribution
10. d.freeze(): Freeze the distribution, making 'fit' and 'from_summaries' not change model parameters
11. d.thaw(): Unfreeze the distribution so that it can be updated agin.
Univariate distributions also have a plot( n=1000, **kwargs )
command, where you can plot a histogram of n
randomly generated samples from the distribution, which will pass matplotlib keyword arguments to the plt.hist
function.
Lets look at a few examples.
In [1]:
from pomegranate import *
%pylab inline
Lets start off by looking at a normal distribution. It is easy to specify, and take a look at.
In [2]:
d = NormalDistribution( 5, 2 )
d.plot( n=100000, edgecolor='c', color='c', bins=50 )
We can then calculate the log probability of some points under this distribution.
In [3]:
print d.log_probability( 1 )
print d.log_probability( 5 )
print d.log_probability( 9 )
We can also fit this distribution to some data we have.
In [4]:
data = np.random.randn(10000) * 0.4 + 9
In [5]:
d.fit( data )
d.plot( n=100000, edgecolor='c', color='c', bins=50 )
print d.parameters
We generated data from a normal distribution with mean 9 and standard deviation 0.4, and we can see that the distribution object was able to recover it. This is not surprising given the amount of data it received.
The fitting process discards the previous parameters, and uses a pure MLE estimate on the new data. In this case, it just takes the mean and variance of the new data, discarding its previous parameters. It does not matter if the previous distribution was NormalDistribution(1, 5)
or NormalDistribution(10000000, 10000000)
, they would both recover the same parameters after.
A way of taking into account current parameters is by using inertia, where the update is equal to previous parameters * inertia + new parameters * ( 1 - inertia )
. This allows the update to take into account prior parameters a bit. Lets see an example.
In [6]:
d = NormalDistribution( 10, 1 )
e = NormalDistribution( 200, 1 )
d.fit( data, inertia=0.0 ) # The default setting
e.fit( data, inertia=0.5 )
print "d -- mu: {} sigma {}".format( *d.parameters )
print "e -- mu: {} sigma {}".format( *e.parameters )
Out of core updates are also supported for distributions. This means that we can update a distribution using an amount of data which cannot fit into memory normally, but still produce an exact update. This is done by summarizing the dataset into sufficient statistics at each step, and then producing an update at the end.
In [7]:
d = NormalDistribution( 105, 12 )
for i in range(10):
d.summarize( data[i::10] )
d.from_summaries()
print "d -- mu: {} sigma {}".format( *d.parameters )
In this particular case, the data can all fit in memory, but it doesn't necessarily have to. You just have to call summarize
, then load up the next set of data, and continue.
If we want to weight points, we can easily pass in a weight vector. Weights must be non-negative, and the sum of the weight vector must be greater than 0. Numerical stability issues can arise if you have entirely small weight vectors, so try to avoid that if possible.
In [8]:
d = NormalDistribution( 105, 12 )
weights = np.random.exponential( 100, 10000 )
d.fit( data, weights )
print "d -- mu: {} sigma {}".format( *d.parameters )
It doesn't change much in this case since the data was drawn from a perfect normal distribution, but it illustrates the possibility. The analog for summarization is the following:
In [9]:
d = NormalDistribution( 105, 12 )
for i in range(10):
d.summarize( data[i::10], weights=weights[i::10] )
d.from_summaries()
print "d -- mu: {} sigma {}".format( *d.parameters )
We can also easily sample from distributions by calling the sample
function. plot
does this internally when we call it, but to be explicit lets reproduce that.
In [10]:
d = NormalDistribution( 105, 12 )
samples = [ d.sample() for i in range(10000) ]
plt.hist( samples, edgecolor='c', color='c', bins=25 )
Out[10]:
A notable deviant from the numerics logic is the Discrete Distribution. Instead of passing parameters as floats, you instead pass in a dictionary where keys can be ~any object~ and values are the probability of them occurring. If you try to calculate the log probability of an item not present in the distribution, the default behavior is to return negative infinity.
In [11]:
d = DiscreteDistribution({'A': 0.1, 'C': 0.25, 'G': 0.50, 'T': 0.15})
print d.log_probability( 'A' )
print d.log_probability( 'G' )
print d.log_probability( '????' )
Updating a discrete distributon is the same as updating other distributions.
In [12]:
items = list('ACGATACACTGAATGACAGCAGTCACTGACAGTAGTACGAGTAGTAGCAGAGAGTAATAAAGAATTAATATATGACACTACGAAAAAAATGCATCG')
d.fit( items )
for char in 'ACGT':
print "logp({}) = {}".format( char, d.log_probability( char ) )
print d.parameters
We can also take a look at the JSON serialization. JSONs are the perfect way to store these models given their human interpretability, and the ability to recursively store JSONs within other JSONs, such as more complicated models (such as HMMs) will need to store the JSON of each distribution. It is useful to store these models after the computationally intensive task of training them, so that it need not be repeated.
In [13]:
print d.to_json()
The Gaussian Kernel Density is a non-parametric distribution which stores a series of points, their associated weights, and the bandwidth representing the range of influence of each point. This will become apparent with an example.
In [14]:
d = GaussianKernelDensity( [0.5, 1.2, 4.2, 1.9, 5.4, 3.4], bandwidth=1 )
d.plot( n=100000, edgecolor='c', color='c', bins=100 )
Changing the bandwidth parameter is equivalent to reducing the variance on a Gaussian. It makes the densities more central around the points, but can cause overfitting.
In [15]:
d = GaussianKernelDensity( [0.5, 1.2, 4.2, 1.9, 5.4, 3.4], bandwidth=0.2 )
d.plot( n=100000, edgecolor='c', color='c', bins=100 )
We can easily see the influence of weights by weighting a single point.
In [16]:
d = GaussianKernelDensity( [0.5, 1.2, 4.2, 1.9, 5.4, 3.4], weights=[1, 1, 1, 3, 1, 1], bandwidth=0.2 )
d.plot( n=100000, edgecolor='c', color='c', bins=100 )
In addition to the Gaussian kernel density, there is a Triangle kernel density, which has a slightly different shape. Mostly I liked how they looked like pointy hats, so I implemented it.
In [17]:
d = TriangleKernelDensity( [0.5, 1.2, 4.2, 1.9, 5.4, 3.4], weights=[1, 1, 1, 3, 1, 1], bandwidth=0.5 )
d.plot( n=500000, edgecolor='c', color='c', bins=250 )
Sometimes, a univariate distribution isn't good enough. Many times data comes in the form of multiple, independent, dimensions, each with their own distribution. We can model this with a multivariate distribution which contains a tuple of independent components.
A common example may be to look at data whose signal is normally distributed, and whose duration is modelled by an exponential.
In [18]:
d = IndependentComponentsDistribution([ NormalDistribution(10, 1), ExponentialDistribution(0.5) ])
print d.log_probability( (11, 3) )
We can also weight the distributions. In this example, lets say that the signal is far more important than the duration.
In [19]:
d = IndependentComponentsDistribution([ NormalDistribution(10, 1), ExponentialDistribution(0.5) ], weights=[3, 1])
print d.log_probability( (11, 3) )
We see that the point becomes more probable, because we care more about the good fit to the signal dimension (the normal distribution) than it poorly fits the duration (exponential) distribution. You must be careful with the weightings though, because they don't have to sum to 1. In this sense, weights = [3,1]
is not the same as weights = [0.75, 0.25]
.
We can do updates in the same manner. Drawing sampels from a Gaussian with mu = 12
and sigma = 2
, and an exponential with mu = 5
(0.2 in pomegranate notation, which is more prominent), we can recover the underlying distributions with 1000 points.
In [20]:
data = numpy.zeros((1000, 2))
data[:,0] = np.random.randn(1000) * 2 + 12
data[:,1] = np.random.exponential(5, 1000)
d.fit(data)
print d.parameters[0][0].parameters, d.parameters[0][1].parameters
It's not always good enough to have independent distributions, as many signals are heavily correlated. A simple multivariate Gaussian has been written, as this is the most common use case, with full covariance updates. You provide a vector of mus, and a covariance matrix, and you're good to go! Unfortunately, plotting is not natively supported for this distribution.
In [21]:
d = MultivariateGaussianDistribution( np.arange(5) * 5, np.eye(5) )
print d.sample()
In [22]:
data = np.random.randn(1000, 5) + np.arange(5) * 8
d.fit(data)
print d.parameters
In [23]:
print d.sample()
Tables are a different type of multivariate distribution. In essence, a multivariate discrete distribution. To use them, we define the values, and give probabilities which sum to 1 for each instance of the parents. They must be linked to parent distributions (which can also be conditional probability tables). Large networks of tables are called Bayesian networks!
Lets code up the Monty Hall problem very quickly, without inference (see Bayesian Network introduction for the inference component).
In [24]:
guest = DiscreteDistribution({'A': 0.33, 'B': 0.33, 'C': 0.33})
prize = DiscreteDistribution({'A': 0.33, 'B': 0.33, 'C': 0.33})
monty = ConditionalProbabilityTable(
[['A', 'A', 'A', 0.0],
['A', 'A', 'B', 0.5],
['A', 'A', 'C', 0.5],
['A', 'B', 'A', 0.5],
['A', 'B', 'B', 0.0],
['A', 'B', 'C', 0.5],
['A', 'C', 'A', 0.5],
['A', 'C', 'B', 0.5],
['A', 'C', 'C', 0.0],
['B', 'A', 'A', 0.0],
['B', 'A', 'B', 0.5],
['B', 'A', 'C', 0.5],
['B', 'B', 'A', 0.5],
['B', 'B', 'B', 0.0],
['B', 'B', 'C', 0.5],
['B', 'C', 'A', 0.5],
['B', 'C', 'B', 0.5],
['B', 'C', 'C', 0.0],
['C', 'A', 'A', 0.0],
['C', 'A', 'B', 0.5],
['C', 'A', 'C', 0.5],
['C', 'B', 'A', 0.5],
['C', 'B', 'B', 0.0],
['C', 'B', 'C', 0.5],
['C', 'C', 'A', 0.5],
['C', 'C', 'B', 0.5],
['C', 'C', 'C', 0.0]], parents=[guest, prize]
)
print monty
In [25]:
print monty.log_probability( ('A', 'A', 'C') )
print monty.log_probability( ('A', 'B', 'B') )
The reason which we need the parent distributions to be explicit distributions is so that we can have seamless conversion from the conditional table to the joint table.
In [26]:
print monty.joint()
There are 18 non-zero entries in the table, each of equal weight. 1 / 18 is 0.556, so this makes sense. The joint distribution is it's own object, which can be stored and used later if desired.
In [27]:
a = monty.joint()
print type(a)
In [28]:
a.log_probability( ('C', 'A', 'A') )
Out[28]:
Both conditional and joint distributions can be trained in the same manner as the other distributions.
In [29]:
data = [['A', 'B', 'C'],
['B', 'B', 'A'],
['A', 'B', 'A'],
['C', 'C', 'B'],
['C', 'C', 'A'],
['C', 'C', 'A']]
a.fit(data)
print a
In [30]:
monty.fit(data)
print monty
In the cases of conditional probability tables, if parent tuples haven't been seen before, it assumes a uniform distribution over them, explaining the prevalence of 0.3333... entries.
Distributions (except tables right now) support serialization to JSONs using to_json()
and Distribution.from_json( json )
. We can see a few examples easily.
In [31]:
print NormalDistribution( 5, 2 ).to_json()
In [32]:
print DiscreteDistribution( {'A': 0.5, 'B': 0.25, 'C': 0.25} ).to_json()
In [33]:
print Distribution.from_json( GammaDistribution( 5, 2 ).to_json() )
In [34]:
print Distribution.from_json( GaussianKernelDensity( [5,6,7,3,5,7,3,3,5,2] ).to_json() )