Author: Jacob Schreiber

Contact: jmschreiber91@gmail.com

```
In [1]:
```%pylab inline
import seaborn, pandas, itertools, time, random
seaborn.set_style('whitegrid')
from pomegranate import *
numpy.set_printoptions(suppress=True)
random.seed(0)
numpy.random.seed(0)
import matplotlib
matplotlib.rcParams['xtick.labelsize'] = 14
matplotlib.rcParams['ytick.labelsize'] = 14
matplotlib.rcParams['axes.labelsize'] = 16
matplotlib.rcParams['axes.titlesize'] = 16
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['legend.fontsize'] = 16

```
```

pomegranate is a probabilistic modelling library which focuses on combining an ease of use with a speedy cython backend. Its precursor, yahmm, began while I was an undergraduate working in the nanopore group at UC Santa Cruz which was developing a new DNA sequencing tool. Since it was in its beginning stages, sequence analysis using this tool was primarily done by hand, which was extremely slow and could suffer significantly from human bias. Through the use of hidden Markov models implemented in yahmm this process was automated, speeding up the analysis by several orders of magnitude while removing human bias and cutting out around 80% of the error (increasing accuracy from ~90% to ~98%).

pomegranate began when I was in graduate school at the University of Washington and taking machine learning courses. Homework assignments frequently required the implementation of a model, and I noticed that many implementations simply involved rearranging the components of yahmm into other useful models, such as Naive Bayes, mixture models, and Bayesian networks. Thus, pomegranate was born.

While efficient, pomegranate didn't fully utilize the benefits that cython can provide until last summer, while I was doing a summer internship working on sklearn. During the internship I was working on speeding up decision tree regressors, and a mathematical trick I proposed was able to speed them up up to 4x faster in some cases. This taught my how to write bare-bones and parallelizable cython code and utilize computational tricks to speed up seemingly-complex code. I've taken this knowledge to significantly improve the internals of pomegranate and make it extremely fast.

The main models which pomegranate implements and I will be discussing in this talk are

- Probability Distributions
- Naive Bayes
- Markov Chains
- General Mixture Models
- Hidden Markov Models
- Bayesian Networks

One of the core tenets of pomegranate is that every model is really a probability distribution. We're all familiar with simple distributions such as normal or uniform distributions, but even a Bayesian network can be thought of as a distribution which is factored along a graph. This idea makes pomegranate extremely flexible, allowing for models to be stacked within each other to neatly produce hidden Markov models with general mixture model emissions, Naive Bayes classifiers with Bayesian network emissions, and mixtures of hidden Markov models.

Here is a table of the ways in which models can be stacked. On the left is a chart of what is possible, with dark blue indicating that models can be stacked in a certain way, and light blue indicating a way models can be stacked which will be added in the near future. On the right is a chart of the comparison of pomegranates models to other known packages, with light red indicating model stacking which is known to exist in other packages but pomegranate provides more support for. For example, sklearn implements mixture models, but these are limited to Gaussian emissions, while pomegranate supports mixture models of arbitrary distributions. Dark red indicates new model stacking that no other known package supports, such as mixtures of hidden Markov models. If you know of a package which supports these meshes I'd love to know!

```
In [2]:
```labels = 'Distributions', 'Naive Bayes', 'Markov Chains', 'GMM', 'HMM', 'Bayesian Networks'
X = [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 1.0, 0.5, 0.0, 0.0, 0.0],
[1.0, 1.0, 0.0, 1.0, 1.0, 0.0],
[0.0, 1.0, 0.0, 1.0, 0.0, 0.0],
[0.0, 1.0, 0.0, 0.5, 0.5, 0.0]]
Y = [[0.0, 0.5, 0.0, 0.5, 0.5, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.5, 0.0, 0.0, 0.0, 0.0],
[0.5, 1.0, 0.0, 1.0, 0.5, 0.0],
[0.0, 1.0, 0.0, 1.0, 0.0, 0.0],
[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]]
plt.figure(figsize=(20, 10))
plt.subplot(121)
plt.title("Model Stacking Allowed", fontsize=24)
plt.imshow(X, interpolation='nearest', cmap='Blues')
plt.xticks(range(6), labels, rotation=90, fontsize=20)
plt.xlabel("..is allowed in this model", fontsize=20)
plt.ylabel("This model...", fontsize=20)
plt.yticks(range(6), labels, fontsize=20)
plt.subplot(122)
plt.title("Model Stacking Compared To Others", fontsize=24)
plt.imshow(Y, interpolation='nearest', cmap='Reds')
plt.xticks(range(6), labels, rotation=90, fontsize=20)
plt.xlabel("..is allowed in this model", fontsize=20)
plt.yticks(range(6), '', fontsize=18)
plt.show()

```
```

All of the models (and stacks of models) in pomegranate support the same core functionality and methods. You can read about the API more at the readthedocs site. These methods are almost all implemented with a cython backend which releases the GIL. This makes the operations especially fast, but it also makes them all parallelizable using ~multithreading~. In many cases, this can be much more efficient than multiprocessing.

To utilize this parallelization, pomegranate has implemented some parallelization wrappers, which can run model prediction, log probability calculations, or model fitting in parallel for any model or stack of models.

Since all models share the same core set of functionality, let's start with a quick API reference. All models have the following methods:

- probability
- log_probability
- fit
- sample
- summarize
- from_summaries
- clear_summaries
- to_json
- from_json

The first three methods are likely the most used functions---updating parameters of the distribution, and evaluating samples using the model parameters.

Methods 5-7 implement pomegranate's out-of-core learning scheme. All models can summarize data down to its sufficient statistics using 'summarize', update model parameters exactly using these sufficent statistics using 'from_summaries', and then clear the stored sufficent statistics using 'clear_sumaries'. This will ultimately allow you to decide if you want to clear the summaries after each parameter updates, or continue with online learning to get better estimates of a single process.

In addition, for models composed of simple distributions, a sklearn-like API is exposed for the common operations:

- predict
- predict_proba
- predict_log_proba

The 'predict' method returns the index of the model which most likely generated the sample, 'predict_proba' returns the posterior probabilities of each model having generated the sample, and 'predict_log_proba' returns the log probabilities.

The basic unit of probabilistic modelling is the probability distribution. Most people are familiar with simple ones like the Normal distribution, but pomegranate supports a wide variety of them. Here is a full list of the probability distributions which pomegranate currently (8/12/2016) supports.

Univariate Distributions

- UniformDistribution
- NormalDistribution
- LogNormalDistribution
- ExponentialDistribution
- BetaDistribution
- GammaDistribution
- DiscreteDistribution
- PoissonDistribution

Kernel Densities

- GaussianKernelDensity
- UniformKernelDensity
- TriangleKernelDensity

Multivariate Distributions

- IndependentComponentsDistribution
- MultivariateGaussianDistribution
- DirichletDistribution
- ConditionalProbabilityTable
- JointProbabilityTable

The first two methods are likely the most used functions---updating parameters of the distribution, and evaluating samples using the model parameters. The next three are involved with the out-of-core learning API, where `summarize`

will store sufficient statistics from a dataset, `from_summaries`

will update the model parameters using the stored sufficient statistics, and `clear_summaries`

will reset the sufficient statistics if you don't want past points to influence the update anymore. Lastly, `to/from_json`

allow you to serialize your models.

Let's take a look at one of the simpler distributions, the Normal Distribution. We can specify it easily in terms of the mean $\mu$ and the standard deviation $\sigma$.

```
In [3]:
```a = NormalDistribution(0, 2)
b = NormalDistribution(1, 0.5)

`log_probability`

method.

```
In [4]:
```print a.log_probability(5)
print b.log_probability(5)

```
```

We can fit any type of distribution to the data using the class method `Model.from_sample(X)`

. This uses maximum likelihood estimates on the data to derive parameters for the distribution. In the case of the normal distribution, it just calculates the mean and standard deviation of the data.

Let's fit a normal distribution to that data and take a look at it

```
In [5]:
```data = numpy.random.randn(100)
a = NormalDistribution.from_samples(data)
x = numpy.arange(-3.5, 3.5, 0.1)
y = a.probability(x)
plt.plot(x, y, c='b')
plt.fill_between(x, 0, y, color='b', alpha=0.3)
plt.scatter(data, numpy.zeros(100)-0.05)
plt.ylabel('Probability')
plt.xlim(-4, 4)
plt.show()

```
```

In addition to the basic fit method, pomegranate also supports out-of-core learning to train on datasets which don't fit in memory through the `summarize`

and `from_summaries`

methods. Essentially, the `summarize`

method will reduce a batch of data down to its sufficient statistics, and the `from_summaries`

method will calculate an exact update to the model parameters based on the stored sufficient statistics. A simple example of this is the Normal Distribution, where the three sufficient statistics are $\sum\limits_{i=1}^{n}w_{i}$, $\sum\limits_{i=1}^{n}w_{i}x_{i}$, and $\sum\limits_{i=1}^{n} w_{i}x_{i}^{2}$. Since this sum can be broken into parts, you can run `summarize`

as many times as necessary to analyze the entire dataset. Once you're done summarizing data, you can then run `from_summaries`

, which calculates the following:

Let's confirm that this is true by taking a look at two distributions, one fit to the entire thing, and one summarized over five chunks of data.

```
In [6]:
```data = numpy.random.randn(5000)
a = NormalDistribution(5, 2)
b = NormalDistribution(5, 2)
a.fit(data)
b.summarize(data[:1000])
b.summarize(data[1000:2000])
b.summarize(data[2000:3000])
b.summarize(data[3000:4000])
b.summarize(data[4000:])
b.from_summaries()
x = numpy.arange(-3.5, 3.5, 0.1)
ya = map( a.probability, x )
yb = map( b.probability, x )
plt.plot(x, ya, c='b')
plt.plot(x, yb, c='r')
plt.fill_between(x, 0, ya, color='b', alpha=0.3)
plt.fill_between(x, 0, yb, color='r', alpha=0.3)
plt.ylabel('Probability')
plt.show()
print "Fit Mean: {}, Fit STD: {}".format( *a.parameters )
print "Summarize Mean: {}, Summarize STD: {}".format( *b.parameters )

```
```

```
In [7]:
```data = numpy.random.randn(1000)
print "numpy time:"
%timeit -n 100 data.mean(), data.std()
print
print "pomegranate time:"
%timeit -n 100 NormalDistribution.from_samples(data)

```
```

Looks like we're around twice as fast at extracting the mean and standard deviation. This is mostly because the mean and variances are calculated together in pomegranate through the use of the sufficient statistics, instead of going through the entire array twice to calculate the mean and then the standard deviation.

Lets take a look at doing this with multivariate gaussian distributions on large amounts of data.

```
In [8]:
```data = numpy.random.randn(10000000, 10)
print "numpy time:"
%timeit -n 10 data.mean(), numpy.cov(data.T)
print
print "pomegranate time:"
%timeit -n 10 MultivariateGaussianDistribution.from_samples(data)

```
```

```
In [9]:
```from scipy.stats import norm
d = NormalDistribution(0, 1)
print "scipy time:"
%timeit -n 100 norm.logpdf(2, 0, 1)
print
print "pomegranate time:"
%timeit -n 100 NormalDistribution(0, 1).log_probability(2)
print
print "pomegranate with (w/ created object)"
%timeit -n 100 d.log_probability(2)
print
print "logp difference: {}".format( norm.logpdf(2, 0, 1) - NormalDistribution(0, 1).log_probability( 2 ) )

```
```

One of the reasons which pomegranate can be so fast at calculating log probabilities is that it is able to cache parts of the logpdf equation so that it doesn't need to do all of the calculations each time. For example, let's look at the Normal distribution pdf equation:

\begin{equation} P(X|\mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} exp \left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) \\ \end{equation}We can take the log of this to simplify it.

\begin{equation} logP(X|\mu, \sigma) = -\log \left(\sqrt{2\pi}\sigma \right) - \frac{(x-\mu)^{2}}{2\sigma^{2}} \end{equation}pomegranate speeds up this calculation by caching $-\log(\sqrt{2\pi}\sigma)$ and $2\sigma^{2}$ when the object is created. This means that the equation is simplified to the following:

\begin{equation} logP(X|\mu, \sigma) = \alpha - \frac{(x - \mu)^{2}}{\beta} \end{equation}We don't need to calculate any logs or exponentials here, just a difference, a multiplication, a division, and a subtraction.

It is worth noting that numpy and scipy supports vectorized operations across an array of samples at the same time. As the size of this array grows, it may become faster to use numpy or scipy than to use pomegranate. However, most of the models inside pomegranate only require the calculation of one value at a time. Because of this, it's more useful for pomegranate to have fast one-sample calculations than vectorized ones.

Let's turn now to a practical and useful example of how probabilistic modelling can help us.

A few years ago, my girlfriend wanted me to watch *Gossip Girl* with her. The show follows a group of angsty teenagers in Manhattan who take turns being in relationships with each other while constantly disappointing their parents. In the backdrop, an enigmatic figure known as 'gossip girl' sends out text message 'blasts' to stir up trouble--usually just in time to cause needless drama which could be solved if the characters ever decided to talk to each other. A mystery of the show is 'Who is Gossip Girl', with the implication that one of the main characters is secretly gossip girl.

Let's take a look at some of the blasts.

```
In [10]:
```data = pandas.read_excel('GGBlasts.xlsx')
blasts = data['Blast']
print blasts[1]
print
print blasts[5]

```
```

I took that last blast as a challenge. I wanted to figure out which of the characters was gossip girl, and I wanted to use data science in order to do it.

To do this, the blasts were first all encoded as being either in favor of someone's agenda, or against someone's agenda. Presumably, the character who the blasts help the most would be gossip girl. So let's now take a look at a bar chart of the number of positive blasts minus the number of negative blasts.

```
In [11]:
```data = data.fillna(0)
data.sum(axis=0).plot(kind='bar', facecolor='m', edgecolor='m')
plt.ylabel('Positive Blasts - Negative Blasts')
plt.show()

```
```

This chart seems to have some important information in it. Specifically, it seems like Blair has by far the most blasts working against her agenda. It seems unlikely that she is gossip girl. Likewise, it seems unlikely that Chuck is gossip girl due to the negativity. This leaves Dan, Nate, and Vanessa as the most likely contendors to be Gossip Girl based entirely on the negativity index. In order to break this tie, let's attempt to use probabilistic modelling to see if we can figure out who it is.

This brings us to the concept of the beta distribution. Basically, the beta distribution is a probability distribution between 0 and 1, which is parameterized by two shape parameters $\alpha$ and $\beta$. Typically, this represents a distribution over the probability that an event occurs, with $\alpha$ representing the number of times which an event happened, and $\beta$ representing the number of times which it did not happen.

A classic example of using a beta distribution is modelling the probability of a coin coming up heads. A coin coming up heads is considered a 'success', and an event coming up tails is considered a 'failure'. Typically a count of 1 is added to the beta distribution on top of the number of observations, for math reasons. Let's take a look.

```
In [12]:
```a = BetaDistribution(0+1, 0+1)
b = BetaDistribution(0+1, 2+2)
c = BetaDistribution(1+1, 2+2)
d = BetaDistribution(25+1, 25+1)
plt.figure(figsize=(10, 6))
x = numpy.arange(0, 1, 0.01)
ya = map( a.probability, x )
yb = map( b.probability, x )
yc = map( c.probability, x )
yd = map( d.probability, x )
plt.plot(x, ya, c='b', label="0 heads, 0 tails")
plt.plot(x, yb, c='c', label="0 heads, 2 tails")
plt.plot(x, yc, c='r', label="1 head, 2 tails")
plt.plot(x, yd, c='g', label="25 heads, 25 tails")
plt.fill_between(x, 0, ya, color='b', alpha=0.3)
plt.fill_between(x, 0, yb, color='c', alpha=0.3)
plt.fill_between(x, 0, yc, color='r', alpha=0.3)
plt.fill_between(x, 0, yd, color='g', alpha=0.3)
plt.legend()
plt.ylabel('Probability')
plt.show()

```
```

This seems intuitive. If we haven't seen any heads or any tails (blue), we really have a uniform distribution over the possible values of the probability of the coin flipping heads. If we've seen two tails (cyan) then it becomes impossible that there is a 100% chance of seeing heads, and there is a growing probability that there is a 100% probability of tails. However, as soon as we see a single head (red), the probability of only flipping tails goes to 0. This makes sense, because we've observed a head. Continuing, if we flip 10 heads and 10 tails, we get a tighter distribution around 0.5, representing our increased confidence that the parameter is around there. This makes sense, because if we flip a coin 2 times and get 1 head and 1 tail, we're less confident than if we've flipped it 2000 times and get 1000 heads and 1000 tails.

Let's apply this concept to the *Gossip Girl* data. We can start off by saying that a blast in favor of a characters agenda is a 'success', and a blast against their agenda is a 'failure'. Let's then learn beta distributions for each character for each of the four seasons, and see how these distributions evolve over time.

```
In [13]:
```characters = {
'Blair': [[1, 1, 1, 1], [1, 1, 1, 1]],
'Serena' : [[1, 1, 1, 1], [1, 1, 1, 1]],
'Dan' : [[1, 1, 1, 1], [1, 1, 1, 1]],
'Vanessa' : [[1, 1, 1, 1], [1, 1, 1, 1]],
'Jenny' : [[1, 1, 1, 1], [1, 1, 1, 1]],
'Chuck' : [[1, 1, 1, 1], [1, 1, 1, 1]]
}
for i, row in data.iterrows():
season_s = row['Season/episode']
if isinstance( season_s, unicode ):
season = int( row['Season/episode'][1] ) - 1
for character in characters.keys():
if row[character] == 1:
characters[character][0][season] += 1.
elif row[character] == -1:
characters[character][1][season] += 1.
for character, (good, bad) in characters.items():
characters[character] = [ np.cumsum(good), np.cumsum(bad) ]
x = np.linspace(0, 1, 100)
plt.figure( figsize=(16, 8))
colors = {
'Blair' : 'r',
'Serena' : 'y',
'Dan' : 'g',
'Jenny' : 'c',
'Vanessa' : 'm',
'Chuck' : 'b'
}
for k in xrange(4):
plt.subplot(len(good) / 2, 2, k + 1)
plt.xlabel("After Season {}".format( k+1 ))
for character, (good, bad) in characters.items():
B = BetaDistribution(good[k], bad[k])
y = numpy.exp([B.log_probability(i) for i in x])
plt.plot( x, y, label=character, c=colors[character] )
plt.fill_between(x, 0, y, color=colors[character], alpha=0.2)
plt.vlines( good[k] / ( good[k]+bad[k] ), 0, 9, colors=colors[character], linestyles='dashed' )
plt.legend()
plt.tight_layout()

```
```

Above we've plotted both the distributions and the maximum likelihood estimates of these distributions plotted as vertical dotted bars. We can see that over the course of the show, the distributions tighten. Two interesting observations are that Dan appears to be the favorite to be Gossip Girl across all seasons, though not by much. Also, except for by the end of Season 3, Blair is not the least likely to be Gossip Girl, being surpassed by Chuck. In fact, by the end of season 1, she is above the median for being Gossip Girl.

Hopefully this practical example has shown how even basic probabilistic modelling can be applied to derive insight in important, real world, problems.

Now that we've covered basic probability distributions, a typical thing you may want to do is create a machine learning classifier using them. After all, you can fit these distributions to some data, and use them to evaluate the likelihood of some future points. Wouldn't it be simple to just see which distribution produced the highest likelihood of a given point, and then assign that label to the point?

This is the basis of Naive Bayes classifiers, which are a form of probabilistic classifier which rely on Bayes Rule, shown below. M stands for model, and D stands for data.

\begin{equation} P(M|D) P(D) = P(D|M) P(M) \end{equation}

Let's look at the different parts of this equation. The first part of P(M|D), which is the probability of the model given the data. This is ultimately what we would like to calculate for each model (distribution) in the Naive Bayes classifier. We then return the model label corresponding to the model with the highest P(M|D) score for each point.

However, P(D) is a bit more ambiguous. What exactly is the probability of that dataset? Instead of worrying about that, we realize that we can simply get rid of that term, since it does not depend on the model, and so will be the same for each component. We can get rid of that term, and are left with the following:

\begin{equation} P(M|D) = P(D|M) P(M) \end{equation}

Next, we get to P(D|M), which is our basic likelihood function. All distributions implement `model.log_probability(X)`

, which calculates logP(D|M).

Lastly, we come to P(M), which is the prior distribution over the distributions in the Naive Bayes classifier. This term represents the probability that, without even having seeing the data point yet, we would assign a specific label to it. This typically will weight the likelihood by the frequency of that label, so that labels which occur far more frequently are recognized as being more likely.

pomegranate implements Naive Bayes in a fast, flexible, and simple to use format. Let's see a basic example using two normal distributions.

We can specify the model in two ways, depending on if we want to fit the model to a dataset, or have specific distributions in mind which we'd like to use to predict.

```
In [14]:
```a = NormalDistribution(0, 2)
b = NormalDistribution(5, 1)
x = numpy.arange(-8, 14, 0.1)
ya = a.probability(x)
yb = b.probability(x)
plt.plot(x, ya, c='b')
plt.plot(x, yb, c='r')
plt.fill_between(x, 0, ya, color='b', alpha=0.2)
plt.fill_between(x, 0, yb, color='r', alpha=0.2)
plt.ylabel("Probability")
plt.show()

```
```

```
In [15]:
```data = numpy.arange(-5, 10, 0.5).reshape(-1, 1)
clf = NaiveBayes([a, b])
y = clf.predict(data)
n0 = data.shape[0] - y.sum()
n1 = y.sum()
x = numpy.arange(-8, 14, 0.1)
ya = a.probability(x)
yb = b.probability(x)
plt.plot(x, ya, c='b', alpha=0.2)
plt.plot(x, yb, c='r', alpha=0.2)
plt.fill_between(x, 0, ya, color='b', alpha=0.2)
plt.fill_between(x, 0, yb, color='r', alpha=0.2)
plt.scatter( data[y==0], numpy.zeros(n0)-0.02, color='b', edgecolor='b' )
plt.scatter( data[y==1], numpy.zeros(n1)+0.42, color='r', edgecolor='r' )
plt.ylabel("Probability")
plt.show()

```
```

This makes sense. The model switches from predicting 0 (blue) to 1 (red) at the position in the plot were the red distribution has a higher probability than the blue distribution. This is pretty much what we would expect from a Naive Bayes classifier with no prior distribution on the components.

Now, we can get a probabilistic view of the classification by looking at the predicted probabilities instead of just a hard classification. This ends up being P(M|D), which is a vector of probabilities that each model component generated that sample, which will sum to 1 for each sample. We can get these values by using `model.predict_proba`

.

```
In [16]:
```data = numpy.arange(-3, 7, 0.25).reshape(-1, 1)
y = clf.predict_proba(data)
x = numpy.arange(-8, 14, 0.1)
ya = a.probability(x)
yb = b.probability(x)
plt.plot(x, ya, c='b', alpha=0.2)
plt.plot(x, yb, c='r', alpha=0.2)
plt.fill_between(x, 0, ya, color='b', alpha=0.2)
plt.fill_between(x, 0, yb, color='r', alpha=0.2)
plt.scatter( data[y[:,1] <= 0.5], y[y[:,1] <= 0.5, 1]*0.4, color='b', edgecolor='b' )
plt.scatter( data[y[:,1] > 0.5], y[y[:,1] > 0.5, 1]*0.4, color='r', edgecolor='r' )
plt.yticks(numpy.arange(0, 0.41, 0.05), numpy.arange(0, 1.1, 0.125))
plt.ylim(-0.05, 0.45)
plt.ylabel("P(Red Model | Data)")
plt.show()

```
```

These results mesh pretty well with what we saw before. As the input values increases, there is a higher probability that the same was generated by the red distribution. The point at which the dots switch colors lines up with the point at which the red distribution has a higher probability than the blue distribution.

We can fit Naive Bayes classifiers in two ways. The first is by passing in an explicit list of instantiated models, possibly with a prior distribution over them, and then calling fit. The second is similar to sklearn, where we only pass a callable for the type of distribution which we would like to fit. We know the number of distributions which we'd like to fit from the label set.

```
In [17]:
```X = numpy.concatenate([numpy.random.normal(0, 2, (100, 1)), numpy.random.normal(4, 1.5, (100, 1))])
y = numpy.concatenate([numpy.zeros(100), numpy.ones(100)])
clf = NaiveBayes.from_samples(NormalDistribution, X, y)
print(clf)

```
```

```
In [18]:
```from sklearn.naive_bayes import GaussianNB
def create_dataset(n_samples, n_dim, n_classes):
"""Create a random dataset with n_samples in each class."""
X = numpy.concatenate([numpy.random.randn(n_samples, n_dim) + i for i in range(n_classes)])
y = numpy.concatenate([numpy.zeros(n_samples) + i for i in range(n_classes)])
return X, y
def plot( fit, predict, skl_error, pom_error, sizes, xlabel ):
"""Plot the results."""
idx = numpy.arange(fit.shape[1])
plt.figure( figsize=(14, 4))
plt.plot( fit.mean(axis=0), c='c', label="Fitting")
plt.plot( predict.mean(axis=0), c='m', label="Prediction")
plt.plot( [0, fit.shape[1]-1], [1, 1], c='k', label="Baseline" )
plt.fill_between( idx, fit.min(axis=0), fit.max(axis=0), color='c', alpha=0.3 )
plt.fill_between( idx, predict.min(axis=0), predict.max(axis=0), color='m', alpha=0.3 )
plt.xticks(idx, sizes, rotation=65, fontsize=14)
plt.xlabel('{}'.format(xlabel), fontsize=14)
plt.ylabel('pomegranate is x times faster', fontsize=14)
plt.legend(fontsize=12, loc=4)
plt.show()
plt.figure( figsize=(14, 4))
plt.plot( 1 - skl_error.mean(axis=0), alpha=0.5, c='c', label="sklearn accuracy" )
plt.plot( 1 - pom_error.mean(axis=0), alpha=0.5, c='m', label="pomegranate accuracy" )
plt.fill_between( idx, 1-skl_error.min(axis=0), 1-skl_error.max(axis=0), color='c', alpha=0.3 )
plt.fill_between( idx, 1-pom_error.min(axis=0), 1-pom_error.max(axis=0), color='m', alpha=0.3 )
plt.xticks(idx, sizes, rotation=65, fontsize=14)
plt.xlabel('{}'.format(xlabel), fontsize=14)
plt.ylabel('Accuracy', fontsize=14)
plt.legend(fontsize=14)
plt.show()
def evaluate_models():
sizes = numpy.around(numpy.exp( numpy.arange(7, 13))).astype('int')
n, m = sizes.shape[0], 5
skl_predict, pom_predict = numpy.zeros((m, n)), numpy.zeros((m, n))
skl_fit, pom_fit = numpy.zeros((m, n)), numpy.zeros((m, n))
skl_error, pom_error = numpy.zeros((m, n)), numpy.zeros((m, n))
for i in range(m):
for j, size in enumerate(sizes):
X, y = create_dataset(size, 20, 2)
# bench fit times
tic = time.time()
skl = GaussianNB()
skl.fit( X, y )
skl_fit[i, j] = time.time() - tic
tic = time.time()
pom = NaiveBayes.from_samples(NormalDistribution, X, y)
pom_fit[i, j] = time.time() - tic
# bench predict times
tic = time.time()
skl_predictions = skl.predict( X )
skl_predict[i, j] = time.time() - tic
tic = time.time()
pom_predictions = pom.predict( X )
pom_predict[i, j] = time.time() - tic
# check number wrong
skl_e = (y != skl_predictions).mean()
pom_e = (y != pom_predictions).mean()
skl_error[i, j] = min(skl_e, 1-skl_e)
pom_error[i, j] = min(pom_e, 1-pom_e)
fit = skl_fit / pom_fit
predict = skl_predict / pom_predict
plot(fit, predict, skl_error, pom_error, sizes, "samples per component")
evaluate_models()

```
```

A common case in the analysis of signals is to have a signal which is a mixture of two different underlying phenomena, which each have different characteristics. For example, this could be the electricity usage in a house when the lights are on versus off, or a pedometer which measures when a person is running versus walking.

Let's assume that we have a perfect segmenter right now, which will divide the signal in regions which are different from the adjacent regions. The task becomes to take this segment of the signal, and classify which phenomena are happening during it. Let's take a look at some data, colored red and blue to describe the underlying phenomena.

```
In [19]:
```def plot_signal(X, n):
plt.figure(figsize=(16, 6))
t_current = 0
for i in range(n):
mu, std, t = X[i]
t = int(t)
chunk = numpy.random.normal(mu, std, t)
plt.plot(numpy.arange(t_current, t_current+t), chunk, c='cm'[i % 2])
t_current += t
plt.ylim(20, 40)
plt.show()
def create_signal(n):
X, y = [], []
for i in range(n):
mu = numpy.random.normal(30.0, 0.4)
std = numpy.random.lognormal(-0.1, 0.4)
t = int(numpy.random.exponential(45)) + 1
X.append([mu, std, t])
y.append(0)
mu = numpy.random.normal(30.5, 0.4)
std = numpy.random.lognormal(-0.4, 0.4)
t = int(numpy.random.exponential(250)) + 1
X.append([mu, std, t])
y.append(1)
return numpy.array(X), numpy.array(y)
X_train, y_train = create_signal(1000)
X_test, y_test = create_signal(250)
plot_signal(X_train, 20)

```
```

Each segment is drawn from a normal distribution with the mean drawn from a normal distribution, the standard deviation drawn from a log normal distribution, and the duration drawn from an exponential distribution. These distributions are all fairly similar to each other, but together they may have enough of a difference to produce a good classifier. Let's say that when the classifier gets a segment of signal, it will be classifying based on the mean, standard deviation, and duration of the signal, and now the raw data points. This turns into a basic classification task with three features.

We can create two classifers in pomegranate and also see how the more well used sklearn classifier does. One uses the appropriate underlying distribution for the features, a Normal distribution for the mean, a log normal distribution for the standard deviation, and an exponential distribution for the duration. We can pass these directly into the `from_samples`

method. The second classifier is a, more typical, multivariate Gaussian Naive Bayes model.

Let's fit our models to 1000 example segments, and then test the validation performance on 250 new segments.

```
In [20]:
```from sklearn.naive_bayes import GaussianNB
distributions = [NormalDistribution, LogNormalDistribution, ExponentialDistribution]
model = NaiveBayes.from_samples(distributions, X_train, y_train)
print "Naive Bayes Using Normal, LogNormal, Exponential: ", (model.predict(X_test) == y_test).mean()
model = NaiveBayes.from_samples(NormalDistribution, X_train, y_train)
print "Multivariate Gaussian Naive Bayes: ", (model.predict(X_test) == y_test).mean()
clf = GaussianNB()
clf.fit(X_train, y_train)
print "sklearn Multivariate Gaussian Naive Bayes: ", ( clf.predict(X_test) == y_test).mean()

```
```

Looks like using a mutlviariate Gaussian model is not a bad choice for this data, but it looks like using the proper distributions to model each of the features does allow you to get significantly better performance and generalizability. This improved performance is not just in the hard classifications, but the soft calls (using `model.predict_proba`

) will be more reliable and accurate.

A difference between the pomegranate Naive Bayes clasifier and the sklearn classifier is that pomegranate will learn the full covariance matrix for the classifier, while sklearn will only learn the diagonal. In many applications, learning the full covariance matrix can improve performance as the underlying features are correlated. In this example, all of the data is generated from independent distributions, and so there isn't much correlation.

Bayesian networks are just distributions which are factorized along a graphical structure. As such, we can plug them into a Naive Bayes classifier to compare different structures. For example, let's compare a Bayesian network to a simple independent components distribution. This will let us compare whether or not adding structure to the variables is better than treating them indepdently.

```
In [21]:
```intelligence = DiscreteDistribution({0: 0.7, 1: 0.3})
difficulty = DiscreteDistribution({0: 0.4, 1: 0.6})
SAT = ConditionalProbabilityTable([[0, 0, 0.95],
[0, 1, 0.05],
[1, 0, 0.20],
[1, 1, 0.80]], [intelligence])
grade = ConditionalProbabilityTable([[0, 0, 0, 0.70],
[0, 0, 1, 0.30],
[0, 1, 0, 0.95],
[0, 1, 1, 0.05],
[1, 0, 0, 0.90],
[1, 0, 1, 0.10],
[1, 1, 0, 0.30],
[1, 1, 1, 0.70]], [intelligence, difficulty])
letter = ConditionalProbabilityTable([[0, 0, 0.90],
[0, 1, 0.10],
[1, 0, 0.40],
[1, 1, 0.60]], [grade])
s1 = State(intelligence, name="intelligence")
s2 = State(difficulty, name="difficulty")
s3 = State(SAT, name="SAT")
s4 = State(grade, name="grade")
s5 = State(letter, name="letter")
dN = BayesianNetwork()
dN.add_nodes(s1, s2, s3, s4, s5)
dN.add_edge(s1, s3)
dN.add_edge(s1, s4)
dN.add_edge(s2, s4)
dN.add_edge(s4, s5)
dN.bake()
intelligence = DiscreteDistribution({0: 0.95, 1: 0.05})
difficulty = DiscreteDistribution({0: 0.4, 1: 0.6})
SAT = ConditionalProbabilityTable([[0, 0, 0.95],
[0, 1, 0.05],
[1, 0, 0.20],
[1, 1, 0.80]], [intelligence])
grade = DiscreteDistribution({0: 0.2, 1: 0.8})
letter = ConditionalProbabilityTable([[0, 0, 0.90],
[0, 1, 0.10],
[1, 0, 0.40],
[1, 1, 0.60]], [grade])
s1 = State(intelligence, name="intelligence")
s2 = State(difficulty, name="difficulty")
s3 = State(SAT, name="SAT")
s4 = State(grade, name="grade")
s5 = State(letter, name="letter")
dC = BayesianNetwork()
dC.add_nodes(s1, s2, s3, s4, s5)
dC.add_edge(s1, s3)
dC.add_edge(s4, s5)
dC.bake()

Now we have two fully created Bayesian networks, one without any dependency on grades because the student is cheating. The two changes that we make are that the distribution over intelligence favors those who are not exceptional, because they are more likely to cheat in this example. Secondly, we remove the dependency of grade on difficulty and intelligence because cheating gives a better score.

Let's build a Bayes classifier using these two networks to try to identify students who are cheating just based on this network structure.

```
In [22]:
```X = numpy.array([[0, 0, 1, 0, 0],
[1, 0, 1, 1, 0],
[1, 1, 0, 0, 0],
[0, 0, 1, 0, 1],
[0, 0, 1, 0, 1],
[0, 1, 1, 0, 1],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0]])
clf = BayesClassifier([dN, dC])
clf.predict_proba(X)

```
Out[22]:
```

The student who is most likely to not be cheating makes a lot of sense. It's a smart student taking an easy class who does not do well, but does well on their SAT, and ultimately ends up getting a letter of recommendation. Conversely, the student most likely to be cheating was a poor student who was taking a difficult class, did not end up doing well on the SAT but somehow gets a good grade in the class and does not get a letter.

It is easy to see how these networks could be expanded to allow for more granularity at each level. For example, each grade could be the actual letter grade and SAT performance might be the number of standard deviations above or below the mean. More variables and connections could allow for more sophisticated differences between the networks. No matter how complicated these networks get, they can be fed into pomegranate.

Markov Chains are a simple model based on conditional probability, where a sequence is modelled as the product of conditional probabilities. A n-th order Markov chain looks back n emissions to base its conditional probability on. For example, a 3rd order Markov chain models $P(X_{t}|X_{t−1},X_{t−2},X_{t−3})$.

However, a full Markov model needs to model the first observations, and the first n-1 observations. The first observation can't really be modelled well using $P(X_{t}|X+{t−1},X_{t−2},X_{t−3})$, but can be modelled by $P(X_{t})$. The second observation has to be modelled by $P(X_{t}|X_{t−1})$. This means that these distributions have to be passed into the Markov chain as well.

We can initialize a Markov chain easily enough by passing in a list of the distributions.

```
In [23]:
```a = DiscreteDistribution({'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10})
b = ConditionalProbabilityTable([['A', 'A', 0.10],
['A', 'C', 0.50],
['A', 'G', 0.30],
['A', 'T', 0.10],
['C', 'A', 0.10],
['C', 'C', 0.40],
['C', 'T', 0.40],
['C', 'G', 0.10],
['G', 'A', 0.05],
['G', 'C', 0.45],
['G', 'G', 0.45],
['G', 'T', 0.05],
['T', 'A', 0.20],
['T', 'C', 0.30],
['T', 'G', 0.30],
['T', 'T', 0.20]], [a])
model = MarkovChain([a, b])

Markov chains have log probability, fit, summarize, and from summaries methods implemented. They do not have classification capabilities by themselves, but when combined with a Naive Bayes classifier can be used to do discrimination between multiple models.

Let's see the log probability of some data.

```
In [24]:
```model.log_probability( list('CAGCATCAGT') )

```
Out[24]:
```

```
In [25]:
```model.log_probability( list('C') )

```
Out[25]:
```

```
In [26]:
```model.log_probability( list('CACATCACGACTAATGATAAT') )

```
Out[26]:
```

```
In [27]:
```model.fit( map( list, ('CAGCATCAGT', 'C', 'ATATAGAGATAAGCT', 'GCGCAAGT', 'GCATTGC', 'CACATCACGACTAATGATAAT') ) )
print model.log_probability( list('CAGCATCAGT') )
print model.log_probability( list('C') )
print model.log_probability( list('CACATCACGACTAATGATAAT') )

```
```

```
In [28]:
```print model.distributions[0]

```
```

```
In [29]:
```data = np.concatenate((np.random.normal(1, 2, (250, 1)), np.random.normal(7.85, 1.2, (500, 1))))
numpy.random.shuffle(data)
plt.figure(figsize=(10, 4))
plt.hist( data, edgecolor='c', color='c', bins=50 )
plt.show()

```
```

This data is clearly made up of two Gaussian distributions, one centered around 8, and one centered around 1. To be exact, the distributions from which the samples were generated are centered around 1 and 7.85 respectively. If we had labels for each sample, we could easily train a Naive Bayes classifier using this data, or just learn the two underlying distributions simply by grouping points with the same label and fitting a Gaussian to that data.

Unfortunately, we don't have labels for these data points. Without labels for this data, we can't directly fit a Gaussian distribution to some points. If we had explicit parameters for a distribution, we could try to infer labels based on the likelihood of each sample under each distribution (like a Naive Bayes from earlier), but we don't have explicit parameters for the distributions. We're stuck in this chicken-and-egg problem of being able to figure out one given the other, but ultimately having neither.

We can solve this problem using the Expectation-Maximization (EM) algorithm. The gist is that you start off with a rough estimate of the parameters (such as randomly choosing a sample to be the $\mu$ and use unit variance for each component in a 1d Gaussian mixture), and then you iterate between inferring labels and performing parameter updates.

**Expectation:** Calculate 'responsibility' of each component for each sample, which is the probability that component generated that sample. This sums to 1 for each sample.

**Maximization:** Use weighted maximum likelihood estimates to update the parameters of each component, with the weight being the probability of that component generating that sample.

The difficulty can be in getting an appropriate initial estimate. pomegranate current defaults to using a single iteration of kmeans clustering in order to get initial estimates. Those clusters are then used to fit the initial distributions, and EM is run to refine those estimates until convergence.

```
In [30]:
```model = GeneralMixtureModel.from_samples(NormalDistribution, 2, data, verbose=True)

```
```

```
In [31]:
```a, b = model.distributions
plt.figure(figsize=(10, 4))
x = numpy.arange(-4, 12, 0.1)
p = numpy.exp(model.log_probability(x))
plt.plot(x, p, color='r')
plt.hist( data, edgecolor='c', color='c', normed=True, bins=50)
plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(-4, 12)
plt.show()
print numpy.exp(model.weights)

```
```

It seems like it did a fairly good job on this simple example. The two learned distributions (in the background) do seem to fit what we expected that distribution to by eye. In addition, the weights are pretty spot on with the process which randomly generated the data (one third in the left hand group, two thirds in the right hand group).

We can now use these underlying distributions, and Bayes rule, to calculate the probabilities of samples having been generated from the distributions (the posterior probability, P(M|D)). This is the exact same calculation as the Naive Bayes classifier, since we have some underlying models and we want to know which one was most likely to have generated the sample.

```
In [32]:
```y = model.predict(data)
plt.figure(figsize=(10, 4))
plt.hist( data[y==0], edgecolor='b', color='b', normed=True, bins=25 )
plt.hist( data[y==1], edgecolor='r', color='r', normed=True, bins=25 )
plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(-4, 12)
plt.show()

```
```

```
In [33]:
```x = numpy.array([numpy.arange(-3, 11, .2)]).T
y = model.predict_proba(x)
plt.figure(figsize=(10, 5))
plt.hist( data, edgecolor='c', color='c', normed=True, bins=50, alpha=0.3 )
plt.scatter(x, y[:,0]*0.30, c='b', edgecolor='b', label="P(Model 1|D)")
plt.scatter(x, y[:,1]*0.30, c='r', edgecolor='r', label="P(Model 2|D)")
plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(numpy.arange(0, 0.31, 0.06), numpy.arange(0, 1.1, 0.2), fontsize=14)
plt.xlim(-4, 12)
plt.legend(loc=6, fontsize=14)
plt.show()

```
```

The results of this make sense. But this is still pretty cool, we can take entirely unlabelled data and derive similar results to a Naive Bayes classifier, which requires labeled data.

The reason which the model is called a General Mixture Model is because it supports mixtures of all types of distributions, not just Gaussians. This is one reason I chose to not abbreviate it to be GMM, just to make this point explicit. We can create a mixtue model of exponential distributions just as easily.

```
In [34]:
```data = numpy.concatenate([ numpy.random.exponential(5, (1000, 1)), numpy.random.exponential(0.25, (1000, 1))])
plt.figure(figsize=(10, 5))
plt.hist( data[data < 20], edgecolor='c', color='c', normed=True, bins=100 )
plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.legend(loc=6, fontsize=14)
plt.show()

```
```

```
In [35]:
```model = ExponentialDistribution.from_samples(data)
x = numpy.arange(0, 20, 0.1)
y = map(model.probability, x)
plt.plot(x, y, color='r')
plt.hist(data[data < 20], edgecolor='c', facecolor='c', normed=True, bins=250)
plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(0, 20)
plt.legend(loc=6, fontsize=14)
plt.show()

```
```

```
In [36]:
```model = GeneralMixtureModel.from_samples(ExponentialDistribution, 2, data, verbose=True, stop_threshold=1e-2)

```
```

```
In [37]:
```a, b = model.distributions
print a
print b

```
```

```
In [38]:
```x = numpy.arange(0, 20, 0.1)
y = model.probability(x)
plt.plot(x, y, color='r')
plt.hist(data[data < 20], edgecolor='c', facecolor='c', normed=True, bins=250)
plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(0, 20)
plt.legend(loc=6, fontsize=14)
plt.show()

```
```

```
In [39]:
```print "Mixture Fit: ", model.log_probability(data).sum()
print "Exp 1 Fit : ", sum(a.log_probability(d) for d in data)
print "Exp 2 Fit : ", sum(b.log_probability(d) for d in data)

```
```

```
In [40]:
```model = GeneralMixtureModel([ExponentialDistribution(5), UniformDistribution(0, 20)])
model.fit(data, stop_threshold=1e-1, verbose=True)
x = numpy.arange(0, 20, 0.1)
y = model.probability(x)
plt.plot(x, y, color='r')
plt.hist(data[data < 20], edgecolor='c', facecolor='c', normed=True, bins=250)
plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(0, 20)
plt.legend(loc=6, fontsize=14)
plt.show()

```
```