author: Jacob Schreiber
contact: jmschreiber91@gmail.com
Neural networks have become exceedingly popular recently due, in part, to their ability to achieve state-of-the-art performance on a variety of tasks without requiring complicated feature extraction pipelines. These models are frequently applied to domains where there is a great deal of raw structured data, such as computer vision, where neighboring pixels are strongly correlated, and natural language processing, where words are organized and modified in specific ways to convey meaning.
There have been mergers between neural networks and probabilistic models. For example, deep hidden Markov models (DHMMs) are models where the input to the neural network is some observation, such as an image, and the output is the state in the hidden Markov model that the observation belongs to. These resulting probabilities are then treated as the likelihood function P(D|M) by the model, regularized using the transition matrix, and then re-normalized to get the posterior probabilities. Another example is a deep mixture model, where expectation-maximization is used to train the model on unlabeled samples.
Thus far, pomegranate has stuck to probabilistic models that are not coupled with a neural network. However, with the recent inclusion of custom distributions, one can use a quick hack in order to turn pomegranate's models into deep models.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy
import seaborn; seaborn.set_style('whitegrid')
from pomegranate import *
numpy.random.seed(0)
%load_ext watermark
%watermark -m -n -p numpy,scipy,pomegranate
Two natural questions emerge when considering the notion of combining a neural network and a probabilistic model. The first is: what is the benefit? The second: how does one do that?
There are many benefits of merging a neural network and a probabilistic model. The simplest is that one can expand the expressiveness of probabilistic models by incorporating a neural network. Instead of forcing your data to come neatly in the form of a single distribution or a mixture of distributions, one can process data that comes in more structured forms such as images. Additionally, a neural network may be able to extract more complicated relationships amongst the features of existing data sets than a simple covariance matrix could.
Practically, this works by having a single neural network replace the entire set of distributions in your model where the model has the same number of outputs as probability distributions that used to be in your model. For example, a HMM that has eight states would be coupled with a neural network that makes predictions for eight classes. First, each observation in your sequence would be run through the neural network to get the predictions for each observation. This calculation is essentially the same as calculating the probability of an observation given each state and then normalizing, i.e., taking a softmax.
We can hack pomegranate models by using a custom distribution that wraps our neural network (using the package of your choice). The following class will store a pointer to your model, an identifying of which class it's representing, and train or make predictions using the given class. When asked for a log probability, it returns the posterior probability of the observation given the class it's storing. When training, it stores data and labels (it's own class ID) and then performs a single round of training to update. Note that this is the exact code that is already implemented in pomegranate, except this version is called "NeuralNetworkWrapperExample" instead of "NeuralNetworkWrapper", and is only shown here for reference as to what's going on in the backend.
In [2]:
class NeuralNetworkWrapperExample():
'''A wrapper for a neural network model for use in pomegranate.
This wrapper will store a pointer to the model, as well as an indicator
of what class it represents. It needs information about the number of
dimensions of the input and the total number of classes in the input.
It is currently built to work with keras models, but can be easily
modified to work with the package of your choice.
Training works in a somewhat hacky manner. Internally, pomegranate
will scan over all components of the model, calling `summarize` for each
one, and then `from_summaries` at the end. During the EM procedure, the
samples and their associated weights are passed in to the `summarize`
function. The associated weights are the responsibilities calculated
during the EM algorithm. In theory, one could simply update the model
using the samples and their corresponding weights. In practice, it's much
better to reconstruct the whole responsibility matrix for the batch of
data and then train on soft labels.
The process for training is as follows. When pomegranate starts a round of
optimization, this wrapper will store a pointer to the data set to the
neural network model object. This data set is the same one passed to each
NeuralNetworkWrapper, the only difference being the ~weights~. Thus, each
successive call will store the weights that are passed in (the responsibilities
from the EM algorithm) to an associated label matrix. The result is a single
copy of the data batch and a corresponding matrix of soft labels. Keras
allows us to train a classifier on soft labels, and this is the preferred
strategy.
Parameters
----------
model : object
The neural network model being utilized.
i : int
The class that this distribution represents.
n_dimensions : int
The number of dimensions in the input.
n_classes : int
The total number of classes that the model can output.
'''
def __init__(self, model, i, n_dimensions, n_classes):
self.d = n_dimensions
self.n_classes = n_classes
self.model = model
self.i = i
self.model.X = []
self.model.y = []
self.model.w = []
def log_probability(self, X):
''' Return pseudo-log probabilities from the neural network.
This method returns the log probability of the class that this
wrapper represents given the model. Thus, it's not strictly a
likelihood function, but rather, a posterior. However, because
the HMM takes log probabilities, multiplies them with a prior,
and then normalizes them, mathematically they work out to be
equivalent.
This method uses the `predict` function from the neural network,
which should take in a single batch of data, and returns the
posterior probability of each class given the network. Typically,
this is calculated using a softmax function on the outputs. The
output of this function should be a matrix of size (n, k), where
n is the number of samples and k is the number of classes, where
the sum over each sample is equal to 1.
Parameters
----------
X : numpy.ndarray, shape=(n, d)
The batch of data to calculcate probabilities over.
'''
return numpy.log(self.model.predict(X)[:,self.i])
def summarize(self, X, w):
'''When shown a batch of data, store the data.
This will store the batch of data, and associated weights, to the
object. The actual update occurs when `from_summaries` is called.
Parameters
----------
X : numpy.ndarray, shape=(n, d)
The batch of data to be passed in.
w : numpy.ndarray, shape=(n,)
The associated weights. These can be uniform if unweighted.
Returns
-------
None
'''
if self.i == 0:
self.model.X = X.copy()
self.model.y = numpy.zeros((X.shape[0], self.n_classes))
self.model.y[:, self.i] = w
def from_summaries(self, inertia=0.0):
'''Perform a single gradient update to the network.
This will perform a single gradient update step, using the
`train_on_batch` method from the network. This is already implemented
for keras networks, but for other networks this method will have to
be implemented. It should take in a single batch of data, along with
associated sample weights, and update the model weights.
Parameters
----------
inertia : double, optional
This parameter is ignored for neural networks, but required for
compatibility reasons.
Returns
-------
None
'''
if self.i == 0:
self.model.train_on_batch(self.model.X, self.model.y)
self.clear_summaries()
def clear_summaries(self):
self.model.X = None
self.model.y = None
@classmethod
def from_samples(self, X, weights):
'''The training of this wrapper on data should be performed in the main model.
This method should not be used directly to train the network.
'''
return self
The first model we'll merge with a deep network is the general mixture model. Here, our neural network model can have any architecture as long as it outputs a number of classes equal to the number of components that you'd like in the mixture model. Our goal is to then train the neural network in an unsupervised manner using EM. This can be very difficult to do correctly and fragile, so use with care.
Let's demonstrate this on the MNIST handwritten digits data set, where we want to learn a model that can distinguish 2's from 9's without any training labels.
In [3]:
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train[(y_train == 2) | (y_train == 9)]
X_test = X_test[(y_test == 2) | (y_test == 9)]
y_test = y_test[(y_test == 2) | (y_test == 9)]
n, d = X_train.shape[0], 784
X_train = X_train.reshape(n, d)
X_test = X_test.reshape(X_test.shape[0], d)
Now, we have to implement the initialization step by hand because it's currently not built-in. First we use K-means to cluster the data and assign preliminary labels. Then, we train our neural network model for some amount of time on those labels. Lastly, we use EM to refine our labels just in case the initial K-means cluster was not good enough. This procedure is essentially identical to the built-in initialization for mixtures, with the only difference being that all the complexities of training a neural network replace the simplicity of fitting a probability distribution.
In [4]:
from sklearn.cluster import KMeans
from pomegranate import GeneralMixtureModel
from pomegranate import NeuralNetworkWrapper
from keras.layers import Dense
from keras.models import Sequential
numpy.random.seed(111)
# Use k-means to get an initial clustering of our data
y_init = KMeans(2).fit_predict(X_train)
y_kmeans = numpy.zeros((n, 2))
y_kmeans[numpy.arange(n), y_init] = 1
# Define the neural network model. Here we use a simple model for illustration.
nn = Sequential([
Dense(128, input_dim=d, activation='relu'),
Dense(2, activation='softmax')
])
nn.compile(loss='categorical_crossentropy', optimizer='adam')
# Now we fit to the k-means defined labels.
nn.fit(X_train, y_kmeans, epochs=10, verbose=False)
# Here we define our wrappers---one for each component in our mixture / each class in our neural model
d1 = NeuralNetworkWrapper(nn, 0, d, 2)
d2 = NeuralNetworkWrapper(nn, 1, d, 2)
# We feed the wrapper into the mixture model as you would a normal distribution
model1 = GeneralMixtureModel([d1, d2])
# Now we fit the model using EM, just as one would with built-in distributions
model1.fit(X_train, max_iterations=10, stop_threshold=0.1)
# And make predictions just as one would with normal distributions
y_hat = model1.predict(X_test)
Let's take a look at the clusters that are found using this approach.
In [5]:
plt.figure(figsize=(10, 3))
X_test_r = X_test.reshape(X_test.shape[0], 28, 28)
for i in range(1, 11):
plt.subplot(2, 10, i)
plt.imshow(X_test_r[y_hat == 0][i], cmap='Greens')
plt.axis('off')
plt.subplot(2, 10, i+10)
plt.imshow(X_test_r[y_hat == 1][i], cmap='Reds')
plt.axis('off')
plt.tight_layout()
Looks good! One cluster clearly seems to be 2's, and one clearly appears to be 9's.
But, how much of this was just because K-means did a good job of clustering MNIST, which is a notoriously easy data set? Well, we can evaluate that by seeing how good the predictions using K-means would be versus our deep mixture model. Let's print out the accuracy using the predicted labels, and those labels flipped, just because we don't know which cluster ends up corresponding to which original label when you use unsupervised methods for training.
In [6]:
y_kmeans_hat = KMeans(2).fit(X_train).predict(X_test)
y_test[y_test == 2] = 0
y_test[y_test == 9] = 1
print("kmeans: ", (y_kmeans_hat == y_test).mean(), ((1 - y_kmeans_hat) == y_test).mean())
print("deep mixture model: ", (y_hat == y_test).mean(), ((1 - y_hat) == y_test).mean())
It looks like we get a small improvement in performance. Now, MNIST is notoriously easy and one could likely get much better performance by tuning the K-means model. However, this does demonstrate that deep mixture models have potential for much larger and more complicated phenomena.
The next type of model we could merge with a neural network is a hidden Markov model. In this situation, the classes in the neural network correspond to hidden states in the HMM, and the edges in the HMM correspond to probabilistic constraints in the network predictions over a sequence.
To demonstrate how this would work we consider the situation in which one has a pre-trained object recognition network and wants to adapt it to work over videos (sequences of images) but doesn't have labels for each frame in their data. The goal is not to refine the network, which has been pre-trained, but to train the HMM such that it learns a useful transition matrix to constrain the predictions from the neural network alone applied to each frame sequentially.
Let's use the MNIST data set again, but this time use data from each of the 10 classes.
In [7]:
from keras.utils import to_categorical
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1] * X_train.shape[2])
y_train_ohe = to_categorical(y_train, num_classes=10)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1] * X_test.shape[2])
Because keras doesn't have any pre-trained MNIST models, let's pre-train some simple model on the training set and then freeze it. We use L2 regularization because unregularized dense networks with lots of parameters have a tendancy to predict probabilities of 1 rather than smooth probability distributions.
In [8]:
from keras.regularizers import l2
# Define our neural network model. In this case it's a small dense network for fast training, but in
# practice it can be anything that has 10 classes as the output.
nn = Sequential([
Dense(32, activation='relu', input_dim=784, kernel_regularizer=l2(1.)),
Dense(10, activation='softmax')
])
nn.compile(loss='categorical_crossentropy', optimizer='adam')
# Now we fit our model to the data to get a "pre-trained" model
nn.fit(X_train, y_train_ohe, epochs=25, verbose=False, batch_size=256)
# After training we go through and freeze the model so that training our HMM later on doesn't affect the parameters
for layer in nn.layers:
layer.trainable = False
nn.compile(loss='categorical_crossentropy', optimizer='adam')
Now we have to generate sequences from our training data of images. Our "videos" will be sequences of images, and we add structure by enforcing that the images are of digits that either stay the same value or increase. For example, a "video" might be images of the digits 1, 1, 2, and 5 sequentially, but not 1, 4, and 2 sequentially. Our goal is to train the HMM to learn this constraint.
In [9]:
def generate_videos(X_train, y_train, n=1000):
seqs, labels = [], []
for _ in range(n):
seq, label, number = [], [], 0
# Our videos can be a maximum of 25 images
for i in range(25):
chance = numpy.random.choice(4)
# There's a 25% chance that the next image skips a number, but only as long as it
# stays between 1 and 10, e.g. 1 followed by a 3.
if chance == 2:
if number < 8:
number += 2
# There's a 25% chance that the next image proceeds to the next number, e.g. 1 followed by a 2
elif chance == 3:
if number < 9:
number += 1
# There's a 50% chance that the next image is of the same number, e.g. 1 followed by a 1
# If we've seen at least one image already, there's an independent 10% chance that
# the sequence terminates to give us a distribution of sequence lengths
if i > 1 and numpy.random.choice(10) == 9:
break
idx = numpy.random.randint((y_train == number).sum())
seq.append(X_train[y_train == number][idx])
label.append(number)
labels.append(label)
seqs.append(numpy.array(seq))
return seqs, labels
X_train_seqs, y_train_seqs = generate_videos(X_train, y_train, 1000)
X_test_seqs, y_test_seqs = generate_videos(X_test, y_test, 100)
Next, let's hand define a model where you can either stay in the same hidden state of proceed to any of the next three states. In this case, a state corresponds to a number / class in the MNIST data set. If one doesn't have prior knowledge of what the transition structure should look like they can train the model similarly to the deep mixture model example above. However, this will allow us to see whether training the model using EM actually works to improve on rough hand-specified constraints.
In [10]:
d0 = NeuralNetworkWrapper(nn, 0, 28*28, 10)
d1 = NeuralNetworkWrapper(nn, 1, 28*28, 10)
d2 = NeuralNetworkWrapper(nn, 2, 28*28, 10)
d3 = NeuralNetworkWrapper(nn, 3, 28*28, 10)
d4 = NeuralNetworkWrapper(nn, 4, 28*28, 10)
d5 = NeuralNetworkWrapper(nn, 5, 28*28, 10)
d6 = NeuralNetworkWrapper(nn, 6, 28*28, 10)
d7 = NeuralNetworkWrapper(nn, 7, 28*28, 10)
d8 = NeuralNetworkWrapper(nn, 8, 28*28, 10)
d9 = NeuralNetworkWrapper(nn, 9, 28*28, 10)
s0 = State(d0, "s0")
s1 = State(d1, "s1")
s2 = State(d2, "s2")
s3 = State(d3, "s3")
s4 = State(d4, "s4")
s5 = State(d5, "s5")
s6 = State(d6, "s6")
s7 = State(d7, "s7")
s8 = State(d8, "s8")
s9 = State(d9, "s9")
model = HiddenMarkovModel()
model.add_states(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9)
model.add_transition(model.start, s0, 0.33)
model.add_transition(model.start, s1, 0.33)
model.add_transition(model.start, s2, 0.34)
model.add_transition(s0, s0, 0.24)
model.add_transition(s0, s1, 0.24)
model.add_transition(s0, s2, 0.24)
model.add_transition(s0, s3, 0.24)
model.add_transition(s0, model.end, 0.04)
model.add_transition(s1, s1, 0.24)
model.add_transition(s1, s2, 0.24)
model.add_transition(s1, s3, 0.24)
model.add_transition(s1, s4, 0.24)
model.add_transition(s1, model.end, 0.04)
model.add_transition(s2, s2, 0.24)
model.add_transition(s2, s3, 0.24)
model.add_transition(s2, s4, 0.24)
model.add_transition(s2, s5, 0.24)
model.add_transition(s2, model.end, 0.04)
model.add_transition(s3, s3, 0.24)
model.add_transition(s3, s4, 0.24)
model.add_transition(s3, s5, 0.24)
model.add_transition(s3, s6, 0.24)
model.add_transition(s3, model.end, 0.04)
model.add_transition(s4, s4, 0.24)
model.add_transition(s4, s5, 0.24)
model.add_transition(s4, s6, 0.24)
model.add_transition(s4, s7, 0.24)
model.add_transition(s4, model.end, 0.04)
model.add_transition(s5, s5, 0.24)
model.add_transition(s5, s6, 0.24)
model.add_transition(s5, s7, 0.24)
model.add_transition(s5, s8, 0.24)
model.add_transition(s5, model.end, 0.04)
model.add_transition(s6, s6, 0.24)
model.add_transition(s6, s7, 0.24)
model.add_transition(s6, s8, 0.24)
model.add_transition(s6, s9, 0.24)
model.add_transition(s6, model.end, 0.04)
model.add_transition(s7, s7, 0.25)
model.add_transition(s7, s8, 0.25)
model.add_transition(s7, s9, 0.25)
model.add_transition(s7, model.end, 0.25)
model.add_transition(s8, s8, 0.33)
model.add_transition(s8, s9, 0.33)
model.add_transition(s8, model.end, 0.34)
model.add_transition(s9, s9, 0.5)
model.add_transition(s9, model.end, 0.5)
model.bake()
Now we can make predictions using the maximum a posteriori (MAP) algorithm.
In [11]:
y_pred = numpy.concatenate([model.predict(seq) for seq in X_test_seqs])
Alternately, we could apply our pre-trained model to each image individually to make predictions. If our deep HMM model does not outperform this, there is no benefit to combining the neural network with a probabilistic model.
In [12]:
y_nn_pred = numpy.argmax(nn.predict(numpy.concatenate(X_test_seqs)), axis=1)
Let's evaluate our models using simple accuracy at predicting the value of each image.
In [13]:
y_true = numpy.concatenate(y_test_seqs)
print("pre-trained model: ", (y_nn_pred == y_true).mean())
print("deep HMM before training: ", (y_pred == y_true).mean())
It looks like there is a significant improvement using the HMM, even before refinement of the transition matrix using EM. This is because there are predictions that the pre-trained neural network alone may have gotten incorrect, but when combined with the transition matrix the correct label became more likely. For example, there may be a situation where the model alone thought that a 0 came after a 3. However, our transition matrix says that that is impossible. Essentially, our HMM allows our model to consider context better.
Now, what happens when we train the model? Remember that we froze our model, so this refinement is only done on the transition matrix of the HMM, not the parameters of the pre-trained neural network.
In [14]:
_ = model.fit(X_train_seqs, max_iterations=10, verbose=True)
In [15]:
y_pred2 = numpy.concatenate([model.predict(seq) for seq in X_test_seqs])
print("pre-trained model: ", (y_nn_pred == y_true).mean())
print("deep HMM before training: ", (y_pred == y_true).mean())
print("deep HMM after training: ", (y_pred2 == y_true).mean())
It looks like the addition of the constraints by themselves leads to a large improvement in performance, and that refining the transition matrix leads to further improvements.
Even if we don't know the structure of the HMM beforehand we can still get good performance.
In [16]:
nn = Sequential([
Dense(32, activation='relu', input_dim=784, kernel_regularizer=l2(1.)),
Dense(10, activation='softmax')
])
nn.compile(loss='categorical_crossentropy', optimizer='adam')
nn.fit(X_train, y_train_ohe, epochs=25, verbose=False, batch_size=256)
for layer in nn.layers:
layer.trainable = False
nn.compile(loss='categorical_crossentropy', optimizer='adam')
y_nn_pred = numpy.argmax(nn.predict(numpy.concatenate(X_test_seqs)), axis=1)
distributions = [NeuralNetworkWrapper(nn, i, 28*28, 10) for i in range(10)]
model = HiddenMarkovModel.from_samples(distributions, 10, X_train_seqs, max_iterations=10, verbose=True)
y_pred = numpy.concatenate([model.predict(seq) for seq in X_test_seqs])
print("pre-trained model: ", (y_nn_pred == y_true).mean())
print("deep HMM before training: ", (y_pred == y_true).mean())
In a situation where one has labels for all images in a sequence, a recurrent neural network is likely to have superior performance to a HMM at this type of classification. However, when one doesn't have class labels, deep HMMs can be a very powerful tool.