author: Jacob Schreiber
contact: jmschreiber91@gmail.com
It is frequently the case that the data you have is not explained by a single underlying distribution. If we want to try to recover the underlying distributions, we need to have a model which has multiple components. An example is the following data.
In [5]:
from pomegranate import *
%pylab inline
import numpy as np
In [6]:
data = np.concatenate( (np.random.randn(250, 1) * 2.75 + 1.25, np.random.randn(500, 1) * 1.2 + 7.85) )
np.random.shuffle(data)
plt.hist( data, edgecolor='c', color='c', bins=20 )
Out[6]:
We can create our initial estimate of what this distribution is a General Mixture Model. This is a model which is comprised of multiple distributions, and weights on those distributions representing the prior probability of a point falling under that distribution given no knowledge of the point itself (defaults to equal). We can have univariate mixture models by using univariate distributions, or multivariate distributions by using multivariate distributions.
In [7]:
d = GeneralMixtureModel( [NormalDistribution(2.5, 1), NormalDistribution(8, 1)] )
We can now predict the class labels of each point under this mixture.
In [8]:
labels = d.predict( data )
print labels[:5]
print "{} 1 labels, {} 0 labels".format( labels.sum(), labels.shape[0] - labels.sum() )
This is fairly close to the number of underlying points from each distribution, off by 17 in each label. We still don't know if the labels are accurate, just the number of labels.
In [9]:
plt.hist( data[ labels == 0 ], edgecolor='r', color='r', bins=20 )
plt.hist( data[ labels == 1 ], edgecolor='c', color='c', bins=20 )
Out[9]:
It is slightly more difficult to update the underlying components of the model because we don't have labels indicating which point came from which distribution. We could try to use the labels inferred from the model. It seems to cleanly split it, but what if our initial estimate was not very good? It could be difficult to get a good update if we had a bad prior.
Another possibility is to predict the probability of each point under each component, to get a softer estimate of the labels. Lets take a look.
In [10]:
labels = d.predict_proba( data )
print labels[:5]
print labels.sum(axis=0)
This is slightly closer to the truth, with 15.2 off instead of 17, around 10% closer.
This is the beginning of a common unsupervised training algorithm called expectation maximization. It has two steps, expectation and maximization. The expectation step involves what we just did--assigning weights based on the probability of each point being generated by each component. The next step, maximization, is maximizing the probability that the distribution generated these points but performing weighted MLE.
This process must be iterated until convergence. Sometimes this requires only a single update, but for overlapping distributions (such as this one) it can sometimes take many iterations.
In [11]:
d.fit( data, verbose=True )
Out[11]:
We can do the same with multivariate distributions just as easily.
In [12]:
mu = np.arange(5)
cov = np.eye(5)
mgs = [ MultivariateGaussianDistribution( mu*i, cov ) for i in range(5) ]
gmm = GeneralMixtureModel( mgs )
In [13]:
data = numpy.random.randn(1000, 5) * 5
for i in range(5):
data[i::5] += np.arange(5)*i
Lets see how well some points fit under the mixture model.
In [14]:
for i in range(10):
print "Point {}: logp {}".format( i, gmm.log_probability(data[i]) )
In [15]:
gmm.fit(data, verbose=True, stop_threshold=1)
Out[15]:
Now lets see how well the previous points fit.
In [16]:
for i in range(10):
print "Point {}: logp {}".format( i, gmm.log_probability(data[i]) )
Looks like they're being fit significantly better than before! Training works.
In addition to having general mixture models over continuous distributions, we can also have mixture models over discrete distributions. This is useful in many bioinformatics contexts, specifically sequence analysis. Lets use the toy analysis of trying to analyze CG island distribution.
The problem is the following; DNA is made up of long sequences the four canonical nucleotides, abbreviated 'A', 'C', 'G', and 'T'. These nucleotides are not distributed randomly, and there is significant amounts of structure in the sequence. A major field in bioinformatics is trying to interpret this structure. One structured element is the CG content, where the Cs and the Gs appear more commonly than in the background. If we want to try to determine CG percentages in these islands, we can use a mixture model.
In [17]:
d1 = DiscreteDistribution( {'A' : 0.25, 'C': 0.25, 'G' : 0.25, 'T': 0.25 } ) # Background
d2 = DiscreteDistribution( {'A' : 0.05, 'C': 0.45, 'G' : 0.45, 'T': 0.05 } ) # CG rich regions
gmm = GeneralMixtureModel( [d1, d2] )
In [18]:
seq = numpy.array(list('CGACATCTGACTACGGCGCGCCTACTACTTGATCGATACGGCGTCAGCGACGACGATGATCGGCATCAGTCACTAC'))
gmm.fit(seq)
Out[18]:
In [19]:
print gmm.distributions
print
print numpy.exp(gmm.weights)
Looks like in this case the concept was sound, that there many CG rich regions, but our initial estimates of the percentages were off. We can use a GMM like the one above to both identify and study the composition of these regions at the same time, updating the parameters of the distributions using expectation-maximization. We will go into a more complex way of dong this using HMMs in the next tutorial.
General Mixture Models support serialization to JSONs using to_json()
and from_json( json )
. This is useful is you want to train a GMM on large amounts of data, taking a significant amount of time, and then use this model in the future without having to repeat this computationally intensive step.
In [20]:
print gmm.to_json()
Not the prettiest thing to look at right now. However, we can easily load it and use it immediately.
In [21]:
gmm_2 = GeneralMixtureModel.from_json( gmm.to_json() )
gmm_2.distributions
Out[21]:
gmm_2
is now ready to go, as if it was the original model!
General Mixture Models are extremely powerful tools for unsupervised learning. pomegranate makes it easy to define, fit, and make predictions using these models. All the examples above have used the same distribution type for all the components of the mixture, but that is not a requirement, if your application requires varied distributions.