I was inspired by this blog post about James Pearson to learn a bit about his method of solving mixture models. I also figured it would be a good way to learn sympy a little better.
The basic gist of this notebook is that moments for a mixture of gaussians are nice polynomials in terms of means and variance. Thus if you calculate the moments in your data, you can use those polynomials to get the formulas for the parameters of the mixture model. But to have the numerical solver converge, it requires picking points somewhat near the real solution. Also, I don't know much about stats. Feel free to educate me.
I'm going to start writing more algorithmshop.com stuff just in IPython notebooks. This might make its way into a post someday, but I would rather just put things out there and talk to folks. I'm justinvf on Twitter. If you live in the bay (or the internet) and want to talk math, please reach out! I think I made my site nicer than I want to commit to regularly. I would rather be learning stuff than making my site pretty.
In [1]:
from sympy.stats import Normal, moment
from sympy import Symbol, simplify, init_printing, symbols, pretty_print
from sympy.solvers import nsolve
init_printing()
import numpy as np
from numpy.linalg import norm
np.random.seed(43)
from matplotlib import pyplot as plt
%pylab inline
In [2]:
# The moments to use in the calculations
MOMENTS_USED = [1,2,3,4,5]
# The parameters of the 2 gaussians, and the mixture param (mu_1, sigma_2, mu_2, sigma_2, r)
REAL_PARAMS = (0, 2, 10, 3, .7)
NUM_POINTS = 5000
In [3]:
mu = Symbol('mu')
sigma = Symbol('sigma', positive=True)
X = Normal('X', mu, sigma)
moments = [simplify(moment(X,i)) for i in MOMENTS_USED]
for (i, m) in zip(MOMENTS_USED, moments):
print('\nMoment {}'.format(i))
pretty_print(m)
Now let's generate data from a gaussian mixture
In [4]:
def generate_data(mu_1, sigma_1, mu_2, sigma_2, r, n):
"""mu and sigmas are for the two gaussians.
r is the probability of picking from from the first gaussian.
n is the number of points to generate
"""
s = np.random.rand(n) < r
return np.select([s, ~s],
[np.random.normal(mu_1, sigma_1, n),
np.random.normal(mu_2, sigma_2, n)])
In [5]:
fake_data = generate_data(*REAL_PARAMS, n=NUM_POINTS)
In [6]:
plt.hist(fake_data, bins=50);
If we assume our data is a mixture of two gaussians, N_1 and N_2, with r the mixture parameter, then we know what the moments should look like for the data.
In [7]:
mu_1, sigma_1, mu_2, sigma_2, r = symbols('mu_1 sigma_1 mu_2 sigma_2 r')
In [8]:
def mixture_moment(m):
return ( r * (m.subs({mu:mu_1, sigma: sigma_1}))
+ (1 - r) * (m.subs({mu:mu_2, sigma: sigma_2})))
mixture_moments = [mixture_moment(m) for m in moments]
for (i, m) in zip(MOMENTS_USED, mixture_moments):
print('\nMixture Moment {}'.format(i))
pretty_print(m)
In [9]:
def numerical_moment(a, n):
return np.sum(a ** n) / len(a)
actual_moments = [numerical_moment(fake_data, i) for i in MOMENTS_USED]
for (i, m) in zip(MOMENTS_USED, actual_moments):
print('\nActual Moment {}: {}'.format(i, m))
We have 5 moments, and 5 unknows (r and the 2 parameters for the 2 normal distributions). So now we should be able to use the numerical momemnts to solve for the parameters.
In [10]:
# These should all be solved for zero
equations = [m - value for (m,value) in zip(mixture_moments, actual_moments)]
for (i, m) in zip(MOMENTS_USED, equations):
print('\nEqn for moment {}'.format(i))
pretty_print(m)
In [11]:
print("True Paramaters for mu_1, sigma_1, mu_2, sigma_2, r: {}".format(REAL_PARAMS))
In [12]:
def solve_numerically(equations, initial_guess):
# If I start off near-ish to the point, then we can solve it numerically.
solved = nsolve(equations, (mu_1, sigma_1, mu_2, sigma_2, r), list(initial_guess))
print('Numeric solution')
pretty_print(solved.T)
# Blarg. I get arbitrary precesion stuff out and I don't need that
to_float_array = lambda mfp_array: np.array(list(map(float, mfp_array)))
distance_to_real_array = norm(np.array(REAL_PARAMS) - to_float_array(solved.T))
# Not sure how I should really be talking about distance...
print('L2 distance from true solution (after sampling {} sample points): {}'.format(
NUM_POINTS, distance_to_real_array ))
If we just solve numerically starting at the real parameters, then the error is just the sampling error
In [13]:
solve_numerically(equations, np.array(REAL_PARAMS))
If we move the initial guess slightly we are fine
In [14]:
solve_numerically(equations, np.array(REAL_PARAMS) + np.array([.3, .2, .2, .2, 0]))
But then if we get a little farther....
In [15]:
solve_numerically(equations, np.array(REAL_PARAMS) + np.array([2, 1, -1, 1, .2]))
If we looked at the data, we could guess some obvious initial guesses for the mus and sigmas. As we had in the histogram above:
In [16]:
plt.hist(fake_data, bins=50);
The left hump looks a little move massive, so I would guess .7 for $r$. Then $\mu_1$ looks to be around 0, $\mu_2$ around 10. The I would guess a larger standard deviation for the second.
In [17]:
solve_numerically(equations, np.array([0, 3, 10, 5, .7]))
And that works... But it's not satisfying. I should read more papers about this :).
My main questions left from this exercise (many probably probably obvious, I know like zero stats):