Example in Python

This is a fairly minimal example, demonstrating the slightly different calling convention in the Python version of LRGS, compared with the R version.

One notable and practical difference is that the Python version is significantly slower than the R version, despite using vectorization more efficiently in places. In the benchmarks I've done so far, it takes between 1.3 and 3 times longer to produce the same number of samples, depending on the model being fit.

We'll just look at univariate regression, with p(x) a single Gaussian.



In [ ]:

    
import lrgs
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

One immediate difference is that we need to pass x and y as Numpy matrices, with each row corresponding to a data point. (So, in the univariate case, column vectors.)



In [ ]:

    
## Use the same setup as the simple example in the R documentation
## ... but with a bigger scatter and fewer data points, just because.
x = np.matrix(np.random.normal(0.0, 5.0, size=100)).T
y = np.matrix(np.random.normal(np.pi*x, np.exp(1.0)))



In [ ]:

    
plt.plot(x, y, 'o');

If we were going to pass measurement errors, it would be as a list of matrices, e.g.



In [ ]:

    
# M = [np.asmatrix(np.eye(2)*1e-4) for i in range(x.shape[0])]

Instead of doing everything in a single function call, like in R, we instead have one call to set up the model and a second to run the sampler. This does the first step (the first argument is the size of the Gaussian mixture):



In [ ]:

    
par = lrgs.Parameters_GaussMix(1, x, y)

This object holds the current values of all the model parameters (currently rough default guesses) and has methods to update individual parameters or all parameters. We could call these directly to sample the parameters "in place", but of course we actually want to store the chain. For that, we use a second declaration. Here the second argument is the length of the chain we want to run, and the third encodes the parameters that we want to store (in the usual LRGS way):



In [ ]:

    
chain = lrgs.Chain(par, 100, 'bsmt')

In the "run" call, we specify any parameters to be fixed (x and y in this case):



In [ ]:

    
chain.run(fix='xy')

The chain can now be converted to a dict or recarray (and further converted to a pandas data frame, for e.g.). Note that the "proper names" of the parameters to include must be specified here, and that the resulting keys differ slightly from the column naming scheme used in the R function Gibbs.post2dataframe. (In particular, indices are offset by underscores and start from zero, consistent with Python indexing.)



In [ ]:

    
dchain = chain.to_dict(["B", "Sigma", "mu", "Tau"])

For example, dchain["B_1_0"] is the slope; in R it would have been called B21.

Plot things. Red lines show the input values.



In [ ]:

    
keep = range(10, chain.nmc) # throw out a little burn-in



In [ ]:

    
plt.plot(dchain["B_0_0"][keep], 'o');
plt.axhline(y=0.0, color='r');



In [ ]:

    
plt.plot(dchain["B_1_0"][keep], 'o');
plt.axhline(y=np.pi, color='r');



In [ ]:

    
plt.plot(dchain["B_0_0"][keep], dchain["B_1_0"][keep], 'o');
plt.plot(0.0, np.pi, 'or');



In [ ]:

    
plt.plot(dchain["Sigma_0_0"][keep], 'o');
plt.axhline(y=np.exp(2.0), color='r');



In [ ]:

    
plt.plot(dchain["mu_0_0"][keep], 'o');
plt.axhline(y=0.0, color='r');



In [ ]:

    
plt.plot(dchain["Tau_0_0_0"][keep], 'o');
plt.axhline(y=5.0**2, color='r');