Guide to using carma_pack

The purpose of this notebook is to illustrate most of the functionality of carma_pack so that users can start applying it to their own data.

Background

carma_pack implements Bayesian inference for Gaussian continuous-time autoregressive moving average (CARMA) models via Markoc Chain Monte Carlo (MCMC) sampling. CARMA models provide a flexible framework for modeling irregularly-sampled time series. A zero-mean CARMA process $y(t)$ of order $p, q$ is defined according to the stochastic differential equation:

$$\frac{d^p y(t)}{dt^p} + \alpha_{p-1} \frac{d^{p-1} y(t)}{dt^{p-1}} + \ldots + \alpha_0 y(t) = \beta_q \frac{d^q \epsilon(t)}{dt^q} + \beta_{q-1} \frac{d^{q-1} \epsilon(t)}{dt^{q-1}} + \ldots + \epsilon(t).$$

Here, $\epsilon(t)$ is a Gaussian white noise process with mean zero and variance $\sigma^2$ and the free parameters are $\alpha, \beta,$ and $\sigma^2$. A CARMA(p,q) process has a power spectrum given by a ratio of polynomials,

$$ P(f) = \sigma^2 \frac{\left| \sum_{j=0}^q \beta_j (2\pi i f)^j \right|^2}{\left| \sum_{k=0}^p \alpha_k (2\pi i f)^k \right|^2},$$

and a autocovariance function given by a sum of exponential decays and exponentially-damped sinusoids:

$$ R(\tau) = \sigma^2 \sum_{k=1}^p \frac{ \left[\sum_{l=0}^q \beta_l r_k^l \right] \left[\sum_{l=0}^q \beta_l (-r_k)^l \right] \exp(r_k \tau) }{ -2 \operatorname{Re}(r_k) \prod_{l=1, l \neq k}^p (r_l - r_k)(r^*_l + r_k) }.$$

The terms $r_k$ are the roots of the autoregressive polynomial $A(z) = \sum_{k=0}^p \alpha_k z^k$. Note that the exponentially-damped sinusoids occur when $r_k$ is complex. A CARMA process is stationary when $q < p$ and the real parts of $r_k$ are negative.

So what does all this math mean? Basically that CARMA models have a very flexible parameteric form for their power spectrum and autocorrelation function, and because of this they are able to model a broad range of non-deterministic time series. CARMA models therefore provide a flexible framework for analyzing irregularly sampled time series when some degree of stochasticity is expected. Moreover, they also have the additional advantage that many of the computations involved with fitting, interpolating, and forecasting can be efficiently performed using the Kalman Filter and Smoother. And finally, Gaussian measurement noise is naturally incorporated into the analysis. Further details can be found in this paper.

The use of Bayesian inference is important for these models. CARMA models typically exhibit multiple modes in their likelihood function and complicated uncertainties in their parameters. This implies that a using single best-fit value and approximating the uncertainties as normally distributed may provide a poor estimate of the uncertainty in, for example, the inferred power spectrum and lead to degraded forecasting.

Simulating a CARMA process

To start I will show how to use carma_pack to compute and plot the stationary power spectrum and autocorrelation function of a CARMA process, and how to simulate a CARMA(5,3) process. First let's set the parameter values. Because the power spectrum of a CARMA process can be expressed as a sum of Lorentzian functions, below we parameterize the CARMA model in terms of the parameters of these Lorentzian functions. This can be more intuitive than using the values of $\alpha$ directly.


In [4]:
%matplotlib inline
import carmcmc as cm
import numpy as np
import matplotlib.pyplot as plt

# set the CARMA model parameters
sigmay = 2.3  # dispersion in the time series
p = 5  # order of the AR polynomial
mu = 17.0  # mean of the time series
qpo_width = np.array([1.0/100.0, 1.0/300.0, 1.0/200.0])  # widths of of Lorentzian components
qpo_cent = np.array([1.0/5.0, 1.0/25.0])  # centroids of Lorentzian components
ar_roots = cm.get_ar_roots(qpo_width, qpo_cent) # compute the roots r_k from the Lorentzian function parameters
ar_coefs = np.poly(ar_roots)
ma_coefs = np.array([1.0, 4.5, 1.25, 0.0, 0.0])
# convert CARMA model variance to variance in the driving white noise
sigsqr = sigmay ** 2 / cm.carma_variance(1.0, ar_roots, ma_coefs=ma_coefs)  # carma_variance computes the autcovariance function

Let's plot the power spectrum of this CARMA process.


In [5]:
frequencies = np.logspace(-4.0, 0.0, 200)
pspec = cm.power_spectrum(frequencies, np.sqrt(sigsqr), ar_coefs, ma_coefs)
plt.loglog(frequencies, pspec)
plt.xlabel('Frequency')
plt.ylabel('Power Spectrum [variance / frequency]')


Out[5]:
<matplotlib.text.Text at 0x1085d4d50>

The two peaks in the power spectrum correspond to quasi-periodic modes with centroids and widths given by the Lorentzian parameters of our CARMA(5,3) model. Also note that the turnover to a flat power spectrum occurs at a frequency given by the third Lorentzian width parameter. Let's take a look at the autocorrelation function:


In [6]:
lag = np.linspace(0.0, 300.0, 300)
autocovar = cm.carma_variance(sigsqr, ar_roots, ma_coefs, lag)
autocorr = autocovar / autocovar[0]  # normalize by stationary variance
plt.plot(lag, autocorr)
plt.xlabel('Lag')
plt.ylabel('Autocorrelation Function')


Out[6]:
<matplotlib.text.Text at 0x1095a1810>

The autocorrelation function has the appearance of an exponentially damped sinusoid. In reality, it is a sum of two exponentially-damped sinusoids and one exponential decay, with the damping time scales proportional to the widths of the Lorentzian functions. The oscillation frequencies of the two damped sinusoids are proportional to the centroids of the Lorentzian functions.

We can use carma_pack to simulate a CARMA process using these parameters. Let's go ahead and simulate a CARMA process irregularly-sampled over three seasons:


In [7]:
ny = 270
t = np.zeros(ny)
dt = np.random.uniform(1.0, 3.0, ny)
t[0:90] = np.cumsum(dt[0:90])  # season 1
t[90:2*90] = 180 + t[90-1] + np.cumsum(dt[90:2*90])  # season 2, 180 time units between seasons
t[2*90:] = 180 + t[2*90-1] + np.cumsum(dt[2*90:])  # season 3
y = mu + cm.carma_process(t, sigsqr, ar_roots, ma_coefs=ma_coefs)
for i in range(3):
    plt.plot(t[90*i:90*(i+1)], y[90*i:90*(i+1)], 'k')
    plt.plot(t[90*i:90*(i+1)], y[90*i:90*(i+1)], 'b.')
plt.xlabel('Time')
plt.ylabel('CARMA(5,3) Process')
plt.xlim(0, t.max())


Out[7]:
(0, 877.75811068956364)

Analysis of a Real Time Series

While plotting power spectra and time series generated from different CARMA processes is helpful for exploring the 'machinery' of the CARMA process, the primary purpose of carma_pack is to model real time series. Here I will illustrate how to do this using an astronomical time series. I will use a time series of magnitudes (a measure of brightness) for a variable star from the OGLE-III variable star catalog. This time series is also contaminated by Gaussian measurement error. Further details may be found in our paper.

First let's load the data and plot the time series:


In [8]:
data = np.genfromtxt('OGLE-LMC-LPV-00007.dat')
juldate = data[:,0]  # time is given by the julian date
t = juldate - juldate.min()  # set first time value to zero for convenience. units of t are days.
y = data[:,1]  # i-band magnitudes
yerr = data[:,2]  # standard deviation in the measurement noise

plt.errorbar(t, y, yerr=yerr)
plt.plot(t, y, 'k.')
plt.xlim(t.min(), t.max())
plt.xlabel('Time [days]')
plt.ylabel('magnitude')


Out[8]:
<matplotlib.text.Text at 0x109d9df90>

Running the MCMC Sampler

There are two classes in carma_pack: CarmaModel and CarmaSample (the class Car1Sample is just a special case of CarmaSample for $p = 1, q = 0$). The CarmaModel class defines a CARMA model while the CarmaSample class is a sample of CARMA models generated by the MCMC sampler.

First we need to create a carma model:


In [9]:
model = cm.CarmaModel(t, y, yerr, p=6, q=0)

The CARMA model order input is optional. If we want to automatically choose the CARMA order by minimizing the AICc we can use the CarmaModel.choose_order method to do this. Because this can take a long time, here we simply set the order to $(p,q)=(6,0)$ as this is what was found in our paper from minimizing the AICc.

A note on choosing the order of the CARMA model: In order to calculate the maximum-likelihood estimate needed for the AICc, we use scipy.optimize. Unfortunately, these optimization algorithms seem to be unstable for CARMA models, so choose_order may perform poorly. An alternative approach is to use the CarmaSample.DIC method to calculate the DIC, which is calculated from the MCMC samples. Because the MCMC sampler is more stable, this provides a more stable estimate. However, because MCMC is more expensive than computing the maximum-likelihood estimate I usually just search over $p$ and set $q = p - 1$.

Now that we have a CarmaModel object, let's run the MCMC sampler. We will generate 20,000 samples of the CARMA parameters from their probability distribution given the measured time series (this is called the 'posterior distribution'). We can also specify the number of burn-in iterations to perform, the number of chains to use in the parallel tempering algorithm, and the number of iterations to thin. Here we just use the default values.

Running the MCMC sampler will probably take about 10-15 minutes.


In [10]:
%%capture capt
sample = model.run_mcmc(20000)

Assessing the Appropriateness of a CARMA Model

Before using our samples, let's first make sure the CARMA model provides an adequate description of the time series.


In [11]:
sample.assess_fit()


This routine generated four panels that we can use to assess the quality of the fit. The upper right panel compares the data with the interpolated CARMA model and it's standard error based on the parameters that had the highest posterior probability. The upper right panel compares the residuals from the one-step-ahead predictions standardized by their standard deviation. If the assumption of a Gaussian CARMA process is adequate the residuals should be uncorrelated and have a standard normal distribution (shown by the orange line). The two lower panels show the autocorrelation function of the residuals and their square, along with the 95% confidence intervals for white noise. Everything seems to check out: we can see that there is no strong reason to reject the Gaussian CARMA process as providing an adequate model for this time series.

Plotting the Inferred Power Spectrum

Now let's see what the power spectrum looks like:


In [12]:
psd_low, psd_hi, psd_mid, frequencies = sample.plot_power_spectrum(percentile=95.0, nsamples=5000)


Here we have plotted the regions containing 95% of the posterior probability of the power spectrum as a function of frequency. The routine CarmaSample.plot_power_spectrum returns a tuple containing the lower bound on the power spectrum, the upper bound, the posterior median, and the frequencies used. Here we tell the routine to only use 5000 samples when computing the power spectrum bounds to increase the speed.

It can be helpful to get a sense of where the measurement noise level is on the power spectrum. We can estimate this as $2 \bar{\Delta t} \bar{\sigma_y^2}$, where $\bar{\Delta t}$ and $\bar{\sigma^2_y}$ are the average time spacing and measurement noise variance, respectively. The median may also be used and may be preferable in certain cases. Because there is no point in interpreting the power spectrum too far below the measurement noise level, we reset the vertical plot limits to focus on the regions not dominated by measurement noise.


In [13]:
dt = t[1:] - t[:-1]
noise_level = 2.0 * np.mean(dt) * np.mean(yerr ** 2)

In [14]:
plt.loglog(frequencies, psd_mid)
plt.fill_between(frequencies, psd_hi, y2=psd_low, alpha=0.5)
plt.loglog(frequencies, np.ones(frequencies.size) * noise_level, color='grey', lw=2)
plt.ylim(noise_level / 10.0, plt.ylim()[1])
plt.xlim(frequencies.min(), frequencies[psd_hi > noise_level].max() * 10.0)
plt.ylabel('Power Spectrum')
plt.xlabel('Frequency [1 / day]')
plt.annotate("Measurement Noise Level", (1.25 * plt.xlim()[0], noise_level / 1.5))


Out[14]:
<matplotlib.text.Annotation at 0x10b31af10>

Forecasting and Interpolation

For many applications we will want to predict future values of the time series. We can use the CARMA model to forecast future values, given the measured time series, and carma_pack provides two routines to do this. The first is CarmaSample.predict, which will compute the expected value of future data points and their variance given the best-fit CARMA model (the default is to use the model with the highest posterior probability). Here is an example:


In [15]:
tpredict = t.max() + np.linspace(0.0, 250.0, 250)
ypredict, yp_var = sample.predict(tpredict)
plt.errorbar(t, y, yerr=yerr, color='black')
plt.plot(t, y, 'ko', ms=4)
plt.fill_between(tpredict, ypredict+np.sqrt(yp_var), y2=ypredict-np.sqrt(yp_var), color='DodgerBlue', alpha=0.5)
plt.plot(tpredict, ypredict, 'b-')
plt.xlim(tpredict.min() - 100, tpredict.max())
plt.xlabel('Time [days]')
plt.ylabel('magnitude')
plt.title('Forecasting, Expected Value')


Out[15]:
<matplotlib.text.Text at 0x10b3a6550>

The second option we have for forecasting is to simulate a CARMA process for future times. This is attractive because we can simulate a bunch of future paths given the measured time series by first randomly choosing one of the CARMA model parameters returned by the MCMC sampler, and then generating a time series based on these CARMA parameters. This way we get a distribution of future paths after marginalizing over the uncertainty in the CARMA parameters. This is important because ignoring the uncertainty in the CARMA parameters when predicting future values can lead to an overly optimistic estimate of the uncertainty in the forecast.

Here's how it works:


In [16]:
tpredict = t.max() + np.linspace(0.0, 200.0, 250)
npaths = 5
plt.errorbar(t, y, yerr=yerr, color='black')
plt.plot(t, y, 'ko', ms=4)
for i in range(npaths):
    ysim = sample.simulate(tpredict, bestfit='random')  # use a random draw of the CARMA parameters from its posterior
    plt.plot(tpredict, ysim)
plt.xlabel('Time [days]')
plt.ylabel('magnitude')
plt.xlim(tpredict.min()-100, tpredict.max())
plt.title('Forecasting, Simulated Paths')


Out[16]:
<matplotlib.text.Text at 0x10b37e490>

So, in summary, we simulated paths conditional only on the measured time series in the preceding plot, while in the first option we compute the expected path and its variance conditional on the measured time values and on the best-fit value of the CARMA parameters.

Let's compare the probability distribution for a 10-day ahead forecast when using the best-fit value and when averaging over the uncertainty in the CARMA model parameters.


In [35]:
tpredict = 10.0
nsim = 10000
ysim = np.zeros(nsim)
for i in range(nsim):
    ysim[i] = sample.simulate(tpredict, bestfit='random')
yhat, yhvar = sample.predict(tpredict)
plt.hist(ysim, bins=100, alpha=0.5, histtype='stepfilled', normed=True, label='Includes Uncertainty')
ymin, ymax = plt.xlim()
ygrid = np.linspace(ymin, ymax, 200)
predictive_pdf = 1.0 / np.sqrt(2.0 * np.pi * yhvar) * np.exp(-0.5 * (ygrid - yhat) ** 2 / yhvar)
plt.plot(ygrid, predictive_pdf, color='DarkOrange', lw=2, label='Best-fit')
plt.xlim(ymin, ymax + 0.1 * (ymax - ymin))
plt.legend(loc='upper right')


Out[35]:
<matplotlib.legend.Legend at 0x130687150>

In this case, the probability distributions for the two forecasts are close, but we can see that the forecasts are slightly tilted toward larger values when averaging over the uncertainty in the CARMA parameters.

Of course, we are not limited to simply forecasting. We can also do backcasting or interpolate between measured values:


In [23]:
tinterp = t.max() - 1000 + np.linspace(0.0, 1000.0, 500)
npaths = 3
for i in range(npaths):
    ysim = sample.simulate(tinterp, bestfit='random')
    plt.plot(tinterp, ysim)
plt.plot(t, y, 'ko', ms=4)
plt.xlabel('Time [days]')
plt.ylabel('magnitude')
plt.xlim(tinterp.min(), tinterp.max())
plt.title('Interpolation, Simulated Paths')


Out[23]:
<matplotlib.text.Text at 0x12f795390>

Working with the Sampled Parameter Values

Finally, in many cases we will want to access the MCMC samples for the CARMA parameters. The list of parameter names are stored as a member of the CarmaSample object:


In [24]:
sample.parameters


Out[24]:
['quad_coefs',
 'logpost',
 'ar_coefs',
 'mu',
 'psd_centroid',
 'loglik',
 'psd_width',
 'var',
 'measerr_scale',
 'sigma',
 'ma_coefs',
 'ar_roots']

The parameters are:

  • logpost: The logarithm of the posterior probability distribution.
  • loglik: The logarithm of the likelihood function.
  • mu: The mean of the CARMA model.
  • var: The variance of the CARMA model.
  • sigma: The standard deviation in the driving white noise.
  • ar_coefs: The coefficients for the autoregressive polynomial, $\alpha$.
  • ma_coefs: The coefficients of the moving average polynomial, $\beta$.
  • quad_coefs: This is the parameterization of the autoregressive parameters used in the MCMC sampler. These will almost never be of interest.
  • ar_roots: The roots of the autoregressive polynomial, $r_k$.
  • psd_centroid: The centroid frequencies of the Lorentzian functions, under the sum-of-Lorentzians interpretation of the power spectrum.
  • psd_width: The widths of the Lorentzian functions.
  • measerr_scale: A multiplicative scaling parameter that is multiplied to the supplied measurement noise variances, constrained to lie between 0.5 and 2. The true measurement noise variances are assumed to be $A\sigma^2_y$, where $A$ is the scaling parameter.

We can get the sampled values using the CarmaSample.get_samples method. For example,


In [25]:
centroid_samples = sample.get_samples('psd_centroid')
centroid_samples.shape


Out[25]:
(20000, 6)

We can also get a useful summary of the parameter's posterior:


In [26]:
sample.posterior_summaries('psd_centroid')


Calculating effective number of samples
Posterior summary for parameter psd_centroid  element 0
----------------------------------------------
Effective number of independent samples: 155.756401975
Median: 0.0652739872614
Standard deviation: 0.200251281333
68% credibility interval: [0.062688402152411032, 0.29372709466078145]
95% credibility interval: [0.060629337872255225, 0.82109971864699494]
99% credibility interval: [0.056881915981414073, 0.97260754457123588]
Posterior summary for parameter psd_centroid  element 1
----------------------------------------------
Effective number of independent samples: 155.756401975
Median: 0.0652739872614
Standard deviation: 0.200251281333
68% credibility interval: [0.062688402152411032, 0.29372709466078145]
95% credibility interval: [0.060629337872255225, 0.82109971864699494]
99% credibility interval: [0.056881915981414073, 0.97260754457123588]
Posterior summary for parameter psd_centroid  element 2
----------------------------------------------
Effective number of independent samples: 126.267736744
Median: 0.0433002590408
Standard deviation: 0.00969813568924
68% credibility interval: [0.039999618200317458, 0.056962709650421328]
95% credibility interval: [0.036181372287221548, 0.062752578989610897]
99% credibility interval: [0.0, 0.066098175988176952]
Posterior summary for parameter psd_centroid  element 3
----------------------------------------------
Effective number of independent samples: 126.267736744
Median: 0.0433002590408
Standard deviation: 0.00969813568924
68% credibility interval: [0.039999618200317458, 0.056962709650421328]
95% credibility interval: [0.036181372287221548, 0.062752578989610897]
99% credibility interval: [0.0, 0.066098175988176952]
Posterior summary for parameter psd_centroid  element 4
----------------------------------------------
Effective number of independent samples: 237.598760109
Median: 0.0
Standard deviation: 0.00117745379968
68% credibility interval: [0.0, 0.0]
95% credibility interval: [0.0, 0.0051714369313628772]
99% credibility interval: [0.0, 0.0073833389635405049]
Posterior summary for parameter psd_centroid  element 5
----------------------------------------------
Effective number of independent samples: 237.598760109
Median: 0.0
Standard deviation: 0.00117745379968
68% credibility interval: [0.0, 0.0]
95% credibility interval: [0.0, 0.0051714369313628772]
99% credibility interval: [0.0, 0.0073833389635405049]

We can plot the trace of a parameter (value of the parameter as a function of MCMC iteration), a histogram of its values, and the autocorrelation function of the MCMC samples. For example, to plot these quantities for the width of the fifth Lorentzian function:


In [27]:
sample.plot_parameter('psd_width', 4, doShow=True)


Plotting parameter summary

Finally, we can make nice plots of the estimated joint posterior distributions of pairs of parameters using CarmaSample.plot2dkde. For example, to plot the joint distribution of the time series mean and measurement error scale parameter we would do


In [28]:
sample.plot_2dkde('mu', 'measerr_scale', doShow=True)


Plotting 2d PDF w KDE

Similarly, we could do


In [29]:
sample.plot_2dpdf('mu', 'measerr_scale', doShow=True)


Plotting 2d PDF

That's it! Enjoy using carma_pack on your own data.