TIGRAMITE
TIGRAMITE is a time series analysis python module. It allows to reconstruct graphical models (conditional independence graphs) from discrete or continuously-valued time series based on the PCMCI method and create high-quality plots of the results.
PCMCI is described here: J. Runge, P. Nowack, M. Kretschmer, S. Flaxman, D. Sejdinovic, Detecting and quantifying causal associations in large nonlinear time series datasets. Sci. Adv. 5, eaau4996 (2019) https://advances.sciencemag.org/content/5/11/eaau4996
This tutorial explains how to use PCMCI to obtain optimal predictors. See the following paper for theoretical background: Runge, Jakob, Reik V. Donner, and Jürgen Kurths. 2015. “Optimal Model-Free Prediction from Multivariate Time Series.” Phys. Rev. E 91 (5): 052909. https://doi.org/10.1103/PhysRevE.91.052909.
Last, the following Nature Communications Perspective paper provides an overview of causal inference methods in general, identifies promising applications, and discusses methodological challenges (exemplified in Earth system sciences): https://www.nature.com/articles/s41467-019-10105-3
In [12]:
# Imports
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
## use `%matplotlib notebook` for interactive figures
# plt.style.use('ggplot')
import sklearn
import tigramite
from tigramite import data_processing as pp
from tigramite import plotting as tp
from tigramite.pcmci import PCMCI
from tigramite.independence_tests import ParCorr, GPDC, CMIknn, CMIsymb
from tigramite.models import LinearMediation, Prediction
In [2]:
np.random.seed(42)
T = 150
links_coeffs = {0: [((0, -1), 0.6)],
1: [((1, -1), 0.6), ((0, -1), 0.8)],
2: [((2, -1), 0.5), ((1, -1), 0.7)], # ((0, -1), c)],
}
N = len(links_coeffs)
data, true_parents = pp.var_process(links_coeffs, T=T)
dataframe = pp.DataFrame(data, var_names = [r'$X^0$', r'$X^1$', r'$X^2$', r'$X^3$'])
We initialize the Prediction
class with cond_ind_model=ParCorr()
. Secondly, we choose sklearn.linear_model.LinearRegression()
here for prediction. Last, we scale the data via data_transform
. The class takes care of rescaling the data for prediction. The parameters train_indices
and test_indices
are used to divide the data up into a training set and test set. The test set is optional since new data can be supplied later. The training set is used to select predictors and fit the model.
In [3]:
pred = Prediction(dataframe=dataframe,
cond_ind_test=ParCorr(), #CMIknn ParCorr
prediction_model = sklearn.linear_model.LinearRegression(),
# prediction_model = sklearn.gaussian_process.GaussianProcessRegressor(),
# prediction_model = sklearn.neighbors.KNeighborsRegressor(),
data_transform=sklearn.preprocessing.StandardScaler(),
train_indices= range(int(0.8*T)),
test_indices= range(int(0.8*T), T),
verbosity=1
)
Now, we estimate causal predictors using get_predictors
for the target variable 2 taking into account a maximum past lag of tau_max
. We use pc_alpha=None
which optimizes the parameter based on the Akaike score. Note that the predictors are different for each prediction horizon. For example, at a prediction horizon of steps_ahead=1
we get the causal parents from the model plus some others:
In [4]:
target = 2
tau_max = 5
predictors = pred.get_predictors(
selected_targets=[target],
steps_ahead=1,
tau_max=tau_max,
pc_alpha=None
)
link_matrix = np.zeros((N, N, tau_max+1), dtype='bool')
for j in [target]:
for p in predictors[j]:
link_matrix[p[0], j, abs(p[1])] = 1
# Plot time series graph
tp.plot_time_series_graph(
figsize=(6, 3),
val_matrix=np.ones(link_matrix.shape),
link_matrix=link_matrix,
var_names=None,
link_colorbar_label='',
)
Note, that get_predictors
is based only on the first step of PCMCI and skips the MCI step since correct false positive rates are not that relevant for prediction and the first step alone is faster. Now, we set steps_ahead=2
and get different predictors:
In [5]:
tau_max = 30
steps_ahead = 2
target = 2
all_predictors = pred.get_predictors(
selected_targets=[target],
steps_ahead=steps_ahead,
tau_max=tau_max,
pc_alpha=None
)
link_matrix = np.zeros((N, N, tau_max + 1), dtype='bool')
for j in [target]:
for p in all_predictors[j]:
link_matrix[p[0], j, abs(p[1])] = 1
# Plot time series graph
tp.plot_time_series_graph(
figsize=(6, 3),
val_matrix=np.ones(link_matrix.shape),
link_matrix=link_matrix,
var_names=None,
link_colorbar_label='',
)
These predictors now efficiently avoid overfitting in the following model fit. Here one can specify whether multiple target variables should be fit at once (assuming that for all of these predictors have been estimated beforehand).
In [6]:
pred.fit(target_predictors=all_predictors,
selected_targets=[target],
tau_max=tau_max)
Out[6]:
Now we are ready to predict the target variable at the test samples:
In [7]:
predicted = pred.predict(target)
true_data = pred.get_test_array()[0]
plt.scatter(true_data, predicted)
plt.title(r"NRMSE = %.2f" % (np.abs(true_data - predicted).mean()/true_data.std()))
plt.plot(true_data, true_data, 'k-')
plt.xlabel('True test data')
plt.ylabel('Predicted test data')
Out[7]:
We can also predict other new data by supplying a new dataframe to new_data
. Because we have more samples here, the estimate of NRMSE is more reliable.
In [8]:
new_data = pp.DataFrame(pp.var_process(links_coeffs, T=200)[0])
predicted = pred.predict(target, new_data=new_data)
true_data = pred.get_test_array()[0]
plt.scatter(true_data, predicted)
plt.title(r"NRMSE = %.2f" % (np.abs(true_data - predicted).mean()/true_data.std()))
plt.plot(true_data, true_data, 'k-')
plt.xlabel('True test data')
plt.ylabel('Predicted test data')
Out[8]:
This prediction is much better than using all past variables which leads to overfitting:
In [9]:
whole_predictors = {2:[(i, -tau) for i in range(3) for tau in range(1, tau_max+1)]}
pred.fit(target_predictors=whole_predictors,
selected_targets=[target],
tau_max=tau_max)
# new_data = pp.DataFrame(pp.var_process(links_coeffs, T=100)[0])
predicted = pred.predict(target, new_data=new_data)
# predicted = pred.predict(target)
true_data = pred.get_test_array()[0]
plt.scatter(true_data, predicted)
plt.plot(true_data, true_data, 'k-')
plt.title(r"NRMSE = %.2f" % (np.abs(true_data - predicted).mean()/true_data.std()))
plt.xlabel('True test data')
plt.ylabel('Predicted test data')
Out[9]:
Before, we rescaled the data before fitting which requires us to also rescale the test data. We can also leave the data unscaled:
In [10]:
pred = Prediction(dataframe=dataframe,
cond_ind_test=ParCorr(),
prediction_model = sklearn.linear_model.LinearRegression(),
# data_transform=sklearn.preprocessing.StandardScaler(),
train_indices= range(int(0.8*T)),
test_indices= range(int(0.8*T), T),
verbosity=1
)
pred.fit(target_predictors=all_predictors,
selected_targets=[target],
tau_max=tau_max)
predicted = pred.predict(target)
# predicted = pred.predict(target)
true_data = pred.get_test_array()[0]
plt.scatter(true_data, predicted)
plt.plot(true_data, true_data, 'k-')
plt.title(r"NRMSE = %.2f" % (np.abs(true_data - predicted).mean()/true_data.std()))
plt.xlabel('True test data')
plt.ylabel('Predicted test data')
Out[10]:
Note the different scales on the x- and y-axes.
Last, let's try out a Gaussian process regressor in conjunction with a GPDC predictor selection. Here we supply cond_ind_params
and prediction_model_params
because the sklearn defaults don't work well here.
In [11]:
tau_max = 10
steps_ahead = 2
target = 2
T = 500
dataframe = pp.DataFrame(pp.var_process(links_coeffs, T=T)[0])
pred = Prediction(dataframe=dataframe,
cond_ind_test=GPDC(), #CMIknn ParCorr
prediction_model = sklearn.gaussian_process.GaussianProcessRegressor(alpha=0., kernel=sklearn.gaussian_process.kernels.RBF() +
sklearn.gaussian_process.kernels.WhiteKernel()),
# prediction_model = sklearn.neighbors.KNeighborsRegressor(),
data_transform=sklearn.preprocessing.StandardScaler(),
train_indices= range(int(0.8*T)),
test_indices= range(int(0.8*T), T),
verbosity=1
)
all_predictors = pred.get_predictors(
selected_targets=[target],
steps_ahead=steps_ahead,
tau_max=tau_max,
pc_alpha=0.2
)
pred.fit(target_predictors=all_predictors,
selected_targets=[target],
tau_max=tau_max)
predicted = pred.predict(target)
# predicted = pred.predict(target)
true_data = pred.get_test_array()[0]
plt.scatter(true_data, predicted)
plt.plot(true_data, true_data, 'k-')
plt.title(r"NRMSE = %.2f" % (np.abs(true_data - predicted).mean()/true_data.std()))
plt.xlabel('True test data')
plt.ylabel('Predicted test data')
Out[11]:
In [ ]:
In [ ]: