Math behind LinearExplainer with correlation feature perturbation

When we use LinearExplainer(model, prior, feature_perturbation="correlation_dependent") we do not use $E[f(x) \mid do(X_S = x_S)]$ to measure the impact of a set $S$ of features, but rather use $E[f(x) \mid X_S = x_s]$ under the assumption that the random variable $X$ (representing the input features) follows a multivariate guassian distribution. To compute SHAP values this way we need to compute conditional expectations under the multivariate guassian distribution for all subset of features. This would be a lot of matrix match for an exponential number of terms, and it hence intractable for models with more than just a few features.

This document briefly outlines the math we have used to precompute all of the required linear algebra using a sampling procedure that can be done just once, and then applied to as many samples as we like. This drastically speed up the computation compared to a brute force approach. Note that all these calculations depend on the fact that we are explaining a linear model $f(x) = \beta x$.

The permutation definition of SHAP values in the interventional form used by most explainers is

$$ \phi_i = \frac{1}{M!} \sum_R E[f(X) \mid do(X_{S_i^R \cup i} = x_{S_i^R \cup i})] - E[f(X) \mid do(X_{S_i^R} = x_{S_i^R})] $$

but here we will use the non-interventional conditional expectation form (where we have simplified the notation by dropping the explicit reference to the random variable $X$).

$$ \phi_i = \frac{1}{M!} \sum_R E[f(x) \mid x_{S_i^R \cup i}] - E[f(x) \mid x_{S_i^R}] $$

where $f(x) = \beta x + b$ with $\beta$ a row vector and $b$ a scalar.

If we replace f(x) with the linear function definition we get:

\begin{align} \phi_i = \frac{1}{M!} \sum_R E[\beta x + b \mid x_{S_i^R \cup i}] - E[\beta x + b \mid x_{S_i^R}] \\ = \beta \frac{1}{M!} \sum_R E[x \mid x_{S_i^R \cup i}] - E[x \mid x_{S_i^R}] \end{align}

Assume the inputs $x$ follow a multivariate normal distribution with mean $\mu$ and covariance $\Sigma$. Denote the projection matrix that selects a set $S$ as $P_S$, then we get:

\begin{align} E[x \mid x_S] = [P_{\bar S} \mu + P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} ( P_S x - P_S \mu)] P_{\bar S} + x P_S^T P_S \\ = [P_{\bar S} \mu + P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S (x - \mu)] P_{\bar S} + x P_S^T P_S \\ = [\mu + \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S (x - \mu)] P_{\bar S}^T P_{\bar S} + x P_S^T P_S \\ = P_{\bar S}^T P_{\bar S} [\mu + \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S (x - \mu)] + P_S^T P_S x \\ = P_{\bar S}^T P_{\bar S} \mu + P_{\bar S}^T P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S x - P_{\bar S}^T P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S \mu + P_S^T P_S x \\ = [P_{\bar S}^T P_{\bar S} - P_{\bar S}^T P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S] \mu + [P_S^T P_S + P_{\bar S}^T P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S] x \end{align}

if we let $R_S = P_{\bar S}^T P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S$ and $Q_S = P_S^T P_S$ then we can write

\begin{align} E[x \mid x_S] = [Q_{\bar S} - R_S] \mu + [Q_S + R_S] x \end{align}

\begin{align} E[x \mid x_{S_i^R \cup i}] = [Q_{\bar{S_i^R \cup i}} - R_{S_i^R \cup i}] \mu + [Q_{S_i^R \cup i} + R_{S_i^R \cup i}] x \end{align}

leading to the Shapley equation of

\begin{align} \phi_i = \beta \frac{1}{M!} \sum_R [Q_{\bar{S_i^R \cup i}} - R_{S_i^R \cup i}] \mu + [Q_{S_i^R \cup i} + R_{S_i^R \cup i}] x - [Q_{\bar{S_i^R}} - R_{S_i^R}] \mu - [Q_{S_i^R} + R_{S_i^R}] x \\ = \beta \frac{1}{M!} \sum_R ([Q_{\bar{S_i^R \cup i}} - R_{S_i^R \cup i}] - [Q_{\bar{S_i^R}} - R_{S_i^R}]) \mu + ([Q_{S_i^R \cup i} + R_{S_i^R \cup i}] - [Q_{S_i^R} + R_{S_i^R}]) x \\ = \beta \left [ \frac{1}{M!} \sum_R ([Q_{\bar{S_i^R \cup i}} - R_{S_i^R \cup i}] - [Q_{\bar{S_i^R}} - R_{S_i^R}]) \right ] \mu + \beta \left [ \frac{1}{M!} \sum_R ([Q_{S_i^R \cup i} + R_{S_i^R \cup i}] - [Q_{S_i^R} + R_{S_i^R}]) \right ] x \end{align}

$$ \phi = \beta T x $$

This means that we can precompute the transform matrix $T$ by drawing random permutations $R$ many times and averaging our results. Once we have computed $T$ we can explain any number of samples (or models for that matter) by just using matrix multiplication.