Definition 1 Formally, suppose we have a function $F: \mathbb{R}^n \mapsto [0, 1]$, that represents a deep network, and an input $\mathbb{x} = \{x_1, \dots, x_n\} \in \mathbb{R}^n$. An attribution of the prediction at input $\mathbb{x}$ relative to a baseline input $\mathbf{x}'$ is a vector $A_F(\mathbf{x}, \mathbf{x}') = \{a_1, \dots, a_n \} \in \mathbb{R}^n$ where $a_i$ is the contribution of $x_i$ to the prediction $F(x)$.
Previous papers on attribution problem:
Significant challenge in designing attribution techniques is that they are hard to evaluate empirically. It is hard to tease apart errors that stem from the misbehavior of the model versus the misbehavior of the attribution method.
Remark 1. A common way for humans to perform attribution relies on counter-factual intuition. When we assign blame to a certain cause we implicitly consider the absence of the cause as a baseline for comparing outcomes. In a deep network, we model the absence using a single baseline input. For most deep networks, a natural baseline existsin the input space where the prediction is neutral. For instance, in object recognition networks, it is the black image. The need for a baseline has also been pointed out by prior work on attribution (Shrikumar et al 2016, Binder et al 2016).
Two networks are functionally equivalent if their outputs are equal for all inputs, despite having very different implementations. Attribution methods should satisfy Implementation Invariance, ie the attributions are always identical for two functionally equivalent networks. To motivate this, notice that attribution can be colloquially defined as assigning the blame (or credit) for the output to the input features. Such a definition oes not refer to implementation details.
We consider the straight-line path in $\mathbb{R}^n$ from the baseline $\mathbf{x}'$ to the input $\mathbf{x}$, and compute the gradients at all points along the path. The integrated gradient along the $i$th dimension for an input $\mathbf{x}$ and baseline $\mathbf{x}'$ is defined as follows:
$$ \text{IntegratedGrads}_i(\mathbf{x}) := (x_i - x_i') \cdot \int_{\alpha=0}^1 \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i}\, d\alpha $$Axiom: completeness Integrated gradients satisfy an axiom called completeness that the attributions add up to the difference between the output of $F$ at the input $\mathbf{x}$ and the baseline $\mathbf{x}'$.
Proposition 1 If $F: \mathbb{R}^n \mapsto \mathbb{R}$ is differentiable almost everywhere, then:
$$ \sum_{i=1}^n \text{IntegratedGrads}_i(\mathbf{x}) = F(\mathbf{x}) - f(\mathbf{x}') $$For most deep networks, it is possible to choose a baseline such that the prediction at the baseline is near zero ($F(x') \approx 0$). In such cases, there is an interpretation of the resulting attributions that ignores the baseline and amounts to distributing the output to the individual input features.
Remark 2 Integrated gradients satisfies Sensitivity(a) because Completeness implies Sensitivity(a) and is thus a strengthening of the Sensitivity(a) axiom. This is because Sensitivity(a) refers to a case where the baseline and the input differ only in one variable, for which Completeness asserts that the difference in the two output values is equal to the attribution to this variable. Attributions generated by integrated gradients satisfy Implmenetation Invariance since they are based only on the gradients of the function represented by the network.
Axiom: Sensitivity(b) If the function implemented by the deep network does not depend mathematically on some variable, then the attribution to that variable is always zero.
Axiom: Linearity Suppose that we linearly composed two deep networks modeled by the function $f_1$ and $f_2$ to form a third network that models the function $a\cdot f_1 + b\cdot f_2$, ie a linear combination of the two networks. Then we'd like the attributions for $a \cdot f_1 + b\cdot f_2$ to be the weighted sum of the attributions for $f_1$ and $f_2$ with weights $a$ and $b$ respectively. Intuitively we would like the attributions to preserve any linearity within the network.
Proposition 2 Parth methods are the only attribution methods that always satisfy Implementation Invariance, Linearity and Completeness.
Symmetry-preserving Two input variables are symmetric wrt a function if swapping them does not change the function. For instance, $x$ and $y$ are symmetric wrt $F$ if and only if $F(x,y) = F(y,x)$ for all values of $x$ and $y$. An attribution method is symmetry preserving, if for all inputs that have identical values for symmetric variables and baselines that have identical values for symmetric variables, the symmetric variables receive identical attributions.
Theorem 1 Integrated gradients is the unique path method that is symmetry-preserving.
Computing integrated gradients The integral of integrated gradients can be efficiently approximated via a summation. We simply sum the gradients at points occurring at sufficiently small intervals along the straight-line path from the baseline $\mathbf{x}'$ to the input $\mathbf{x}$.
We recommend that developers check that the attributions approximately add up to the difference between the score at the input and that at the baseline, and if not increase the number of steps $m$.