Assume we use a neural network in a regression problem, which produces a set of predictions $\hat y$. Let's say that instead of the conventional MSE, we use correlation as the loss to optimize. We further assume that our signal (our $y$) is a time-series, and that it's long, so that it will be split up into several batches. Unfortunately, correlation does not work well in this situation because it's not additive over batches, unlike the MSE.

Let's derive a metric that has some of the same properties as correlation, but without the problems. Our first step is to reduce our sights; rather than emulating the full set of properties of correlation, let's try to match those of a related similarity function, cosine similarity. Desirable properties of this similarity function $L(y, \hat y)$ are:

  1. Invariant to scaling, i.e. for $\beta$, $\gamma$ scalar, $L(y, \hat y) = L(\beta y, \hat y) = L(y, \gamma \hat y)$
  2. Max of 1. $L(y, y) = 1$
  3. Equal to 0 for orthogonal vectors. $L(y, \hat y) = 0$ if $<y, \hat y> = 0$.
  4. (not part of cosine similarity, but important for us): additive over examples.

Consider the metric:

$$L(y, \hat y) = 1 - \arg \min_\alpha ||y - \alpha \hat y||^2_2 / ||y||_2^2$$

Let's show each property holds:

  1. Invariance to scaling with respect to the first argument is clear from the division by the L2 norm of y. The metric is also invariant to scaling in the second argument, via the free $\alpha$.
  2. By substitution.
  3. $||y - \alpha \hat y||^2_2$ expands to $\sum_i y_i^2 - 2 \alpha y_i \hat y_i + \alpha^2 \hat y_i^2$; the second term is zero by orthogonality, therefore the minimum lies at $\alpha = 0$, and L is 0.
  4. We can rewrite the equation as:
$$L(y, \hat y) = \arg \max_ \alpha {1 - 1 / ||y||_2^2 \sum_i (y_i - \alpha \hat y_i)^2}$$

Thus, assuming we have an estimate of the optimal $\alpha$, via, for example, stochastic gradient descent, and that we've cached the value of $||y||_2$, we can estimate the contribution of a single example to the loss using only local information.