Assume we use a neural network in a regression problem, which produces a set of predictions $\hat y$. Let's say that instead of the conventional MSE, we use correlation as the loss to optimize. We further assume that our signal (our $y$) is a time-series, and that it's long, so that it will be split up into several batches. Unfortunately, correlation does not work well in this situation because it's not additive over batches, unlike the MSE.
Let's derive a metric that has some of the same properties as correlation, but without the problems. Our first step is to reduce our sights; rather than emulating the full set of properties of correlation, let's try to match those of a related similarity function, cosine similarity. Desirable properties of this similarity function $L(y, \hat y)$ are:
Consider the metric:
$$L(y, \hat y) = 1 - \arg \min_\alpha ||y - \alpha \hat y||^2_2 / ||y||_2^2$$Let's show each property holds:
Thus, assuming we have an estimate of the optimal $\alpha$, via, for example, stochastic gradient descent, and that we've cached the value of $||y||_2$, we can estimate the contribution of a single example to the loss using only local information.