bias-variance tradeoff

so:

we sample some datapoints from a population
we fit parameters to these sampled datapoints
then we create predictions, using these parameters, for each input $x_n$
we're going to fix the input data points $\mathcal{X} = \{x_1, x_2, \dots, x_N \}$, otherwise how can we average across datasets?

So, we have:

$N$ samples in each dataset
$M$ datasets

Input data $\mathcal{X}$ is common across all datasets:

$\mathcal{X} = x_{1}, x_{2}, \dots, x_{N}$

This means that ground-truth $\mathcal{Y}^*$ is also common across all datasets:

$\mathcal{Y}^* = \{ y^*_{1}, y^*_{2}, \dots, y^*_{N} \}$

Then, the following are per-dataset:

targets $\mathcal{Y}_m = \{ y_{m,1}, y_{m,2}, \dots, y_{m,N} \}$
parameters $\theta_m$ (fitted to above $\mathcal{X}$ and $\mathcal{Y}_m$
predictions $\hat{\mathcal{Y}}_m = \{ \hat{y}_{m,1}, \hat{y}_{m,2}, \dots, \hat{y}_{m,N} \}$

Then conceptually:

bias and variance are first calculated for each datapoint $x_n$, and then averaged over all datapoints
for each datapoint, the bias is the difference between the expected prediction, and the ground truth, squared, ie:

$$ \text{bias}_n = \left( y^*_n - \frac{1}{M} \sum_{m=1}^M \hat{y}_{m,n} \right)^2 $$

So, the bias is:

$$ \text{bias} = \frac{1}{N} \sum_{n=1}^N \left( y_n^* - \frac{1}{M} \sum_{m=1}^M \hat{y}_{m,n} \right)^2 $$

Similarly, variance is:

$$ \text{variance} = \frac{1}{N} \sum_{n=1}^N \frac{1}{M} \sum_{m=1}^M \left( \hat{y}_{m,n} - \frac{1}{M} \sum_{m=1}^M \hat{y}_{m,n} \right)^2 $$

For the noise, we have:

$$ \text{noise}_n = \frac{1}{M} \sum_{m=1}^M \left( y_{m,n} - \frac{1}{M} \sum_{m=1}^M y_{m,n} \right)^2 $$

And:

$$ \text{noise} = \frac{1}{N} \sum_{n=1}^N \frac{1}{M} \sum_{m=1}^M \left( y_{m,n} - \frac{1}{M} \sum_{m=1}^M y_{m,n} \right)^2 $$

Lastly, mse is averaged over each dataset. For each dataset, we calculate the mse over all datapoints, ie:

$$ \text{mse} = \frac{1}{M} \sum_{m=1}^M \frac{1}{N} \sum_{n=1}^N \left( y^*_n - \hat{y}_{m,n} \right)^2 \\ = \frac{1}{N} \sum_{n=1}^N \frac{1}{M} \sum_{m=1}^M \left( y^*_n - \hat{y}_{m,n} \right)^2 $$



In [92]:

    
import matplotlib.pyplot as plt
import numpy as np
import torch

# N = 2000
N = 10
# K = 1
num_runs = 1000
true_a = 1.23
true_b = 0.34
# true_epsilon = 0.23
true_epsilon = 0.51
# true_epsilon = 0.0

print('ground truth a=%s b=%s e=%s' % (true_a, true_b, true_epsilon))

def generate_X():
#     torch.manual_seed(seed)
    torch.manual_seed(123)
    X = torch.rand(N, 1)
#     Y = X.view(-1) * true_a + true_b + torch.randn(N) * true_epsilon
#     return X, Y
    return X

def generate_Y(X, seed):
    torch.manual_seed(seed)
#     X = torch.rand(N, 1)
    Y = X.view(-1) * true_a + true_b + torch.randn(N) * true_epsilon
    return Y

def generate_features(X, order):
    X2 = torch.zeros(N, order + 1)
    for k in range(order + 1):
        X2[:, k] = X[:, 0].pow(k)
    return X2

def fit(X, Y):
    W = (X.transpose(0, 1) @ X)
    W = W.inverse()
    W = W @ X.transpose(0, 1)
    W = W @ Y.view(-1, 1)
    return W

def calc_stats(Y_star, Y, preds, Ys):
    bias_sum = 0
    variance_sum = 0
    noise_sum = 0
    mse_sum = 0
    
    N = Y.size()[0]
    print('N', N)
    num_samples = len(preds)
    for n in range(N):
#         yv = torch.zero(num_samples)
        predv = torch.zeros(num_samples)
        yv = torch.zeros(num_samples)
        for j in range(num_samples):
#             print('preds[j][n]', preds[j][n])
            predv[j] = preds[j][n]
            yv[j] = Ys[j][n]
            mse_sqrt = preds[j][n] - Y[n]
            mse = mse_sqrt * mse_sqrt
            mse_sum += mse
        pred_avg = predv.mean()
        bias_sqrt = pred_avg - Y_star[n]
        bias = bias_sqrt * bias_sqrt
        bias_sum += bias
        variance = predv.var()
#         help(torch.var)
#         asdf
        variance_sum += variance
        noise = yv.var()
        noise_sum += noise
    return bias_sum / N, variance_sum / N, \
        noise_sum / N, \
        mse_sum / N / num_samples
    
#     err = pred - Y.view(-1, 1)
#     mse = (err * err).sum() / N
#     bias = (err * err).sum() / N
#     print('mse', mse)
#     print('bias', bias)
#     variance = 
#     asdfasdf
#     bias = err.sum() * err.sum() / N / N
#     variance = (pred * pred).sum() / N - pred.sum() * pred.sum() / N / N
#     noise = mse - bias - variance

def run(order):
    print('')
    print('order %s' % order)
    preds = []
    Ys = []
    X = generate_X()
    Y_star = (X * true_a + true_b).view(-1)
    X2 = generate_features(X=X, order=order)
    Ws = torch.zeros(num_runs, order + 1)
    for i in range(num_runs):
        Y = generate_Y(X=X, seed=i)
        Ys.append(Y)
        W = fit(X2, Y)
        Ws[i] = W
        pred = X2 @ W
        pred = pred.view(-1)
        preds.append(pred)
    bias, variance, noise, mse = calc_stats(
        Y_star=Y_star, Y=Y, preds=preds, Ys=Ys)
    noise2 = mse - bias - variance
    W_avg = Ws.mean(0)
    print('W_avg', W_avg.view(1, -1))
    print('bias %.3f' % bias, 'variance %.4f' % variance,
          'noise %.4f noise2 %.4f' % (noise, noise2),
          'mse %.4f' % mse)
#         calc_stats(Y=Y, pred=pred)
#         run_one(order=order, seed=i)

run(order=0)
run(order=1)
run(order=2)
run(order=3)
# run(order=4)









    



ground truth a=1.23 b=0.34 e=0.51

order 0
N 10
W_avg 
 1.0177
[torch.FloatTensor of size 1x1]

bias 0.048 variance 0.0273 noise 0.2596 noise2 0.4214 mse 0.4967

order 1
N 10
W_avg 
 0.3303  1.2448
[torch.FloatTensor of size 1x2]

bias 0.000 variance 0.0538 noise 0.2596 noise2 0.3206 mse 0.3744

order 2
N 10
W_avg 
 0.3944  0.9423  0.3054
[torch.FloatTensor of size 1x3]

bias 0.000 variance 0.0791 noise 0.2596 noise2 0.3205 mse 0.3996

order 3
N 10
W_avg 
 0.6098 -0.6744  3.9526 -2.5258
[torch.FloatTensor of size 1x4]

bias 0.000 variance 0.1042 noise 0.2596 noise2 0.3170 mse 0.4213

Bias-variance decomposition of squared error

"Suppose that we have a training set consisting of a set of points $x_1, \dots, x_n$ and real values $y_i$, associated with each point $x_i$. We assume that there is a function with noise $y = f(x) + \epsilon$, where the noise, $\epsilon$, has zero mean and variance $\sigma^2$.

"We want to find a function $\hat{f}(x)$, that approximates the true function $f(x)$ as well as possible, by means of some learning algorithm. We make 'as well as possible' precise by measuring the mean squared error between $y$ and $\hat{f}(x)$: we want $(y - \hat{f}(x))^2$ to be minimal, both for $x_1, \dots, x_n$, and for points outside of our sample. Of course, we cannot hope to do so perfectly, since the $y_i$ contain noise $\epsilon$; this means we must be prepared to accept an irreducible error in any function we come up with.

"Finding an $\hat{f}$ that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function $\hat{f}$ we select, we can decompose its expected error on an unseen sample $x$ as follows:

$$ \def\Exp{\mathbb{E}} \def\Bias{\text{Bias}} \def\Var{\text{Var}} \def\fhat{\hat{f}} \Exp[(y - \fhat(x))^2] = \Bias[\fhat(x)]^2 + \Var[\fhat(x)] + \sigma^2 $$

Where:

$$ \Bias[\fhat(x)] = \Exp[\fhat(x) - f(x)] = \Exp[\fhat(x)] - \Exp[f(x)] $$

and:

$$ \Var[\fhat(x)] = \Exp[\fhat(x)^2] - \Exp[\fhat(x)]^2 $$

"The expectation ranges over different choices of the training set $x_1, \dots, x_n, y_1, \dots, y_n$, all sampled from the same joint distribution $P(x,y)$. The three terms represent:

the square of the bias of the learning method, which can be thought of as the error caused by the simplifying assumptions built into the method. Eg, when approximating a non-linear function $f(x)$ using a learning method for linear models, there will be error in the estimates $\fhat(x)$ due to this assumption
the variance of the learning method, or, intuitively, how much the learning method $\fhat(x)$ will move around its mean
the irreducible error $\sigma^2$. Since all three terms are non-negative, this forms a lower coutn on the expected error on unseen samples.

"The more complex the model $\fhat(x)$ is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model 'move' more to capture the data points, and hence its variance will be larger."

Derivation

"The derivation of the bias-variance decomposition for squared error proceeds as follows. For notational convenience, abbreviate $f = f(x)$ and $\fhat = \fhat(x)$. First recall that, by definition, for any random variable $X$, we have

$$ \Var[X] = \Exp[X^2] - \Exp[X]^2 $$

Rearranging, we get

$$ \Exp[X^2] = \Var[X] + \Exp[X]^2 $$

Since $f$ is deterministic

$$ \Exp[f] = f $$

Thus, given $y = f+ \epsilon$ and $\Exp[\epsilon] = 0$, implies $\Exp[y] = \Exp[f + \epsilon] = \Exp[f] = f$.

Also, since $\Var[\epsilon] = \sigma^2$:

$$ \Var[y] = \Exp[(y - \Exp[y])^2] = \Exp[(y -f)^2] = \Exp[(f + \epsilon -f )^2] = \Exp[\epsilon^2] = \Var[\epsilon] + \Exp[\epsilon]^2 = \sigma^2 $$

Thus, since $\epsilon$ and $\fhat$ are independent, we can write

$$ \Exp[(y-\fhat)^2] = \Exp[y^2 + \fhat^2 - 2y\fhat] \\ = \Exp[y^2] + \Exp[\fhat^2] - \Exp[2y\fhat] \\ = \Var[y] + \Exp[y]^2 + \Var[\fhat] + \Exp[\fhat]^2 - 2f\Exp[\fhat] \\ = \Var[y] + \Var[\fhat] + (f^2 - 2f\Exp[\fhat] + \Exp[\fhat]^2) \\ = \Var[y] + \Var[\fhat] + ( f - \Exp[\fhat])^2 \\ = \sigma^2 + \Var[\fhat] + \Bias[\fhat]^2 $$