so:
So, we have:
Input data $\mathcal{X}$ is common across all datasets:
This means that ground-truth $\mathcal{Y}^*$ is also common across all datasets:
Then, the following are per-dataset:
Then conceptually:
So, the bias is:
$$ \text{bias} = \frac{1}{N} \sum_{n=1}^N \left( y_n^* - \frac{1}{M} \sum_{m=1}^M \hat{y}_{m,n} \right)^2 $$Similarly, variance is:
$$ \text{variance} = \frac{1}{N} \sum_{n=1}^N \frac{1}{M} \sum_{m=1}^M \left( \hat{y}_{m,n} - \frac{1}{M} \sum_{m=1}^M \hat{y}_{m,n} \right)^2 $$For the noise, we have:
$$ \text{noise}_n = \frac{1}{M} \sum_{m=1}^M \left( y_{m,n} - \frac{1}{M} \sum_{m=1}^M y_{m,n} \right)^2 $$And:
$$ \text{noise} = \frac{1}{N} \sum_{n=1}^N \frac{1}{M} \sum_{m=1}^M \left( y_{m,n} - \frac{1}{M} \sum_{m=1}^M y_{m,n} \right)^2 $$Lastly, mse is averaged over each dataset. For each dataset, we calculate the mse over all datapoints, ie:
$$ \text{mse} = \frac{1}{M} \sum_{m=1}^M \frac{1}{N} \sum_{n=1}^N \left( y^*_n - \hat{y}_{m,n} \right)^2 \\ = \frac{1}{N} \sum_{n=1}^N \frac{1}{M} \sum_{m=1}^M \left( y^*_n - \hat{y}_{m,n} \right)^2 $$
In [92]:
import matplotlib.pyplot as plt
import numpy as np
import torch
# N = 2000
N = 10
# K = 1
num_runs = 1000
true_a = 1.23
true_b = 0.34
# true_epsilon = 0.23
true_epsilon = 0.51
# true_epsilon = 0.0
print('ground truth a=%s b=%s e=%s' % (true_a, true_b, true_epsilon))
def generate_X():
# torch.manual_seed(seed)
torch.manual_seed(123)
X = torch.rand(N, 1)
# Y = X.view(-1) * true_a + true_b + torch.randn(N) * true_epsilon
# return X, Y
return X
def generate_Y(X, seed):
torch.manual_seed(seed)
# X = torch.rand(N, 1)
Y = X.view(-1) * true_a + true_b + torch.randn(N) * true_epsilon
return Y
def generate_features(X, order):
X2 = torch.zeros(N, order + 1)
for k in range(order + 1):
X2[:, k] = X[:, 0].pow(k)
return X2
def fit(X, Y):
W = (X.transpose(0, 1) @ X)
W = W.inverse()
W = W @ X.transpose(0, 1)
W = W @ Y.view(-1, 1)
return W
def calc_stats(Y_star, Y, preds, Ys):
bias_sum = 0
variance_sum = 0
noise_sum = 0
mse_sum = 0
N = Y.size()[0]
print('N', N)
num_samples = len(preds)
for n in range(N):
# yv = torch.zero(num_samples)
predv = torch.zeros(num_samples)
yv = torch.zeros(num_samples)
for j in range(num_samples):
# print('preds[j][n]', preds[j][n])
predv[j] = preds[j][n]
yv[j] = Ys[j][n]
mse_sqrt = preds[j][n] - Y[n]
mse = mse_sqrt * mse_sqrt
mse_sum += mse
pred_avg = predv.mean()
bias_sqrt = pred_avg - Y_star[n]
bias = bias_sqrt * bias_sqrt
bias_sum += bias
variance = predv.var()
# help(torch.var)
# asdf
variance_sum += variance
noise = yv.var()
noise_sum += noise
return bias_sum / N, variance_sum / N, \
noise_sum / N, \
mse_sum / N / num_samples
# err = pred - Y.view(-1, 1)
# mse = (err * err).sum() / N
# bias = (err * err).sum() / N
# print('mse', mse)
# print('bias', bias)
# variance =
# asdfasdf
# bias = err.sum() * err.sum() / N / N
# variance = (pred * pred).sum() / N - pred.sum() * pred.sum() / N / N
# noise = mse - bias - variance
def run(order):
print('')
print('order %s' % order)
preds = []
Ys = []
X = generate_X()
Y_star = (X * true_a + true_b).view(-1)
X2 = generate_features(X=X, order=order)
Ws = torch.zeros(num_runs, order + 1)
for i in range(num_runs):
Y = generate_Y(X=X, seed=i)
Ys.append(Y)
W = fit(X2, Y)
Ws[i] = W
pred = X2 @ W
pred = pred.view(-1)
preds.append(pred)
bias, variance, noise, mse = calc_stats(
Y_star=Y_star, Y=Y, preds=preds, Ys=Ys)
noise2 = mse - bias - variance
W_avg = Ws.mean(0)
print('W_avg', W_avg.view(1, -1))
print('bias %.3f' % bias, 'variance %.4f' % variance,
'noise %.4f noise2 %.4f' % (noise, noise2),
'mse %.4f' % mse)
# calc_stats(Y=Y, pred=pred)
# run_one(order=order, seed=i)
run(order=0)
run(order=1)
run(order=2)
run(order=3)
# run(order=4)
"Suppose that we have a training set consisting of a set of points $x_1, \dots, x_n$ and real values $y_i$, associated with each point $x_i$. We assume that there is a function with noise $y = f(x) + \epsilon$, where the noise, $\epsilon$, has zero mean and variance $\sigma^2$.
"We want to find a function $\hat{f}(x)$, that approximates the true function $f(x)$ as well as possible, by means of some learning algorithm. We make 'as well as possible' precise by measuring the mean squared error between $y$ and $\hat{f}(x)$: we want $(y - \hat{f}(x))^2$ to be minimal, both for $x_1, \dots, x_n$, and for points outside of our sample. Of course, we cannot hope to do so perfectly, since the $y_i$ contain noise $\epsilon$; this means we must be prepared to accept an irreducible error in any function we come up with.
"Finding an $\hat{f}$ that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function $\hat{f}$ we select, we can decompose its expected error on an unseen sample $x$ as follows:
Where:
$$ \Bias[\fhat(x)] = \Exp[\fhat(x) - f(x)] = \Exp[\fhat(x)] - \Exp[f(x)] $$and:
$$ \Var[\fhat(x)] = \Exp[\fhat(x)^2] - \Exp[\fhat(x)]^2 $$"The expectation ranges over different choices of the training set $x_1, \dots, x_n, y_1, \dots, y_n$, all sampled from the same joint distribution $P(x,y)$. The three terms represent:
"The more complex the model $\fhat(x)$ is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model 'move' more to capture the data points, and hence its variance will be larger."
"The derivation of the bias-variance decomposition for squared error proceeds as follows. For notational convenience, abbreviate $f = f(x)$ and $\fhat = \fhat(x)$. First recall that, by definition, for any random variable $X$, we have
$$ \Var[X] = \Exp[X^2] - \Exp[X]^2 $$Rearranging, we get
$$ \Exp[X^2] = \Var[X] + \Exp[X]^2 $$Since $f$ is deterministic
$$ \Exp[f] = f $$Thus, given $y = f+ \epsilon$ and $\Exp[\epsilon] = 0$, implies $\Exp[y] = \Exp[f + \epsilon] = \Exp[f] = f$.
Also, since $\Var[\epsilon] = \sigma^2$:
$$ \Var[y] = \Exp[(y - \Exp[y])^2] = \Exp[(y -f)^2] = \Exp[(f + \epsilon -f )^2] = \Exp[\epsilon^2] = \Var[\epsilon] + \Exp[\epsilon]^2 = \sigma^2 $$Thus, since $\epsilon$ and $\fhat$ are independent, we can write
$$ \Exp[(y-\fhat)^2] = \Exp[y^2 + \fhat^2 - 2y\fhat] \\ = \Exp[y^2] + \Exp[\fhat^2] - \Exp[2y\fhat] \\ = \Var[y] + \Exp[y]^2 + \Var[\fhat] + \Exp[\fhat]^2 - 2f\Exp[\fhat] \\ = \Var[y] + \Var[\fhat] + (f^2 - 2f\Exp[\fhat] + \Exp[\fhat]^2) \\ = \Var[y] + \Var[\fhat] + ( f - \Exp[\fhat])^2 \\ = \sigma^2 + \Var[\fhat] + \Bias[\fhat]^2 $$