Consider a sequence of IID random variable: $$ X_i = \begin{cases} 100 & \text{ with prob. } 0.02 \\ 0 & \text{ with prob. } 0.97 \\ -100 & \text{ with prob. } 0.01 \\ \end{cases} $$ The true mean of $X_i$ is $$ 0.02 \times 100 + 0.97 \times 0 + 0.01 \times -100 = 1 $$
We want to estimate the true mean of this distribution. We will consider two different estimators of the true mean. Let's say you take three samples $X_1, X_2, X_3$, and you compute the empirical mean $Z=\frac{X_1 + X_2 + X_3}{3}$ and empirical median $Y$ of these three samples (recall that the median is obtained by sorting $X_1, X_2, X_3$ and then choosing the middle (2nd) entry).
The last answer is correct.
The empirical mean of a sample of random $n$ IID random variables is always an unbiased estimate of the true mean. However, the empirical mean estimator can have high variance. Here it is $ \text{Var}(Z) = \frac{\text{Var}(X_i)}{3} = \frac{(100-1)^2 \times 0.02 + (-100 - 1)^2 \times 0.01 + (0-1)^2 \times 0.97}{3} = 99 \frac 2 3.$
The median, on the other hand, is a biased estimator. It is a little bit hard to calculate exactly, but here goes: $$ median = \begin{cases} 100 & w.p. 0.02^3 + \binom{3}{1} 0.02^2 \times 0.98 \\ -100 & w.p. 0.01^3 + \binom{3}{1} 0.01^2 \times 0.99 \end{cases} $$ If you work this out, you see that the median on average is $0.089$. This means that the $\text{bias}^2 \approx (1-0.089)^2$ which is no more than 1. Using a similar argument, you can check that the variance of the median is no more than 20. This can be checked experimentally!
Assume that we have noisy data, modeled by $f = y + \epsilon$, where $\epsilon \in \mathcal{N}(0,\sigma)$. Given an estimator $\hat{f}$, the squared error can be derived as follows:
$$ \begin{align} \mathbb{E}\left[\left(\hat{f} - f\right)^2\right] &= \mathbb{E}\left[\hat{f}^2 - 2f\hat{f} + f^2\right]\\ &= \mathbb{E}\left[\hat{f}^2\right] + \mathbb{E}\left[f^2\right] - 2\mathbb{E}\left[f\hat{f}^2\right] \text{ By linearity of expectation} \\ \end{align} $$Now, by definition, $Var(x) = \mathbb{E}\left[x^2\right] - \left(\mathbb{E}\left[x\right]\right)^2$. Subsituting this definition into the eqaution above, we get: $$ \begin{align} \mathbb{E}\left[\hat{f}^2\right] + \mathbb{E}\left[f^2\right] - 2\mathbb{E}\left[f\hat{f}^2\right] &= Var(\hat{f}) + \left(\mathbb{E}[\hat{f}]\right)^2 + Var(f) + \left(\mathbb{E}[f]\right)^2 - 2f\mathbb{E}[\hat{F}^2] \\ &= Var(\hat{f}) + Var(f) + \left(\mathbb{E}[\hat{f}] - f\right)^2\\ &= \boxed{\sigma + Var(\hat{f}) + \left(\mathbb{E}[\hat{f}] - f\right)^2} \end{align} $$
The first term $\sigma$ is the irreducible error due to the noise in the data (from the distribution of $\epsilon$). The second term is the variance of the estimator $\hat{f}$ and the final term is the bias of the estimator. There is an inherent tradeoff between the bias and variance of an estimator. Generally, more complex estimators (think of high-degree polynomials as an example) will have a low bias since they will fit the sampled data really well. However, this accuracy will not be maintained if we continued to resample the data, which implies that the variance of this estimator is high.
In [8]:
import numpy as np
import matplotlib.pyplot as plt
from numpy.matlib import repmat
from sklearn
degrees = [1,2,3,4,5]
#define data
n = 20
sub = 1000
mean = 0
std = 0.25
#define test set
Xtest = np.random.random((n,1))*2*np.pi
ytest = np.sin(Xtest) + np.random.normal(mean,std,(n,1))
#pre-allocate variables
preds = np.zeros((n,sub))
bias = np.zeros(len(degrees))
variance = np.zeros(len(degrees))
mse = np.zeros(len(degrees))
values = np.expand_dims(np.linspace(0,2*np.pi,100),1)
Let's try several polynomial fits to the data:
In [ ]:
for j,degree in enumerate(degrees):
for i in range(sub):
#create data - sample from sine wave
x = np.random.random((n,1))*2*np.pi
y = np.sin(x) + np.random.normal(mean,std,(n,1))
#TODO
#create features corresponding to degree - ex: 1, x, x^2, x^3...
A =
#TODO:
#fit model using least squares solution (linear regression)
#later include ridge regression/normalization
coeffs =
#store predictions for each sampling
preds[:,i] = poly.fit_transform(Xtest).dot(coeffs)[:,0]
#plot 9 images
if i < 9:
plt.subplot(3,3,i+1)
plt.plot(values,poly.fit_transform(values).dot(coeffs),x,y,'.b')
plt.axis([0,2*np.pi,-2,2])
plt.suptitle('PolyFit = %i' % (degree))
plt.show()
#TODO
#Calculate mean bias, variance, and MSE (UNCOMMENT CODE BELOW!)
#bias[j] =
#variance[j] =
#mse[j] =
Let's plot the data with the estimators!
In [ ]:
plt.subplot(3,1,1)
plt.plot(degrees,bias)
plt.title('bias')
plt.subplot(3,1,2)
plt.plot(degrees,variance)
plt.title('variance')
plt.subplot(3,1,3)
plt.plot(degrees,mse)
plt.title('MSE')
plt.show()