EECS 445: Machine Learning

Hands On 10: Bias Variance Tradeoff

Consider a sequence of IID random variable: $$ X_i = \begin{cases} 100 & \text{ with prob. } 0.02 \\ 0 & \text{ with prob. } 0.97 \\ -100 & \text{ with prob. } 0.01 \\ \end{cases} $$ The true mean of $X_i$ is $$ 0.02 \times 100 + 0.97 \times 0 + 0.01 \times -100 = 1 $$

We want to estimate the true mean of this distribution. We will consider two different estimators of the true mean. Let's say you take three samples $X_1, X_2, X_3$, and you compute the empirical mean $Z=\frac{X_1 + X_2 + X_3}{3}$ and empirical median $Y$ of these three samples (recall that the median is obtained by sorting $X_1, X_2, X_3$ and then choosing the middle (2nd) entry).

What is the bias-variance tradeoff of the $Y$ and $Z$ for estimating the true mean of the above distribution?

  • They are both unbiased estimators of the true mean, and have the same variance.
  • The median has higher bias and higher variance.
  • The mean has higher bias and higher variance.
  • They both have no bias, but the mean has lower variance.
  • The mean has no bias but some variance, and the median has non-zero bias but less variance

Solution

The last answer is correct.

The empirical mean of a sample of random $n$ IID random variables is always an unbiased estimate of the true mean. However, the empirical mean estimator can have high variance. Here it is $ \text{Var}(Z) = \frac{\text{Var}(X_i)}{3} = \frac{(100-1)^2 \times 0.02 + (-100 - 1)^2 \times 0.01 + (0-1)^2 \times 0.97}{3} = 99 \frac 2 3.$

The median, on the other hand, is a biased estimator. It is a little bit hard to calculate exactly, but here goes: $$ median = \begin{cases} 100 & w.p. 0.02^3 + \binom{3}{1} 0.02^2 \times 0.98 \\ -100 & w.p. 0.01^3 + \binom{3}{1} 0.01^2 \times 0.99 \end{cases} $$ If you work this out, you see that the median on average is $0.089$. This means that the $\text{bias}^2 \approx (1-0.089)^2$ which is no more than 1. Using a similar argument, you can check that the variance of the median is no more than 20. This can be checked experimentally!

Derivation of Bias-Variance Tradeoff eqaution

Assume that we have noisy data, modeled by $f = y + \epsilon$, where $\epsilon \in \mathcal{N}(0,\sigma)$. Given an estimator $\hat{f}$, the squared error can be derived as follows:

$$ \begin{align} \mathbb{E}\left[\left(\hat{f} - f\right)^2\right] &= \mathbb{E}\left[\hat{f}^2 - 2f\hat{f} + f^2\right]\\ &= \mathbb{E}\left[\hat{f}^2\right] + \mathbb{E}\left[f^2\right] - 2\mathbb{E}\left[f\hat{f}^2\right] \text{ By linearity of expectation} \\ \end{align} $$

Now, by definition, $Var(x) = \mathbb{E}\left[x^2\right] - \left(\mathbb{E}\left[x\right]\right)^2$. Subsituting this definition into the eqaution above, we get: $$ \begin{align} \mathbb{E}\left[\hat{f}^2\right] + \mathbb{E}\left[f^2\right] - 2\mathbb{E}\left[f\hat{f}^2\right] &= Var(\hat{f}) + \left(\mathbb{E}[\hat{f}]\right)^2 + Var(f) + \left(\mathbb{E}[f]\right)^2 - 2f\mathbb{E}[\hat{F}^2] \\ &= Var(\hat{f}) + Var(f) + \left(\mathbb{E}[\hat{f}] - f\right)^2\\ &= \boxed{\sigma + Var(\hat{f}) + \left(\mathbb{E}[\hat{f}] - f\right)^2} \end{align} $$

The first term $\sigma$ is the irreducible error due to the noise in the data (from the distribution of $\epsilon$). The second term is the variance of the estimator $\hat{f}$ and the final term is the bias of the estimator. There is an inherent tradeoff between the bias and variance of an estimator. Generally, more complex estimators (think of high-degree polynomials as an example) will have a low bias since they will fit the sampled data really well. However, this accuracy will not be maintained if we continued to resample the data, which implies that the variance of this estimator is high.

Activity 1: Bias Variance Tradeoff

We will now see try to see the inherent tradeoff between bias and variance of estimators through linear regression. Consider the following dataset.


In [8]:
import numpy as np
import matplotlib.pyplot as plt
from numpy.matlib import repmat

from sklearn
degrees = [1,2,3,4,5]


#define data
n = 20
sub = 1000
mean = 0
std = 0.25

#define test set
Xtest = np.random.random((n,1))*2*np.pi
ytest = np.sin(Xtest) + np.random.normal(mean,std,(n,1))

#pre-allocate variables
preds = np.zeros((n,sub))
bias = np.zeros(len(degrees))
variance = np.zeros(len(degrees))
mse = np.zeros(len(degrees))
values = np.expand_dims(np.linspace(0,2*np.pi,100),1)

Let's try several polynomial fits to the data:


In [ ]:
for j,degree in enumerate(degrees):
    
    for i in range(sub):
            
        #create data - sample from sine wave     
        x = np.random.random((n,1))*2*np.pi
        y = np.sin(x) + np.random.normal(mean,std,(n,1))
        
        #TODO
        #create features corresponding to degree - ex: 1, x, x^2, x^3...
        A = 
        
        #TODO:        
        #fit model using least squares solution (linear regression)
        #later include ridge regression/normalization
        coeffs = 
                
        #store predictions for each sampling
        preds[:,i] = poly.fit_transform(Xtest).dot(coeffs)[:,0]
        
        #plot 9 images
        if i < 9:
            plt.subplot(3,3,i+1)
            plt.plot(values,poly.fit_transform(values).dot(coeffs),x,y,'.b')

    plt.axis([0,2*np.pi,-2,2])
    plt.suptitle('PolyFit = %i' % (degree))
    plt.show()

    #TODO
    #Calculate mean bias, variance, and MSE (UNCOMMENT CODE BELOW!)
    #bias[j] = 
    #variance[j] = 
    #mse[j] =

Let's plot the data with the estimators!


In [ ]:
plt.subplot(3,1,1)
plt.plot(degrees,bias)
plt.title('bias')
plt.subplot(3,1,2)
plt.plot(degrees,variance)
plt.title('variance')
plt.subplot(3,1,3)
plt.plot(degrees,mse)
plt.title('MSE')
plt.show()