Bias vs. Variance

  • We need to distinguish whether bias or variance is the problem contributing to bad predictions.
  • High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden mean between these two.

The training error will tend to decrease as we increase the degree d of the polynomial.

At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.

High bias (underfitting): both $J_{train}(\theta)$ and $J_{cv}(\theta)$ will be high. Also, $J_{cv}(\theta)$≈$J_{train}(\theta)$.

High variance (overfitting): $J_{train}(\theta)$ will be low and $J_{cv}(\theta)$ will be much greater than $J_{train}(\theta)$.

Model Selection and Train/Validation/Test Sets

Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis. It could over fit and as a result your predictions on the test set would be poor. The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any other data set.

Given many models with different polynomial degrees, we can use a systematic approach to identify the 'best' function. In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.

One way to break down our dataset into the three sets is:

Training set: 60%
Cross validation set: 20%
Test set: 20%

Regularization

Penalize and make some of the θ parameters really small
here θ3 and θ4

$\begin{equation}\begin{split}\min_\theta\frac{1}{2m}\sum_{i=1}^{\infty}(h_{\theta}(x^(i)-y^(i))) +( 1000 *\theta_3^2 + 1000 * \theta_4^2)\end{split}\end{equation}$

  • The addition in blue is a modification of our cost function to help penalize θ3 and θ4
    • So here we end up with $\theta_3$ and $\theta_4$ being close to zero (because the constants are massive)
    • So we're basically left with a quadratic function

$\theta_0 + \theta_1x + \theta_2x^2 +\underline {\theta_3x^3} + \underline{\theta_4x^4}$

  • In this example, we penalized two of the parameter values
    • More generally, regularization is as follows

Cost function

$\begin{equation}\begin{split}J(\theta) = \frac{1}{2m}[ \sum_{i=1}^{m} (h_\theta(x^{(i)})-y^{(i)})^2 + \lambda \sum_{i=1}^{n}\theta_j^2]\end{split}\end{equation}$

  • warning: update $\theta_0$ like previous because for regularization we don't penalize θ0 so treat it slightly differently

$\begin{equation}\begin{split}\theta_j := \theta_j - \alpha [\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j]\\{(j=1,2,3,..,n)}\end{split}\end{equation}$

k-fold cross-validation

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples.Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once.

Debugging learning algorithms

  • Getting more training examples: Fixes high variance
  • Trying smaller sets of features: Fixes high variance
  • Adding features: Fixes high bias
  • Adding polynomial features: Fixes high bias
  • Decreasing λ: Fixes high bias
  • Increasing λ: Fixes high variance.

Feature Scaling

Min-Max Normalization

$$ x' = \frac{x - min(x)}{max(x) - min(x)} $$

In [3]:
from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data), end='\n\n')

print(scaler.data_max_, end='\n\n')

print(scaler.transform(data), end='\n\n')

print(scaler.transform([[2, 2]]))


MinMaxScaler(copy=True, feature_range=(0, 1))

[  1.  18.]

[[ 0.    0.  ]
 [ 0.25  0.25]
 [ 0.5   0.5 ]
 [ 1.    1.  ]]

[[ 1.5  0. ]]

Standardization

$$ x' = \frac{x - \bar{x}}{\sigma} $$

where $ \bar{x} $ is the mean and $ \sigma $ is the standard deviation.


In [5]:
from sklearn.preprocessing import StandardScaler

data = [[0, 0], [0, 0], [1, 1], [1, 1]]
scaler = StandardScaler()
print(scaler.fit(data), end='\n\n')

print(scaler.mean_, end='\n\n')

print(scaler.transform(data), end='\n\n')

print(scaler.transform([[2, 2]]))


StandardScaler(copy=True, with_mean=True, with_std=True)

[ 0.5  0.5]

[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]

[[ 3.  3.]]

Normalization

$$ x' = \frac{x}{\left\lVert x \right\rVert} $$

where $ \left\lVert x \right\rVert $ is the norm of vector $ x $ (euclidean distance in most cases).


In [21]:
import numpy as np
from sklearn.preprocessing import Normalizer

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
normalizer = Normalizer()
print(normalizer.fit(data), end='\n\n')

print(normalizer.transform(data), end='\n\n')
print(np.linalg.norm(normalizer.transform(data), axis=1))


Normalizer(copy=True, norm='l2')

[[-0.4472136   0.89442719]
 [-0.08304548  0.99654576]
 [ 0.          1.        ]
 [ 0.05547002  0.99846035]]

[ 1.  1.  1.  1.]

On the importance of feature scaling: sklearn example

Outliers