The training error will tend to decrease as we increase the degree d of the polynomial.
At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.
High bias (underfitting): both $J_{train}(\theta)$ and $J_{cv}(\theta)$ will be high. Also, $J_{cv}(\theta)$≈$J_{train}(\theta)$.
High variance (overfitting): $J_{train}(\theta)$ will be low and $J_{cv}(\theta)$ will be much greater than $J_{train}(\theta)$.
Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis. It could over fit and as a result your predictions on the test set would be poor. The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any other data set.
Given many models with different polynomial degrees, we can use a systematic approach to identify the 'best' function. In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.
One way to break down our dataset into the three sets is:
Training set: 60%
Cross validation set: 20%
Test set: 20%
Penalize and make some of the θ parameters really small
here θ3 and θ4
$\begin{equation}\begin{split}\min_\theta\frac{1}{2m}\sum_{i=1}^{\infty}(h_{\theta}(x^(i)-y^(i))) +( 1000 *\theta_3^2 + 1000 * \theta_4^2)\end{split}\end{equation}$
$\theta_0 + \theta_1x + \theta_2x^2 +\underline {\theta_3x^3} + \underline{\theta_4x^4}$
$\begin{equation}\begin{split}J(\theta) = \frac{1}{2m}[ \sum_{i=1}^{m} (h_\theta(x^{(i)})-y^{(i)})^2 + \lambda \sum_{i=1}^{n}\theta_j^2]\end{split}\end{equation}$
$\begin{equation}\begin{split}\theta_j := \theta_j - \alpha [\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j]\\{(j=1,2,3,..,n)}\end{split}\end{equation}$
In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples.Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once.
In [3]:
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data), end='\n\n')
print(scaler.data_max_, end='\n\n')
print(scaler.transform(data), end='\n\n')
print(scaler.transform([[2, 2]]))
where $ \bar{x} $ is the mean and $ \sigma $ is the standard deviation.
In [5]:
from sklearn.preprocessing import StandardScaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
scaler = StandardScaler()
print(scaler.fit(data), end='\n\n')
print(scaler.mean_, end='\n\n')
print(scaler.transform(data), end='\n\n')
print(scaler.transform([[2, 2]]))
where $ \left\lVert x \right\rVert $ is the norm of vector $ x $ (euclidean distance in most cases).
In [21]:
import numpy as np
from sklearn.preprocessing import Normalizer
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
normalizer = Normalizer()
print(normalizer.fit(data), end='\n\n')
print(normalizer.transform(data), end='\n\n')
print(np.linalg.norm(normalizer.transform(data), axis=1))
On the importance of feature scaling: sklearn example