We will talk about different validation techniques used to do model selection. The goald of model selection is to use the data available to choose the model (from a class/set of models) which will work best in the future. One important cornerstone of model selection is that historical performance does not imply future performance. As we saw before, optimizing for historical performance can often cause overfitting where the model's future performance is far far worse than the historical performance. The term generalization error is used to describe the error of a model on previously unseen data (or future data).
We start with the discussion with one way to select models: we would like the model which has the minimum information loss to the true reality. We will denote truth as $f$ and have a set of models $g_i( x, \theta )$ where $\theta$ denotes the model parameters.
We define the difference in information between two distributions as:
$$ KL( f || g ) = \int_x f(x) \cdot log\left(\frac{f(x)}{g(x)}\right) $$This is the Kullback–Leibler divergence and denotes the amount of information which must be added to g to get to f (otehrwise the information which is lost by using g when f is reality).
In general machine learning algorithms are trained with data (henve the learning part). So we will explicitly denote this data seen as $y$. This data is actual data seen from reality, so we have $y \sim f$ since y is a sample dataset from reality $f$. Now we would like to select the model using the data $y$ that minimizes the informatio loss from relaity, so: $$ \min \int_x f(x) \cdot log\left( \frac{f(x)}{g(x, \theta(y)} \right) = \min \int_x f(x) \cdot log( f(x) ) - \int_x f(x) \cdot log( g(x, \theta(y)) ) $$
We see that the first term involves only realityf $f$ and so we can ignore it since it is a constant with respects to the different possible $g_i( x, \theta(y) )$ models we are selecting from. This means we end up minimizing: $$ \min - \int_x f(x) \cdot log[g(x,\theta(y))] $$
Let us define $\theta_0$ as the parameters for $g$ which minimize hte information loss according to the above equation. This is in essence the best $\theta$ for hte model in terms of information loss.
We don't really know $f$ (otherwise we would not be trying to model it ). So, we will try to minimize the expected information loss in terms of the data we have seen, to get: $$ \min E_y\left[ - \int_x f(x) \cdot log[ g(x,\theta(y)) ] \right] = \min - E_y \left[ E_x \left[ g(x,\theta(y) ) \right] \right] $$
one idea is to try to jsut use the maximum likelihood estimatro (MLE) for $\theta_0$. We will call the MLE $\hat{\theta}(y)$, and plug it in as the minimizer to get $$ E_y \left[ E_x \left[ g(x,\hat{\theta}(y) ) \right] \right] = log[ L(\hat{\theta}(y) | data) ]$$
Where $L(\theta|data)$ is the likelihood of MLE estimate. However, it turns out the this estimate is BIASED.
It can be shown that, under certain conditions, the bias of MLE $\hat{\theta}$ is approsimaltely $K$, where $K$ = #of parameters in the model. This leads us to a unbiased estimator of the information loss of a model $g_i$ as: $$ loss = -log( L(\hat{\theta}) | data ) + K $$
This is the Akaike Information Criteria (AIC) for a set of models. We choose the model with the minimum AIC as our estimate of hte model which has the smallest information loss.
Another way to choose a model is to actually try to estimate the future performance of the model. In the ideal world, we would have accessto $f$ directly and could in fact as for a sample dataset $y \sim f$ at any time. Then we could simply as for enough sample datasets to truly characterize the performance of a model $g(x,\theta(y))$ in the reality $f$.
This basic ideas leads us to cross validation as an option. Here, we take our entire single dataset $y \sim f$ and split it into a train and a test group. Denote the test group as $y_{test}$ and the train group as $y_{train}$. Now we judge the performance of any particular model $g(x,\theta(y_{train})$ using the test group, so: $$performance = g( y_{test}, \theta(y_{train}) )$$
Now, since this is an estimate of performance, we might want to take more than one measurement of teh performance. Rather than splitting into a single test/train grouping, we can split the dataset into a set of folds (sections) and use each section as a test dataset while using hte other folds as training data. In this way we get k observations of the performace. In hte extreme case we get the leave-one-out cross validation where each datapoint is it's own fold and we test the performance over each single datapoint being a test point. For such k-fold cross-validation we generally choose the model with the bet expected performance given the observed performances from the folds.
A second approach is to randomly sample test and training datasets from $y$ a number of times and estimate the performance as the average performance over the samples test/train. This allows us to not have to itertave over all of the k-folds. It also allows us to generate test/train datasets that are repesentative of the original dataset but are no where in it (if we use sampling with replamcenet, we can make some arguments around the distribution over the test/train if the original dataset was sampled from reality and was large enough).
We will now consider teh setting of emprirical risk minimization. Suppose that we have a supervised learning algorithm. There exists a distribution over data point $x$ and label $y$ defined as $P(x,y)$. Now our algorithm has a training dataset $Z=\{(x_i,y_i)\} \sim P$ of $|Z| = n$ samples. We further have a loss function $L(\hat{y},y)$ which measures the difference of a predicted $\hat{y}$ from a true $y$. We define the risk as the expectation of the loss function: $$R(g) = E_{(x,y) \sim P}\left[ L( g(x), y ) \right]$$
However, we do not acutally known $P$ and so we will minimize the empirical risk: $$ R_{emp}(g) = \frac{1}{n}\sum_Z L( g(x_i), y_i ) $$
We will choose the model $g$ which minimizes this empirical risk.
In this setting, there is a direct connection between a version of stability and a notion of generalizability in a strict sense: the more stable an model the mode general it is (so higher stability implies lower generalization error).
In [ ]: