Suppose we are able to take a sample of size $N$, from an infinite bin, in an attempt to determine the bin frequency, $\mu$, which is unknown. Hoeffding's Inequality allows us to calculate a probability bound between a sample frequency, $\nu$, and the unknown bin frequency, $\mu$. This is given by:
$$\mathbb{P} \left[ \left| \nu - \mu \right| > \epsilon \right] \le 2e^{-2\epsilon^2 N}$$This is valid for all $N$ and $\epsilon$ and the bound does not depend on $\mu$. It shows how $\nu$ generalizes to $\mu$ for a single hypothesis. It also illustrates the tradeoff between $N, \epsilon$, and the bound.
In practice, however, the multiple hypothesis case is what we are really interested in. Hoeffding's Inequality can be generalised to the case of multiple bins (and therefore multiple hypotheses). Using the approach of the 'union bound', the probability bounds for the multiple bin case can be shown to be:
$$\mathbb{P} \left[ \left| E_{in}(h) - E_{out}(h) \right| > \epsilon \right] \le 2Me^{-2\epsilon^2 N}$$where $E_{in}(h) = \nu$ and $E_{out}(h) = \mu$, and $M$ is the number of bins.
This result is useful when we have a finite number of bins, but this is not the case when $M \rightarrow \infty$, i.e. we have an infinite number of hypothesis sets, as in cases such as linear regression. There is a way to deal with this problem by finding a tighter bound on the probability, via the Vapnik Chervonenkis inequality.
The Vapnik-Chervonenkis (VC) Inequality, addresses the issue of "infinite hypothesis sets" in the multiple bin Hoeffding's Inequality by removing $M$, and introducing a growth function, $m_\mathcal{H}(N)$. The growth function counts the most number of dichotomies on any $N$ points, and can be written as:
$$m_\mathcal{H}(N) = \underset{\mathbf{x_1,\dots,x_N} \in \mathcal{X}}{\max} \left| \mathcal{H}\left(\mathbf{x_1,\dots,x_N}\right)\right|$$The growth function satisfies:
$$m_\mathcal{H}(N) \le 2^N$$In practice, however, the inequality, $m_\mathcal{H}(N) \le 2^N$ on the growth function may sometimes be too pessimistic. A possible situation that we might see could look like the following:
In the first scenario, we say that the $N$ points are fully shattered by $\mathcal{H}$. In the second scenario, we say $k$ is a break point for $\mathcal{H}$ because no dataset of size $k$ can be 'shattered' by $\mathcal{H}$.
Now if a break point, $k$, exists, it can be shown that the growth function is polynomial (rather than the exponential $m_\mathcal{H}(N) = 2^N$), and satisfies the following inequality:
$$m_\mathcal{H}(N) \le B(N,k) \le \sum\limits_{i=0}^{k-1} \left(\begin{array}{c} N \\ i \end{array}\right)$$$$m_\mathcal{H}(N) \le B(N,k) \le \sum\limits_{i=0}^{d_{VC}} \left(\begin{array}{c} N \\ i \end{array}\right)$$which is a polynomial with a maximum power of $N^{d_{VC}}$ where $d_{VC}$ is the VC dimension.
The removal of the quantity, $M$ (which was introduced via the union bound approach), means that the issue where $M \rightarrow \infty$ (infinite hypothesis sets) is no longer relevant. When the growth function is introduced, we arrive at the VC inequality given by (proof of the VC inequality is available here):
$$\mathbb{P}\left[ \left| E_{in}(g) - E_{out}(g) \right| > \epsilon \right] \le 4 m_\mathcal{H}(2N) \exp\left(-\frac{1}{8}\epsilon^2 N\right)$$This is the most important theoretical result in machine learning.
In other words,
If we know that the VC dimension is finite, then we can say that we can generalize (i.e. $g \in \mathcal{H}$ will generalize). This result is:
Let us introduce a probability, $\delta$, given by:
$$\delta = 4m_\mathcal{H}(2N)\exp\left(-\frac{1}{8}\epsilon^2 N\right)$$As we saw earlier, the growth function, $m_\mathcal{H}(N)$, is a polynomial with a maximum power of $N^{d_{VC}}$. To better understand how the probability bounds behave, it helps to make the following approximation $m_\mathcal{H}(N) \approx N^{d_{VC}}$. We thus have:
$$\delta \approx 4 (2N)^{d_{VC}}\exp\left(-\frac{1}{8}\epsilon^2 N\right)$$In the case of large $N$, the term $(2N)^{d_{VC}}$ can become quite large, while on the other hand, the exponential term could become very small. Hence, for numerics, we choose to calculate $\delta$ via its logarithm, i.e.:
$$\ln \delta \approx \ln 4 + d_{VC} \ln (2N) -\frac{1}{8}\epsilon^2 N$$This is implemented in the functions below
In [1]:
using PyPlot
using Optim
In [2]:
function get_ln_delta(N, d_VC, epsilon)
return log(4) + d_VC * log(2N) -(epsilon^2)*N/8
end;
Now lets put some numbers in and see how the probability behaves. Consider a hypothesis, $\mathcal{H}$, with $d_{VC} = 10$. Suppose that we need a generalization error that is at most 0.05 (i.e. $\epsilon = 0.05$), and we need to know this with 95% confidence (i.e. $\delta = 0.05$). In such a case, how many data points, $N$, will we need?
There are two ways to solve this--one way is to try out a few values of $N$ and see which corresponds to the preferred value of $\delta$, i.e.:
In [3]:
d_VC = 10;
epsilon = 0.05;
N_arr = [400000. 420000. 440000. 460000. 480000.];
for N in N_arr
delta = exp(get_ln_delta(N, d_VC, epsilon));
println("N = $N, delta = $delta")
end
A more general approach, however, would be to solve for $N$. First, an 'error function' needs to be defined (we choose the squared error), and the optimize
function in Julia will optimize $N$ to the value that minimizes the 'error function' that we just specified, i.e.
In [4]:
function get_error(N, d_VC, epsilon, target_delta)
delta = exp(get_ln_delta(N, d_VC, epsilon));
return (delta - target_delta)^2;
end;
result = optimize(N -> get_error(N, 10, 0.05, 0.05), 400000, 500000)
Out[4]:
In [5]:
N = result.minimum;
err = result.f_minimum;
if result.converged == true
println("Solution: N = $N (error = $err)")
end
The probability, $\delta$ can be inverted to find the "level of tracking" (or tolerance), $\epsilon$:
$$\delta = 4 m_\mathcal{H}(2N) \exp\left(-\frac{1}{8}\epsilon^2 N\right)$$$$\ln \frac{\delta}{4 m_\mathcal{H}(2N)} = -\frac{1}{8}\epsilon^2 N$$$$\epsilon = \underbrace{\sqrt{\frac{8}{N} \ln \frac{4 m_\mathcal{H}(2N)}{\delta}}}_{\Omega}$$The "good event", which happens with a probability greater or equal to $1-\delta$, is that $E_{in}$ tracks $E_{out}$:
$$\left|E_{out} - E_{in}\right| \le \Omega\left(N, \mathcal{H}, \delta\right)$$We can also say with the same probability (note that generally, $E_{in}$ will be smaller than $E_{out}$, i.e. the in-sample error tends to be smaller than the out-of-sample error):
$$ E_{out} - E_{in} \le \Omega\left(N, \mathcal{H}, \delta\right)$$$$ E_{out} \le E_{in} + \Omega$$Here, $E_{out} - E_{in}$ can be viewed as a generalization error. The tradeoff here is that with a bigger hypothesis set (i.e. more parameters), $E_{in}$ will decrease, but $\Omega$ (the generalization error) will increase. There is a way to find the "optimum" tradeoff, which is a technique called "regularization".
We may wish to compare the original VC bounds to other bounds:
Using the approximation, $m_\mathcal{H}(N) \approx N^{d_{VC}}$, and splitting the logarithmic terms for numerics, we have:
In [6]:
function get_vc_bound(N,d_VC,delta)
inner = log(4) + d_VC * log(2N) - log(delta);
return sqrt(8inner/N);
end;
function get_rademacher_penalty(N,d_VC,delta)
inner = 2(log(2N) + d_VC * log(N))/N
return sqrt(inner) + sqrt(-2log(delta)/N) + 1/N;
end;
function get_parrondo_and_van_den_broek(N,d_VC,delta)
pvdb_err(epsilon,N,d_VC,delta) = (epsilon - sqrt((2epsilon + log(6) + d_VC*log(2N) - log(delta))/N))^2;
result = optimize(epsilon -> pvdb_err(epsilon,N,d_VC,delta), 0.0, 10.0);
if result.converged==true
return result.minimum
else
return NaN
end;
end;
function get_devroye(N,d_VC,delta)
devroye_err(epsilon,N,d_VC,delta) = (epsilon - sqrt((4epsilon*(1+epsilon) + log(4) + d_VC*log(N^2) - log(delta))/(2N)))^2;
result = optimize(epsilon -> devroye_err(epsilon,N,d_VC,delta), 0.0, 20.0);
if result.converged==true
return result.minimum
else
return NaN
end;
end;
In [7]:
d_VC = 50;
delta = 0.05;
In [8]:
#N_arr = logspace(0,4,11); #from 1 to 10k
N_arr = [5, 10, 50, 100, 500, 1000, 5000, 10000];
epsilon_vc = [ get_vc_bound(N,d_VC,delta) for N in N_arr ];
epsilon_rp = [ get_rademacher_penalty(N,d_VC,delta) for N in N_arr ];
epsilon_pv = [ get_parrondo_and_van_den_broek(N,d_VC,delta) for N in N_arr ];
epsilon_dv = [ get_devroye(N,d_VC,delta) for N in N_arr ];
In [9]:
function plot_epsilons(N_arr,epsilon_arr,label_arr)
fig = figure(figsize=(7,4))
ax = fig[:add_subplot](1,1,1)
for i in 1:size(label_arr)[1]
#ax[:loglog](N_arr,epsilon_arr[:,i],label=label_arr[i]);
loglog(N_arr,epsilon_arr[:,i],"x-",label=label_arr[i]);
end;
#ax[:legend](loc="best",frameon=false);
legend(loc="best",frameon=false,fontsize=10)
grid(true);
end;
In [10]:
epsilon_arr = [ epsilon_vc epsilon_rp epsilon_pv epsilon_dv ];
label_arr = [ "VC", "Rademacher", "Parrondo & Van den Broek", "Devroye" ];
plot_epsilons(N_arr,epsilon_arr,label_arr);
What we are seeing here is that all of these bounds follow roughly the same kind of trends. And while it is possible to improve on the VC bound, the maximum improvement here is an improvement by a factor of about 2.5 ~ 3. For example, we could do a comparison for small $N$ and large $N$ as follows:
Comparison of Bounds for large $N$ = 10,000
In [11]:
println("N = $(N_arr[end])")
println("ϵ (VC) = $(epsilon_vc[end])")
println("ϵ (Rademacher) = $(epsilon_rp[end])")
println("ϵ (Parrondo & Van den Broek) = $(epsilon_pv[end])")
println("ϵ (Devroye) = $(epsilon_dv[end])")
Comparison of Bounds for small $N$ = 5
In [12]:
println("N = $(N_arr[1])")
println("ϵ (VC) = $(epsilon_vc[1])")
println("ϵ (Rademacher) = $(epsilon_rp[1])")
println("ϵ (Parrondo & Van den Broek) = $(epsilon_pv[1])")
println("ϵ (Devroye) = $(epsilon_dv[1])")
Let's write $E_{out}$ as an expected value of the squared error between the hypothesis, $g^{(\mathcal{D})}(\mathbf{x})$ and the target function, $f(\mathbf{x})$, i.e.:
$$E_{out}\left(g^{(\mathcal{D})}\right) = \mathbb{E}_{\mathbf{x}} \left[ \left( g^{(\mathcal{D})}(\mathbf{x}) - f(\mathbf{x})\right)^2\right]$$where we have introduced the symbol $\mathcal{D}$ to say that this is specific to a particular dataset (if a different dataset is used, we would arrive at a different target function).
Now we wish to get rid of the dependency on the particular dataset, $\mathcal{D}$. This is done by repeating this over multiple datasets so that we can calculate an 'expected value':
$$\mathbb{E}_\mathcal{D}\left[ E_{out}\left(g^{(\mathcal{D})}\right) \right] = \mathbb{E}_\mathcal{D}\left[ \mathbb{E}_{\mathbf{x}} \left[ \left( g^{(\mathcal{D})}(\mathbf{x}) - f(\mathbf{x})\right)^2\right] \right]$$$$\mathbb{E}_\mathcal{D}\left[ E_{out}\left(g^{(\mathcal{D})}\right) \right] = \mathbb{E}_{\mathbf{x}} \left[ \mathbb{E}_\mathcal{D}\left[ \left( g^{(\mathcal{D})}(\mathbf{x}) - f(\mathbf{x})\right)^2\right] \right]$$where in the last step, the order of expectations has been 'reversed'. We are allowed to do this because we are 'integrating', where the integrand is strictly non-negative.
Let us consider the case where we are learning from multiple datasets. From each dataset that we 'learn' from, we get a hypothesis. If we take the 'average' of each of these hypotheses, we will arrive at the concept of an 'average' hypothesis, $\bar g(\mathbf{x})$. If we have $K$ datasets, $\mathcal{D_1, D_2, \dots D_K}$, then we have:
$$\bar g(\mathbf{x}) = \frac{1}{K} \sum\limits_{k=1}^K g^{(\mathcal{D_k})}\left(\mathbf{x}\right) $$$$\bar g(\mathbf{x}) = \mathbb{E}_\mathcal{D}\left[ g^{(\mathcal{D})}\left(\mathbf{x}\right) \right] $$Now we expand the following quantity
$$\mathbb{E}_\mathcal{D}\left[ \left( g^{(\mathcal{D})}(\mathbf{x}) - f(\mathbf{x})\right)^2\right] = \mathbb{E}_\mathcal{D}\left[ \left( g^{(\mathcal{D})}(\mathbf{x}) - \bar g(\mathbf{x}) + \bar g(\mathbf{x}) - f(\mathbf{x})\right)^2\right]$$$$= \mathbb{E}_\mathcal{D}\left[ \left( g^{(\mathcal{D})}(\mathbf{x}) - \bar g(\mathbf{x})\right)^2 + \left( \bar g(\mathbf{x}) - f(\mathbf{x})\right)^2 + 2\left( g^{(\mathcal{D})}(\mathbf{x}) - \bar g(\mathbf{x})\right) \left( \bar g(\mathbf{x}) - f(\mathbf{x})\right) \right]$$Let us now look at the 'cross term', $\mathbb{E}_\mathcal{D}\left[ 2\left( g^{(\mathcal{D})}(\mathbf{x}) - \bar g(\mathbf{x})\right) \left( \bar g(\mathbf{x}) - f(\mathbf{x})\right) \right]$, the second factor is a constant, and that the first factor is zero because $\mathbb{E}_\mathcal{D}\left[ g^{(\mathcal{D})}(\mathbf{x}) \right] = \bar g(\mathbf{x})$ we can see that the entire 'cross term' is zero (the reason why the 'squared error' approach was chosen earlier was because with this approach, it could be shown that the cross term would be zero, i.e. it 'decomposes cleanly'). Hence we arrive at the following expression which shows the relationship between:
$$\mathbb{E}_\mathcal{D}\left[ \left( g^{(\mathcal{D})}(\mathbf{x}) - f(\mathbf{x})\right)^2\right] = \underbrace{ \mathbb{E}_\mathcal{D}\left[ \left( g^{(\mathcal{D})}(\mathbf{x}) - \bar g(\mathbf{x})\right)^2 \right] }_{\text{var}(\mathbf{x})} + \underbrace{ \left( \bar g(\mathbf{x}) - f(\mathbf{x})\right)^2 }_{\text{bias}(\mathbf{x})}$$Hence we have:
$$\mathbb{E}_\mathcal{D}\left[ E_{out}\left(g^{(\mathcal{D})}\right) \right] = \mathbb{E}_{\mathbf{x}} \left[ \mathbb{E}_\mathcal{D}\left[ \left( g^{(\mathcal{D})}(\mathbf{x}) - f(\mathbf{x})\right)^2\right] \right]$$$$\mathbb{E}_\mathcal{D}\left[ E_{out}\left(g^{(\mathcal{D})}\right) \right] = \mathbb{E}_{\mathbf{x}} \left[ \text{bias}(\mathbf{x}) + \text{var}(\mathbf{x}) \right] = \text{bias + variance}$$where we have introduced the following 'definitions' for bias and variance:
$$\text{bias} = \mathbb{E}_{\mathbf{x}} \left[ \left( \bar g(\mathbf{x}) - f(\mathbf{x})\right)^2 \right]$$$$\text{variance} = \mathbb{E}_{\mathbf{x}} \left[ \mathbb{E}_\mathcal{D}\left[ \left( g^{(\mathcal{D})}(\mathbf{x}) - \bar g(\mathbf{x})\right)^2 \right] \right]$$In the bias variance trade-off, we typically plot the following quantities
Note:
Here, we examine the bias-variance tradeoff using a simple example.
Consider the following target function:
$f: \left[-1,1\right] \rightarrow \mathbb{R}$ given by $f(x) = \sin\left(\pi x\right)$
Assume an input probability distribution uniform on [-1, 1] and that we use only 2 examples for the training set to find the best hypothesis of the following form:
$$h(x) = ax$$Suppose we choose a least squared error measure:
$$e(\mathbf{x}) = \frac{1}{N} \sum\limits_{i=1}^N \left[ h(x_i) - \sin(\pi x_i) \right]^2$$$$e(\mathbf{x}) = \frac{1}{N} \sum\limits_{i=1}^N \left[ ax_i - \sin(\pi x_i) \right]^2$$Setting the derivative to zero
$$\frac{\partial e(\mathbf{x})}{\partial a} = \frac{2}{N} \sum\limits_{i=1}^N x_i \left[ ax_i - \sin(\pi x_i) \right] = 0$$$$\sum\limits_{i=1}^N \left[ ax_i^2 - x_i \sin(\pi x_i) \right] = 0 \Leftrightarrow \sum\limits_{i=1}^N ax_i^2 = \sum\limits_{i=1}^N x_i \sin(\pi x_i)$$$$a = \frac{\sum\limits_{i=1}^N x_i \sin(\pi x_i)}{\sum\limits_{i=1}^N x_i^2} = \frac{\sum\limits_{i=1}^N x_i y_i}{\sum\limits_{i=1}^N x_i^2}$$
In [13]:
function generate_dataset(;n_points=200)
x = linspace(-1,1,n_points);
y = sin(π*x);
x_sample = rand(2);
y_sample = sin(π*x_sample);
return x,y,x_sample,y_sample;
end;
function fit_a(x_sample,y_sample)
sum = x_sample[1]*y_sample[1]+x_sample[2]*y_sample[2]
return sum/(x_sample[1]^2+x_sample[2]^2);
end;
function predict_y_a(x,a)
return a .* x;
end;
In [14]:
function plot_xy(x,y,x_sample,y_sample,y_pred)
figure(figsize=(7,5));
plot(x,y,"m-",label=L"$\sin(\pi x)$");
plot(x,y_pred,"g--",label="predicted");
plot(x_sample,y_sample,"bo",label="sample data");
grid(true);
legend(loc="best",frameon=false);
end;
x,y,x_sample,y_sample = generate_dataset();
y_pred = predict_y_a(x,fit_a(x_sample,y_sample));
plot_xy(x,y,x_sample,y_sample,y_pred);
We now run 1 million trials to obtain $\bar g(\mathbf{x})$. This will give us the coefficient, $a$, of the 'best possible' hypothesis, $h(\mathbf{x}) = a\mathbf{x}$
$$\bar g(\mathbf{x}) = \mathbb{E}_\mathcal{D}\left[ g^{(\mathcal{D})}\left(\mathbf{x}\right) \right] = \lim\limits_{K\rightarrow\infty} \frac{1}{K} \sum\limits_{k=1}^K g^{(\mathcal{D_k})}\left(\mathbf{x}\right)$$
In [15]:
function get_avg_a(n_trials; n_points=200)
a_arr = Float64[];
for i in 1:n_trials
x_sample = rand(2);
y_sample = sin(π*x_sample);
a = fit_a(x_sample,y_sample);
push!(a_arr,a);
end
return mean(a_arr);
end;
a = get_avg_a(1e6); #1.423, 1.425, 1.424, 1.427
println("<g(x)> = $(round(a,2))x");
With the $\bar g(\mathbf{x})$ that we have just obtained, we can now calculate the bias and variance as follows:
$$\text{bias} = \mathbb{E}_{\mathbf{x}} \left[ \left( \bar g(\mathbf{x}) - f(\mathbf{x})\right)^2 \right]$$$$\text{variance} = \mathbb{E}_{\mathbf{x}} \left[ \mathbb{E}_\mathcal{D}\left[ \left(\bar g(\mathbf{x}) - g^{(\mathcal{D})}(\mathbf{x})\right)^2 \right] \right]$$
In [16]:
function get_bias_variance(avg_a, n_trials; n_points=200)
bias = Float64[];
var = Float64[];
for i in 1:n_trials
x_sample = rand(2);
y_sample = sin(π*x_sample);
a = fit_a(x_sample,y_sample);
g_bar_x = avg_a*x_sample;
push!(bias,mean((g_bar_x - y_sample).^2));
push!(var,mean((g_bar_x - a .* x_sample).^2));
end
return bias, var;
end;
In [17]:
bias,var = get_bias_variance(a, 1e7);
println("<bias> = $(mean(bias))");
println("<var> = $(mean(var))");