Asymptotic Analysis and Consistency

February 2017

In this notebook we discuss the asymptotic properties of decision rules. After a general introduction we restrict attention to empirical loss minimizing decision rules, establish sufficient conditions for consistency to hold and discuss the rate at which the convergence takes place.

  • Consistency of a decision rule states that the corresponding empirical loss converges to the true loss in probability as the sample size grows to infinity.
  • In the case of empirical loss minimization, consistency is tightly linked to the uniform law of large numbers that we present in detail.
  • We introduce tail and concentration bounds to establish the uniform law of large numbers and analyze the (non-asymptotic) rate at which the convergence takes place.

Introduction

Knowing the true mechanism $P$, while one faces a decision problem $(\mathcal{H},L,\mathcal{A})$, naturally leads to the best in class action $a^{*}_{L, P, \mathcal{A}}$ as the optimal action within the prespecified class $\mathcal{A}$. Although in reality we do not know $P$, it is illuminating to consider this case first as we naturally require that at least with a sufficiently large (in fact, infinite) sample, the true mechanism---from a statistical point of view---can be revealed. The property that guarantees this outcome is the ergodicity of the assumed statistical models.

This argument suggests to use the best in class action, $a^{*}_{L, P, \mathcal{A}}$, as the optimal large sample solution. In other words, from any sensible decision rule a minimal property to require is that as the sample size grows, the distribution of the decision rule converges to a degenerate distribution concentrated at the point $a^{*}_{L, P, \mathcal{A}}$.

As an example, consider again the MLE estimator in the coin tossing problem for which the above property is satisfied. The following figure represents how the different confidence bands associated with the induced action distribution evolve as the sample size goes to infinity. Apparently, for sufficiently large sample sizes, the confidence bands concentrate around the true value, which, due to the correct specification is equivalent to the best in class action.

This property of the decision rule is called consistency. One of the main objectives of this notebook is to investigate the conditions under which it can be established.

Given that asymptotically the true $P$ can be learned, a natural question to ask is why not to set $\mathcal{A}$ large enough so as to guarantee $\gamma(P)\in\mathcal{A}$, i.e. a correctly specified model. We will see the sense in which requiring consisteny ties our hands in terms of the "size" of $\mathcal{A}$.

Although it is hard to provide generally applicable sufficient conditions for consistency, roughly speaking, we can identify two big classes of decision rules for which powerful results are available.

  • Bayes decision rules
  • Frequentist decision rules building on the empirical distribution $P_n$, where
$$P_n(z) : = \frac{1}{n}\sum_{i=1}^{n} \mathbf{1}\{Z_i \leq z\}$$

and $\mathbf{1}\{ A \}$ is the indicator function of the set $A$.

In this notebook we will focus exclusively on the latter approach considering decision rules that assign actions by minimizing the empirical analog of the population loss. Hence, the procedure is labelled by the name: empirical loss minimization or analog estimation.


Consistency of decision rules

To start out our asymptotic inquiry, we first define consistency in a more precise manner than we did before. There are two, slightly different notions depending on the tractability and objectives of the decision problem at hand.

  • Consistency in terms of the loss function: a decision rule $d: \mathcal{S} \mapsto \mathcal{A}$ is consistent in terms of the loss function relative to $(\mathcal{H}, \mathcal{A})$, if for all $P \in \mathcal{H}$,
$$L(P, d(Z^n)) \ \ \underset{n \to \infty}{\overset{\text{P}}{\rightarrow}} \ \ \inf_{a \in \mathcal{A}}L(P, a) = L\left(P, a^{*}_{L, P,\mathcal{A}}\right).$$
  • Consistency in terms of the action: a decision rule $d: \mathcal{S} \mapsto \mathcal{A}$ is consistent in terms of the action relative to $(\mathcal{H}, \mathcal{A})$, if for all $P \in \mathcal{H}$, $$m\left(d\left(Z^n\right), a^*_{L, P, \mathcal{A}}\right) \ \ \underset{n \to \infty}{\overset{\text{P}}{\rightarrow}} \ \ 0.$$

where $m(\cdot, \cdot)$ is a metric on the action space $\mathcal{A}$. The necessary condition---in the case of analog estimators---for this notion of consistency is the identifiability of $a^*_{L, P, \mathcal{A}}$ that we define as follows:

Identification:

$a^*_{L, P, \mathcal{A}}$ is identified relative to $(P, \mathcal{A})$ if $a^*_{L, P, \mathcal{A}}$ is the unique minimizer of $L(P, \cdot)$ over $\mathcal{A}$.

$a^*_{L, P, \mathcal{A}}$ is (uniformly) identified relative to $(\mathcal{H}, \mathcal{A})$ if it is identified relative to $(P, \mathcal{A})$ for all $P\in\mathcal{H}$.

Under identification, the two notions of consistency are equivalent. As the above definitions suggest, however, the former is more general to the extent that it also allows for partially identified statistical models. Unless otherwise noted, we will work with consistency in terms of the loss function and call it simply consistency.

In the above definitions, the set of assumed statistical models $\mathcal{H}$ plays a key role: it outlines the set of distributions under which the particular notion of convergence is required. Ideally, we want convergence under the true distribution, however, because we do not know $P$, the "robust" approach is to be as agnostic as possible regarding $\mathcal{H}$. The central insight of statstical decision theory is to highlight that this approach is not limitless: one needs to find a balance between $\mathcal{H}$ and $\mathcal{A}$ to obtain decision rules with favorable properties.

Consistency has strong implications for the risk functional in large samples: the degenerate limiting distribution of the decision rule implies that asymptotically the variability of the decision rule vanishes and so

$$ R_{\infty}(P,d) := \lim_{n\to \infty} \ R_n(P, d) \overset{a.s.}{=} L\left(P, a^{*}_{L, P,\mathcal{A}}\right). $$

Uniform law of large numbers

As mentioned before, in this notebook we are focusing on decision rules arising from empirical loss minimization. The basic idea behind this approach is to utilize the consistency definitions directly, but instead of using the quantity $L(P, a)$ that we cannot actually evaluate, subsitute the empirical distribution into the loss function and work with the empricial loss $L(P_n, a)$ instead.

REMARK: The empirical loss is not defined, if $P_n$ is not in the domain of $L$---as is the case, for example, with non-parametric density estimation. Then, one has to either extend the domain of $L$ or map $P_n$ to the original domain of the loss function $L$. An excellent discussion can be found in Manski (1988).

Given that the loss function is continuous in its first argument, the law of large numbers (LLN), accompanied with the continuous mapping theorem, implies

$$L(P_n, a) \ \ \underset{n \to \infty}{\overset{\text{P}}{\to}} \ \ L(P, a)$$

that is, for a fixed action $a$, the emprical loss converges in probability to the true loss as the sample size goes to infinity. Notice, however, that consistency

  • is not a property of a given action, but a whole decision rule, more precisely, it is about the convergence of the sequence $\{d(z^n)\}_{n\geq 1} \subseteq \mathcal{A}$
  • requires convergence to the minimum loss (within $\mathcal{A}$) that we ought to take into account while generating decision rules

In the spirit of the analogy principle, decision rules minimizing the empirical loss can be defined as follows

$$d(z^n) := \arg\min_{a\in\mathcal{A}} L(P_n, a), \quad\quad\quad (1)$$

where the dependence on the sample is embodied by the empirical distribution $P_n$. The heuristic idea is that because $d(z^n)$ minimizes $L(P_n, \cdot)$, and $P_n$ converges to $P$, then "ideally" $d(z^n)$ should go to $a^{*}_{L, P, \mathcal{A}}$, the minimizer of $L(P, a)$, as the sample size grows to infinity.

What do we need to ensure this argument to hold? First, notice that the standard law of large numbers is inadequate for this purpose. In order to illustrate this, we turn now to a concept closely related to empirical loss minimization: generalization.

Using the empirical loss as a substitute for the true loss while determining the decision rule, a critical question is how much error do we introduce with this approximation. The object of interest regarding this question is the following quantity defined for a given action $a$ and realization of the sample $z^n$

$$\xi(a, z^n) = \left| L(P_n, a) - L(P, a) \right|.$$

We say that the action $a$ generalizes well from a given sample $z^n$, if the corresponding $\xi(a, z^n)$ is small. Taking the absolute value is important: $L(P_n, a)$ can easily be smaller than $L(P, a)$ as it is in the case of overfitting.

Due to $P_n$'s dependence on the particular sample, however, the quantity $\xi(a, \cdot)$ is a random variable. It might be small for a given sample $z^n$, but what we really need in order to justify the method of empricial loss minimization is that $\xi$ is small for "most samples", or more precisely

$$P\left(z^n : \left| L(P_n, a) - L(P, a) \right| > \delta \right) \quad \text{is small.}$$

Since $L$ is continuous in its first argument and $P_n \to P$, this so called tail probability converges to zero for any fixed $a\in\mathcal{A}$. The figure below displays such tail probabilities for the MLE estimator in the coin tossing example (with quadratic loss) for different actions as functions of the sample size.

One can see that even though the tail probabilities converge to $0$ for all $a\in\mathcal{A}$, the rate of convergence depends on the particular $a$. In other words, $\forall \varepsilon>0$ and $\forall a\in\mathcal{A}$, there is an action dependent minimum sample size $N(a, \varepsilon, \delta)$ (represented by the vertical solid lines) that we need in order to guarantee that the tail probability falls below $\varepsilon$.

Suppose now that we have a decision rule and look at its induced sequence of actions $\{a_n\}_{n\geq 1}\subset \mathcal{A}$ for a fixed realization $z^{\infty}$. Again, for consistency, we need the associated sequence of tail probabilities converging to zero for "most $z^{\infty}$".

  • When the set $\mathcal{A}$ has finitely many elements, this is not a problem: we can simply define
$$ N(\varepsilon, \delta) : = \max_{a\in\mathcal{A}} N(a, \varepsilon, \delta) $$

and observe that for all $n\geq N(\varepsilon, \delta)$, the tail probabilities are smaller than $\varepsilon$, for all $a\in \{a_n\}_{n\geq 1} \subseteq \mathcal{A}$.

  • However, because the critical sample size $N(a, \varepsilon, \delta)$ depends on the action, if $\mathcal{A}$ is "too big", it is possible that there is no $n$ so that $n > N(a_n, \varepsilon, \delta)$, hence $L(P_n, a_n)$ may never approach $L(P, a_n)$.

A possible approach to avoid this complication is to require the existence of an integer $N(\varepsilon, \delta)$ independent of $a$, such that for all sample sizes larger than $N(\epsilon, \delta)$

$$ \sup_{a \in\mathcal{A}} \left| L(P_n, a) - L(P, a) \right| \ \overset{\text{P}}{\to} \ 0. $$

This notion is called the uniform law of large numbers referring to the fact that the convergence is guaranteed simultaneously for all actions in $\mathcal{A}$.


A special case -- plug-in estimators

There are special cases where we do not need to worry about uniformity. When the true loss function has a minimizer $a^{*}_{L, P, \mathcal{A}}$ that admits a closed form in the sense that it can be expressed as a continuous function of $P$, the empirical loss minimizer is simply a sample analog of $a^{*}_{L, P, \mathcal{A}}$. Effectively, there is no need to explicitly minimize the emprical loss, so in spirit, it is as if we kept a particular action fixed. In these cases the standard LLN is sufficient to establish consistency.

  • Sample mean:

    Consider the quadratic loss function with the feature $\gamma(P) = E[Z]$, i.e. the loss is $L(P,a) = E[(a - E[Z])^2]$. Evidently, the minimizer of this quantity is $E[Z]$. Although we could proceed by minimizing the empricial loss to derive the decision rule, we do not need to do that, because we know up front that it is equal to the sample analog of $E[Z]$. The decision rule is the plug-in estimator $d(z^n) = \frac{1}{n}\sum_{i=1}^{n}z_i$.

  • OLS estimator:

    Suppose that $Z=(Y, X)$ and consider the quadratic loss $L(P, a) = E[(Y-a)^2 | X]$, where $E[\cdot | X]$ is the expectation operator conditioned on $X$. Assume that $(Y, X)$ follows a multivariate normal distribution, then the minimizer of $L$ is given by $a = E\left[(X'X)^{-1}\right]E[X'Y]$. Consequently, there is no need for explicit minimization, we can use the least squares estimator as a plug-in estimator for $a$.

More generally, as the empirical loss minimizer is itself a functional, if the $\min$ functional can be shown to be continuous on the space of empirical loss functions, i.e. the functional

$$f : L(P_n, \cdot) \mapsto \mathbb{R}, \quad \text{s.t.}\quad\quad f(P_n) = \min_{a\in\mathcal{A}} \ L(P_n, a)$$

is continuous, then an application of the continuous mapping theorem together with the Glivenko-Cantelli theorem (i.e. $\lVert P_n - P \rVert_{\infty} \to 0$) would yield consistency. For reference, see van der Vaart (2000).


Uniform law of large numbers and consistency

In more general nonlinear models, however, $a^{*}_{L, P, \mathcal{A}}$ has no analytical form, so we cannot avoid minimizing the empirical loss in order to derive the decision rule. In these cases, guaranteeing the validity of uniform convergence becomes essential. In fact, it can be shown that consistency of any decision rule that we derive by minimizing some empirical loss function, is equivalent with the uniform LLN.

Theorem. (Vapnik and Chervonenkis, 1971): Uniform convergence

$$ \lim_{n\to \infty} P\left\{\sup_{a \in\mathcal{A}} \left| L(P_n, a) - L(P, a) \right| < \varepsilon\right\}= 1 $$

for all $\varepsilon >0$ is necessary and sufficient for consistency of decision rules arising from empirical loss minimization over $\mathcal{A}$.

Although this characterization of consistency is theoretically intriguing, it is not all that useful in practice unless we find easy-to-check conditions which ensure uniform convergence of empirical loss over a particular function class $\mathcal{A}$. As a preparation to discuss these conditions, introduce first some notation

Almost all loss functions used in practice can be cast in the following form. There exist

  • $l : \mathcal{A}\times \mathbb{R}^d \mapsto \mathbb{R}^{m}$ and
  • a continuous $r : \mathbb{R}^{m} \mapsto \mathbb{R}_+$ such that
$$ L(P, a) = r\left( \int_Z l(a, z)dP (z)\right).$$

Continuity of $r$ implies that in order to establish consistency it is enough to investigate the properties of the limit

$$ \sup_{a\in\mathcal{A}} \left| \int_Z l(a, z)dP_n (z) - \int_Z l(a, z)dP(z)\right| \ \ \overset{\text{P}}{\to} \ \ 0 .$$

Formally, this requires that the class of functions $\mathcal{L}_{\mathcal{A}}:= \left\{l(a, \cdot) : a \in \mathcal{A} \right\}$ is a Glivenko-Cantelli class for $P$.


Definition (Glivenko-Cantelli class): Let $\mathcal{G}$ be a class of integrable real-valued functions of the random variable $Z$ having distribution $P$. Consider the random variable

$$ \Vert P_n - P \Vert_{\mathcal{G}} : = \sup_{g\in\mathcal{G}} \left| P_n g - Pg \right| $$

We say that $\mathcal{G}$ is a Glivenko-Cantelli class for $P$, if $\Vert P_n - P \Vert_{\mathcal{G}}$ converges to zero in probability as $n\to\infty$.


The usefulness of this formulation comes from the fact that it allows us to bring to bear powerful techniques from empirical process theory (Pollard, 1984). This is because the random variable

$$\left| \int_Z l(a, z)dP_n (z) - \int_Z l(a, z)dP(z)\right|$$

can be considered as an empirical process indexed by the function class $\mathcal{L}_{\mathcal{A}}$. For each action $a$, let $l_a:=l(a, \cdot)$ and note that $l_a$ is a function of the random variable $Z$. Using typical notation of the empirical process literature, we can rewrite the above absolute difference as

$$\left\{\left|P_n l_a - P l_a\right|\right\}_{l_a\in\mathcal{L}_{\mathcal{A}}}.$$

Notice that $P_n l_a$ is a random variable because the empirical distribution is a function of the random sample, while $Pl_a$ is just the expectation of $l_a$ and so it is a non-random scalar.

Sufficient conditions for ULLN

Variables of the form $\Vert P_n - P \Vert_{\mathcal{G}}$ are ubiquitos in statistics and econometrics and there are well-known sufficient conditions that guarantee its convergence.

For parametric estimation, probably the most widely used (at least in econometrics) sufficient condition includes the following three assumptions (or some form thereof). Suppose that the action space $\mathcal{A}$ is indexed by a finite dimensional vector $\theta\in\Theta \subset \mathbb{R}^{p}$. If

  • $\Theta$ is compact
  • $l(\theta, z)$ is a function such that $l(\cdot, z)$ is continuous on $\Theta$ with probability one
  • $l(\theta, z)$ is dominated by a function $B(z)$, i.e. $|l(\theta, z)|\leq B(z)$ for all $\theta\in\Theta$, such that $\mathbb{E}[B(Z)]<\infty$

then $\mathcal{L}_{\mathcal{A}} := \left\{l(a, \cdot) : a \in \mathcal{A} \right\}$ is a Glivenko-Cantelli class for $P$ and so the estimator that it represents is consistent. Among others, these assumptions are the bases for consistency of the (Quasi-)Maximum Likelihood (White, 1994) and the Generalized Method of Moments estimators (Hansen, 1982).

A somewhat different approach focuses on the "effective size" of the class $\mathcal{L}_{\mathcal{A}}$ and frames sufficient conditions in terms of particular complexity measures such as the Rademacher complexity, Vapnik-Chervonenkis dimension, covering/packing numbers, etc. The idea is to find tail bounds for the probability that $\Vert P_n - P \Vert_{\mathcal{L}_{\mathcal{A}}}$ deviates substantially above the complexity of $\mathcal{L}_{\mathcal{A}}$. In the following section we present the main ideas behind the derivation of non-asymptotic tail bounds.

An interesting necessary condition: If $\mathcal{L}_{\mathcal{A}}$ is a collection of indicator functions and the data generating process is assumed to be i.i.d., a necessary and sufficient condition for distribution free uniform convergence is that the Vapnik-Chervonenkis complexity of $\mathcal{L}_{\mathcal{A}}$ is finite.


Non-asymptotic bounds

As we have seen above, one way to establish consistency for analog estimators is to require uniform convergence of the empirical loss to the true loss over the entire action space, i.e. guarantee that the uniform law of large numbers holds. This section presents an approach to the laws of large numbers (uniform or "standard") building on a finite sample perspective.

We study the concentration of centered averages around their means for different finite sample sizes. More generally, we want results of the form

$$P\left(z^n \ : \ \left|g(z^n) - E[g(z^n)]\right| \geq \delta_n\right) \leq \varepsilon\quad\text{with}\quad\delta_n \to 0\quad\quad\text{OR}\quad\quad P\left( z^n \ : \ \left|g(z^n) - E[g(z^n)]\right| \geq \delta\right) \leq \varepsilon_n\quad\text{with}\quad\varepsilon_n \to 0$$

Such results are known as concentration inequalities. Ensuring that the centered average goes to a degenerate distribution on its mean---"perfectly" concentrates around it---establishes the law of large numbers. Moreover, studying concentration and tail bounds has the advantage that we obtain approximate rates of convergence as a byproduct. These rates of convergence will give us information regarding the minimum sample size above which generalization from the empricial loss to the true loss is justified.

First, we study how the concentration inequalities work for a single random variable. This is going to parallel the law of large numbers. Second, we study concentration inequalities applicable uniformly over a class of random variables, i.e. uniform bounds of the form

$$P\left(z^n \ : \ \sup_{g\in\mathcal{G}} \ \left|g(z^n) - E[g(z^n)]\right| \geq \delta_n\right) \leq \varepsilon\quad\text{with}\quad\delta_n \to 0$$

over a class of functions $\mathcal{G}$. This is going to parallel the uniform law of large numbers. For ease of notation we present the results for a general class of measurable functions $g \in \mathcal{G}$ -- but remember that in almost all applications we will substitute it for $\mathcal{L}_{\mathcal{A}}$.

This section builds on the material in Wasserman (2016) and Bousquet et al. (2004)

Markov's inequality

Markov's inequality states that for any non-negative scalar random variable $Z$ and $t>0$ we have

$$ P\{Z \geq t\}\leq \frac{\mathbb{E}[Z]}{t}. $$

The idea behind this inequality is fairly simple and in fact it follows from the following obvious inequality

$$ \mathbf{1}\{Z\geq t\}\leq \frac{Z}{t} $$

illustrated by the figure below. Clearly, for any probability measure $P$ over $Z$ with a finite first moment Markov's inequality can be established.

It follows from Markov's inequality that for any strictly monotonically increasing non-negative function $\phi$ and any random variable $Z$ (not necessarily non-negative) we have that

$$ P\{Z \geq t\} = P\{\phi(Z) \geq \phi(t)\} \leq \frac{\mathbb{E}[\phi(Z)]}{\phi(t)}. $$

Taking the centered random variable $| Z - \mathbb{E}[Z]|$ and $\phi(x) = x^q$ for some $q>0$ leads to tail bounds expressed in terms of the moments of $Z$.

$$ P\{| Z - \mathbb{E}[Z]| \geq t\} \leq \frac{\mathbb{E}[| Z - \mathbb{E}[Z]|^q]}{t^q}. $$

Note that for $q = 2$, this form delivers the Chebyshev inequality (see the right panel of the Figure). This approach to bounding tail probabilities for a wide range of possible $P$ is quite general. Controlling higher order moments--restricting $\mathcal{H}$, i.e. the range of probabilities to which the inequality holds--leads to (weakly) tighter bounds on the tail probabilities.

Chernoff bounds

A related idea is at the core of the so called Chernoff bound. For that one takes the transformation $\phi(x) = e^{\lambda x}$ applied to the centered random variable $(Z - \mathbb{E}[Z])$ which yields

$$ P\{(Z - \mathbb{E}[Z]) \geq t\} \leq \frac{\mathbb{E}\left[e^{\lambda(Z - \mathbb{E}[Z])}\right]}{e^{\lambda t}}. $$

Minimizing the bound over $\lambda$ (provided the moments exist) would lead us to the Chernoff bound

$$ \log P\{(Z - \mathbb{E}[Z]) \geq t\} \leq - \sup_{\lambda} \left\{\lambda t - \log \mathbb{E}\left[e^{\lambda(Z - \mathbb{E}[Z])}\right]\right\}.$$

If the centered random variable is an iid sum of other random variables, the Chernoff bound gains additional structure.

Hoeffding bounds

Suppose first that the sample $z^n$ is generated by an iid process and recall that

$$P_n g - P g = \frac{1}{n}\sum_{i=1}^n g(Z_i) - \mathbb{E}[g(Z)] = \frac{1}{n}\sum_{i=1}^n \left(g(Z_i) - \mathbb{E}[g(Z)]\right).$$

Bounding the tails via Chernoff's method yields

$$ P\left\{\frac{1}{n}\sum_{i=1}^n \left(g(Z_i) - \mathbb{E}[g(Z)]\right) \geq t\right\} \leq \frac{\mathbb{E}\left[e^{\lambda \frac{1}{n}\sum_{i=1}^n \left(g(Z_i) - \mathbb{E}[g(Z)]\right)}\right]}{e^{\lambda t}} = e^{-\lambda t} \prod_{i=1}^n \mathbb{E}\left[e^{\lambda \frac{1}{n}\left(g(Z_i) - \mathbb{E}[g(Z)] \right)} \right] $$

where the equality follows from independece. Hence, the problem of deriving a tight bound boils down to bounding the moment generating function of

$$ \frac{1}{n}\left(g(Z_i) - \mathbb{E}[g(Z)] \right). $$

For obtaining bounds of moment generating functions, a particularly important class of random variables are the so called sub-Gaussian variables.

Definition. A random variable $Z$ is called sub-Gaussian if there is a positive number $\sigma$ such that for all $\lambda \in \mathbb{R}$

$$ \mathbb{E}\left[e^{\lambda \left(Z - \mathbb{E}[Z] \right)} \right] \leq e^{\sigma^2\lambda^2/2}$$

REMARK:: A Guassian variable with variance $\sigma^2$ is sub-Gaussian with parameter $\sigma$. There are other non-Guassian random variables which are sub-Gaussian -- for example all bounded random variables.

The property of sub-Gaussianity is preserved by linear operations, which leads to the following important result

Theorem. (Hoeffding) Let the variables $g(Z_i)$ be iid and sub-Gaussian with parameter $\sigma$. Then, for all $t>0$ we have that

$$P\left\{\left|\frac{1}{n}\sum_{i=1}^n g(Z_i) - \mathbb{E}[g(Z)]\right| \geq t\right\} \leq 2\exp\left\{-\frac{t^2 n}{2\sigma^2}\right\}.$$

For a single random variable $g(Z)$, we see how the Hoeffding inequality is at the core of the law of large numbers. As $n$ goes to infinity, the upper bound on the right goes to zero implying that the probability on the left must go to zero as well for all $P$ that makes $g(Z)$ sub-Gaussian. In fact, further inspection reveals that the upper bound also provides an exponential rate of convergence: $t^2/(2\sigma^2)$

However, in order to talk about concentration properties of decision rules, $d: \mathcal{S}\mapsto\mathcal{A}$ we need bounds for tail probabilities applicable uniformly over the whole action space or alternatively, uniformly across $\mathcal{L}_{\mathcal{A}}$. This consideration leads us to uniform bounds.


Uniform Tail and Concentration Bounds for classes of finite cardinality

As a first simple example we consider uniform tail bounds over sets of finite cardinality. Let $\mathcal{G} = \{g_j : j = 1, \ldots, J\}$. A conservative estimate of the uniform tail bound in this case is the union bound.

Corollary. (Hoeffding union bound) Let the variables $\{g_j(Z_i): j=1,\ldots, J\}$ be iid and sub-Gaussian with common parameter $\sigma$. Then, for all $t > 0$ we have

$$ P\left\{\sup_{g\in\mathcal{G}}\left| \frac{1}{n}\sum_{i=1}^n g(Z_i) - \mathbb{E}[g(Z)]\right| \geq t \right\} \leq \sum_{j = 1}^{J} P\left\{\left|\frac{1}{n}\sum_{i=1}^n g_j(Z_i) - \mathbb{E}[g_j(Z)]\right| \geq t \right\} \leq J 2\exp\left\{-\frac{t^2 n}{2\sigma^2}\right\}. $$

The difference between the uniform and individual Hoeffding bounds is just a scaling factor, $J$. If there is only finitely many actions in $\mathcal{A}$, then of course the function class $\mathcal{L}_{\mathcal{A}}$ has finite cardinality. Our objective, however, is to extend the above analysis to action spaces with infinite cardinality. To this end, we need to find a better "measure" of the size of a function space than its cardinality. Next, we introduce one such complexity measure which proves to be extremely useful in the case of characterizing tail bounds for analog estimators.

Uniform tail bounds for classes of infinite cardinality - Rademacher complexity

In order to work with sets of infinitely many functions we would like to capture the size of these infinite sets for the purpose of statistical analysis. One such measure of statistical size is the Rademacher complexity of a class of real-valued functions.

For a given realization of a sample, $z^n$, of size $n$ consider the the set of vectors

$$\mathcal{G}(z^n) := \left\{ (g(z_1), \ldots, g(z_n)) \in \mathbb{R}^n \mid g \in \mathcal{G} \right\}.$$

This embodies the number of ways one can label points of a sample using functions in the class $\mathcal{G}$. It is often called the projection of the function class $\mathcal{G}$ to the sample $z^n$. The empirical Rademacher complexity of $\mathcal{G}$ for fixed $z^n$ is defined as

$$\mathsf{R} \left(\mathcal{G}(z^n) \right) := \mathbb{E}_{\epsilon}\left[\sup_{g\in \mathcal{G}}\Big| \frac{1}{n}\sum_{i=1}^n \epsilon_i g(z_i) \Big| \right],$$

where the expectation is taken with respect to the iid Rademacher random variables, $\epsilon_i$ which take value in $\{-1, 1\}$ with equal probability.

The Rademacher complexity of $\mathcal{L}$ for sample size $n$ is then defined as

$$\mathsf{R}\left(\mathcal{G}, n\right) := \mathbb{E}_{Z^n} \Big[\mathsf{R}\left(\mathcal{G}(z^n) \right)\Big].$$

where the expectations is taken with respect to the sample. Rademacher complexity has an intuitive interpretation. It is the average (across samples) of the maximum correlations between the vectors $\big(g(z_1), \ldots, g(z_n)\big)$ and the pure noise vector $\big(\epsilon_1, \ldots, \epsilon_n\big)_{z^n}$. The function class $\mathcal{G}$ is too "large" (complex) for statistical purposes, if we can always choose a function, $g\in\mathcal{G}$ that has high correlation with a randomly drawn noise vector, that is, when the corresponding Rademacher complexity does not go to zero as the sample size grows to infinity.

Illustration of Rademacher complexity

In the following we consider two examples. Although both examples use a function class that can be parametrized by a single free parameter, we will see that one of them has drastically smaller Rademacher complexity than the other.

Coin tossing example revisited

Consider again the coin tossing example with the quadratic loss function discussed before. In this case $Z^n$ is an iid sample from a Binomial distribution parametrized with $\theta\in[0, 1]$. The class of functions is given by the quadratic class

$$ \mathcal{L}_{\mathcal{A}} := \{l(a, \cdot) : l(a, z) = (z - a)^2, \ \ a \in [0, 1]\subseteq \mathbb{R}\}.$$

The following figure depicts the empricial Rademacher complexity $\mathsf{R} \left(\mathcal{L}_{\mathcal{A}}(z^n) \right)$ along alternative sample paths (on the left) and the Rademacher complexity $\mathsf{R}\left(\mathcal{L}_{\mathcal{A}}, n\right)$ for different sample sizes (on the right).

  • Clearly, both $\mathsf{R} \left(\mathcal{L}_{\mathcal{A}}(z^n) \right)$ and $\mathsf{R}\left(\mathcal{L}_{\mathcal{A}}, n\right)$ converge to zero as the sample size grows to infinity.
  • Rademacher complexity is nearly indistinguishable from Empirical Rademacher complexity and in fact, the latter does not show much variation across the different samples $z^n$ (several lines lie on top of each other). This suggests that in the simple example of coin tossing, the different ensemble samples (with more than 300 data points) encode very similar information about the complexity of the function class used in the estimation.

Sinusoid classification

Consider next a classification problem, where the observable is $Z=(Y, X)$ and assume that the binary random variable $Y$--taking values from $\{-1, 1\}$--depends on the covariate $X$ in some way. Suppose that a statistician uses the following classifiers to cature this dependence

$$ \mathcal{L}_{\mathcal{A}} := \Big\{ \mathbb{1}\{\sin(ax) \geq 0\} - \mathbb{1}\{\sin(ax) < 0\} : a \in \mathbb{R}_+ \Big\}. $$

That is, the classifier boundary is given by sine functions, $\sin(aX)$, parametrized by the frequency $a$. For a given $a$, one can determine the range of covariate values for which this classifier assigns the value $1$ or $-1$. As a result, choosing $a\in\mathcal{A}$ is equivalent with choosing particular subsets of $X$. To determine the optimal $a$ we use the $0/1$ loss.

The following figure represents the empirical loss minimizing classifiers for two random samples (depicted by the red dots) with different sizes. In order to have a closed form solution for the optimal classifier we are selecting the sample at convenient points. This is without loss of generality and useful for illustrative purposes. The shaded area indicates the ranges of $X$, where the classifier assigns $1$ or $-1$.

  • As we can see, by choosing a sufficiently high frequency $a$ we can always find a curve which classifies the data perfectly, i.e. the in-sample predicitions are always "correct".
  • For better illustration, consider the empirical Rademacher complexity associated with the sample, $z^n$, depicted by the red dots. Since $\mathcal{L}_{\mathcal{A}}$ is always able to perfectly separate the sample, the (empirical) Rademacher complexity always takes its maximum value (one) irrespective of the sample size.

Consequently, for statistical purposes the family of sine curves is too complex. This example highlights that for general non-linear functions the number of free parameters---here only one---does not correspond to the complexity of the function class---which is infinity.

Uniform bounds using the Rademacher complexity

With the introduced concepts we are now capable to present an important concentration inequality for classes of infinite cardinality.

Theorem: For uniformly bounded functions $\lvert g \rvert_{\infty} < B, \ \ \forall g \in\mathcal{G}$ and for each $\eta>0$ we have that

$$ P \Big\{ \Vert P_n - P \Vert_{\mathcal{G}} \geq 2\mathsf{R}\left(\mathcal{G}, n\right) + \eta \Big\} \leq 2 \exp\Big\{- \frac{2 n \eta^2}{B^2} \Big\}. \quad\quad\quad(*)$$

and for the empirical Rademacher complexity

$$ P\Big\{ \Vert P_n - P \Vert_{\mathcal{G}} \geq 2 \mathsf{R} \left(\mathcal{G}(z^n) \right) + \eta \Big\} \leq 2 \exp\Big\{-\frac{n \eta^2}{4 B^2} \Big\}. \quad\quad\quad(**) $$

It is apparent from the above inequalities that the tightness of the finite sample uniform bounds will be (partly) determined by the Rademacher complexity. The change in Rademacher complexity as the sample size grows will determine the (non-asymptotic) rate of convergence. Hence, if $R_n(\mathcal{G}) = o(1)$ then $\Vert P_n - P \Vert_{\mathcal{G}} \to 0$ or put it differently, $\mathcal{G}$ is a Glivenko-Cantelli class.

Notice that the second bound in $(**)$ is data dependent in the sense that it includes only variables that in principle, we can determine without knowing the true probability distribution.

Illustration of the tail bounds in the coin tossing example

An alternative way to express the uniform bounds is to specify a fixed tail probability $\varepsilon$ and use the change of variable $\varepsilon = 2 \exp\left(- \frac{2 n \eta^2}{B^2}\right)$ to obtain

$$ P \left\{ \Vert P_n - P \Vert_{\mathcal{G}} \geq 2\mathsf{R}\left(\mathcal{G}, n\right) + \sqrt{\frac{B^2}{2n}\log\left(\frac{2}{\varepsilon}\right)} \right\} \leq \varepsilon. $$

This formulation of the bound is useful because it makes transparent the sense in which it is uniform over the action space $\mathcal{A}$. Similarly, for fixed action $a$, we can compute the sequence $\{\delta_n(a)\}_{n\geq 1}$ that makes the following expression true for all $n$

$$P \left\{ \left| L(P_n, a) - L(P, a)\right| \geq \delta_n(a) \right\} \leq \varepsilon.$$

Note that conceptually this is the same exercise that we did for tail probabilities above. The difference is that while there we fixed $\delta_n=\delta$ for all $n$ and treated $\varepsilon$ (tail probability) as sample size dependent, here, we are switching the roles and investigating the implied $n$-dependent bounds for fixed tail probability $\varepsilon$.

The left panel of the following figure compares the uniform bound derived from Rademacher complexity along with the $\{\delta_n(a)\}$ sequences for different actions $a$ in the coin-tossing example.

  • The bound expressed in terms of Rademacher complexity clearly dominates the absolute difference between the empirical and true loss for all actions.
  • We can see the sense in which the tail bound represents a worst-case viewpoint: by bounding the supremum of the difference between the empirical and true loss, one can be sure that generalization works uniformly well over the action space, but the bound is loose for most actions.

Calculation of the Rademacher complexity in $(*)$ requires the knowledge of the true distribution. In this sense, inequality $(**)$ provides a practically more useful bound. As we have seen above, this inequality incorporates the Empirical Rademacher complexity and uses a looser scaling factor in the upper bound. On the other hand, this bound is agnostic in the sense that it is applicable for any true distribution $\theta\in[0,1]$, i.e. for any statistical model in $\mathcal{H}$. The right panel of the above figure demonstrates this fact.

  • For fixed $a$, different $\theta$ distributions induce different tail probabilities. The bound constructed from the Empirical Rademacher complexity works uniformly for all of such alternative true probabilities.
  • One might consider this bound as a result of a worst-case analysis taking into account both the action space $\mathcal{A}$ and the class of statistical models $\mathcal{H}$.

Bounding expectations

Related strongly to the above tail bounds, one can also bound the expectation of the supremum of an empirical process using Rademacher complexity. This is often called symmetrization inequality.

Theorem: For any class of $P$-integrable functon class $\mathcal{G}$, we have that

$$\mathbb{E}_{Z^n}\Big[\Vert P_n - P \Vert_{\mathcal{G}} \Big] \leq 2\mathsf{R}\left(\mathcal{G}, n\right) = 2 \mathbb{E}_{Z^n}\Big[\mathsf{R} \left(\mathcal{G}(z^n) \right)\Big]. $$

Unfortunately, computing the Rademacher complexity directly is only feasible in special cases (see one example below). There are various techniques however, which give bounds on the Rademacher complexity. For different classes of functions different techniques prove to be useful, so one usually proceeds on a case-by-case basis. The most common ways of bounding the Rademacher complexity are via the Vapnik-Chervonenkis dimension for binary functions and via metric entropy for bounded real valued functions.


The code for the simulations and producing the graphs can be found here.


References

Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical learning theory. In Advanced lectures on machine learning (pp. 169-207). Springer Berlin Heidelberg.

Chervonenkis, A. and Vapnik, V. (1971). Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal solution from empirical data. Automation and Remote Control, 32, 207-217.

Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, 1029-1054.

Manski, Charles F. (1988). Analog estimation methods in econometrics. Chapman and Hall.

Pollard, David (1984). Convergence of Stochastic Processes. Springer-Verlag.

Van der Vaart, A. W. (2000). Asymptotic statistics (Vol. 3). Cambridge University Press.

Wasserman, Larry (2016). Lecture notes for Statistical Machine Learning at CMU. Link

White, Halbert (1994), Estimation, Inference and Specification Analysis (Econometric Society Monographs). Cambridge University Press.


In [ ]: