Mathematical Formulas in the DDS Code

Computing parameters for the Cormode & Garofalakis (TODS) model

Assume that we are given a maximum overall error $\beta$ for monitoring in the TODS method, with probability of failure equal to $\gamma$.

The sizing parameters of the model are the size of the sketch $D$ and $L$, the number of local sites $k$, and the local threshold $\theta$. From the AGMS paper, a sketch of dimensions $D\times L$ can answer queries with accuracy $\varepsilon = \frac{4}{\sqrt{L}}$, with probability at least $1-2^{-D/2}$.

Note that the accuracy means the following: the sketch product at time $t$, say $Q(t)$, will, with probability at least $1-\gamma$, be $$ Q(t) \in f_1\cdot f_2 \pm \beta \|f_1(t)\| \|f_2(t)\|,$$ where $f_i(t)$ are the actual frequency vectors.

On the other hand, the TODS method requires a local threshold $\theta$, and guarantees that the coordinator can answer queries with accuracy $$\beta = \varepsilon + (1+\varepsilon)^2 (2\theta + \theta^2),$$ with probability of failure at most $$\gamma = 2k\cdot 2^{-D/2},$$.

With respect to $L=\frac{4}{\varepsilon^2}$ and $\theta$, we have a trade-off, for any given $\beta$. We can choose a value for $\varepsilon \in (0, \beta)$, and compute a matching $\theta$.

A good rule of the thumb, derived experimentally, seems to be to choose $\varepsilon = \beta/2$. Whatever the choice for $\varepsilon$ is, we can the proceed as follows:

$D = \lceil 2\log_2(\frac{2k}{\gamma})\rceil$
$L = \lceil \frac{16}{\varepsilon^2} \rceil$
$\theta = \sqrt{1+\frac{\beta-\varepsilon}{(1+\varepsilon)^2}} -1$.

Computing parameters for the Geometric Method

Assume that we are given a maximum overall error $\beta \in (0,1)$ for monitoring in the Geometric Method, with probability of failure equal to $\gamma$.

Again, we have a choice of $\varepsilon \in (0,\beta)$ for the (global) sketches. Given this choice, the sketch size can be chosen:

$D = \lceil 2\log(\frac{1}{\gamma}) \rceil$
$L = \lceil \frac{16}{\varepsilon^2} \rceil$

Now, the task is to monitor the query approximately, so that the overall error is at most $\beta$. To this effect, we will wish to create a safe zone for the sketch query $Q(t)$. That is, $Q(t)$ (which is unknown by the system) is the value of the sketch-product of the (unknown) global state---the sum of all local states. The AGMS theorem tells us that, with probability at least $1-\gamma$, $$ Q(t) \in f_1\cdot f_2 \pm \varepsilon \|f_1\| \|f_2\|, $$ where $f_i$ are the unsketched frequency vectors.

The system's estimate of $Q(t)$ is $Q_0$, the query value of the current estimate sketches at the coordinator. What we need is to construct a safe zone for $Q(t)$, that guarantees that $Q_0$ is a good estimate.

Theorem For the twoway-join query between frequency vectors $f_1$ and $f_2$, the admissible region condition $$ \frac{1+\varepsilon}{1+\beta} Q_0 \leq Q(t) \leq \frac{1-\varepsilon}{1-\beta} Q_0 $$ ensures that, with probability at least $1-\gamma$, it is $$ Q_0 \in f_1\cdot f_2 \pm \beta \|f_1\| \|f_2\|.$$

Proof Define $a_{\pm} = \frac{\beta-\varepsilon}{1\pm\varepsilon}$, The admissible region condition is equivalent to $$ (1- a_{-}) Q(t) \leq Q_0 \leq (1+a_{+})Q(t). $$ With probability at least $1-\gamma$, $$Q(t) \in f_1\cdot f_2 \pm \varepsilon \| f_1\| \| f_2\|. $$ Therefore, with at least the same probability it will be $$ (1-a_{-})\bigl(f_1\cdot f_2 - \varepsilon \|f_1\| \|f_2\|\bigr) \leq Q_0 \leq (1+a_{+}) \bigl(f_1\cdot f_2 + \epsilon \|f_1\| \|f_2\|\bigr). $$ This relation implies $$ f_1\cdot f_2 - \bigl(a_{-} + \varepsilon (1-a_{-})\bigr) \|f_1\| \|f_2\| \leq Q_0 \leq f_1\cdot f_2 + \bigl(a_{+} + \varepsilon (1+a_{+})\bigr) \|f_1\| \|f_2\|. $$ The proof is complete by observing that $$ a_{\pm} + \varepsilon (1\pm a_{\pm}) = \varepsilon + a_{\pm}(1\pm \varepsilon) = \beta. $$ QED

When the query is selfjoin of frequency vector $f$, the formulas hold exactly, with $f=f_1=f_2$.