Caissières agoraphobes - Barman agile

Voir le rapport https://hal.archives-ouvertes.fr/hal-01476889v1


In [1]:
# Numpy & Matplotlib
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
# Plotly
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=False)



In [3]:
# Where the csv files are saved
directory = "./"

We consider the following queueing model

  • $N$ servers
  • Each server has capacity $\mu > 0$;
  • The total arrival rate in the queue is $N \lambda$ (so that $\lambda$ is the arrival rate per server);
  • Load of the queue $\rho = \frac{N \lambda}{N \mu} = \frac\lambda\mu$;
  • Upon arrival, each job is assigned to $d \ge 1$ servers chosen uniformly at random, independently of the current state of the queue and of the previous and future assignments of other jobs;
  • Each job can be processed in parallel by several servers, so that its total service rate is the sum of the capacities of the servers that are processing it;
  • The servers process their customers by order of arrival (FCFS);
  • All jobs have independent, exponentially distributed sizes, with mean $1$, and each job leaves the queue immediately after service completion.

Since all servers are exchangeable and each job is assigned to $d$ servers, the mean queue length of each server is $\frac{d \mathbb{E}({\bf X})}N$ where ${\bf X}$ is the random variable which counts the total number of jobs in the queue. Since this quantity tends to infinity when the load tends to $1$, we normalize it with the mean queue size $\rho / (1-\rho)$ of the servers in the worst case, when $d = 1$ or $d = N$ (see Section 3). By K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, S. Zbarsky, and S. Zbarsky. 2016. Redundancy-d: The Power of d Choices for Redundancy. Submitted to Operations Research., we have: $$ \frac{ \frac{d \mathbb{E}({\bf X})}N }{ \frac\rho{1-\rho} } = \frac{d}N \sum_{m=d}^N \frac{1-\rho}{ \frac{\binom{N-1}{d-1}}{\binom{m-1}{d-1}} - \rho } $$ We would like to minimize the mean queue length of the servers. The plots below suggest that $d = \sqrt{N}$ is optimal when the load is low or intermediate. The optimal value of $d$ seems to decrease when the load increases and reaches $d = 2$ when the load is very close to $1$. Here we will first observe the evolution of the mean queue length of the servers, depending of $N$, $d$ and $\rho$. Then we will try to prove the observations.

Functions


In [4]:
# Mean queue length as a function of d

def mean_relative_queue_length_d (N, dd, ρ):
    mean = zeros(len(dd))
    
    for i, d in enumerate(dd):
        binom = ones(N-d+1)
        binom[1:] = cumprod( 1 + (d-1) / arange(1,N-d+1) )
        mean[i] = d * (1-ρ) * sum( 1 / (binom[-1] / binom - ρ) ) / N
    
    return mean

In [5]:
# Mean queue length as a function of ρ

def mean_relative_queue_length_ρ (N, d, ρρ):
    mean = zeros(len(ρρ))
    
    for i, ρ in enumerate(ρρ):
        binom = ones(N-d+1)
        binom[1:] = cumprod( 1 + (d-1) / arange(1,N-d+1) )
        mean[i] = d * (1-ρ) * sum( 1 / (binom[-1] / binom - ρ) ) / N
    
    return mean

The following mean field approximation was proved in Theorem 10 of K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, S. Zbarsky, and S. Zbarsky. 2016. Redundancy-d: The Power of d Choices for Redundancy. Submitted to Operations Research.. Observe that this quantity depends neither on $N$ nor on $d$.


In [6]:
# Mean field approximation which is independent of d

def mean_field(ρ):
    return - (1.-ρ) * log(1.-ρ) / ρ

In [7]:
# Distribution of the collisions when there are 2 jobs in the queue (hyper-geometric)

def distribution_collisions(N, d):
    k = arange( max(2*d-N, 0), d+1, 1 )
    p = ones(len(k))
    p[1:] = cumprod( (d - k[1:] + 1) * (d - k[1:] + 1) / ( k[1:] * (N - 2*d + k[1:]) ) )
    return k, p / sum(p)

In [8]:
# Truncated Poisson approximation for the distribution of the collisions

def poisson(N, d):
    k = arange( max(2*d-N, 0), d+1, 1 )
    p = ones(len(k))
    p[1:] = cumprod( (d*d/N) / k[1:] )
    return p / sum(p)

Relative mean queue length as a function of $\rho$

In the limit when $\rho \to 1$, the normalized mean queue length tends to $\frac{N}d$.


In [9]:
N = 100
ρρ = arange(.001, 1, .001)
dd = [2, 3, 10, 15, 20, 60, 90, 100]

# Plots
data_comp = [
    go.Scatter(x = ρρ, y = mean_field(ρρ), name = "Mean field", line = dict(dash = 'dash'))
]

for d in dd:
    # Files
    # savetxt(directory + "exact-N" + str(N) + "-d" + str(d) + "-rho0_1.csv",
    #         np.c_[ρρ, mean_relative_queue_length_ρ(N, d, ρρ)])

    data_comp += [
        go.Scatter(
            x = ρρ, y = mean_relative_queue_length_ρ(N, d, ρρ),
            name = "d = " + str(d), mode = 'lines',
            line = dict(width=1,),
        ),
    ]

layout_comp = go.Layout(
    title="N = " + str(N),
    xaxis = dict(title='ρ'), yaxis = dict(title='Relative mean queue length'),
    autosize = False, width = 800, height = 600,
)

iplot(go.Figure(data = data_comp, layout = layout_comp))



In [10]:
N = 1000
ρρ = arange(.001, 1, .001)
dd = [20, 30, 100, 150, 200, 600, 900, 1000]

# Plots
data_comp = [
    go.Scatter(x = ρρ, y = mean_field(ρρ), name = "Mean field", line = dict(dash = 'dash'))
]

for d in dd:
    # Files
    # savetxt(directory + "exact-N" + str(N) + "-d" + str(d) + "-rho0_1.csv",
    #         np.c_[ρρ, mean_relative_queue_length_ρ(N, d, ρρ)])

    data_comp += [
        go.Scatter(
            x = ρρ, y = mean_relative_queue_length_ρ(N, d, ρρ),
            name = "d = " + str(d), mode = 'lines',
            line = dict(width=1,),
        ),
    ]

layout_comp = go.Layout(
    title="N = " + str(N),
    xaxis = dict(title='ρ'), yaxis = dict(title='Relative mean queue length'),
    autosize = False, width = 800, height = 600,
)

iplot(go.Figure(data = data_comp, layout = layout_comp))


As $N$ increases, the mean field approximation seems to coincide with the minimum.

$d$ optimal as a function of $\rho$


In [11]:
N = 700
ρρ = arange(.01, 1, .01)
dd = arange(1, N+1, 1)

data_comp = []

dd_opt = zeros(len(ρρ))
for i, ρ in enumerate(ρρ):
    dd_opt[i] = dd[argmin(mean_relative_queue_length_d(N, dd, ρ))]

# Plots
data_comp = [
    go.Scatter(
        x = ρρ, y = dd_opt,
        mode = 'lines', name = "Optimal d", line = dict(width=1,),
    ),
    go.Scatter(
        x = ρρ, y = sqrt(N) * pow(1-ρρ, 1/4),
        mode = 'lines', name = "N^(1/2) (1-ρ)^(1/4)$", line = dict(width=1,),
    ),
]

layout_comp = go.Layout(
    title="N = " + str(N),
    xaxis = dict(title='ρ'), yaxis = dict(title='d optimal'),
    autosize = False, width = 800, height = 600,
)

iplot(go.Figure(data = data_comp, layout = layout_comp))


We observed that the curve of the optimal value of $d$ is close to $\sqrt{N} (1-\rho)^\frac14$ but we didn't try to go further yet.

Optimalité de $\sqrt{N}$ à faible charge


In [12]:
ρρ = concatenate((arange(.001, 1, .001), arange(.999, 1, .0001)))

data_comp = []

for N in [100, 400, 900]:
    dd_opt = zeros(len(ρρ))
    dd = arange(1, N+1, 1)
    
    for i, ρ in enumerate(ρρ):
        mean = mean_relative_queue_length_d(N, dd, ρ)
        dd_opt[i] = min(mean) / mean[int(ceil(sqrt(N))) - 1]

    # Files
    # savetxt(directory + "optimald-N" + str(N) + "-rho0_1.csv", np.c_[ρρ, dd_opt])

    # Plots
    data_comp += [
        go.Scatter(
            x = ρρ, y = dd_opt,
            mode = 'lines', name = "N = " + str(N), line = dict(width=1,),
        ),
    ]

layout_comp = go.Layout(
    title = "Ratio between the mean queue size of the servers when d = N^(1/2) and the optimal value",
    xaxis = dict(title='ρ'), yaxis = dict(title='d optimal'),
    autosize = False, width = 800, height = 600,
)

iplot(go.Figure(data = data_comp, layout = layout_comp))


We observe that $\sqrt{N}$ is close to optimal for high values of the load, and that it gets better when $N$ increases.

Approximation under low load


In [13]:
N = 100
ρ = 1/N
dd = arange(1, N+1, 1)

data_comp = []

approximation = (1 - ρ) / (1 - ρ / (2 - (1/dd+dd/N)))
    
# Files
#savetxt(directory + "empty-exact.csv",
#        np.c_[dd, mean_relative_queue_length_d(N, dd, ρ)])
#savetxt(directory + "empty-appro.csv",
#        np.c_[dd, approximation])

data_comp += [
    go.Scatter(
        x = dd, y = mean_relative_queue_length_d(N, dd, ρ),
        name = "Exact", mode = 'lines', line = dict(width=1,),
    ),
    go.Scatter(
        x = dd, y = approximation,
        name = "Approximation", mode = 'lines', line = dict(width=1,),
    ),
]

layout_comp = go.Layout(
    title="N = " + str(N) + " et ρ = " + str(ρ),
    xaxis = dict(title='d', type="log"), yaxis = dict(title='Relative mean queue size'),
    autosize = False, width = 800, height = 600,
)

iplot(go.Figure(data = data_comp, layout = layout_comp))


Approximation under heavy load

Here we plot the approximation we found in Section 5:


In [14]:
N = 100
ρ = 1 - 1/(N*N)
dd = arange(1, N+1, 1)

data_comp = []

approximation = ones(N)
approximation[1:] = dd[1:] / N
    
# Files
# savetxt(directory + "full-exact.csv",
#         np.c_[dd, mean_relative_queue_length_d(N, dd, ρ)])
# savetxt(directory + "full-appro.csv",
#     np.c_[dd, approximation])

data_comp += [
    go.Scatter(
        x = dd, y = mean_relative_queue_length_d(N, dd, ρ),
        name = "Exact", mode = 'lines', line = dict(width=1,),
    ),
    go.Scatter(
        x = dd, y = approximation,
        name = "Approximation", mode = 'lines', line = dict(width=1,),
    ),
]

layout_comp = go.Layout(
    title="N = " + str(N) + " et ρ = " + str(ρ),
    xaxis = dict(title='d', type="log"), yaxis = dict(title='Relative mean queue size'),
    autosize = False, width = 800, height = 600,
)

iplot(go.Figure(data = data_comp, layout = layout_comp))


It is possible to obtain another approximation when the load is small, by using a Taylor approximation. We have successively: \begin{align*} \delta = \frac1{N \mu} \sum{m=d}^N \frac{\binom{m-1}{d-1}}{\binom{N-1}{d-1}} \frac1{1 - \frac{\binom{m-1}{d-1}}{\binom{N-1}{d-1}} \rho} = \frac1{N \mu} \sum{m=d}^N \frac{\binom{m-1}{d-1}}{\binom{N-1}{d-1}} \left[ 1 + \frac{\binom{m-1}{d-1}}{\binom{N-1}{d-1}} \rho + o(\rho) \right] = \frac1{N \mu} \left{ \frac{\sum_{m=d}^N \binom{m-1}{d-1}}{\binom{N-1}{d-1}}

+ \rho \sum_{m=d}^N \left[ \frac{\binom{m-1}{d-1}}{\binom{N-1}{d-1}} \right]^2
+ o(\rho)
\right\}.

\end{align} Using the equality $$ \sum_{m=0}^n \binom{m}{i} \binom{n-m}{k-i} = \binom{n+1}{k+1}, \quad \forall n \in \mathbb{N}^, \quad \forall i, k \in {0,\ldots,n} $$ in the special case where $i = k$, we have $$ \sum{m=0}^n \binom{m}{k} = \binom{n+1}{k+1}, \quad \forall n \in \mathbb{N}^*, \quad \forall k \in {0,\ldots,n}. $$ We obtain $$ \sum{m=d}^N \binom{m-1}{d-1} = \sum{m=d-1}^{N-1} \binom{m}{d-1} = \sum{m=0}^{N-1} \binom{m}{d-1} = \binom{N}{d}, $$ so that $$ \frac{\sum_{m=d}^N \binom{m-1}{d-1}}{\binom{N-1}{d-1}} = \frac{\binom{N}{d}}{\binom{N-1}{d-1}} = \frac{N}{d}. $$ Hence, $$ \delta = \frac1{N \mu} \left{ \frac{N}{d}

    + \rho \sum_{m=d}^N \left[ \frac{\binom{m-1}{d-1}}{\binom{N-1}{d-1}} \right]^2
    + o(\rho)
\right\}
\quad \text{i.e.} \quad
\frac\gamma{d} = \frac1{
    \frac1\mu
    + \rho \frac{d}{N \mu} \frac{\sum\limits_{m=d}^N \binom{m-1}{d-1}^2}{\binom{N-1}{d-1}^2}
    + o(\rho)
}.
$$ We need to study the variations of $$
\alpha_d = \frac{d}{N \mu} \frac{\sum_{m=d}^N \binom{m-1}{d-1}^2}{\binom{N-1}{d-1}^2}, \qquad d = 1,\ldots,N.

$$ but this is no so easy ...

Intermediary load


In [15]:
N = 100
ρ = .8
dd = arange(1, N+1, 1)

data_comp = []
    
# Files
# savetxt(directory + "intermediary-exact.csv",
#         np.c_[dd, mean_relative_queue_length_d(N, dd, ρ)])

data_comp += [
    go.Scatter(
        x = dd, y = mean_relative_queue_length_d(N, dd, ρ),
        name = "Exact", mode = 'lines', line = dict(width=1,),
    ),
]

layout_comp = go.Layout(
    title="N = " + str(N) + " et ρ = " + str(ρ),
    xaxis = dict(title='d', type="log"), yaxis = dict(title='Relative mean queue size'),
    autosize = False, width = 800, height = 600,
)

iplot(go.Figure(data = data_comp, layout = layout_comp))


Distribution of the collisions

When the load is low, there are (almots) always few customers in the queue. One can show that the only states which "matter" are those with $0$, $1$ or $2$ customers in the queue. When there is $0$ or $1$ customer, we now exactly the number of active servers: $0$ when the number of customers is $0$, $d$ when there is $1$ customer. When there are $2$ customers in the queue, the number of active servers is not deterministic any more, it is distributed according to a hyper-geometric distribution. To prove it, we consider the queueing model of K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, S. Zbarsky, and S. Zbarsky. 2016. Redundancy-d: The Power of d Choices for Redundancy. Submitted to Operations Research. and we define an aggregate state which is more practical for for our purpose.

Equivalent per-class model

  • The class of a job is given by the set of servers to which it is assigned.
  • There are $\binom{N}{d}$ classes of jobs
  • The set of job classes is the family of subsets of $\{1,\ldots,N\}$ with $d$ elements: $\Sigma = \{ {\cal S} \subset \{1,\ldots,S\}: |{\cal S}| = d \}$.
  • Queue state = sequence of sets of servers assigned to each job, by order of arrival: $$ ({\cal S}_1,\ldots,{\cal S}_n) \in \Sigma^* $$ where $n$ is the number of jobs in the queue and ${\cal S}_i$ is the set of servers that can process the job in position $k$, for each $k = 1,\ldots,n$.
  • Stability condition given in the paper of Gardner: $\lambda < \mu$, that is $\rho < 1$.
  • The stationary distribution of this detailed queue state satisfies the recursion \begin{align*}
    \pi({\cal S}_1,\ldots,{\cal S}_n)
    = \frac{ \frac{S \lambda}{\binom{S}d} }{ \bigg| \bigcup\limits_{k=1}^n {\cal S}_k \bigg| \mu }
    \pi({\cal S}_1,\ldots,{\cal S}_{n-1})
    = \frac{S \rho}{\bigg| \bigcup\limits_{k=1}^n {\cal S}_k \bigg| \binom{S}{d}  }
    \pi({\cal S}_1,\ldots,{\cal S}_{n-1}).
    
    \end{align*} for all $n \in \mathbb{N}$ et ${\cal S}_1,\ldots,{\cal S}_n \in \Sigma$.

Aggregate state

  • We describe the queue state by a sequence $a = (a_1,\ldots,a_n)$, where $n$ is the number of jobs in the queue and $a_k$ is the number of servers that are (currently) processing the $k$-th job in the queue, for each $k = 1,\ldots,n$.
  • $a$ defines a stochastic process on its state space
  • For each $n \in \mathbb{N}$, let ${\cal A}_n$ denote the set of possible aggregate states with $n$ jobs: $$ {\cal A}_n = \left\{ a = (a_1,\ldots,a_n) \in \mathbb{N}^n :~ \forall k = 1,\ldots,n,~ a_k \le d \wedge (N - a_1 - \ldots - a_{n-1})^+ \right\} $$
  • The corresponding state space is ${\cal A} = \bigcup_{n=0}^{+\infty} {\cal A}_n$.

We define the stationary distribution of $a$ by $$ \pi(a) = \sum_{\substack{ {\cal S}_1,\ldots,{\cal S}_n \in \Sigma: \\ \forall k = 1,\ldots,n,~ \big| {\cal S}_k \cap \bigcup_{\ell=1}^{k-1} {\cal S}_\ell \big| = d - a_k }} \pi({\cal S}_1,\ldots,{\cal S}_n). $$ We can compute its stationary distribution using that of $q$, and we get: $$ \pi(a) = \frac{N \rho}{|a|} \frac{ \binom{a_1 + \ldots + a_{n-1}}{d - a_n} \binom{N - a_1 - \ldots - a_{n-1}}{a_n} }{\binom{N}{d}} \pi\left( a_1,\ldots,a_{n-1} \right). $$

Proof: \begin{align*} \pi(a) &= \sum_{\substack{ {\cal S}_1,\ldots,{\cal S}_n \in \Sigma: \\ \forall k = 1,\ldots,n,~ \big| {\cal S}_k \cap \bigcup_{\ell=1}^{k-1} {\cal S}_\ell \big| = d - a_k }} \pi({\cal S}_1,\ldots,{\cal S}_n), \\ &= \frac{N \rho}{|a| \binom{S}d} \sum_{\substack{ {\cal S}_1,\ldots,{\cal S}_{n-1} \in \Sigma: \\ \forall k = 1,\ldots,n-1,~ \big| {\cal S}_k \cap \bigcup_{\ell=1}^{k-1} {\cal S}_\ell \big| = d - a_k }} \sum_{\substack{ {\cal S}_n \in \Sigma: \big| {\cal S}_n \cap \bigcup_{\ell=1}^{n-1} {\cal S}_\ell \big| = d - a_n }} \pi({\cal S}_1,\ldots,{\cal S}_{n-1}), \\ &= \frac{N \rho}{|a| \binom{S}d} \binom{a_1 + \ldots + a_{n-1}}{d - a_n} \binom{S - a_1 - \ldots - a_{n-1}}{a_n} \sum_{\substack{ {\cal S}_1,\ldots,{\cal S}_{n-1} \in \Sigma: \\ \forall k = 1,\ldots,n-1,~ \big| {\cal S}_k \cap \bigcup_{\ell=1}^{k-1} {\cal S}_\ell \big| = d - a_k }} \pi({\cal S}_1,\ldots,{\cal S}_{n-1}), \\ &= \frac{N \rho}{|a|} \frac{ \binom{a_1 + \ldots + a_{n-1}}{d - a_n} \binom{N - a_1 - \ldots - a_{n-1}}{a_n} }{ \binom{N}d } \pi(a_1,\ldots,a_{n-1}). \end{align*} $\Box$

Autre interprétation. Distribution hyper-géométrique du nombre de collisions. $$ \pi(a) = \frac{N \rho}{|a|} p_{d-a_n | a_1+\ldots+a_{n-1}} \pi(a_1,\ldots,a_{n-1}) $$ où $$ p_{k|a} = \frac{ \binom{a}{k} \binom{N-a}{d-k} }{ \binom{N}d } $$ est la probabilité d'avoir $k$ collisions quand on tire $d$ serveurs parmi $S$ dont $a$ sont déjà actifs. On peut le réécrire comme: $$ \pi(a) = \frac{N \lambda p_{d-a_n | a_1+\ldots+a_{n-1}}}{|a| \mu} \pi(a_1,\ldots,a_{n-1}) $$ où $N \lambda p_{d-a_n | a_1 + \ldots + a_{n-1}}$ est le taux d'arrivée des classes qui ont exactement $d - a_n$ collisions avec les $a_1 + \ldots + a_{n-1}$ déjà actifs.

On peut aussi observer que $$ \forall a \in {\cal A}, \quad \sum_{\ell=0}^{d \wedge (N - |a|)^+} \pi(a_1,\ldots,a_n,\ell) = \sum_{\ell=0}^{d \wedge (N - |a|)^+} \frac{ \binom{|a|}{d - \ell} \binom{N - |a|}\ell }{\binom{N}{d}} \frac{N \rho}{|a| + \ell} \pi(a) = \frac{N \rho}{|a| + \ell} \pi(a), $$ ce qui implique en particulier $$ \sum_{a \in {\cal A}_{n+1}} \pi(a) = \sum_{a \in {\cal A}_n} \sum_{\ell=0}^{d \wedge (N - |a|)^+} \pi(a_1,\ldots,a_n,\ell) \le \frac{N \rho}d \sum_{a \in {\cal A}_n} \pi(a). $$

Approximate a hyper-geometric distribution by a Poisson distribution

Under the good scaling regime, it is possible to approximate this hyper-geometric distribution with a Poisson distribution with parameter

Proof: http://math.stackexchange.com/questions/427491/approximating-hypergeometric-distribution-with-poisson On va prouver l'approximation suivante : $$ p_{k|a} \to \frac{\gamma^k}{k!} e^{-\gamma} \quad \text{quand} \quad N, a, d \to +\infty ~ \text{et} ~ \frac{ad}{N} \to \gamma > 0. $$ On décompose $p_{k|a}$ en trois facteurs : $$ p_{k|a} = \frac{\binom{a}{k} \binom{N-a}{d-k}}{\binom{N}{d}} = \frac1{k!} \times \frac{a!}{(a-k)!} \frac{d!}{(d-k)!} \frac{(N-k)!}{N!} \times \frac{(N-a)! (N-d)!}{(N-a-d+k)! (N-k)!} $$ On montre facilement que le deuxième facteur tend vers $\gamma^k$ : \begin{align*} \frac{a!}{(a-k)!} \frac{d!}{(d-k)!} \frac{(N-k)!}{N!} &= \frac{ \frac{a!}{(a-k)!} \frac{d!}{(d-k)!} }{ \frac{N!}{(N-k)!} } = \prod_{j=0}^{k-1} \frac{ (a-j)(d-j) }{N-j} = \left( \frac{ad}N \right)^k \times \prod_{j=0}^{k-1} \frac{ \left( 1 - \frac{j}a \right) \left( 1- \frac{j}d \right) }{1 - \frac{j}N}, \\ &\to \gamma^k \quad \text{quand} \quad N, a, d \to +\infty ~ \text{et} ~ \frac{ad}{N} \to \gamma > 0. \end{align*} On montre maintenant que le troisième facteur tend vers $e^{-\gamma}$. On a $$ \frac{(N-a)! (N-d)!}{(N-a-d+k)! (N-k)!} = \frac{ \frac{(N-a)!}{(N-a-(d-k))!} }{ \frac{(N-k)!}{(N-k-(d-k))!} } = \prod_{j=0}^{d-k-1} \frac{N-a-j}{N-k-j} = \prod_{j=0}^{d-k-1} \left( 1 - \frac{a-k}{N-k-j} \right), $$ puis en passant au logarithme, $$ \ln\left( \frac{(N-a)! (N-d)!}{(N-a-d+k)! (N-k)!} \right) = \sum_{j=0}^{d-k-1} \ln\left( 1 - \frac{a-k}{S-k-j} \right). $$ La suite $\left( 1 - \frac{a-k}{S-k-j} \right)_{j=0,\ldots,d-k-1}$ étant décroissante, on obtient l'encadrement suivant : $$ (d-k) \ln\left( 1 - \frac{a-k}{N-d+1} \right) \le \ln\left( \frac{(N-a)! (N-d)!}{(N-a-d+k)! (N-k)!} \right) \le (d-k) \ln\left( 1 - \frac{a-k}{N-k} \right). $$ Les membres de gauche et de droite tendent vers $-\lambda$ :

  • Pour le membre de gauche, on a : $$ \frac{a-k}{N-d+1} = \frac{ \frac1d - \frac{k}{ad} }{ \frac{N}{ad} - \frac1d + \frac1{ad} } \to 0, $$ de sorte que \begin{align*} (d-k) \ln\left( 1 - \frac{a-k}{N-d+1} \right) &= - \frac{(d-k)(a-k)}{N-d+1} (1 + o(1)) \\ &= - \frac{ \left( 1- \frac{k}d \right) \left( 1- \frac{k}a \right) } { \frac{N}{ad} - \frac1a + \frac1{ad} } (1 + o(1)) \to - \gamma \end{align*}
  • Pour le membre de droite, on a : $$ \frac{a-k}{N-k} \sim \frac\gamma{d} \quad \text{donc} \quad \frac{a-k}{N-k} \to 0, $$ d'où $$ (d-k) \ln\left( 1 - \frac{a-k}{N-k} \right) = - \frac{ (d-k) (a-k) }{N-k} (1 + o(1)) \to - \gamma. $$ On peut donc appliquer le théorème d'encadrement et on conclut que $$ \ln\left( \frac{(N-a)! (N-d)!}{(N-a-d+k)! (N-k)!} \right) \to - \gamma $$ puis par continuité de la fonction exponentielle, on conclut $$ \frac{(N-a)! (N-d)!}{(N-a-d+k)! (N-k)!} \to e^{-\gamma}. $$ $\Box$

Approximation when we don't scale

The number of collisions with $2$ jobs in the queue has a hyper-geometric distribution with parameters $N$, $d$ and $d$: $$ p_k = \frac{ \binom{d}{k} \binom{N-d}{d-k} }{ \binom{N}d }, \quad \forall k = (2d-N)^+, \ldots, d. $$ The approximation by a Poisson distribution is good when the system scales, but for finite sizes it is not very good because the variance of the Poisson distribution increases with $k$ while that of the hyper-geometric distribution is not monotonic (it increases up to $k = \sqrt{N}$ and decreases again). That's what we see below.


In [16]:
N = 100
dd = arange(10, 100, 10, dtype = int)

data_comp = []

for d in dd:
    k, p = distribution_collisions(N, d)
    terms = (1 / (2*d-k)) * p
    terms /= sum(terms)
    
    data_comp += [
        go.Scatter(
            x = k, y = p,
            name = "d = " + str(d), mode = 'lines+markers',
        ),
        go.Scatter(
            x = k, y = poisson(N, d),
            name = "d = " + str(d), mode = 'lines',
        ),
        # plot the mean of each distribution
        # go.Scatter(
        #     x = [d*d/N, d*d/N], y = [0,.4],
        #     name = "d = " + str(d), mode = 'lines',
        # ),
    ]

layout_comp = go.Layout(
    title="N = " + str(N), xaxis = dict(title='Number of collisions'), yaxis = dict(title='Probability'),
    autosize = False, width = 800, height = 600,
)

iplot(go.Figure(data = data_comp, layout = layout_comp))


However, one can observe a symmetry in the hyper-geometric distribution (see Appendix B3): whenever $d \ge \frac{N}2$, two sets of $d$ servers chosen among $N$ always overlap, so that there is (deterministically) a minimum of $2d-N$ collisions. The number of additional collisions follows a hyper-geometric distribution with parameters $N$, $N-d$ and $N-d$. Using this intuition, we try to approximate the hyper-geometric distribution by a product of two Poisson distributions. It seems to give a better fit.


In [17]:
N = 100
dd = arange(10, 100, 10, dtype = int)

data_comp = []

for d in dd:
    k, p = distribution_collisions(N, d)
    approximation = poisson(N, N-d) * poisson(N, d)
    
    data_comp += [
        go.Scatter(
            x = k, y = p,
            name = "d = " + str(d), mode = 'lines+markers',
        ),
        go.Scatter(
            x = k, y = approximation / sum(approximation),
            name = "d = " + str(d), mode = 'lines',
        ),
        # plot the mean of each distribution
        # go.Scatter(
        #     x = [d*d/N, d*d/N], y = [0,.4],
        #     name = "d = " + str(d), mode = 'lines',
        # ),
    ]

layout_comp = go.Layout(
    title="N = " + str(N), xaxis = dict(title='Number of collisions'), yaxis = dict(title='Probability'),
    autosize = False, width = 800, height = 600,
)

iplot(go.Figure(data = data_comp, layout = layout_comp))