ACN 903 S1: Galton-Watson Process

Céline Comte & Fabien Mathieu


In [ ]:
%pylab inline

Introduction

For the first session, we propose to investigate the Galton-Watson process. This process was first introduced to analyse the propagation of a feature (family surname, DNA) through generations, and in particular to estimate its probability of extinction.

A Galton-Watson process can be depicted as a (directed) random tree. The nodes represent individuals and the arrows family relations. The tree is built generation by generation as follows. At generation $0$, there is a single individual, the population ancestor, represented by the tree root. Then, each individual from a given generation $ i $ gives birth to a random number of individuals at generation $ i+1 $, represented by its direct successors in the tree.

The unique parameter of the process is the distribution of the number of children per individual. Specifically, we assume that the numbers of children are independent between individuals and drawn from the same distribution $(p_k)_{k\in \mathbb{N}}$ (with $\sum_{k=0}^{\infty} p_k = 1$). We let $\mu$ denote the mean number of children per individual, which we assume to be finite: $$ \mu = \sum_{k=0}^{\infty} k p_k <+\infty. $$

Note that, in practice, there are multiple ways to explore the tree. We will consider two of them in the pratical:

  • Generation by generation. This is the method we described above. For each $i \in \mathbb{N}$, we let $ G_i $ denote the random variable that counts the number of nodes at generation $ i $.
  • Active node by active node. The nodes are visited one by one to decide on their number of children. We keep track of the number of nodes that are active in the sense that we have discovered them but we haven't drawn their number of children yet. As long as there is at least one active node, we can perform a termination that consists in desactivating a node, drawing its number of children according to the distribution $(p_k)_{k \in \mathbb{N}}$, and adding these children (if any) to the set of active nodes. For each $t \in \mathbb{N}$, we let $ X_t $ denote the number of nodes that are active after $ t $ terminations, with the convention that $X_0 = 1$. Observe that $(X_t)_{t \in \mathbb{N}}$ defines a Markov process that is similar to a birth-and-death process, except that state $0$ is absorbing.

The goal of the practical is to play with these two viewpoints in order to better understand the Galton-Watson process. We will focus on two metrics: the mean population size $\chi$ and the extinction probability $P_{ext}$ (i.e., the probability that the population is finite).

If you want to deepen your theoretical knowledge of this process, you can read Chapter 1 from the book Epidemics and Rumours in Complex Networks (which is not mandatory).

1. Bimodal distribution

Throughout the practical, we will first gain insight into the Galton-Watson process by running simulations with a specific distribution of the number of children, and then generalize our observations to any distribution.

For the simulations, we will focus on the bimodal distribution, in which a node can only have 0 or 2 children: $p_0 = 1-\frac{\mu}{2}$, $p_2 = \frac{\mu}{2}$, and $p_k = 0$ for $k \notin \{0,2\}$. Note that the mean $\mu$ can range from $0$ to $2$. The objective of this first exercice is to build a realization of the Galton-Watson process with the bimodal distribution, using either the generation by generation traversal or the active node by active node traversal.

Note: Try to write a flexible code, as the parameters and distributions will change later.

Question 1

We consider the generation by generation traversal.

First, complete the function generation_growth(μ, imax) that returns the values of $G_i$ observed during a realization of the process, up to generation imax. The function random.rand from numpy package may be handy.

Second, complete the function display_generation_growth(μ, imax = 20, n = 20) using the function plot from matplotlib package. You should plot the number of nodes as a function of the generation. The additional argument n gives the number of realizations of the process to plot on the same graph.

 Answer:


In [ ]:
def generation_growth(μ, imax):
    g = ones(imax + 1, dtype = int)
    # to be completed
    return g

In [ ]:
def display_generation_growth(μ, imax = 20, n = 20):
    figure()
    for i in range(n):
        # to be completed
    
    # just to make the plot prettier
    xlim([0, imax])
    axes = gca(); yl = axes.get_ylim(); ylim([0, yl[1]])
    xlabel('Generation'); ylabel('Number of nodes')
    title("μ = %.1f" % μ)
    show()

In [ ]:
for μ in arange(.4, 1.5, .2):
    display_generation_growth(μ)

Question 2

We now consider the active node by active node traversal.

First, complete the function active_growth(μ, tmax) that returns the values of $X_t$ observed during a realization of the process, up to tmax terminations. Look out! Your function shouldn't return negative values.

Second, complete the function display_active_growth(μ, tmax = 50, n = 10) using the function plot from matplotlib package. You should plot the number of active nodes as a function of the termination step.

Answer:


In [ ]:
def active_growth(μ, tmax):
    x = zeros(tmax + 1, dtype = int)
    # to be completed
    return x

In [ ]:
def display_active_growth(μ, tmax = 50, n = 10):
    figure()
    
    for i in range(n):
        # to be completed
    
    xlim([0, tmax])
    axes = gca(); yl = axes.get_ylim(); ylim([0, yl[1]])
    xlabel('Termination step'); ylabel('Number of active nodes')
    title("μ = %.1f" % μ)
    show()

In [ ]:
for μ in arange(.4, 1.5, .2):
    display_active_growth(μ)

2. Mean population size

The first quantity we evaluate is the mean population size. The results obtained by simulation under the bimodal distribution will prove representative of what we obtain under any distribution.

Question 1

Complete the function chi_bim_generation(μ, imax, n = 1000) that estimates $\mathbb{E}(G_i)$ for $i$ up to imax by averaging the results of generation_growth over n independent runs.

For $\mu = \frac12$ and $\mu = \frac43$, plot the results and comment them. We advise you to start with $\mu = \frac43$ and use a semi-log plot.

Code:


In [ ]:
def chi_bim_generation(μ, imax, n = 1000):
    # to be completed

Discussion:

Your turn!

Question 2

Write $\mathbb{E}(G_{i+1})$ as a function of $\mathbb{E}(G_i)$ for each $i \in \mathbb{N}$. Derive an explicit expression of $\mathbb{E}(G_i)$ for each $i \in \mathbb{N}$.

Answer:

Question 3

Use the result of the previous question to derive the value of the mean population size $\chi$.

Answer:

Question 4 (Optional)

Redo Question 2.1 with the active node by active node traversal. Specifically, write a function chi_bim_active(μ, tmax, n = 1000) that estimates $\mathbb{E}(X_t)$ for $t$ up to tmax. Plot the result. What do you observe? Can you explain it?

Answer:

Question 5 (Optional)

Write a function chi_bim_conditional(μ, imax, n = 1000) that evaluates the mean value of $G_i$ for $i$ up to imax, given that the run has lead to extinction. Discuss the results.

Answer:

3. Extinction probability

We now focus on the extinction probability $P_{ext}$, that is, the probability that the total population is finite.

 Question 1

What do the results of Exercice 2 suggest about the extinction probability?

Answer:

Question 2

Complete the function pext_bim_trials(range_μ, tmax, n) that estimates the probability that the population extincts after no more than tmax termination steps by averaging the results of active_growth over n independent runs, for several values of $\mu$.

Plot the results and comment them. Suggested values: range_μ = (np.)linspace(2, 0, 40, endpoint = False), tmax = 10, 100, 1000, n = 1000.

 Code:


In [ ]:
def pext_bim_trials(range_μ, tmax, n):
    # to be completed

Discussion:

Question 3

Give an equality that relates the extinction probability $P_{ext}$ and the distribution $(p_k)_{k\in \mathbb{N}}$ of the number of children per node. In the sequel, we will admit that $P_{ext}$ is the smallest solution of this equation in the interval $[0,1]$.

Answer:

Question 4

In the special case of the bimodal distribution, use this equation to write $P_{ext}$ as a function of $\mu$. Write a (very) small function pext_bim_exact(range_μ) that computes $P_{ext}$ for a list of values of $\mu$ and display the results against the empirical values obtained in Question 3.2.

Answer:

Question 5 (Optional)

Evaluating the results by simulation has an inherent lack of accuracy. Write a function pext_bim_distrib(range_μ, tmax) that computes exactly the probability of an extinction after no more than tmax terminations. Display the results and compare them with those obtained by simulation.

Hint: write a function pop_after_t(p, tmax) that computes the distribution of the number of active nodes after tmax termination steps as a function of $ p = (p_k)_{k \in \mathbb{N}}$. The function convolve from numpy package may be handy.

Answer:

4. Other Distributions

We again focus on the probability $P_{ext}$ of an extinction. For the bimodal distribution, we observed the following phase transition:

  • If $\mu < 1$, then $P_{ext} = 1$.
  • If $\mu > 1$, then $P_{ext} < 1$.

This result is valid for any distribution of the number of children per node, as long as its mean $\mu$ is finite. The proof will be given orally during the practical. You can also find it in Epidemics and Rumours in Complex Networks, along with more details (in particular, what happens when $\mu = 1$). We now consider two other distributions of the number of children per node and look at the extinction probability when $\mu > 1$.

Question 1

We now consider a geometric distribution $p_k=(1-a)a^k$, $k \in \mathbb{N}$, for some $0\leq a<1$. Relate $a$ and $\mu$ and study the extinction probability as you did for the bimodal case. In particular,

  • Give the equation that $ P_{ext} $ should verify and write $ P_{ext} $ as a function of $ \mu $, for $ \mu\in [0, 2] $.
  • Validate the results experimentally. You should, for instance:
    • Run multiple simulations using a geometric generator. The function random.geometric from numpy package may be handy.
    • Adapt the function pop_after_t of Question 3.5 to compute the probability of an extinction after tmax = 1000 termination steps for a truncated geometric distribution.

Answer:

Question 2

We finally consider a Poisson distribution of parameter $\mu$: $$ p_k = e^{-\mu}\frac{\mu^k}{k!}, \quad \forall k \in \mathbb{N}. $$ Study the extinction probability. In particular,

  • Give the equation that $ P_{ext} $ should verify and compute $ P_{ext} $ as a function of $ \mu $ for $ \mu\in [0, 2] $. When there is no explicit expression of $P_{ext}$ as a function of $\mu$, you can use the fixed-point equation to compute the solution iteratively.
  • Validate the results experimentally. You should, for instance:
    • Run multiple simulations using a Poisson generator. The function random.poisson from numpy package may be handy.
    • Adapt pop_after_t of Question 3.5 to compute the probability of an extinction after t = 1000 termination steps for a truncated Poisson distribution.

Answer:

Question 3

Plot the three theoretical $ P_{ext} $ you obtained on the same figure, as a function of mean number $\mu$ of children per node. Discuss the differences (informally).

Answer:

5. Generalization (Optional)

Redo the exercices of this Notebook for a Galton-Watson process that starts with two nodes (you can re-use results from above).