Population Genetics

Önder Kartal, University of Zurich

This is a collection of elementary exercises that introduces you to the most fundamental concepts of population genetics. We use Python to explore these topics and solve problems.

The exercises have been chosen for a one day workshop on modeling with 2.5 hours exercises preceded by approx. 3 hours of lectures (a primer on population genetics and probability theory). Evidently, it is not possible to cover a lot of material in this time; but upon finishing this workshop, you should feel comfortable picking up a textbook on population genetics and exploring the many software packages that are available for population genetics.

Note: You can skip the exercises marked by an asterisk and tackle them if time permits.

Preliminaries

All exercises can in principle be solved by only using the Python standard library and a plotting library. However, if you like and it feels more comfortable to you, you can use as well the libraries numpy and pandas. Note, that you have a link to the documentation of Python and standard scientific libraries in the "Help" menu of the Jupyter/IPython notebook.

IPython has so-called magic commands (starting with %) to facilitate certain tasks. In our case, we want to import libraries for efficient handling of numeric data (numpy) and for plotting data (matplotlib). Evaluate the following two commands by pressing shift+enter in the cell; they import the necessary libraries and enable inline display of figures (it make take a few seconds).


In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Let us define two vector variables (a regular sequence and a random one) and print them.


In [2]:
x, y = np.arange(10), np.random.rand(10)
print(x, y, sep='\n')


[0 1 2 3 4 5 6 7 8 9]
[ 0.43191072  0.17692517  0.42616389  0.73142861  0.52226624  0.54028909
  0.19863538  0.08375168  0.03776252  0.73497815]

The following command plots $y$ as a function of $x$ and labels the axes using $\LaTeX$.


In [3]:
plt.plot(x, y, linestyle='--', color='r', linewidth=2)
plt.xlabel('time, $t$')
plt.ylabel('frequency, $f$')


Out[3]:
<matplotlib.text.Text at 0xad64d92c>

From the tutorial: "matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB."

Comment: The tutorial is a good starting point to learn about the most basic functionalities of matplotlib, especially if you are familiar with MATLAB. Matplotlib is a powerful library but sometimes too complicated for making statistical plots à la R. However, there are other libraries that, in part, are built on matplotlib and provide more convenient functionality for statistical use cases, especially in conjunction with the data structures that the library pandas provides (see pandas, seaborn, ggplot and many more).

Hardy-Weinberg Equilibrium

These exercises should make you comfortable with the fundamental notions of population genetics: allele and genotype frequencies, homo- and heterozygosity, and inbreeding.

We will use data from a classical paper on enzyme polymorphisms at the alkaline phosphatase (ALP) locus in humans (Harris 1966). In this case, the alleles have been defined in terms of protein properties. Harris could electrophoretically distinguish three proteins by their migration speed and called them S (slow), F (fast), and I (intermediate).

We use a Python dictionary to store the observed numbers of genotypes at the ALP locus in a sample from the English people.


In [4]:
alp_genotype = {'obs_number':
                {'SS': 141, 'SF': 111, 'FF': 28, 'SI': 32, 'FI': 15, 'II': 5}
               }

1. Calculate the observed genotype frequencies at the ALP locus.


In [ ]:

2. Calculate the observed allele frequencies at the ALP locus.


In [ ]:

3. Calculate the expected genotype frequencies if the ALP locus were in Hardy-Weinberg equilibrium.


In [ ]:

[$\ast$] 4. Calculate the estimate of the inbreeding coefficient $F$ for the ALP locus.

The inbreeding coefficient is defined as $$F = 1 - \frac{h_{\mathrm{obs}}}{h_{\mathrm{exp}}},$$ where $h$ denotes the (observed and expected) frequency of heterozygotes. Can you interpret the result in simple terms?


In [ ]:

Genetic Drift

Not all gametes that are produced by an organism pass over to the next generation. Due to numerous possible influences there is only a finite sample that contributes to the next generation. Therefore, not all alleles of a gene are guaranteed to appear in the next generation in proportions equal to those in the present generation. As long as we cannot specify a process that leads to a specific selection of alleles and we have no reason to believe that the allele itself has a bearing upon its selection, sampling is an undirected (i.e. random) cause of allele frequency changes in a population. We call such an undirected carry-over of genes genetic drift.

Genetic drift does not introduce any new assumptions compared to the Hardy-Weinberg case; it just drops the assumption of infinite population size.

5. Write a function that runs the Wright-Fisher model of genetic drift.

The function must at least take the following arguments:

  • number of generations
  • size of the population (i.e. number of diploid individuals)
  • initial allele frequency (we have only two alleles, so considering a single allele is enough)

The function should return a list (or an array if you like) that represents the trajectory of the allele over the generations.


In [ ]:

6. Plot several trajectories (i.e. replicate populations) of the Wright-Fisher model and study genetic drift with different parameter values.

  • What is the long-term behaviour of the locus?
  • What is the effect of small/large population sizes on the trajectories?
  • Do the trajectories of the replicate populations differ?
  • Do rare alleles become extinct more often than abundant alleles?

In [ ]:

[$\ast$] 7. Plot the distribution of allele frequency under genetic drift.

There is another way to look at the dynamics of a locus under genetic drift. If we have a large collection of replicate populations, we can take, at each time point, the allele frequencies of all these populations and plot a histogram. Thus, instead of looking at individual trajectories, we can observe how the distribution of this allele changes due to genetic drift across all replicate populations. This viewpoint of looking at a time-dependent probability density, is central for understanding the diffusion approximation to genetic drift (Kimura 1955).

Write a function that takes the output of the Wright-Fisher model, a list of generation times and plots a series of histograms of allele frequencies. What can you observe?


In [ ]:

8. Model genetic drift as a Markov chain.

The temporal evolution of the probability distribution is actually governed by a deterministic equation, the Markov chain. To simulate it, we have to only know the transition probabilities and the initial frequencies of all possible states. Since we are looking at a population, the possible states are given by the number of reference alleles $A$; for a population of $2N$ alleles, we have the states $X(A)=0, 1, 2, \ldots, 2N$. The transition probability from $X(A)=i$ to $X(A)=j$ is given by the binomial distribution:

$$T_{ij} = \binom{2N}{j} \left(\frac{i}{2N}\right)^j \left(1-\frac{i}{2N}\right)^{2N-j}$$
  • Write a function that gives the transition matrix for a given population size $N$. Tip: For an efficient implementation, you can use the binom() function from scipy.stats. How can you test if your matrix is consistent?
  • For $N=4$, calculate the probability that a population with 4 copies of allele A transitions into a state with 3, 4, 5 copies. Why should these values be symmetric around 4 copy numbers?
  • Use the function matrix_power from numpy.linalg to compute the distribution for 19 generations with the parameters $N=16$ and initial population frequency of the reference allele of $\frac{1}{2}$. The state probability vector after $t$ transitions is given by $$p(t) = p(0)T^t$$

In [ ]:

Mutation

You have seen yourself that genetic drift removes variation from the population. Since we can observe standing variation, it is evident that genetic drift cannot be the only evolutionary force. There must be something that causes variation. To a certain extent, new variants can arise in a population due to migration, that is an influx of new alleles. However, the ultimate cause of allelic variation is mutation.

To study the interplay of drift and mutation, we will focus on the decay of heterozygosity or the dynamics of the inbreeding coefficient. With mutation, inbreeding changes according to the formula

$$ F_t = \left[ \frac{1}{2N} + \left( 1 - \frac{1}{2N} \right) F_{t-1} \right] \left( 1 - u \right)^2 $$

where $(1-u)^2$ is the probability that no mutation occured in either of the two alleles and $u$ is the mutation probability (also called mutation rate).

9. Simulate the dynamics of the inbreeding coefficient with and without mutation and observe the stationary state. Pick a population size not too small. Play with the number of generations.


In [ ]:

References

Gillespie (2004) Population Genetics: A Concise Guide The Johns Hopkins University Press

Hartl & Clark (2007) Principles of Population Genetics Sinauer Associates, Inc.