This is a collection of elementary exercises that introduces you to the most fundamental concepts of population genetics. We use Python to explore these topics and solve problems.
The exercises have been chosen for a one day workshop on modeling with 2.5 hours exercises preceded by approx. 3 hours of lectures (a primer on population genetics and probability theory). Evidently, it is not possible to cover a lot of material in this time; but upon finishing this workshop, you should feel comfortable picking up a textbook on population genetics and exploring the many software packages that are available for population genetics.
Note: You can skip the exercises marked by an asterisk and tackle them if time permits.
All exercises can in principle be solved by only using the Python standard library and a plotting library. However, if you like and it feels more comfortable to you, you can use as well the libraries numpy and pandas. Note, that you have a link to the documentation of Python and standard scientific libraries in the "Help" menu of the Jupyter/IPython notebook.
IPython has so-called magic commands (starting with %) to facilitate certain tasks. In our case, we want to import libraries for efficient handling of numeric data (numpy) and for plotting data (matplotlib). Evaluate the following two commands by pressing shift+enter in the cell; they import the necessary libraries and enable inline display of figures (it make take a few seconds).
In [1]:
import random
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binom
%matplotlib inline
Let us define two vector variables (a regular sequence and a random one) and print them.
In [2]:
x, y = np.arange(10), np.random.rand(10)
#print(x, y, sep='\n')
The following command plots $y$ as a function of $x$ and labels the axes using $\LaTeX$.
In [3]:
plt.plot(x, y, linestyle='--', color='r', linewidth=2)
plt.xlabel('time, $t$')
plt.ylabel('frequency, $f$')
Out[3]:
From the tutorial: "matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB."
Comment: The tutorial is a good starting point to learn about the most basic functionalities of matplotlib, especially if you are familiar with MATLAB. Matplotlib is a powerful library but sometimes too complicated for making statistical plots à la R. However, there are other libraries that, in part, are built on matplotlib and provide more convenient functionality for statistical use cases, especially in conjunction with the data structures that the library pandas provides (see pandas, seaborn, ggplot and many more).
These exercises should make you comfortable with the fundamental notions of population genetics: allele and genotype frequencies, homo- and heterozygosity, and inbreeding.
We will use data from a classical paper on enzyme polymorphisms at the alkaline phosphatase (ALP) locus in humans (Harris 1966). In this case, the alleles have been defined in terms of protein properties. Harris could electrophoretically distinguish three proteins by their migration speed and called them S (slow), F (fast), and I (intermediate).
We use a Python dictionary to store the observed numbers of genotypes at the ALP locus in a sample from the English people.
In [4]:
alp_genotype = {'obs_number':
{'SS': 141, 'SF': 111, 'FF': 28, 'SI': 32, 'FI': 15, 'II': 5}
}
In [5]:
number = alp_genotype['obs_number']
print(number)
In [6]:
total_number = sum(list(number.values()))
print(total_number)
In [7]:
frequency = dict()
for genotype in number:
frequency[genotype] = number[genotype]/total_number
print(frequency)
In [8]:
alp_genotype['obs_frequency'] = frequency
print(alp_genotype)
In [9]:
alp_allele = dict()
for allele in 'SIF':
allele_freq = 0
for genotype, frequency in alp_genotype['obs_frequency'].items():
allele_freq += genotype.count(allele)/2*frequency
alp_allele[allele] = allele_freq
In [10]:
alp_allele
Out[10]:
In [11]:
sum(list(alp_allele.values()))
Out[11]:
In [12]:
alp_exp_frequency = dict()
# s**2 + i**2 + f**2 + 2*s*i + 2*s*f + 2*i*f
for genotype in alp_genotype['obs_number']:
alleles = [allele for allele in genotype]
exp_frequency = alp_allele[alleles[0]]*alp_allele[alleles[1]]
if alleles[0] != alleles[1]:
exp_frequency = 2*exp_frequency
alp_exp_frequency[genotype] = exp_frequency
alp_genotype['exp_frequency'] = alp_exp_frequency
In [13]:
alp_genotype
Out[13]:
The inbreeding coefficient is defined as $$F = 1 - \frac{h_{\mathrm{obs}}}{h_{\mathrm{exp}}},$$ where $h$ denotes the (observed and expected) frequency of heterozygotes. Can you interpret the result in simple terms?
In [14]:
obs_het, exp_het = list(), list()
for gt in alp_genotype['obs_number']:
if len(set(gt)) == 2:
obs_het.append(alp_genotype['obs_frequency'][gt])
exp_het.append(alp_genotype['exp_frequency'][gt])
inbreeding_coef = 1 - sum(obs_het)/sum(exp_het)
print(inbreeding_coef)
Not all gametes that are produced by an organism pass over to the next generation. Due to numerous possible influences there is only a finite sample that contributes to the next generation. Therefore, not all alleles of a gene are guaranteed to appear in the next generation in proportions equal to those in the present generation. As long as we cannot specify a process that leads to a specific selection of alleles and we have no reason to believe that the allele itself has a bearing upon its selection, sampling is an undirected (i.e. random) cause of allele frequency changes in a population. We call such an undirected carry-over of genes genetic drift.
Genetic drift does not introduce any new assumptions compared to the Hardy-Weinberg case; it just drops the assumption of infinite population size.
The function must at least take the following arguments:
The function should return a list (or an array if you like) that represents the trajectory of the allele over the generations.
In [15]:
def drift(initial_frequency=0.5, generations=100, population_size=20):
allele_sample_size = 2*population_size
allele_trajectory = [initial_frequency]
for generation in range(generations):
allele_count = 0
# alternative: allele_count = np.random.binomial(allele_sample_size, initial_frequency)
for trial in range(allele_sample_size):
if random.random() < initial_frequency:
allele_count += 1
new_allele_frequency = allele_count/allele_sample_size
allele_trajectory.append(new_allele_frequency)
initial_frequency = new_allele_frequency
return allele_trajectory
In [17]:
replicates = 5
for i in range(replicates):
fA = drift(initial_frequency=0.5, generations=10, population_size=10)
plt.plot(fA, linewidth=2, alpha=0.5)
There is another way to look at the dynamics of a locus under genetic drift. If we have a large collection of replicate populations, we can take, at each time point, the allele frequencies of all these populations and plot a histogram. Thus, instead of looking at individual trajectories, we can observe how the distribution of this allele changes due to genetic drift across all replicate populations. This viewpoint of looking at a time-dependent probability density, is central for understanding the diffusion approximation to genetic drift (Kimura 1955).
Write a function that takes the output of the Wright-Fisher model, a list of generation times and plots a series of histograms of allele frequencies. What can you observe?
In [18]:
def drift_replicates(replicates=10, generations=100, initial_frequency=0.5, population_size=20):
trajectory = np.zeros((replicates, generations + 1))
for rep in range(replicates):
trajectory[rep,] = drift(generations=generations,
initial_frequency=initial_frequency,
population_size=population_size)
return trajectory
In [19]:
n = drift_replicates(replicates=20, population_size=10)
In [20]:
plt.matshow(n)
plt.colorbar()
Out[20]:
In [21]:
slices = range(0, 100, 20)
for generation in slices:
plt.figure()
plt.hist(n[:, generation], color='red', alpha=.5)
The temporal evolution of the probability distribution is actually governed by a deterministic equation, the Markov chain. To simulate it, we have to only know the transition probabilities and the initial frequencies of all possible states. Since we are looking at a population, the possible states are given by the number of reference alleles $A$; for a population of $2N$ alleles, we have the states $X(A)=0, 1, 2, \ldots, 2N$. The transition probability from $X(A)=i$ to $X(A)=j$ is given by the binomial distribution:
$$T_{ij} = \binom{2N}{j} \left(\frac{i}{2N}\right)^j \left(1-\frac{i}{2N}\right)^{2N-j}$$
In [22]:
def tmatrix(N=10):
twoN = 2*N
states = twoN + 1
t = np.zeros((states, states))
j = range(0, states)
for i in j:
p = i/twoN
t[i,] = binom.pmf(j, twoN, p)
return t
In [23]:
N = 4
T = tmatrix(N)
p0 = np.zeros(2*N + 1)
p0[N] = 1.
In [26]:
plt.plot(np.dot(p0, np.linalg.matrix_power(T, 10)))
Out[26]:
You have seen yourself that genetic drift removes variation from the population. Since we can observe standing variation, it is evident that genetic drift cannot be the only evolutionary force. There must be something that causes variation. To a certain extent, new variants can arise in a population due to migration, that is an influx of new alleles. However, the ultimate cause of allelic variation is mutation.
To study the interplay of drift and mutation, we will focus on the decay of heterozygosity or the dynamics of the inbreeding coefficient. With mutation, inbreeding changes according to the formula
$$ F_t = \left[ \frac{1}{2N} + \left( 1 - \frac{1}{2N} \right) F_{t-1} \right] \left( 1 - u \right)^2 $$where $(1-u)^2$ is the probability that no mutation occured in either of the two alleles and $u$ is the mutation probability (also called mutation rate).
In [32]:
gen, N, u = 1000, 100, 0.001
rate = 1/(2*N)
prob = (1 - u)**2
F = [0]
for i in range(gen):
F.append((rate + (1 - rate)*F[-1])*prob)
plt.plot(F, linewidth=2)
plt.axhline(1/(1+4*N*u), color='k', linestyle='--', linewidth=2)
plt.axis([0, gen, 0, 1])
Out[32]: