A Wright-Fisher model has a fixed population size N and discrete non-overlapping generations. Each generation, each individual has a random number of offspring whose mean is proportional to the individual's fitness. Each generation, mutation may occur. Mutations may increase or decrease individual's fitness, which affects the chances of that individual's offspring in subsequent generations.
Here, I'm using a fitness model where some proportion of the time a mutation will have a fixed fitness effect, increasing or decreasing fitness by a fixed amount.
In [1]:
import numpy as np
import itertools
In [2]:
pop_size = 100
In [3]:
seq_length = 10
In [4]:
alphabet = ['A', 'T']
In [5]:
base_haplotype = "AAAAAAAAAA"
In [6]:
fitness_effect = 1.1 # fitness effect if a functional mutation occurs
In [7]:
fitness_chance = 0.1 # chance that a mutation has a fitness effect
Store this as a lightweight Dictionary that maps a string to a count. All the sequences together will have count N.
In [8]:
pop = {}
In [9]:
pop["AAAAAAAAAA"] = 40
In [10]:
pop["AAATAAAAAA"] = 30
In [11]:
pop["AATTTAAAAA"] = 30
Map haplotype string to fitness float.
In [12]:
fitness = {}
In [13]:
fitness["AAAAAAAAAA"] = 1.0
In [14]:
fitness["AAATAAAAAA"] = 1.05
In [15]:
fitness["AATTTAAAAA"] = 1.10
In [16]:
pop["AAATAAAAAA"]
Out[16]:
In [17]:
fitness["AAATAAAAAA"]
Out[17]:
In [18]:
mutation_rate = 0.005 # per gen per individual per site
In [19]:
def get_mutation_count():
mean = mutation_rate * pop_size * seq_length
return np.random.poisson(mean)
In [20]:
def get_random_haplotype():
haplotypes = pop.keys()
frequencies = [x/float(pop_size) for x in pop.values()]
total = sum(frequencies)
frequencies = [x / total for x in frequencies]
return np.random.choice(haplotypes, p=frequencies)
In [21]:
def get_mutant(haplotype):
site = np.random.randint(seq_length)
possible_mutations = list(alphabet)
possible_mutations.remove(haplotype[site])
mutation = np.random.choice(possible_mutations)
new_haplotype = haplotype[:site] + mutation + haplotype[site+1:]
return new_haplotype
Mutations have fitness effects
In [22]:
def get_fitness(haplotype):
old_fitness = fitness[haplotype]
if (np.random.random() < fitness_chance):
return old_fitness * fitness_effect
else:
return old_fitness
In [23]:
get_fitness("AAAAAAAAAA")
Out[23]:
If a mutation event creates a new haplotype, assign it a random fitness.
In [24]:
def mutation_event():
haplotype = get_random_haplotype()
if pop[haplotype] > 1:
pop[haplotype] -= 1
new_haplotype = get_mutant(haplotype)
if new_haplotype in pop:
pop[new_haplotype] += 1
else:
pop[new_haplotype] = 1
if new_haplotype not in fitness:
fitness[new_haplotype] = get_fitness(haplotype)
In [25]:
mutation_event()
In [26]:
pop
Out[26]:
In [27]:
fitness
Out[27]:
In [28]:
def mutation_step():
mutation_count = get_mutation_count()
for i in range(mutation_count):
mutation_event()
Fitness weights the multinomial draw.
In [29]:
def get_offspring_counts():
haplotypes = pop.keys()
frequencies = [pop[haplotype]/float(pop_size) for haplotype in haplotypes]
fitnesses = [fitness[haplotype] for haplotype in haplotypes]
weights = [x * y for x,y in zip(frequencies, fitnesses)]
total = sum(weights)
weights = [x / total for x in weights]
return list(np.random.multinomial(pop_size, weights))
In [30]:
get_offspring_counts()
Out[30]:
In [31]:
def offspring_step():
counts = get_offspring_counts()
for (haplotype, count) in zip(pop.keys(), counts):
if (count > 0):
pop[haplotype] = count
else:
del pop[haplotype]
In [32]:
def time_step():
mutation_step()
offspring_step()
In [33]:
generations = 5
In [34]:
def simulate():
for i in range(generations):
time_step()
We want to keep a record of past population frequencies to understand dynamics through time. At each step in the simulation, we append to a history object.
In [35]:
history = []
In [36]:
def simulate():
clone_pop = dict(pop)
history.append(clone_pop)
for i in range(generations):
time_step()
clone_pop = dict(pop)
history.append(clone_pop)
In [37]:
simulate()
In [38]:
def get_distance(seq_a, seq_b):
diffs = 0
length = len(seq_a)
assert len(seq_a) == len(seq_b)
for chr_a, chr_b in zip(seq_a, seq_b):
if chr_a != chr_b:
diffs += 1
return diffs / float(length)
In [39]:
def get_diversity(population):
haplotypes = population.keys()
haplotype_count = len(haplotypes)
diversity = 0
for i in range(haplotype_count):
for j in range(haplotype_count):
haplotype_a = haplotypes[i]
haplotype_b = haplotypes[j]
frequency_a = population[haplotype_a] / float(pop_size)
frequency_b = population[haplotype_b] / float(pop_size)
frequency_pair = frequency_a * frequency_b
diversity += frequency_pair * get_distance(haplotype_a, haplotype_b)
return diversity
In [40]:
def get_diversity_trajectory():
trajectory = [get_diversity(generation) for generation in history]
return trajectory
In [41]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
In [42]:
def diversity_plot():
mpl.rcParams['font.size']=14
trajectory = get_diversity_trajectory()
plt.plot(trajectory, "#447CCD")
plt.ylabel("diversity")
plt.xlabel("generation")
In [43]:
def get_divergence(population):
haplotypes = population.keys()
divergence = 0
for haplotype in haplotypes:
frequency = population[haplotype] / float(pop_size)
divergence += frequency * get_distance(base_haplotype, haplotype)
return divergence
In [44]:
def get_divergence_trajectory():
trajectory = [get_divergence(generation) for generation in history]
return trajectory
In [45]:
def divergence_plot():
mpl.rcParams['font.size']=14
trajectory = get_divergence_trajectory()
plt.plot(trajectory, "#447CCD")
plt.ylabel("divergence")
plt.xlabel("generation")
In [46]:
def get_frequency(haplotype, generation):
pop_at_generation = history[generation]
if haplotype in pop_at_generation:
return pop_at_generation[haplotype]/float(pop_size)
else:
return 0
In [47]:
def get_trajectory(haplotype):
trajectory = [get_frequency(haplotype, gen) for gen in range(generations)]
return trajectory
In [48]:
def get_all_haplotypes():
haplotypes = set()
for generation in history:
for haplotype in generation:
haplotypes.add(haplotype)
return haplotypes
In [49]:
colors = ["#781C86", "#571EA2", "#462EB9", "#3F47C9", "#3F63CF", "#447CCD", "#4C90C0", "#56A0AE", "#63AC9A", "#72B485", "#83BA70", "#96BD60", "#AABD52", "#BDBB48", "#CEB541", "#DCAB3C", "#E49938", "#E68133", "#E4632E", "#DF4327", "#DB2122"]
In [50]:
colors_lighter = ["#A567AF", "#8F69C1", "#8474D1", "#7F85DB", "#7F97DF", "#82A8DD", "#88B5D5", "#8FC0C9", "#97C8BC", "#A1CDAD", "#ACD1A0", "#B9D395", "#C6D38C", "#D3D285", "#DECE81", "#E8C77D", "#EDBB7A", "#EEAB77", "#ED9773", "#EA816F", "#E76B6B"]
In [51]:
def stacked_trajectory_plot(xlabel="generation"):
mpl.rcParams['font.size']=18
haplotypes = get_all_haplotypes()
trajectories = [get_trajectory(haplotype) for haplotype in haplotypes]
plt.stackplot(range(generations), trajectories, colors=colors_lighter)
plt.ylim(0, 1)
plt.ylabel("frequency")
plt.xlabel(xlabel)
In [52]:
def get_snp_frequency(site, generation):
minor_allele_frequency = 0.0
pop_at_generation = history[generation]
for haplotype in pop_at_generation.keys():
allele = haplotype[site]
frequency = pop_at_generation[haplotype] / float(pop_size)
if allele != "A":
minor_allele_frequency += frequency
return minor_allele_frequency
In [53]:
def get_snp_trajectory(site):
trajectory = [get_snp_frequency(site, gen) for gen in range(generations)]
return trajectory
Find all variable sites.
In [54]:
def get_all_snps():
snps = set()
for generation in history:
for haplotype in generation:
for site in range(seq_length):
if haplotype[site] != "A":
snps.add(site)
return snps
In [55]:
def snp_trajectory_plot(xlabel="generation"):
mpl.rcParams['font.size']=18
snps = get_all_snps()
trajectories = [get_snp_trajectory(snp) for snp in snps]
data = []
for trajectory, color in itertools.izip(trajectories, itertools.cycle(colors)):
data.append(range(generations))
data.append(trajectory)
data.append(color)
fig = plt.plot(*data)
plt.ylim(0, 1)
plt.ylabel("frequency")
plt.xlabel(xlabel)
Here, we scale up to more interesting parameter values.
In [56]:
pop_size = 50
seq_length = 100
generations = 500
mutation_rate = 0.0001 # per gen per individual per site
fitness_effect = 1.1 # fitness effect if a functional mutation occurs
fitness_chance = 0.1 # chance that a mutation has a fitness effect
In this case there are $\mu$ = 0.01 mutations entering the population every generation.
In [57]:
seq_length * mutation_rate
Out[57]:
And the population genetic parameter $\theta$, which equals $2N\mu$, is 1.
In [58]:
2 * pop_size * seq_length * mutation_rate
Out[58]:
In [59]:
base_haplotype = ''.join(["A" for i in range(seq_length)])
pop.clear()
fitness.clear()
del history[:]
pop[base_haplotype] = pop_size
fitness[base_haplotype] = 1.0
In [60]:
simulate()
In [61]:
plt.figure(num=None, figsize=(14, 14), dpi=80, facecolor='w', edgecolor='k')
plt.subplot2grid((3,2), (0,0), colspan=2)
stacked_trajectory_plot()
plt.subplot2grid((3,2), (1,0), colspan=2)
snp_trajectory_plot()
plt.subplot2grid((3,2), (2,0))
diversity_plot()
plt.subplot2grid((3,2), (2,1))
divergence_plot()