Tom Ellis, August 2017
Before committing to the time and cost of genotyping samples for a paternity study, it is always sensible to run simulations to test the likely statistical power of your data set. This can help with important questions regaridng study design, such as finding an appropriate balance between the number of families vs offspring per family, or identifying a minimum number of loci to type. Simulated data can also be useful in verifying the results of an analysis.
FAPS provides tools to run such simulations. In this notebook we look look at:
It is worth noting that I relied on loops for a lot of these tools, for the purely selfish reason that it was easy to code. Loops are of course slow, so if you work with these tools a lot there is ample scope for speeding things up (see especially the functions make_offspring
, make_sibships
and make_power
).
Simulations are built using genotypeArrays
. See the section on these here for more information.
make_parents
generates a population of reproductive adults from population allele frequencies.
This example creates ten individuals.
Note that this population will be in Hardy-Weinberg equilibrium, but yours may not.
In [11]:
import numpy as np
import faps as fp
import matplotlib.pylab as plt
import pandas as pd
from time import time, localtime, asctime
np.random.seed(37)
allele_freqs = np.random.uniform(0.2, 0.5, 50)
adults = fp.make_parents(10, allele_freqs, family_name='adult')
There are multiple ways to mate adults to generate offspring. If you supply a set of adults and an integer number of offspring, make_offspring
mates adults at random.
In [12]:
family1 = fp.make_offspring(parents = adults, noffs=5)
family1.parents
Out[12]:
You can also supply an explicit list of dams and sires, in which case the adults are paired in the order they appear in each list.
In [13]:
family2 = fp.make_offspring(parents = adults, dam_list=[7,1,8,8,0], sire_list=[2,6,3,0,7])
family2.parents
Out[13]:
Usually we really want to simulate half sib arrays. This can be done using make_sibships
, which mates a single mother to a set of males.
In [14]:
family3 = fp.make_sibships(parents=adults, dam=0, sires=[1,2,3,4], family_size=5)
family3.parents
Out[14]:
For uneven sibship sizes, give a list of sizes for each family of the same length as sires
.
In [15]:
family4 = fp.make_sibships(parents=adults, dam=0, sires=[1,2,3,4], family_size=[5,4,3,2])
family4.parents
Out[15]:
Real data almost always contains errors. For SNP data, these take the form of:
These are straightforward to include in simulated data. First generate some clean data again, and mate the parents.
In [16]:
np.random.seed(85)
allele_freqs = np.random.uniform(0.2, 0.5, 50)
adults = fp.make_parents(10, allele_freqs, family_name='adult')
progeny = fp.make_sibships(parents=adults, dam=0, sires=[1,2,3,4], family_size=5)
It is best to create the progeny before adding errors. Set the error rates and add errors at random.
In [17]:
d, mu= 0.01, 0.0015 # values for dropout and error rate.
# add genotyping errors
adults_mu = adults.mutations(mu)
progeny_mu = progeny.mutations(mu)
# add dropouts (to the mutated data)
adults_mu = adults_mu.dropouts(d)
progeny_mu = progeny.dropouts(d)
mutations
and dropouts
make copies of the genotypeArray
, so the original data remains unchanged. For example:
In [18]:
print(adults.missing_data().mean())
print(adults_mu.missing_data().mean())
In [19]:
np.random.seed(85)
allele_freqs = np.random.uniform(0.4, 0.5, 50)
adults = fp.make_parents(10, allele_freqs, family_name='adult')
progeny = fp.make_sibships(parents=adults, dam=0, sires=[1,2,3,4], family_size=5)
mothers = adults.subset(progeny.mothers)
patlik = fp.paternity_array(progeny, mothers, adults, mu=0.0015, missing_parents=0.01)
sc = fp.sibship_clustering(patlik)
A very useful tool is the accuracy
subfunction for sibshipCluster
objects.
When the paternity and sibship structure are know (seldom the case in real life, but true for simulated data) this returns an array of handy information about the analysis:
In [20]:
sc.accuracy(progeny, adults)
Out[20]:
In this example, accuracy is high, but the probability of a missing sire is NaN because all the sires are present, and this number of calculated only for offspring whose sire was absent.
We can adjust the paternityArray
to see how much this effects the results.
For example, if we remove the sire of the first family (i.e. the male indexed by 1), there is a drop in the accuracy for full-sibling relationships, although half-sibling relationships are unaffected.
In [21]:
patlik.purge = 'adult_1'
patlik.missing_parents=0.25
sc = fp.sibship_clustering(patlik)
sc.accuracy(progeny, adults)
Out[21]:
In contrast, imagine we had an idea that selfing was strong. How would this affect things?
In [22]:
patlik.selfing_rate=0.5
sc = fp.sibship_clustering(patlik)
sc.accuracy(progeny, adults)
Out[22]:
The results are identical to the unmodified case; FAPS has correctly identifed the correct partition structure in spite of the (incorrect) strong prior for high selfing.
It can be tedious to put together your own simulation for every analysis.
FAPS has an automated function that repeatedly creates genotype data, clusters into siblings and calls the accuracy
function.
You can supply lists of variables and it will evaluate each combination. There are a lot of possible inputs, so have a look at the help page using fp.make_power?
.
For example, this code creates four families of five full siblings with a genotyping error rate of 0.0015. It considers 30, 40 and 50 loci for 100, 250 or 500 candidate fathers. Each parameter combination is replicated 10 times. In reality you would want to do more than this; I have found that results tend to asymptote with 300 simulations.
In [25]:
# Common simulation parameters
r = 10 # number of replicates
nloci = [30,40,50] # number of loci
allele_freqs = [0.25, 0.5] # draw allele frequencies
nadults = [100,250,500] # size of the adults population
mu = 0.0015 #genotype error rates
sires = 4
offspring = 5
np.random.seed(614)
eventab = fp.make_power(
replicates = r,
nloci = nloci,
allele_freqs = allele_freqs,
candidates = nadults,
sires = sires,
offspring = offspring,
missing_loci=0,
mu_real = mu,
unsampled_input=0.01
)
For convenience, make_power
provides a summary of the input parameters.
This can be turned off by setting verbose
to False
.
Similarly, the progress bar can be removed by setting progress
to False
.
This bar uses iPython widgets, and probably won't work outside of iPython, so it may be necessary to turn them off.
The results of make_power are basically the output from the accuracy
function we saw before, but include information on simulation parameters, and the time taken to create the paternityArray
and sibshipCluster
objects. View them by inspecting eventab
.
Arguments to set up the population work much like those to create genotypeArrays
, and are quite flexible.
Have a look into the help file (run make_power?
in Python) for more.
You can also take a look at the simulations in support of the main FAPS paper, which considered a range of contrasting demographic scenarios; the example above is adapted from there.
Error rates and missing candidates are important topics to get a handle on.
We can estimate these parameters (e.g. by genotyping some individuals twice and counting how many loci are different), but we can never completely be sure how close to reality we are.
With that in mind make_power
allows you to simulate true values mu and the proportion of missing sires, but run the analysis with different values.
The idea is to estimate how wrong you could be before the analysis fails.
For example, this code would simulate the case where you thought that the error rate was 0.0015, and 5% of the candidates went unsampled, but in reality both parameters were double that amount.
In [26]:
fp.make_power(r, nloci, allele_freqs, nadults, sires, offspring, 0,
mu_input= 0.003,
mu_real=0.0015,
unsampled_real=0.1,
unsampled_input = 0.05);
If you want to perform downstream analysis, you can tell make_power
to also export each paternity_Array
and/or sibshipCluster
object. This is done by setting return_paternities
and return_clusters
to True
. For example, this code pulls out the distribution of family sizes from each sibshipArray
, and plots it.
In [28]:
eventab, evenclusters = fp.make_power(
replicates = r,
nloci = nloci,
allele_freqs = allele_freqs,
candidates = nadults,
sires = sires,
offspring = offspring,
missing_loci=0,
mu_real = mu,
unsampled_input=0.01,
return_clusters=True,
verbose=False
)
even_famsizes = np.array([evenclusters[i].family_size() for i in range(len(evenclusters))])
plt.plot(even_famsizes.mean(0))
plt.show()
Once you are familiar with the basic building blocks for generating data and running analysis, creating your own simulations if largely a case of setting up combinations of parameters, and looping over them. Given the vast array of possible scenarios you could want to simulate, it is impossible to be comprehensive here, so it must suffice to given a couple of examples for inspiration.
In this example is was interested in the performance of the likelihood estimator for a sire being absent.
This is the likelihood of generating the offspring genotype if paternal alleles come from population allele frequencies.
This is what the attribute lik_abset
in a paternityArray
tells you.
Ideally this likelihood should be below the likelihood of paternity for the true sire, but higher than that of the other candidates. I suspected this would not be the case when minor allele frequency is low and there are many candidates.
This cell sets up the simulation. I'm considering 50 loci, and mu=0.0015, but varying sample size and allele frequency.
In [31]:
# Common simulation parameters
nreps = 10 # number of replicates
nloci = [50] # number of loci
allele_freqs = [0.1, 0.2, 0.3, 0.4, 0.5] # draw allele frequencies
nadults = [10, 100, 250, 500, 750, 1000] # size of the adults population
mu_list = [0.0015] #genotype error rates
nsims = nreps * len(nloci) * len(allele_freqs) * len(nadults) * len(mu_list) # total number of simulations to run
dt = np.zeros([nsims, 7]) # empty array to store data
This cell simulates genotype data and clusters the offspring into full sibships. The code pulls out the mean probability that each sire is absent, and the rank of the likelihood for a missing sire among the likelihoods of paternity for the candidates.
In [46]:
t0 = time()
counter = 0
print("Beginning simulations on {}.".format(asctime(localtime(time()) )))
for r in range(nreps):
for l in range(len(nloci)):
for a in range(len(allele_freqs)):
for n in range(len(nadults)):
for m in range(len(mu_list)):
af = np.repeat(allele_freqs[a], nloci[l])
adults = fp.make_parents(nadults[n], af)
progeny = fp.make_offspring(adults, 100)
mi = progeny.parent_index('m', adults.names) # maternal index
mothers = adults.subset(mi)
patlik = fp.paternity_array(progeny, mothers, adults, mu_list[m], missing_parents=0.01)
# Find the rank of the missing term within the array.
rank = [np.where(np.sort(patlik.prob_array()[i]) == patlik.prob_array()[i,-1])[0][0] for i in range(progeny.size)]
rank = np.array(rank).mean() / nadults[n]
# get the posterior probabilty fir the missing term.
prob_misisng = np.exp(patlik.prob_array()[:, -1]).mean()
#export data
dt[counter] = np.array([r, nloci[l], allele_freqs[a], nadults[n], mu_list[m], rank, prob_misisng])
# update counters
counter += 1
print("Completed in {} hours.".format(round((time() - t0)/3600,2)))
head = ['rep', 'nloci', 'allele_freqs', 'nadults', 'mu', 'rank', 'prob_missing']
dt = pd.DataFrame(dt, columns=head)
There is a strong dependency on minor allele frequency. As MAF goes from zero to 0.5, the effectiveness of identifying a missing sire using this likelihood estimator goes from 'basically useless' to 'useful'.
In [47]:
dt.groupby('allele_freqs').mean()
Out[47]:
In contrast, there is no effect of the number of adults.
In [48]:
dt.groupby('nadults').mean()
Out[48]: