In [11]:
%pwd
Out[11]:
In [12]:
import dadi
I have not installed dadi globally on huluvu. Instead, I left it in my Downloads directory '/home/claudius/Downloads/dadi'. In order for Python to find that module, I need to add that directory to the PYTHONPATH variable.
In [13]:
import sys
sys.path # get the PYTHONPATH variable
Out[13]:
In [14]:
sys.path.insert(0, '/home/claudius/Downloads/dadi') # add path to dadi at beginning of list
sys.path
Out[14]:
In [15]:
import dadi, numpy
I am going to analyise and example data set that comes with dadi.
In [16]:
% ll /home/claudius/Downloads/dadi/examples/YRI_CEU/
The Python script YRI_CEU.py
contains example code for the analysis of these two human population samples.
In [17]:
! head /home/claudius/Downloads/dadi/examples/YRI_CEU/demographic_models.py
The file demographic_models.py
contains function definitions for custom demographic models.
In [18]:
%quickref
In [19]:
# insert path where file with model functions resides
sys.path.insert(0, '/home/claudius/Downloads/dadi/examples/YRI_CEU/')
# load custom demographic model functions into current namespace
import demographic_models
In [20]:
% ll /home/claudius/Downloads/dadi/examples/YRI_CEU/
In [21]:
! cat /home/claudius/Downloads/dadi/examples/YRI_CEU/YRI_CEU.fs
This a 2D SFS format as understood by dadi. It pertains to the two human populations YRI and CEU. 10 Individuals of each population have been sampled. The first 21 numbers should be [YRI: 0, CEU: 0-20]. The following 21 numbers should be [YRI: 1, CEU: 0-20] and so on.
In [22]:
# read in the unfolded 2D SFS from file
data = dadi.Spectrum.from_file('/home/claudius/Downloads/dadi/examples/YRI_CEU/YRI_CEU.fs')
In [23]:
ns = data.sample_sizes
ns
Out[23]:
Number of samples (or sample size) refers to the number of sampled gene copies.
In [24]:
# flatten the array and get its length
len([ elem for row in data.data for elem in row ])
Out[24]:
In [25]:
21*21
Out[25]:
The Spectrum is a 21 $\times$ 21 matrix.
In [ ]:
%quickref
In [ ]:
# print the docstring of the fs object
%pdoc data
In [ ]:
%pinfo data
In [ ]:
# this prints the source code for the object
%psource data
In [26]:
pts_l = [40, 50, 60]
dadi will solve the partial differential equation at these three grid sizes and then extrapolate to an infinitely fine grid.
The demographic model we'll is defined in the function demographic_models.prior_onegrow_mig
. Let's have a look at its definition:
In [ ]:
%psource demographic_models.prior_onegrow_mig
The function first records the required grid size.
It then initialises the phi distribution with this grid.
It then specifies a stepwise change in population size of the ancestral population. This initial ancestral population is implicitly taken as the reference population. The population size parameters nu1F
, nu2F
and nu2B
are relativ to the population size of this initial ancestral population, which is set to 1.
That means, if nu1F
is greater 1, the model specifies a stepwise increase in population size. The population
stays at this size for a time Tp
, which is specified in $2N_{ref}$ generation.
Next the model function specifies a population split. One of the daughter populations has the same population size as the ancestral population (African presumably). The other daughter population starts at a population size of nu2B
and then exponentially increases in size to nu2F
. During this time of divergence T
, the two populations exchange gene copies at a rate m
in each direction.
The function Spectrum.from_phi
then solves the partial differential equation given the spcified model and given the specified parameter values.
Finally the expected SFS given the model is returned.
In [27]:
func = demographic_models.prior_onegrow_mig
Next we turn the model function into a version that can do the extrapolation.
In [28]:
func_ex = dadi.Numerics.make_extrap_log_func(func)
The func_ex
is the function we are going to use for optimisation.
The model function takes a list of 6 parameters as its first argument. See the docstring for their description.
In [151]:
%psource func
It is necessary to confine the search space space to reasonable values.
In [152]:
upper_bounds = [100, 100, 100, 10, 3, 3]
This specifies that the maximal size that the ancestral population can grow to (nu1F
) is 100$\times$ its initial size. Similarly, the maximal time the ancestral population stays at this size before it splits into two populations is 3$\times 2N_{ref}$. Note that this time is 3 times the expected time to the MRCA for a sample of $n$ gene copies when $n$ is very large and under the standard neutral model (see p. 76 in Wakeley2009).
In [153]:
lower_bounds = [1e-2, 1e-2, 1e-2, 0, 0, 0]
The lower bound of the population size parameters is set to 1/100th of the reference population and the lower bounds of migration rate and time parameters is set to 0.
In [154]:
# define starting values for model parameters
p0 = [2, 0.1, 2, 1, 0.2, 0.2]
Since the optimisation algorithms are not guaranteed to find the global optimum, it is important to run several optimisations for each data set, each with different starting values.
In [155]:
%psource dadi.Misc.perturb_params
In [156]:
p0 = dadi.Misc.perturb_params(p0, upper_bound=upper_bounds, lower_bound=lower_bounds, fold=1.5)
The naming of the function argument for the "number of factors to disturb by" is very unfortunate in this context. Anyway, a higher value leads to greater perturbation.
In [92]:
p0
Out[92]:
In [93]:
popt_1 = dadi.Inference.optimize_log(p0, data, func_ex, pts_l, \
lower_bound=lower_bounds, upper_bound=upper_bounds, \
verbose=10, maxiter=10)
In [94]:
p_names = ("nu1F", "nu2B", "nu2F", "m", "Tp", "T")
for n,p in zip(p_names, popt_1):
print str(n) + "\t" + "{0:.3f}".format(p)
Let's see how robust these estimates are to different starting values.
In [95]:
# define starting values for model parameters
p0 = [2, 0.1, 2, 1, 0.2, 0.2]
# create new starting values for parameters
p0 = dadi.Misc.perturb_params(p0, upper_bound=upper_bounds, lower_bound=lower_bounds, fold=1.5)
# run optimisation with 10 iterations
popt_2 = dadi.Inference.optimize_log(p0, data, func_ex, pts_l, \
lower_bound=lower_bounds, upper_bound=upper_bounds, \
verbose=10, maxiter=10)
In [104]:
for n, p1, p2 in zip(p_names, popt_1, popt_2):
print str(n) + "\t" + "{0:.3f}".format(p1) + "\t" + "{0:.3f}".format(p2)
In [105]:
# define starting values for model parameters
p0 = [2, 0.1, 2, 1, 0.2, 0.2]
# create new starting values for parameters
p0 = dadi.Misc.perturb_params(p0, upper_bound=upper_bounds, lower_bound=lower_bounds, fold=1.5)
# run optimisation with 10 iterations
popt_3 = dadi.Inference.optimize_log(p0, data, func_ex, pts_l, \
lower_bound=lower_bounds, upper_bound=upper_bounds, \
verbose=10, maxiter=10)
In [106]:
for n, p1, p2, p3 in zip(p_names, popt_1, popt_2, popt_3):
print str(n) + "\t" + "{0:.3f}".format(p1) + "\t" + "{0:.3f}".format(p2) + "\t" + "{0:.3f}".format(p3)
With just 10 iterations, the optimisation does not seem to converge for all parameters.
In [85]:
# best fit parameter values (from YRY_CEU.py)
popt = [1.881, 0.0710, 1.845, 0.911, 0.355, 0.111]
In [86]:
# get best-fit model AFS
model = func_ex(popt, ns, pts_l)
In [31]:
model
Out[31]:
In [32]:
model.data.sum()
Out[32]:
I do not understand what is in this model spectrum. I thought it would be expected proportions, so that the sum across the spectrum would be 1. I think these are expected counts (not proportions) assuming a $\theta$ of 1.
In [87]:
# Log likelihood of the data given the model
ll = dadi.Inference.ll_multinom(model, data)
ll
Out[87]:
In [34]:
%psource dadi.Inference.ll_multinom
In [35]:
# the optimal value of theta0 given the model
theta0 = dadi.Inference.optimal_sfs_scaling(model, data)
theta0
Out[35]:
In [36]:
import pylab
%matplotlib inline
pylab.rcParams['figure.figsize'] = [12.0, 10.0]
In [37]:
# plot a comparison of the model SFS with the SFS from the data
dadi.Plotting.plot_2d_comp_multinom(model, data, vmin=1, resid_range=3, pop_ids=("YRI", "CEU"))
In [162]:
# print the docstring of the function
%pdoc dadi.Plotting.plot_2d_comp_multinom
The following requires that ms is installed.
In [38]:
# generate the core of a ms command with the optimised model parameter values
mscore = demographic_models.prior_onegrow_mig_mscore(popt)
In [39]:
# generate full ms command
mscommand = dadi.Misc.ms_command(1., ns, mscore, int(1e5))
Note, the ms command specifies a $\theta$ of 1 for better efficiency. The simulated spectra can be rescaled later with the theta0
from above.
In [40]:
mscommand
Out[40]:
In [41]:
import os
return_code = os.system('{0} > test.msout'.format(mscommand))
In [42]:
% ll
In [43]:
msdata = dadi.Spectrum.from_ms_file('test.msout')
In [44]:
dadi.Plotting.plot_2d_comp_multinom(model, theta0*msdata, vmin=1, pop_ids=['YRI', 'CEU'])
The spectrum simulated with ms
(averaged across iterations, I believe) is almost identical to the model spectrum. This confirms that $\delta$a$\delta$i's deterministic approximation is very good. One could now compare the ms
simulated spectra to the observed spectrum.
In order to obtain confidence intervals for the parameter estimates, one needs to create conventional bootstraps over unlinked loci, i. e. over contigs instead of nucleotide sites. From these bootstrapped data sets one can generate site frequency spectra and estimate model parameters as for the full observed data. However, this is computationally expensive. A more efficient alternative is calculating the Godambe Information Matrix (GIM) from the bootstrapped data sets (see Coffman2016 for details).
In [48]:
# the examples directory contains site frequency spectra from bootstraps
% ll examples/YRI_CEU/bootstraps/
In [58]:
# load spectra from bootstraps of the data into an array
all_boot = [ dadi.Spectrum.from_file('examples/YRI_CEU/bootstraps/{0:02d}.fs'.format(i)) for i in range(100) ]
In [56]:
print ['{0:02d}.fs'.format(i) for i in range(100)]
In [64]:
%%time
uncerts = dadi.Godambe.GIM_uncert(func_ex, pts_l, all_boot, popt, data, multinom=True)
In [65]:
print 'Estimated parameter standard deviations from GIM: {0}'.format(uncerts)
In [67]:
# These are the optimal parameters when the spectrum is folded. They can be
# found simply by passing data.fold() to the above call to optimize_log.
popt_fold = numpy.array([1.907, 0.073, 1.830, 0.899, 0.425, 0.113])
In [68]:
# get standard deviations for model parameters
uncerts_folded = dadi.Godambe.GIM_uncert(func_ex, pts_l, all_boot, popt_fold, data.fold(), multinom=True)
In [72]:
print 'Folding increases parameter uncertainties by factors of: {}'.format(uncerts_folded/uncerts)
Outgroup information greatly increases power!
The following will compare the model with migration with a model without migration, thus testing whether the inferred migration rate is significantly different from 0.
In [74]:
# the model without migration is also defined in the demographic_models script
func_nomig = demographic_models.prior_onegrow_nomig
func_ex_nomig = dadi.Numerics.make_extrap_log_func(func_nomig)
In [75]:
# these are the best-fit parameters for the model without migration,
# as provided in YRI_CEU.py
popt_nomig = numpy.array([ 1.897, 0.0388, 9.677, 0.395, 0.070])
In [76]:
# get the expected AFS from the model without migration
model_nomig = func_ex_nomig(popt_nomig, ns, pts_l)
In [91]:
# get the likelihood of the data given the model without migration
ll_nomig = dadi.Inference.ll_multinom(model_nomig, data)
Out[91]:
In [93]:
print 'The log likelihood of the model with migration was: {0:.1f}'.format(ll)
print 'The log likelihodd of the model without migration is: {0:.1f}'.format(ll_nomig)
The more complex model with migration (one parameter more) has a greater likelihood as expected. But is that difference significant or just due to better being able to fit noise in the data?
In [79]:
p_lrt = popt
p_lrt[3] = 0
print p_lrt
print popt
# the first line just creates a reference, not a copy
In [80]:
# best fit parameter values for the model with migration (from YRY_CEU.py)
popt = [1.881, 0.0710, 1.845, 0.911, 0.355, 0.111]
In [82]:
p_lrt = popt[:] # copy parameter list
p_lrt[3] = 0
print p_lrt
print popt
Need to calculate an adjustment factor, maybe correcting for linkage (see Coffman2016).
In [84]:
adj = dadi.Godambe.LRT_adjust(func_ex, pts_l, all_boot, p_lrt, data, nested_indices=[3], multinom=True)
In [88]:
D_adj = adj * 2 * (ll - ll_nomig)
In [89]:
print 'Adjusted D statistic: {0:.4f}'.format(D_adj)
Verbatim from YRI_CEU.py
:
"Because this is test of a parameter on the boundary of parameter space (m cannot be less than zero), our null distribution is an even proportion of chi^2 distributions with 0 and 1 d.o.f. To evaluate the p-value, we use the point percent function for a weighted sum of chi^2 dists."
See also the manual and Coffman2016.
In [90]:
pval = dadi.Godambe.sum_chi2_ppf(D_adj, weights=(0.5, 0.5))
In [94]:
print 'p-val for rejecting the no-migration model: {0:.4f}'.format(pval)
Adding the migration parameter does not significantly improve the fit of the model to the data. According to this data (and analysis), gene flow between YRI and CEU has not significantly affected the distribution of genetic variation as summarised in the joint SFS.
In [ ]: