I have not installed dadi globally on huluvu. Instead, I left it in my Downloads directory '/home/claudius/Downloads/dadi'. In order for Python to find that module, I need to add that directory to the PYTHONPATH variable.
In [2]:
import sys
sys.path
Out[2]:
In [3]:
sys.path.insert(0, '/home/claudius/Downloads/dadi')
In [4]:
sys.path
Out[4]:
In [5]:
import dadi
In [6]:
import pylab
pylab.rcParams['figure.figsize'] = [12.0, 10.0]
%matplotlib inline
In [7]:
% ll dadiExercises/
In [8]:
% cat dadiExercises/ERY.FOLDED.sfs.dadi_format
I have turned the 1D folded SFS's from realSFS
into $\delta$d$\delta$i format by hand according to the description in section 3.1 of the manual.
Note, that the last line, indicating the mask, has length 37, but the folded spectrum has length 19. Dadi wants to mask counts from invariable sites. For an unfolded spectrum, i. e. polarised with respect to an inferred ancestral allele at each site, the first and the last count classes would correspond to invariable sites. In a folded spectrum, i. e. with counts of the minor allele at each site, the last count class corresponds to SNP's with minor sample allele frequency of $n/2$ (with even sample size).
In [9]:
fs_ery = dadi.Spectrum.from_file('dadiExercises/ERY.FOLDED.sfs.dadi_format')
In [10]:
%pdoc dadi.Spectrum.from_file
In [11]:
fs_ery
Out[11]:
In [12]:
ns = fs_ery.sample_sizes
ns
Out[12]:
In [13]:
fs_ery.pop_ids = ['ery'] # must be an array, otherwise leads to error later on
In [14]:
# the number of segregating sites in the spectrum
fs_ery.sum()
Out[14]:
According to the number of segregating sites, this spectrum should have good power to distinguish between alternative demographic models (see Adams2004). However, the noise in the data is extreme, as can be seen below, which might compromise this power and maybe even lead to false inferences.
In [15]:
%pdoc dadi.Plotting.plot_1d_fs
In [16]:
pylab.rcParams['figure.figsize'] = [12.0, 10.0]
dadi.Plotting.plot_1d_fs(fs_ery, show=False)
In [17]:
# show modules within dadi
dir(dadi)
Out[17]:
In [18]:
dir(dadi.Demographics1D)
Out[18]:
In [19]:
# show the source of the 'Demographics1D' method
%psource dadi.Demographics1D
In [20]:
# create link to method
func = dadi.Demographics1D.snm
In [21]:
# make the extrapolating version of the demographic model function
func_ex = dadi.Numerics.make_extrap_log_func(func)
In [22]:
# setting the smallest grid size slightly larger than the largest population sample size
pts_l = [40, 50, 60]
The snm
function does not take parameters to optimize. I can therefore get directly the expected model. The snm
function does not take a fold
argument. I am therefore going to calculated an unfolded expected spectrum and then fold.
In [23]:
# calculate unfolded AFS under standard neutral model (up to a scaling factor theta)
model = func_ex(0, ns, pts_l)
model
Out[23]:
In [24]:
dadi.Plotting.plot_1d_fs(model.fold()[:19], show=False)
What's happening in the 18th count class?
In [25]:
# get the source of the fold method, which is part of the Spectrum object
%psource dadi.Spectrum.fold
In [26]:
# get the docstring of the Spectrum object
%pdoc dadi.Spectrum
In [27]:
# retrieve the spectrum array from the Spectrum object
model.data
Out[27]:
I am going to fold manually now.
In [28]:
# reverse spectrum and add to itself
model_fold = model.data + model.data[::-1]
model_fold
Out[28]:
In [29]:
# discard all count classes >n/2
model_fold = model_fold[:19]
model_fold
Out[29]:
When the sample size is even, then highest sample frequency class corresponds to just one unfolded class (18). This has been added to itself and those SNP's are counted twice at the moment. I need to divide this class by 2 to get the correct count for this folded class.
In [30]:
# divide highest sample frequency class by 2
model_fold[18] = model_fold[18]/2.0
In [31]:
model_fold
Out[31]:
In [32]:
# create dadi Spectrum object from array, need to specify custom mask
model_folded = dadi.Spectrum(data=model_fold, mask_corners=False, mask= [1] + [0]*18)
model_folded
Out[32]:
In [33]:
dadi.Plotting.plot_1d_fs(model_folded)
The folded expected spectrum is correct. Also, see figure 4.5 in Wakeley2009.
In [34]:
# fold the unfolded model
model_folded = model.fold()
#model_folded = model_folded[:(ns[0]+1)]
model_folded.pop_ids = ['ery'] # be sure to give an array, not a scalar string
model_folded
Out[34]:
In [35]:
ll_model_folded = dadi.Inference.ll_multinom(model_folded, fs_ery)
In [36]:
print 'The log composite likelihood of the observed ery spectrum given a standard neutral model is {0:.3f}.'.format(ll_model_folded)
In [37]:
theta = dadi.Inference.optimal_sfs_scaling(model_folded, fs_ery)
In [38]:
print 'The optimal value of theta is {0:.3f}.'.format(theta)
This theta estimate is a little bit higher than what I estimated with curve fitting in Fist_Steps_with_dadi.ipynb
, which was 10198.849.
What effective ancestral population size would that imply?
According to section 4.4 in the dadi manual:
$$ \theta = 4 N_{ref} \mu_{L} \qquad \text{L: sequence length} $$Let's assume the mutation rate per nucleotide site per generation is $3\times 10^{-9}$ (see e. g. Liu2017). Then
$$ \mu_{L} = \mu_{site} \times L $$So
$$ \theta = 4 N_{ref} \mu_{site} \times L $$and
$$ N_{ref} = \frac{\theta}{4 \mu_{site} L} $$
In [39]:
mu = 3e-9
L = fs_ery.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
print "The total sequence length is " + str(L)
N_ref = theta/L/mu/4
print "The effective ancestral population size (in number of diploid individuals) implied by this theta is: {0}.".format(int(N_ref))
This effective population size is consistent with those reported in Lynch2016 for other insect species.
Begin Digression:
In [71]:
x = pylab.arange(0, 100)
y = 0.5**(x)
pylab.plot(x, y)
Out[71]:
In [73]:
x[:10] * y[:10]
Out[73]:
In [72]:
sum(x * y)
Out[72]:
End Digression
In [ ]:
In [37]:
model_folded * theta
Out[37]:
In [40]:
pylab.semilogy(model_folded * theta, "bo-", label='SNM')
pylab.plot(fs_ery, "ro-", label='ery')
pylab.legend()
Out[40]:
In [41]:
%psource dadi.Plotting.plot_1d_comp_Poisson
In [42]:
# compare model prediction and data visually with dadi function
dadi.Plotting.plot_1d_comp_multinom(model_folded[:19], fs_ery[:19], residual='linear')
The lower plot is for the scaled Poisson residuals.
$$ residuals = (model - data)/\sqrt{model} $$The model is the expected counts in each frequency class. If these counts are Poisson distributed, then their variance is equal to their expectation. The differences between model and data are therefore scaled by the expected standard deviation of the model counts.
The observed counts deviate by up to 30 standard deviations from the model!
What could be done about this?
The greatest deviations are seen for the first two frequency classes, the ones that should provide the greatest amount of information (Fu1994) for theta and therefore probably also other parameters. Toni has suggested that the doubleton class is inflated due to "miscalling" heterozygotes as homozygotes. When they contain a singleton they will be "called" as homozygote and therefore contribute to the doubleton count. This is aggravated by the fact that the sequenced individuals are all male which only possess one X chromosome. The X chromosome is the fourth largest of the 9 chromosomes of these grasshoppers (8 autosomes + X) (see Gosalvez1988, fig. 2). That is, about 1/9th of the sequenced RAD loci are haploid but ANGSD assumes all loci to be diploid. The genotype likelihoods it calculates are all referring to diploid genotypes.
I think one potential reason for the extreme deviations is that the genotype likelihoods are generally biased toward homozygote genotypes (i. e. also for autosomal loci) due to PCR duplicates (see eq. 1 in Nielsen2012). So, one potential improvement would be to remove PCR duplicates.
Another potential improvement could be found by subsampling 8/9th to 8/10th of the contigs in the SAF files and estimating an SFS from these. Given enough subsamples, one should eventually be found that maximally excludes loci from the X chromosome. This subsample is expected to produce the least squared deviations from an expected SFS under the standard neutral model. However, one could argue that this attempt to exclude problematic loci could also inadvertently remove loci that strongly deviate from neutral expectations due to non-neutral evolution, again reducing power to detect deviations from the standard neutral model. I think one could also just apply the selection criterion of the second MAF class to be lower than the first and just save all contig subsamples and SFS's that fulfill that criterioin, since that should be true for all demographic scenarios.
As seen above in the folded model spectrum, dadi just masks out entries that are not sensical in a folded spectrum, but keeps the length of the spectrum the same as the unfolded. That way the sample size (i. e. number of chromosomes) is determined correctly. Let's create a correct folded spectrum object for ery.
In [43]:
fs_ery
Out[43]:
In [44]:
# make copy of spectrum array
data_abc = fs_ery.data.copy()
In [45]:
# resize the array to the unfolded length
data_abc.resize((37,))
data_abc
Out[45]:
In [46]:
fs_ery_ext = dadi.Spectrum(data_abc)
fs_ery_ext
Out[46]:
In [47]:
fs_ery_ext.fold()
Out[47]:
In [48]:
fs_ery_ext = fs_ery_ext.fold()
fs_ery_ext.pop_ids = ['ery']
fs_ery_ext
Out[48]:
In [49]:
fs_ery_ext.sample_sizes
Out[49]:
Now, the reported sample size is correct and we have a Spectrum
object that dadi can handle correctly.
Does estimating an unfolded spectrum with ANGSD and then folding yield a sensible folded SFS when the sites are not polarised with respect to an ancestral allele but with respect to the reference allele? Matteo Fumagalli thinks that this is sensible.
In [51]:
% cat dadiExercises/ERY.FOLDED.sfs.dadi_format
In [52]:
# load the spectrum that was created from folded SAF's
fs_ery_folded_by_Angsd = dadi.Spectrum.from_file('dadiExercises/ERY.FOLDED.sfs.dadi_format')
fs_ery_folded_by_Angsd
Out[52]:
In [53]:
# extract unmasked entries of the SFS
m = fs_ery_folded_by_Angsd.mask
fs_ery_folded_by_Angsd[m == False]
Out[53]:
In [54]:
% ll ../ANGSD/SFS/ERY/
I have copied the unfolded SFS into the current directory.
In [55]:
% ll
In [56]:
% cat ERY.unfolded.sfs
In [57]:
# load unfolded spectrum
fs_ery_unfolded_by_ANGSD = dadi.Spectrum.from_file('ERY.unfolded.sfs')
fs_ery_unfolded_by_ANGSD
Out[57]:
In [58]:
# fold unfolded spectrum
fs_ery_unfolded_by_Angsd_folded = fs_ery_unfolded_by_ANGSD.fold()
fs_ery_unfolded_by_Angsd_folded
Out[58]:
In [59]:
# plot the two spectra
pylab.rcParams['figure.figsize'] = [12.0, 10.0]
pylab.plot(fs_ery_folded_by_Angsd, 'ro-', label='folded by ANGSD')
pylab.plot(fs_ery_unfolded_by_Angsd_folded, 'bo-', label='folded by DADI')
pylab.legend()
pylab.savefig('ery_fold_comp.png')
In [59]:
%psource dadi.Plotting.plot_1d_comp_Poisson
In [60]:
dadi.Plotting.plot_1d_comp_Poisson(fs_ery_folded_by_Angsd[:19], fs_ery_unfolded_by_Angsd_folded[:19], \
residual='linear')
The sizes of the residuals (scaled by the Poisson standard deviations) indicate that the two versions of the folded SFS of ery are significantly different.
Now, what does the parallelus data say?
In [61]:
% ll dadiExercises/
In [62]:
% cat dadiExercises/PAR.FOLDED.sfs.dadi_format
In [63]:
# load the spectrum folded by ANGSD
fs_par_folded_by_Angsd = dadi.Spectrum.from_file('dadiExercises/PAR.FOLDED.sfs.dadi_format')
fs_par_folded_by_Angsd
Out[63]:
In [64]:
% cat PAR.unfolded.sfs
In [65]:
# load spectrum that has been created from unfolded SAF's
fs_par_unfolded_by_Angsd = dadi.Spectrum.from_file('PAR.unfolded.sfs')
fs_par_unfolded_by_Angsd
Out[65]:
In [66]:
fs_par_unfolded_by_Angsd_folded = fs_par_unfolded_by_Angsd.fold()
fs_par_unfolded_by_Angsd_folded
Out[66]:
In [67]:
dadi.Plotting.plot_1d_comp_Poisson(fs_par_folded_by_Angsd[:19], fs_par_unfolded_by_Angsd_folded[:19], \
residual='linear')
In [68]:
#pylab.subplot(2,1,1)
pylab.plot(fs_par_folded_by_Angsd[:19], 'ro-', label='folded by ANGSD')
#pylab.subplot(2,1,2)
pylab.plot(fs_par_unfolded_by_Angsd_folded, 'bo-', label='folded by DADI')
pylab.legend()
pylab.savefig('par_fold_comp.png')
The unfolded spectrum folded by dadi seems to be a bit better behaved than the one folded by ANGSD. I really wonder whether folding in ANGSD is needed.
The folded 2D spectrum from ANGSD is a 19 x 19 matrix. This is not a format that dadi can understand.
In [83]:
%psource dadi.Spectrum.from_data_dict
See this thread on the dadi forum.
In [69]:
# show the source of the 'Demographics1D' method
%psource dadi.Demographics1D.growth
In [134]:
# create link to function that specifies a simple growth or decline model
func = dadi.Demographics1D.growth
In [135]:
# create extrapolating version of the function
func_ex = dadi.Numerics.make_extrap_log_func(func)
In [136]:
# set lower and upper bounds to nu and T
upper_bound = [100, 3]
lower_bound = [1e-2, 0]
In [137]:
# set starting value
p0 = [1, 1] # corresponds to constant population size
In [218]:
%pdoc dadi.Misc.perturb_params
In [138]:
# perturb starting values by up to a factor of 2
p0 = dadi.Misc.perturb_params(p0, fold=1, upper_bound=upper_bound, lower_bound=lower_bound)
In [141]:
p0
Out[141]:
In [1]:
%psource dadi.Inference.optimize_log
In [142]:
# run optimisation of paramters
popt = dadi.Inference.optimize_log(p0=p0, data=fs_ery, model_func=func_ex, pts=pts_l, \
lower_bound=lower_bound, upper_bound=upper_bound, \
verbose=0, maxiter=100, full_output=False)
In [143]:
popt
Out[143]:
I need to run the simulation with different starting values to check convergence.
I would like to do these runs in parallel. I have 12 cores available on huluvu.
In [144]:
from ipyparallel import Client
In [145]:
cl = Client()
In [146]:
cl.ids
Out[146]:
I now have connections to 11 engines. I started the engines with ipcluster start -n 11 &
in the terminal.
In [147]:
# create load balanced view of the engines
lbview = cl.load_balanced_view()
In [148]:
lbview.block
Out[148]:
In [149]:
# create direct view of all engines
dview = cl[:]
In [150]:
# set starting value for all engines
dview['p0'] = [1, 1]
dview['p0']
Out[150]:
In [151]:
# set lower and upper bounds to nu and T for all engines
dview['upper_bound'] = [100, 3]
dview['lower_bound'] = [1e-2, 0]
In [152]:
dview['fs_ery'] = fs_ery
cl[0]['fs_ery']
Out[152]:
In [153]:
dview['func_ex'] = func_ex
dview['pts_l'] = pts_l
In [155]:
with dview.sync_imports():
import sys
In [156]:
dview.execute('sys.path.insert(0, \'/home/claudius/Downloads/dadi\')')
Out[156]:
In [157]:
cl[0]['sys.path']
Out[157]:
In [158]:
with dview.sync_imports():
import dadi
In [163]:
@lbview.parallel(block=True)
def run_dadi(x): # for the function to be called with map, it needs to have one input variable
# perturb starting values by up to a factor of 2
p1 = dadi.Misc.perturb_params(p0, fold=1, upper_bound=upper_bound, lower_bound=lower_bound)
# run optimisation of paramters
popt = dadi.Inference.optimize_log(p0=p1, data=fs_ery, model_func=func_ex, pts=pts_l, \
lower_bound=lower_bound, upper_bound=upper_bound, \
verbose=0, maxiter=100, full_output=False)
return popt
In [164]:
run_dadi.map(range(20))
Out[164]:
In [ ]:
In [78]:
popt
Out[78]:
In [60]:
# set starting value
p0 = [1, 1]
# perturb starting values by up to a factor of 2
p0 = dadi.Misc.perturb_params(p0, fold=1, upper_bound=upper_bound, lower_bound=lower_bound)
# run optimisation of paramters
popt = dadi.Inference.optimize_log(p0=p0, data=fs_ery_ext, model_func=func_ex, pts=pts_l, \
lower_bound=lower_bound, upper_bound=upper_bound, \
verbose=0, maxiter=100, full_output=False)
popt
Out[60]:
In [85]:
def exp_growth(x):
p0 = [1, 1]
# perturb starting values by up to a factor of 2
p0 = dadi.Misc.perturb_params(p0, fold=1, upper_bound=upper_bound, lower_bound=lower_bound)
# run optimisation of paramters
popt = dadi.Inference.optimize_log(p0=p0, data=fs_ery_ext, model_func=func_ex, pts=pts_l, \
lower_bound=lower_bound, upper_bound=upper_bound, \
verbose=0, maxiter=100, full_output=False)
return popt
In [86]:
popt = map(exp_growth, range(10))
In [87]:
# this will run a few minutes
# popt
Out[87]:
In [89]:
import ipyparallel as ipp
In [90]:
c = ipp.Client()
c.ids
Out[90]:
In [94]:
%%time
dview = c[:]
popt = dview.map_sync(exp_growth, range(10))
Unfortunately, parallelisation is not as straightforward as it should be.
In [95]:
popt
Out[95]:
Except for the last iteration, the two parameter estimates seem to have converged.
In [102]:
ns = fs_ery_ext.sample_sizes
ns
Out[102]:
In [105]:
print popt[0]
print popt[9]
What is the log likelihood of the model given these two different parameter sets?
In [109]:
model_one = func_ex(popt[0], ns, pts_l)
ll_model_one = dadi.Inference.ll_multinom(model_one, fs_ery_ext)
ll_model_one
Out[109]:
In [110]:
model_two = func_ex(popt[9], ns, pts_l)
ll_model_two = dadi.Inference.ll_multinom(model_two, fs_ery_ext)
ll_model_two
Out[110]:
The lower log-likelihood for the last set of parameters inferred indicates that the optimisation got trapped in a local minimum in the last run of the optimisation.
What the majority of the parameter sets seem to indicate is that at about time $0.007 \times 2 N_{ref}$ generations in the past the ancestral population started to shrink exponentially, reaching a population size of about $0.14 \times N_{ref}$ at present.
In [118]:
print 'The model suggests that exponential decline in population size started {0:.0f} generations ago.'.format(popt[0][1] * 2 * N_ref)
In [119]:
dir(dadi.Demographics1D)
Out[119]:
In [122]:
%psource dadi.Demographics1D.two_epoch
This model specifies a stepwise change in population size some time ago. It assumes that the population size has stayed constant since the change.
In [124]:
func = dadi.Demographics1D.two_epoch
func_ex = dadi.Numerics.make_extrap_log_func(func)
In [130]:
upper_bound = [10, 3]
lower_bound = [1e-3, 0]
pts_l = [40, 50, 60]
In [131]:
def stepwise_pop_change(x):
# set initial values
p0 = [1, 1]
# perturb initial parameter values randomly by up to 2 * fold
p0 = dadi.Misc.perturb_params(p0, fold=1.5, \
upper_bound=upper_bound, lower_bound=lower_bound)
# run optimisation
popt = dadi.Inference.optimize_log(p0, fs_ery_ext, func_ex, pts_l, \
upper_bound=upper_bound, lower_bound=lower_bound,
verbose=0, maxiter=10)
return popt
In [132]:
stepwise_pop_change(1)
Out[132]:
In [133]:
stepwise_pop_change(1)
Out[133]:
In [134]:
popt = map(stepwise_pop_change, range(10))
In [135]:
popt
Out[135]:
This model does not converge on a set of parameter values.
In [137]:
nu = [i[0] for i in popt]
nu
Out[137]:
In [138]:
T = [i[1] for i in popt]
T
Out[138]:
In [146]:
pylab.rcParams['font.size'] = 14.0
pylab.loglog(nu, T, 'bo')
pylab.xlabel(r'$\nu$')
pylab.ylabel('T')
Out[146]:
Both parameters seem to be correlated. With the available data, it may not be possible to distinguish between a moderate reduction in population size a long time ago (topright in the above figure) and a drastic reduction in population size a short time ago (bottomleft in the above figure).
In [91]:
%psource dadi.Demographics1D
This model has three parameters. $\nu_B$ is the ratio of the population size (with respect to the ancestral population size $N_{ref}$) after the first stepwise change at time T in the past. The population is then asumed to undergo exponential growth/decline to a ratio of population size $\nu_F$ at present.
In [84]:
func = dadi.Demographics1D.bottlegrowth
func_ex = dadi.Numerics.make_extrap_log_func(func)
upper_bound = [100, 100, 3]
lower_bound = [1e-3, 1e-3, 0]
pts_l = [40, 50, 60]
In [79]:
def bottleneck_growth(x):
p0 = [1, 1, 1] # corresponds to constant population size
# perturb initial parameter values randomly by up to 2 * fold
p0 = dadi.Misc.perturb_params(p0, fold=1.5, \
upper_bound=upper_bound, lower_bound=lower_bound)
# run optimisation
popt = dadi.Inference.optimize_log(p0, fs_ery_ext, func_ex, pts_l, \
upper_bound=upper_bound, lower_bound=lower_bound,
verbose=0, maxiter=10)
return popt
In [85]:
%%time
popt = map(bottleneck_growth, range(10))
In [90]:
popt
Out[90]:
There is no convergence of parameters estimates. The parameter combinations stand for vastly different demographic scenarios. Most seem to suggest a population increase (up to 100 times the ancestral population size), followed by exponential decrease to about the ancestral population size.
In [92]:
func = dadi.Demographics1D.three_epoch
func_ex = dadi.Numerics.make_extrap_log_func(func)
In [ ]:
%psource dadi.Demographics1D.three_epoch
This model tries to estimate three parameters. The populations is expected to undergo a stepwise population size change (bottleneck) at time TF + TB. At time TF it is expected to recover immediately to the current population size.
In [94]:
upper_bound = [100, 100, 3, 3]
lower_bound = [1e-3, 1e-3, 0, 0]
pts_l = [40, 50, 60]
In [95]:
def opt_three_epochs(x):
p0 = [1, 1, 1, 1] # corresponds to constant population size
# perturb initial parameter values randomly by up to 2 * fold
p0 = dadi.Misc.perturb_params(p0, fold=1.5, \
upper_bound=upper_bound, lower_bound=lower_bound)
# run optimisation
popt = dadi.Inference.optimize_log(p0, fs_ery_ext, func_ex, pts_l, \
upper_bound=upper_bound, lower_bound=lower_bound,
verbose=0, maxiter=10)
return popt
In [96]:
%%time
popt = map(opt_three_epochs, range(10))
In [97]:
popt
Out[97]:
Note, that only one of the optimisations inferred a bottleneck (4th). All others either inferred a constant population size or an increase in population size. Contemporary population sizes are mostly inferred to be similar to ancestral population sizes. The two time parameters vary wildly.
In [ ]: