I have started 11 IPython engines in the terminal with the following command:
ipcluster start -n 11 &
In [1]:
from ipyparallel import Client
cl = Client()
cl.ids
Out[1]:
In [2]:
# clear the namespace in engines
cl.clear()
Out[2]:
In [3]:
%px %who
In [4]:
%%px --local
# run whole cell on all engines a well as in the local IPython session
import numpy as np
import sys
sys.path.insert(0, '/home/claudius/Downloads/dadi')
import dadi
In [5]:
%%px --local
# import 1D spectrum of ery on all engines:
fs_ery = dadi.Spectrum.from_file('dadiExercises/ERY.FOLDED.sfs.dadi_format')
# import 1D spectrum of ery on all engines:
fs_par = dadi.Spectrum.from_file('dadiExercises/PAR.FOLDED.sfs.dadi_format')
In [7]:
fs_par
Out[7]:
In [6]:
%%px --local
ns = fs_ery.sample_sizes # both populations have the same sample size
fs_ery.pop_ids = ['ery']
fs_par.pop_ids = ['par']
# setting the smallest grid size slightly larger than the largest population sample size (36)
pts_l = [50, 60, 70]
In [7]:
%psource dadi.Demographics1D.two_epoch
The built-in two epoch model defines a piecewise history, where at some time $T$ in the past the ancestral population instantaneously changed in size to the contemporary population size, which has ratio of $\nu$ to the ancestral population size.
Note, that $\nu$ and $T$ are confounded parameters, meaning that different combinations of $\nu$ and $T$ can lead to very similar site frequency spectra. For instance, a strong but recent bottleneck can produce a similar SFS to an ancient but weak bottleneck.
In [7]:
%%px --local
# create link to function that specifies a simple growth or decline model
func = dadi.Demographics1D.two_epoch
# create extrapolating version of the function
func_ex = dadi.Numerics.make_extrap_log_func(func)
# set lower and upper bounds to nu and T
upper_bound = [1e3, 4]
lower_bound = [1e-3, 1e-4]
In [8]:
# create range of starting values for nu and T, evenly distributed in log space
p0_nu = np.logspace(-3, 3, base=10.0, num=14)
p0_T = np.logspace(-4, np.log10(4), base=10, num=14)
In [9]:
print p0_nu[0], p0_nu[-1]
print p0_T[0], p0_T[-1]
Note, this also includes parameters at the boundaries I set above.
In [10]:
print "There are %d parameter combinations." % (14*14)
In [11]:
%px %who
In [12]:
# create load balanced view of engines
lbview = cl.load_balanced_view()
In [13]:
%who
In [14]:
def run_dadi(p_init): # for the function to be called with map, it needs to have one input variable
"""
p_init: initial parameter values to run optimisation from
"""
if perturb == True:
p_init = dadi.Misc.perturb_params(p_init, fold=fold,
upper_bound=upper_bound, lower_bound=lower_bound)
# note upper_bound and lower_bound variables are expected to be in the namespace of each engine
# run optimisation of paramters
popt = dadi_opt_func(p0=p_init, data=sfs, model_func=func_ex, pts=pts_l, \
lower_bound=lower_bound, upper_bound=upper_bound, \
verbose=verbose, maxiter=maxiter, full_output=full_output)
return popt
In [15]:
from itertools import product
In [16]:
%%px
# set up global variables on engines required for run_dadi function call
dadi_opt_func = dadi.Inference.optimize_log # uses gradient based BFGS algorithm
sfs = fs_ery # use ERY spectrum
perturb = False
fold = 1
maxiter = 10 # run a maximum of 10 iterations
verbose = 0
full_output = True # need to have full output to get the warnflags (see below)
In [17]:
%px who
In [31]:
# run optimnisations with all combinations of starting values
#ar_ery = lbview.map(run_dadi, product(p0_nu, p0_T), block=False, order=True)
In [34]:
ar_ery.progress
Out[34]:
There were 196 combinations of starting values.
In [35]:
ar_ery.ready()
Out[35]:
In [36]:
ar_ery.wall_time
Out[36]:
In [25]:
from collections import defaultdict
In [38]:
warnflag = defaultdict(int)
for res in ar_ery:
if res[6] == 1:
warnflag[1] +=1
elif res[6] == 2:
warnflag[2] += 1
elif res[6] == 0:
warnflag[0] += 1
else:
warnflag[999] +=1
In [39]:
print "success", warnflag[0]
print "Maximum number of iterations exceeded.", warnflag[1]
print "Gradient and/or function calls not changing.", warnflag[2]
print "unknown flag", warnflag[999]
In [21]:
from scipy.optimize import fmin_bfgs
In [22]:
# this is the function that dadi calls
?fmin_bfgs
# the warnflag has index 6 in the returned array
Only 21 of the 196 starting parameter combinations lead to a successfull optimisation. I could increase that by 28 if I allowed enough iterations. However, I don't know what to do about the 147 combinations for which the gradient didn't change. Maybe they are in a very flat part of the likelihood surface?
In [45]:
from itertools import izip
In [48]:
for p_init, popt in izip(product(p0_nu, p0_T), ar_ery):
if popt[6] == 0:
print p_init, popt[0], popt[6]
Of the runs that did return an optimisation result, most just return the initial parameter values.
In [20]:
?dadi.Inference.optimize_lbfgsb
In [ ]:
?dadi.Inference.optimize_grid
In [21]:
?dadi.Inference.optimize_log
In [22]:
?dadi.Inference.optimize_log_fmin
In [24]:
?dadi.Inference.optimize_log_powell
I have decided to change to the Nelder-Mead optimisation algorithm.
In [25]:
%%px
# use Nelder-Mead algorithm
dadi_opt_func = dadi.Inference.optimize_log_fmin
In [72]:
# do same optimisation as above, but now with Nelder-Mead algorithm
#ar_ery = lbview.map(run_dadi, product(p0_nu, p0_T), block=False, order=True)
In [82]:
ar_ery.progress
Out[82]:
In [90]:
ar_ery.wall_time
Out[90]:
Note, that Nelder-Mead is much slower.
In [23]:
%psource dadi.Inference.optimize_log_fmin
In [ ]:
warnflag = defaultdict(int)
for res in ar_ery:
if res[4] == 1:
warnflag[1] +=1
elif res[4] == 2:
warnflag[2] += 1
elif res[4] == 0:
warnflag[0] += 1
else:
warnflag[999] +=1
print "success", warnflag[0]
print "Maximum number of function evaluations made.", warnflag[1]
print "Maximum number of iterations reached.", warnflag[2]
print "unknown flag", warnflag[999]
None of the runs reached convergence and returned a result.
In [91]:
cl[:]['maxiter']
Out[91]:
maxiter
still set to 10 on all engines.
In [92]:
%%px
# increase maximum number of iterations
maxiter = 100
In [93]:
# run optimisation as above, but now with up to 100 iterations per optimisation
#ar_ery = lbview.map(run_dadi, product(p0_nu, p0_T), block=False, order=True)
In [97]:
ar_ery.progress
Out[97]:
In [103]:
ar_ery.wall_time/60
Out[103]:
This took 17 minutes to finish.
In [146]:
def get_flag_count(ar, NM=True)
"""
ar: asyncresult object from BFGS or Nelder-Mead optimisation
"""
if NM: # if ar from Nelder-Mead
i = 4 # the warnflag is reported at index position 4 in the output array
else: # ar from BFGS optimisation
i = 6
warnflag = defaultdict(int)
for res in ar:
if res[i] == 1:
warnflag[1] +=1
elif res[i] == 2:
warnflag[2] += 1
elif res[i] == 0:
warnflag[0] += 1
else:
warnflag[999] +=1
if NM:
print "success", warnflag[0]
print "Maximum number of function evaluations made.", warnflag[1]
print "Maximum number of iterations reached.", warnflag[2]
print "unknown flag", warnflag[999]
else:
print "success", warnflag[0]
print "Maximum number of iterations exceeded.", warnflag[1]
print "Gradient and/or function calls not changing.", warnflag[2]
print "unknown flag", warnflag[999]
In [ ]:
get_flag_count(ar, NM=True)
This looks much better. I don't want to increase maxiter
even more, due to the already long running time.
In [102]:
for p_init, popt in izip(product(p0_nu, p0_T), ar_ery):
print p_init, popt[0], popt[4]
In [130]:
# get optimised parameters (if optimisation was successfull)
popt = [p[0] for p in ar_ery if p[4] == 0]
len(popt)
Out[130]:
In [26]:
# create parallel function with load balancing
@lbview.parallel(block=True)
def get_ll(p):
"""
p: parameter combination
First, calculates the best-fit model SFS given paramter combination p.
Then returns the log likelihood of the expected SFS given the observed SFS.
"""
expected_sfs = func_ex(p, ns, pts_l)
return dadi.Inference.ll_multinom(expected_sfs, sfs)
# expected_sfs does not need to be folded, ll_multinom does that automatically
# make sure that sfs points to right spectrum
In [27]:
cl[0]['sfs']
Out[27]:
In [155]:
ar_ery.get()
Out[155]:
In [156]:
# get log likelihood values for all parameters returned
ll = get_ll.map([p[0] for p in ar_ery])
In [157]:
len(ll)
Out[157]:
In [168]:
# get all starting parameter combinations, corresponding optimised parameters and log likelihood values
# if the optimisation was successful:
rows = [(p0[0], p0[1], xopt[0][0], xopt[0][1], logL) for p0, xopt, logL in izip(product(p0_nu, p0_T), ar_ery, ll) if xopt[4] == 0]
In [169]:
len(rows)
Out[169]:
In [137]:
import pandas as pd
In [174]:
df = pd.DataFrame(data=rows, columns=['nu_0', 'T_0', 'nu_opt', 'T_opt', 'logL'])
In [175]:
df.sort_values(by='logL', ascending=False)
Out[175]:
Dadi wants to infer a very drastic bottleneck in the very recent past. It hits the lower bound on T that I set above.
In [28]:
%%px
# further reduce lower bound on T (on all engines)
lower_bound = [1e-4, 1e-6]
In [29]:
# check maxiter setting on engines
cl[:]['maxiter']
Out[29]:
In [30]:
cl[:]['lower_bound']
Out[30]:
In [31]:
cl[:]['dadi_opt_func']
Out[31]:
Still set to use Nelder-Mead algorithm.
In [32]:
cl[0]['sfs']
Out[32]:
In [184]:
# rerun optimisation from above with reduced lower bound for T
#ar_ery = lbview.map(run_dadi, product(p0_nu, p0_T), block=False, order=True)
In [185]:
ar_ery.progress
Out[185]:
In [192]:
ar_ery.wall_time/60
Out[192]:
It took more than 1 hour to run this on 11 cores!
In [32]:
import dill
In [207]:
# save optimsiation results to file
with open("1D_two_epoch_opt_res_ERY.dill", "w") as fh:
dill.dump(list(ar_ery.get()), fh) # unfortunately, dill cannot save the ar_ery object
In [189]:
get_flag_count(ar_ery)
The optimisations starting from 51 initial parameter combinations did not converge before the maximum number of iterations (100) was reached.
In [ ]:
# get log likelihood values for all parameters returned
ll = get_ll.map([p[0] for p in ar_ery])
In [ ]:
# create rows for dataframe
rows = [(p0[0], p0[1], xopt[0][0], xopt[0][1], logL) for p0, xopt, logL in izip(product(p0_nu, p0_T), ar_ery, ll) if xopt[4] == 0]
In [ ]:
# create dataframe
df = pd.DataFrame(data=rows, columns=['nu_0', 'T_0', 'nu_opt', 'T_opt', 'logL'])
In [198]:
# look and see
df.sort_values(by='logL', ascending=False).head(40)
Out[198]:
In [199]:
cl[0]['lower_bound']
Out[199]:
The optimisations with the highest likelihood are hitting the lower bound of parameter values that I've set. The time parameter $T$ that dadi infers with the highest likelihood is too low to make sense. It needs to be multiplied by $2N_{ref}$ to get the time in generations. With any reasonable effective ancient population size, this implies a very recent event. The event is inferred to be a very strong bottleneck. I don't think that these values can possibly make any sense.
In [222]:
p_opt = np.array( df.iloc[0, 2:4] )
p_opt
Out[222]:
In [223]:
# calculate the best-fit model SFS given the paramter combination with the highest likelihood
expected_sfs = func_ex(p_opt, ns, pts_l)
expected_sfs.fold()
Out[223]:
In [226]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [12.0, 10.0]
print(p_opt)
dadi.Plotting.plot_1d_comp_multinom(expected_sfs.fold()[:19], fs_ery[:19], residual='linear')
Does this model provide a better fit than a standard neutral model?
In [227]:
# create link to function that specifies a standard neutral model
snm = dadi.Demographics1D.snm
# make extrapolating function
snm_ex = dadi.Numerics.make_extrap_log_func(snm)
In [228]:
?snm
In [238]:
# calculate the best-fit model SFS given paramter combination
expected_sfs = snm_ex(0, ns, pts_l)
# calculate the log likelihood of the expected SFS given the observed SFS
# no need to fold the model, is done automatically if data is folded
dadi.Inference.ll_multinom(expected_sfs, fs_ery)
Out[238]:
In [239]:
# calculate the best-fit model SFS given paramter combination
expected_sfs = func_ex(p_opt, ns, pts_l)
# calculate the log likelihood of the expected SFS given the observed SFS
# no need to fold the model, is done automatically if data is folded
dadi.Inference.ll_multinom(expected_sfs, fs_ery)
Out[239]:
Compared to a standard neutral model, the two epoch model with the best fitting parameter values does provide a much better fit to the data spectrum. However, it also requires two (instead of zero) parameters and the inferred parameter values cannot be trusted since they are hitting the boundary of sensible values.
In [193]:
%%px
dadi_opt_func = dadi.Inference.optimize_log # BFGS
In [195]:
ar_ery_BFGS = lbview.map(run_dadi, product(p0_nu, p0_T), block=False, order=True)
In [200]:
ar_ery_BFGS.progress
Out[200]:
In [203]:
get_flag_count(ar_ery_BFGS, NM=False)
Clearly, the BFGS algorithm cannot be used for this problem. It is not limited by the maximum number of iterations, but by the lack of a gradient in the likelihood function.
In [209]:
%%px
# set up global variables on engines requires for run_dadi function call
dadi_opt_func = dadi.Inference.optimize_log # BFGS
sfs = fs_par # use spectrum from PAR in optimisation
perturb = False
fold = 1
maxiter = 10 # keep it low for now
verbose = 0
full_output = True
In [208]:
# create range of starting values for nu and T, evenly distributed in log space
p0_nu = np.logspace(-3, 3, base=10.0, num=10)
p0_T = np.logspace(-4, np.log10(4), base=10, num=10)
In [210]:
ar_par_BFGS = lbview.map(run_dadi, product(p0_nu, p0_T), block=False, order=True)
In [213]:
ar_par_BFGS.progress
Out[213]:
In [214]:
get_flag_count(ar_par_BFGS, NM=False)
Again, the BFGS algorithm cannot be used. The majority of initial parameter combinations don't converge because of a lack of gradient in the likelihood surface.
In [215]:
%%px
dadi_opt_func = dadi.Inference.optimize_log_fmin # Nelder-Mead
maxiter = 100 # provide plenty of iterations
In [216]:
ar_par_NM = lbview.map(run_dadi, product(p0_nu, p0_T), block=False, order=True)
In [240]:
ar_par_NM.progress
Out[240]:
In [241]:
ar_par_NM.elapsed/60
Out[241]:
This took 40 minutes to complete. Most of the time running on one parameter combination.
In [242]:
get_flag_count(ar_par_NM, NM=True)
In [244]:
with open("1D_two_epoch_opt_res_PAR.dill", "w") as fh:
dill.dump(list(ar_par_NM.get()), fh)
Which initial parameter combination did not converge, i. e. for which did the maximum number of iterations (100) not suffice? These are also likely to be the ones that have run so long.
In [248]:
for p0, xopt in izip(product(p0_nu, p0_T), ar_par_NM):
if xopt[4] == 2:
print p0
In [ ]:
# get log likelihood values for all parameters returned
ll = get_ll.map([p[0] for p in ar_par_NM])
In [249]:
# create rows for dataframe
rows = [(p0[0], p0[1], xopt[0][0], xopt[0][1], logL) for p0, xopt, logL in izip(product(p0_nu, p0_T), ar_par_NM, ll) if xopt[4] == 0]
In [251]:
df = pd.DataFrame(data=rows, columns=['nu_0', 'T_0', 'nu_opt', 'T_opt', 'logL'])
In [252]:
df.sort_values(by='logL', ascending=False)
Out[252]:
The optimal parameter combination returned is hitting the upper parameter bound that I set.
In [253]:
cl[0]['upper_bound']
Out[253]:
In [255]:
%%px
upper_bound = [1e3, 5]
In [256]:
ar_par_NM = lbview.map(run_dadi, product(p0_nu, p0_T), block=False, order=True)
In [276]:
ar_par_NM.progress
Out[276]:
In [277]:
get_flag_count(ar_par_NM, NM=True)
In [278]:
for r in ar_par_NM:
if r[4] == 2:
print r[0]
In [305]:
x = [(p0, xopt[0], m['completed'] - m['started']) for p0, xopt, m in izip(product(p0_nu, p0_T), ar_par_NM, ar_par_NM.metadata)]
y = sorted(x, key=lambda x: x[2])
for p in y:
print "nu_0 = %.6f, T_0 = %.6f \t nu = %.6f, T = %.6f" % (p[0][0], p[0][1], p[1][0], p[1][1]), "\t", p[2]
This has printed out the initial parameter combinations, the final parameter values and the processing for each combination. There were two jobs that have taken a very long time to complete. Note, that these are not the ones which have exceeded maxiter (see above). Neither of the two long lasting jobs is hitting the boundaries of the parameter space that can be searched.
In [294]:
cl[:]['upper_bound']
Out[294]:
In [295]:
cl[:]['lower_bound']
Out[295]:
In [296]:
# get log likelihood values for all parameters returned
ll = get_ll.map([p[0] for p in ar_par_NM])
In [297]:
# create rows for dataframe, without the parameter values on non-convergent runs
rows = [(p0[0], p0[1], xopt[0][0], xopt[0][1], logL) for p0, xopt, logL in izip(product(p0_nu, p0_T), ar_par_NM, ll) if xopt[4] == 0]
In [298]:
df = pd.DataFrame(data=rows, columns=['nu_0', 'T_0', 'nu_opt', 'T_opt', 'logL'])
In [299]:
df.sort_values(by='logL', ascending=False)
Out[299]:
The most likely parameter combinations are hitting the upper parameter bound for the time parameter again. I am not sure whether an even higher $T$ can reasonably be assumed.
The expected time to the most recent common ancestor (MRCA) in a neutral genealogy is:
$$ E\big[T_{MRCA}\big] = 2 \Big(1-\frac{1}{n}\Big) $$measured in $2N_e$ generations. Note, that $T_{MRCA}$ is close to its large sample size limit of 2 already for moderate sample sizes.
In [37]:
# the above formula for the sample sizes here yields:
2 * (1-1.0/ns[0])
Out[37]:
See figure 3.4, p. 79, in Wakeley2009 for the distribution of the $T_{MRCA}$. For $n=36$ it has a mode close 1.2 and an expected value of 1.94. Values for $T_{MRCA}$ greater than 4 are very unlikely given a standard coalescent model, but may be more likely under models including population expansion or gene flow from another population.
In [45]:
%%px
upper_bound = [1e3, 9]
In [320]:
cl[0]['upper_bound']
Out[320]:
In [321]:
cl[0]['lower_bound']
Out[321]:
In [310]:
ar_par_NM = lbview.map(run_dadi, product(p0_nu, p0_T), block=False, order=True)
In [311]:
ar_par_NM.progress
Out[311]:
In [312]:
get_flag_count(ar_par_NM, NM=True)
In [322]:
for p0, r in izip(product(p0_nu, p0_T), ar_par_NM):
if r[4] == 2: # if run hit maxiter limit
print p0, r[0]
In [326]:
x = [(p0, xopt[0], m['completed'] - m['started']) for p0, xopt, m in izip(product(p0_nu, p0_T), ar_par_NM, ar_par_NM.metadata)]
y = sorted(x, key=lambda x: x[2], reverse=True)
for i, p in enumerate(y):
if i > 1: break
print "nu_0 = %.6f, T_0 = %.6f \t nu = %.6f, T = %.6f" % (p[0][0], p[0][1], p[1][0], p[1][1]), "\t", p[2]
I want to get these starting values out, since they take too long to complete.
In [335]:
excl = [ex[0] for ex in y[:2]]
excl[0] == excl[0]
Out[335]:
In [342]:
combs = [comb for comb in product(p0_nu, p0_T) if not (comb == excl[0] or comb == excl[1])]
print excl[1] in combs
print excl[0] in combs
print combs[22] in combs
Ok, I think the new parameter array does not include the sets that lead to such extreme running times.
In [343]:
ar_par_NM = lbview.map(run_dadi, combs, block=False, order=True)
In [347]:
ar_par_NM.progress
Out[347]:
In [349]:
ar_par_NM.elapsed/60
Out[349]:
11 instead of 40 minutes.
In [348]:
get_flag_count(ar_par_NM, NM=True)
In [350]:
# get log likelihood values for all parameters returned
ll = get_ll.map([p[0] for p in ar_par_NM])
In [351]:
# create rows for dataframe, without the parameter values on non-convergent runs
rows = [(p0[0], p0[1], xopt[0][0], xopt[0][1], logL) for p0, xopt, logL in izip(product(p0_nu, p0_T), ar_par_NM, ll) if xopt[4] == 0]
In [352]:
df = pd.DataFrame(data=rows, columns=['nu_0', 'T_0', 'nu_opt', 'T_opt', 'logL'])
In [353]:
df.sort_values(by='logL', ascending=False)
Out[353]:
Still hitting the boundaries of $T$. Something does not work here.
I am not sure whether it is worth continuing.
Ryan Gutenkunst:
This indicates that dadi is having trouble fitting your data. One possibility is that the history of the population includes important events that aren’t in your models. Another possibility is that your data is biased in ways that aren’t in your models. For example, maybe your missing calls for rare alleles.
The data is clearly biased or at least very noisy. However, the models that only include a single event, one event of growth or decline, might also be too simplistic. So, I think before I go on and try to further clean the data, I should try the two remaining more complex 1D models that come with dadi: bottlegrowth
and three_epoch
. If I cannot reliably estimate parameter values for those models, then I think it is most likely a data issue.
In [34]:
?dadi.Demographics1D.bottlegrowth
In [35]:
?dadi.Demographics1D.three_epoch
These two models shall be analysed in separete notebooks.
In [ ]: