I am going to use the same parallel framework as used for the bottlegrowth
model, i. e. using the subprocess
module to spawn different processes for different optimisations and using pgrep
together with kill
to abort jobs that run too long. For smaller optimisations, I am going to use ipyparallel
.
I have written a python script called run_dadi.py
, which can take many command line arguments and in the following parallel framework is the replacement for the run_dadi
function used previously.
In [1]:
%less run_dadi.py
In [2]:
! ./run_dadi.py -h
In [1]:
import multiprocessing
import subprocess32 as subprocess
import time, signal, os, pickle
import numpy as np
In [4]:
multiprocessing.cpu_count()
Out[4]:
In [2]:
import sys
sys.path.insert(0, '/home/claudius/Downloads/dadi')
import dadi
In [55]:
?dadi.Demographics1D.three_epoch
The built-in three_epoch
model specifies a piecewise history (with only instantaneous population size changes instead of gradual changes). At time $TF+TB$ the ancient population underwent a size change, stayed at this size for $TB \times 2N_{ref}$ generations and then underwent a second size size change $TF \times 2N_{ref}$ generations in the past to the contemporary population size. The model has therefore two population size parameters, $\nu_B$ and $\nu_F$ as well as two time parameters, $TB$ and $TF$.
In [5]:
# set lower and upper bounds to nuB, nuF and T
upper_bound = [1e4, 1e4, 4, 4]
lower_bound = [1e-4, 1e-4, 0, 0]
In [6]:
# create range of starting values evenly distributed in log space
p0_nuB = np.logspace(-3, 3, base=10.0, num=6)
p0_nuF = np.logspace(-3, 3, base=10.0, num=6)
p0_TF = np.logspace(-4, np.log10(4), base=10, num=6)
p0_TB = np.logspace(-4, np.log10(4), base=10, num=6)
In [7]:
print "There are %d starting parameter values" % (6**4)
These are already an enormous amount of parameter combinations. I think I will need to keep maxiter
low so that optimisations that do not converge early end more quickly when they hit the maximum number of iterations. I will also frequently have to prune the available cores of optimisation runs that have already taken a considerable amount of time. This can be done with the following command:
pgrep -of "python ./run_dadi.py" | xargs kill
The following variables will be passed as string values to run_dadi.py
:
In [8]:
path_to_spectrum_file = "dadiExercises/ERY.FOLDED.sfs.dadi_format"
dadi_model = "dadi.Demographics1D.three_epoch"
upper = str(upper_bound)
lower = str(lower_bound)
dadi_opt_func = "dadi.Inference.optimize_log_fmin" # Nelder-Mead
stub = "ERY_three_epoch_NM"
maxiter = 10 # keep maxiter low, but not too low to allow convergence of some optimisations
In [11]:
%run run_dadi.py -h
In [9]:
from itertools import product
In [10]:
# check creation of command line
for i, p_init in enumerate(product(p0_nuB, p0_nuF, p0_TF, p0_TB)):
cmd = "./run_dadi.py -p %s -m %s -u '%s' -l '%s' -d %s -s %s --maxiter %s -i '%s'" \
% (path_to_spectrum_file, dadi_model, upper, lower, dadi_opt_func, stub, str(maxiter), str(p_init))
print cmd
if i >= 0: break
Next, I need to define an executor class that takes care of submiting jobs as slots become available. This is taken from Tiago Antao's Cookbook:
In [12]:
class executor:
def __init__(self, limit):
self.limit = limit
self.ncores = multiprocessing.cpu_count()
self.running = []
self.finished = 0
def submit(self, cmd, p_init):
self.wait()
if hasattr(self, 'out'):
out = self.out
else:
out = '/dev/null'
if hasattr(self, 'err'):
err = self.err
else:
err = '/dev/null'
if err == 'stderr':
errSt = ''
else:
errSt = '2> ' + err
# this is where the magic happens:
proc = subprocess.Popen('%s > %s %s' % (cmd, out, errSt), shell=True)
self.running.append(proc)
#
if hasattr(self, 'out'):
del self.out
if hasattr(self, 'err'):
del self.err
def wait(self, for_all=False):
self.clean_done()
#numWaits = 0
if self.limit > 0 and type(self.limit) == int:
cond = 'len(self.running) >= self.ncores - self.limit'
elif self.limit < 0:
cond = 'len(self.running) >= - self.limit'
else:
cond = 'len(self.running) >= self.ncores * self.limit'
while eval(cond) or (for_all and len(self.running) > 0):
time.sleep(1)
self.clean_done() # updates self.running, removes finished jobs from the running queue
#numWaits += 1
def clean_done(self):
terminated = []
for i, p in enumerate(self.running):
if p.poll() is not None: # None means it's still running
terminated.append(i)
for idx in reversed(terminated):
del self.running[idx]
self.finished += 1
def progress(self):
self.clean_done()
print self.finished
In [ ]:
E = executor(limit=-18) # use up to 18 cores
# this will block
for i, p_init in enumerate(product(p0_nuB, p0_nuF, p0_TF, p0_TB)):
cmd = "./run_dadi.py -p %s -m %s -u '%s' -l '%s' -d %s -s %s --maxiter %s -i '%s'" \
% (path_to_spectrum_file, dadi_model, upper, lower, dadi_opt_func, stub, str(maxiter), str(p_init))
E.err = "stderr"
E.submit(cmd, p_init)
After 191 output file had been created, I loaded those pickle files into another notebook (because the upper command blocks the current session). A check of the returned optimisation flags showed that all 191 optimisations had reached the maximum number of iterations, i. e. none had converged.
After 414 optimisations without a single successful optimisation, I have decided to abort this sweep through the parameter space. Due to the size of the parameter space, even with just 6 samples per parameter, increasing the maximum number of allowed iterations per optimisation is not an option due to prohibitive running time given this is still an exploratory analysis.
I think my only option of finding optimal parameter values for this model is to start at some reasonable combination of parameters, perturb them and run optimisations with those slightly perturbed initial parameter values.
Let's first try to do this with ipyparallel
.
In [1]:
from ipyparallel import Client
cl = Client()
cl.ids
Out[1]:
In [2]:
%%px --local
# run whole cell on all engines a well as in the local IPython session
import numpy as np
import sys
sys.path.insert(0, '/home/claudius/Downloads/dadi')
import dadi
In [5]:
%ll
In [6]:
! cat ERY.unfolded.sfs
In [3]:
%%px --local
# import 1D spectrum of ery on all engines:
fs_ery = dadi.Spectrum.from_file('ERY.unfolded.sfs').fold()
# import 1D spectrum of ery on all engines:
fs_par = dadi.Spectrum.from_file('PAR.unfolded.sfs').fold()
In [4]:
%%px --local
ns = fs_ery.sample_sizes # both populations have the same sample size
fs_ery.pop_ids = ['ery']
fs_par.pop_ids = ['par']
# setting the smallest grid size slightly larger than the largest population sample size (36)
pts_l = [50, 60, 70]
In [5]:
%%px --local
# create link to function that specifies the model
func = dadi.Demographics1D.three_epoch
# create extrapolating version of the model function
func_ex = dadi.Numerics.make_extrap_log_func(func)
In [6]:
%%px
# set lower and upper bounds to nuB, nuF and T
upper_bound = [1e4, 1e4, 4, 4]
lower_bound = [1e-4, 1e-4, 0, 0]
In [7]:
# create load balanced view of engines
lbview = cl.load_balanced_view()
In [12]:
def run_dadi(p_init): # for the function to be called with map, it needs to have one input variable
"""
p_init: initial parameter values to run optimisation from
"""
if perturb == True:
p_init = dadi.Misc.perturb_params(p_init, fold=fold,
upper_bound=upper_bound, lower_bound=lower_bound)
# note upper_bound and lower_bound variables are expected to be in the namespace of each engine
# run optimisation of paramters
popt = dadi_opt_func(p0=p_init, data=sfs, model_func=func_ex, pts=pts_l, \
lower_bound=lower_bound, upper_bound=upper_bound, \
verbose=verbose, maxiter=maxiter, full_output=full_output)
return p_init, popt
In [26]:
%%px
# set up global variables on engines required for run_dadi function call
dadi_opt_func = dadi.Inference.optimize_log_fmin # uses Nelder-Mead algorithm
sfs = fs_ery # use ERY spectrum
perturb = True
fold = 2
maxiter = 20 # run a maximum of 20 iterations
verbose = 0
full_output = True # need to have full output to get the warnflags (see below)
In [8]:
from itertools import repeat
In [27]:
# specify the initial parameter values, they will be randomly perturbed by up to a factor of 2
p0 = [1, 1, 1, 1]
# these parameters indicate no size changes at times 1 and 2 (x 2N) generations in the past
ar_ery = lbview.map(run_dadi, repeat([1, 1, 1, 1], 40), block=False, order=True)
In [28]:
ar_ery.wall_time
Out[28]:
In [9]:
def get_flag_count(out, NM=True):
"""
out: list of tuples, each containing p_init and popt + additional info, including warnflags
as produced by run_dadi.py
"""
from collections import defaultdict
if NM: # if ar from Nelder-Mead
i = 4 # the warnflag is reported at index position 4 in the output array
else: # ar from BFGS optimisation
i = 6
warnflag = defaultdict(int)
for res in out:
if res[1][i] == 1: # notice the change in indexing
warnflag[1] +=1
elif res[1][i] == 2:
warnflag[2] += 1
elif res[1][i] == 0:
warnflag[0] += 1
else:
warnflag[999] +=1
if NM:
print "success", warnflag[0]
print "Maximum number of function evaluations made.", warnflag[1]
print "Maximum number of iterations reached.", warnflag[2]
print "unknown flag", warnflag[999]
else:
print "success", warnflag[0]
print "Maximum number of iterations exceeded.", warnflag[1]
print "Gradient and/or function calls not changing.", warnflag[2]
print "unknown flag", warnflag[999]
In [30]:
get_flag_count(ar_ery, NM=True)
OK, I obviously need to increase the number of iterations. Since the upper run was extremely fast, I am bold enough to increase maxiter
to 100.
In [31]:
%%px
maxiter = 100
In [32]:
cl[0]['maxiter']
Out[32]:
In [33]:
# run 40 optimisations with perturbed p0 and maxiter=100
ar_ery = lbview.map(run_dadi, repeat(p0, 40), block=False, order=False)
In [36]:
ar_ery.wall_time
Out[36]:
In [37]:
get_flag_count(ar_ery, NM=True)
In [38]:
for out in ar_ery:
if out[1][4] == 0:
print out[1][0], out[1][1]
In [11]:
import dill
In [141]:
# save optimisation results to file
dill.dump(list(ar_ery.get()), open("OUT_three_epoch/ERY_perturb_ar_ery.dill", "w"))
In [10]:
def flatten(array):
"""
Returns a list of flattened elements of every inner lists (or tuples)
****RECURSIVE****
"""
import numpy
res = []
for el in array:
if isinstance(el, (list, tuple, numpy.ndarray)):
res.extend(flatten(el))
continue
res.append(el)
return list( res )
In [42]:
successfull_popt = [flatten(out)[:9] for out in ar_ery if out[1][4] == 0]
In [12]:
import pandas as pd
In [44]:
df = pd.DataFrame(data=successfull_popt, \
columns=['nuB_0', 'nuF_0', 'TB_0', 'TF_0', 'nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt', '-logL'])
df.sort_values(by='-logL', ascending=True) # smaller is better
Out[44]:
Fairly different parameter combinations have practically identical likelihood. A reduction of the contemporary population size to 1/4 of the ancient population size is quite conistently inferred ($\nu_F$). The ancient population size change is not inferred consistently.
I am going to try one more p0. In the two epoch model fitting, erythropus had shown an ancient increase in population size.
In [45]:
# specify the initial parameter values, they will be randomly perturbed by up to a factor of 2
p0 = [40, 0.4, 0.5, 0.5]
ar_ery_1 = lbview.map(run_dadi, repeat(p0, 40), block=False, order=True)
In [46]:
ar_ery_1.ready()
Out[46]:
In [49]:
get_flag_count(ar_ery_1, NM=True)
Aha!
In [48]:
successfull_popt = [flatten(out)[:9] for out in ar_ery_1 if out[1][4] == 0]
df = pd.DataFrame(data=successfull_popt, \
columns=['nuB_0', 'nuF_0', 'TB_0', 'TF_0', 'nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt', '-logL'])
df.sort_values(by='-logL', ascending=True) # smaller is better
Out[48]:
Ok, this looks like several quite different population histories are equally likely. These parameter values seem to suggest that erythropus first underwent a population increase to a $\nu_B$ fold of the ancient population size (ranging from 23$\times$ to $\gt$7000$\times$!) and stayed at this size for $TB$ (x2N) generations upon which it dramatically reduced its population size to a fraction $\nu_F$ of the ancient size and stayed at this size the remaining TF (x2N) generations.
In [140]:
dill.dump(list(ar_ery_1.get()), open("OUT_three_epoch/ERY_perturb_ar_ery_1.dill", "w"))
I would be good to visualise these different scenarios.
Combine both data frames:
In [124]:
successfull_popt = [flatten(out)[:9] for out in ar_ery if out[1][4] == 0]
df_1 = pd.DataFrame(data=successfull_popt, \
columns=['nuB_0', 'nuF_0', 'TB_0', 'TF_0', 'nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt', '-logL'])
successfull_popt = [flatten(out)[:9] for out in ar_ery_1 if out[1][4] == 0]
df_2 = pd.DataFrame(data=successfull_popt, \
columns=['nuB_0', 'nuF_0', 'TB_0', 'TF_0', 'nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt', '-logL'])
df = df_1.append(df_2)
df.sort_values(by='-logL')
Out[124]:
In [126]:
# remove one parameter combination with slightly lower logL than the others
df = df.sort_values(by='-logL').head(-1)
In [127]:
df['TB+TF'] = pd.Series(df['TB_opt']+df['TF_opt'])
In [128]:
df.head(1)
Out[128]:
In [129]:
import matplotlib.pyplot as plt
In [130]:
%matplotlib inline
In [131]:
# extract columns from table
nuB = df.loc[:,'nuB_opt']
nuF = df.loc[:, 'nuF_opt']
Tb_Tf = df.loc[:, 'TB+TF']
TF = df.loc[:, 'TF_opt']
In [132]:
# calculate reference (ancient) population size from theta estimate (derived elsewhere by fitting a neutral spectrum)
theta_ery = 10617.328085456724
mu = 3e-9
L = fs_ery.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
print "The total sequence length for the ery spectrum is {0:,}.".format(int(L))
N_ref_ery = theta_ery/L/mu/4
In [133]:
# turn nu into absolute Ne and T into generations
nuB_n = nuB*N_ref_ery
nuF_n = nuF*N_ref_ery
Tb_Tf_g = Tb_Tf*2*N_ref_ery
TF_g = TF*2*N_ref_ery
In [137]:
anc = [N_ref_ery] * len(nuB)
pres = [1] * len(nuB)
past = [max(Tb_Tf_g)+1000] * len(nuB)
In [139]:
plt.rcParams['figure.figsize'] = [12.0, 10.0]
for run in zip(past, Tb_Tf_g, Tb_Tf_g, TF_g, TF_g, pres, anc, anc, nuB_n, nuB_n, nuF_n, nuF_n):
plt.semilogy(run[:6], run[6:])
plt.xlabel('generations in the past')
plt.ylabel('effective population size')
Out[139]:
In [14]:
%%px
# set up global variables on engines required for run_dadi function call
dadi_opt_func = dadi.Inference.optimize_log_fmin # uses Nelder-Mead algorithm
sfs = fs_par # use PAR spectrum
perturb = True
fold = 3 # perturb randomly up to 6-fold
maxiter = 100 # run a maximum of 100 iterations
verbose = 0
full_output = True # need to have full output to get the warnflags (see below)
outname = "OUT_three_epoch/PAR_perturb" # set file name stub for opt. result files
In [103]:
def run_dadi(p_init): # for the function to be called with map, it needs to have one input variable
"""
p_init: initial parameter values to run optimisation from
"""
if perturb == True:
p_init = dadi.Misc.perturb_params(p_init, fold=fold,
upper_bound=upper_bound, lower_bound=lower_bound)
# note upper_bound and lower_bound variables are expected to be in the namespace of each engine
# run optimisation of paramters
popt = dadi_opt_func(p0=p_init, data=sfs, model_func=func_ex, pts=pts_l, \
lower_bound=lower_bound, upper_bound=upper_bound, \
verbose=verbose, maxiter=maxiter, full_output=full_output)
# pickle to file
import dill
name = outname[:] # make copy of file name stub!
for p in p_init:
name += "_%.4f" % (p)
with open(name + ".dill", "w") as fh:
dill.dump((p_init, popt), fh)
return p_init, popt
In [16]:
# specify the initial parameter values, they will be randomly perturbed by up to a factor of 2 times `fold`
p0 = [1, 1, 1, 1]
ar_par = lbview.map(run_dadi, repeat(p0, 20), block=False, order=True)
In [23]:
ar_par.progress
Out[23]:
In [25]:
ar_par.elapsed/60
Out[25]:
I had to kill the last two long running jobs.
In [28]:
import glob
In [30]:
ar_par = []
for filename in glob.glob("OUT_three_epoch/PAR_perturb*.dill"):
ar_par.append(dill.load(open(filename)))
In [31]:
get_flag_count(ar_par, NM=True)
Six optimisation converged.
In [34]:
i = 4 # index of flag with NM algorithm
successfull_popt = [flatten(out)[:9] for out in ar_par if out[1][i] == 0]
df = pd.DataFrame(data=successfull_popt, \
columns=['nuB_0', 'nuF_0', 'TB_0', 'TF_0', 'nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt', 'logL'])
df.sort_values(by='logL', ascending=True)
Out[34]:
It would be nice to get a few more replicates for these results. I am going to use these optimal parameter combinations, perturb them randomly and use them as starting values for new optimisations.
In [35]:
df.loc[:, ['nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt']]
Out[35]:
In [40]:
# specify new set initial parameter values (the optimal sets repeated 3 times)
p0 = []
for _ in range(3):
p0.extend(np.array(df.loc[:, ['nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt']]))
p0
Out[40]:
In [41]:
%%px
fold = 2 # perturb up to 4 times
In [42]:
# run optimisations
ar_par_2 = lbview.map(run_dadi, p0, block=False, order=True)
In [48]:
ar_par_2.progress
Out[48]:
In [49]:
get_flag_count(ar_par_2, NM=True)
Wow! All 18 runs were successful?!
In [87]:
# add new output to previous output
successfull_popt.extend( [flatten(out)[:9] for out in ar_par_2 if out[1][4] == 0] )
# create data frame
df = pd.DataFrame(data=successfull_popt, \
columns=['nuB_0', 'nuF_0', 'TB_0', 'TF_0', 'nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt', 'logL'])
# sort data frame by negative log likelihood
df.sort_values(by='nuF_opt', ascending=True)
Out[87]:
I am going to visualise these population size histories.
In [57]:
# add time for ancient size change
df['TB+TF'] = pd.Series(df['TB_opt']+df['TF_opt'])
In [58]:
# extract columns from table
nuB = df.loc[:,'nuB_opt']
nuF = df.loc[:, 'nuF_opt']
Tb_Tf = df.loc[:, 'TB+TF']
TF = df.loc[:, 'TF_opt']
In [59]:
# calculate reference (ancient) population size from theta estimate (derived elsewhere by fitting a neutral spectrum)
theta_par = 10632.738922047551
mu = 3e-9
L = fs_par.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
print "The total sequence length for the ery spectrum is {0:,}.".format(int(L))
N_ref_par = theta_par/L/mu/4
In [60]:
# turn nu into absolute Ne and T into generations
nuB_n = nuB*N_ref_par
nuF_n = nuF*N_ref_par
Tb_Tf_g = Tb_Tf*2*N_ref_par
TF_g = TF*2*N_ref_par
In [61]:
anc = [N_ref_par] * len(nuB)
pres = [1] * len(nuB)
past = [max(Tb_Tf_g)+1000] * len(nuB)
In [62]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [12.0, 10.0]
for run in zip(past, Tb_Tf_g, Tb_Tf_g, TF_g, TF_g, pres, anc, anc, nuB_n, nuB_n, nuF_n, nuF_n):
plt.semilogy(run[:6], run[6:])
plt.xlabel('generations in the past')
plt.ylabel('effective population size')
Out[62]:
The ancient size change is generally inferred to be more than 1 million generations ago (even up to 10 million)?!
Are the strength of the first population size change, $(\frac{1}{\nu_B})^{TB}$, and the strength of the second population size change, $(\frac{1}{\nu_F})^{TF}$, correlated with each other?
In [90]:
plt.rcParams['font.size'] = 12.0
plt.plot((1.0/df['nuB_opt'])**df['TB_opt'], (1.0/df['nuF_opt'])**df['TF_opt'], 'bo')
plt.xlabel('strength of first size change')
plt.ylabel('strength of second size change')
Out[90]:
There is no correlation between the strength of the two size changes.
In [54]:
import dill
dump_this = ar_par
dump_this.extend(list(ar_par_2.get()))
dill.dump(dump_this, open("OUT_three_epoch/PAR_perturb_ar_par.dill", "w"))
In [55]:
p0 = [100, 1e-2, 1, 2] # start from more extreme initial parameter values
ar_par_3 = lbview.map(run_dadi, repeat(p0, 10), block = False, order=False)
In [67]:
ar_par_3.progress
Out[67]:
In [68]:
get_flag_count(ar_par_3, NM=True)
In [93]:
dill.dump(list(ar_par_3.get()), open("OUT_three_epoch/PAR_perturb_extreme_ar_par.dill", "w"))
In [70]:
# add new output to previous output
successfull_popt_3 = [flatten(out)[:9] for out in ar_par_3 if out[1][4] == 0]
# create data frame
df_3 = pd.DataFrame(data=successfull_popt_3, \
columns=['nuB_0', 'nuF_0', 'TB_0', 'TF_0', 'nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt', 'logL'])
# sort data frame by negative log likelihood
df_3.sort_values(by='logL', ascending=True)
Out[70]:
In [92]:
plt.plot(df_3['nuB_opt']**df_3['TB_opt'], (1.0/df_3['nuF_opt'])**df_3['TF_opt'], 'bo')
Out[92]:
I cannot see a correlation here either.
All optimisations consistently show a reduction in the contemporary population size with respect to the ancient population size ($\nu_F$). The ancient population size change is much less clear and ranges from a reduction by one half to a doubling. Also this event is inferred to have occurred quite distant in the past, at $(T_F + T_B) \times 2N_{ref}$ generations in the past. This makes me think that this ancient size change cannot reliably be inferred. There only seems to be evidence for one size change.
My attempts to fit a two epoch model (i. e. a single size change) to the parallelus 1D SFS had previously failed due to inferred parameter values hitting ever more extreme parameter limits. Given that I was able to reasonably fit a 4 parameter model to the 1D SFS of parallelus (that does not show strong evidence for two size changes), I would be surprised if it wasn't possible to fit the two epoch model to the parallelus spectrum.
In [1]:
%who
In [2]:
from ipyparallel import Client
cl = Client()
cl.ids
Out[2]:
In [3]:
%%px --local
# run whole cell on all engines a well as in the local IPython session
import numpy as np
import sys
sys.path.insert(0, '/home/claudius/Downloads/dadi')
import dadi
In [4]:
%%px --local
# create link to function that specifies the model
func = dadi.Demographics1D.two_epoch
# create extrapolating version of the model function
func_ex = dadi.Numerics.make_extrap_log_func(func)
In [5]:
%%px --local
# import 1D spectrum of ery on all engines:
fs_ery = dadi.Spectrum.from_file('ERY.unfolded.sfs').fold()
# import 1D spectrum of ery on all engines:
fs_par = dadi.Spectrum.from_file('PAR.unfolded.sfs').fold()
In [6]:
%%px
# set up global variables on engines required for run_dadi function call
dadi_opt_func = dadi.Inference.optimize_log_fmin # uses Nelder-Mead algorithm
sfs = fs_par # use PAR spectrum
perturb = True
fold = 2 # perturb randomly up to 6-fold
maxiter = 100 # run a maximum of 100 iterations
verbose = 0
full_output = True # need to have full output to get the warnflags (see below)
outname = "OUT_two_epoch/PAR_perturb" # set file name stub for opt. result files
# set lower and upper bounds to nu and T
upper_bound = [1e4, 5]
lower_bound = [1e-4, 0]
ns = fs_ery.sample_sizes # both populations have the same sample size
fs_ery.pop_ids = ['ery']
fs_par.pop_ids = ['par']
# setting the smallest grid size slightly larger than the largest population sample size (36)
pts_l = [50, 60, 70]
In [7]:
lbview = cl.load_balanced_view()
In [32]:
def run_dadi(p_init): # for the function to be called with map, it needs to have one input variable
"""
p_init: initial parameter values to run optimisation from
"""
if perturb == True:
p_init = dadi.Misc.perturb_params(p_init, fold=fold,
upper_bound=upper_bound, lower_bound=lower_bound)
# note upper_bound and lower_bound variables are expected to be in the namespace of each engine
# run optimisation of paramters
popt = dadi_opt_func(p0=p_init, data=sfs, model_func=func_ex, pts=pts_l, \
lower_bound=lower_bound, upper_bound=upper_bound, \
verbose=verbose, maxiter=maxiter, full_output=full_output)
# pickle to file
import dill
name = outname[:] # make copy of file name stub!
for p in p_init:
name += "_%.4f" % (p)
with open(name + ".dill", "w") as fh:
dill.dump((p_init, popt), fh)
return p_init, popt
In [9]:
from itertools import repeat
In [10]:
# specify the initial parameter values, they will be randomly perturbed by up to a factor of 4
p0 = [0.1, 2.5]
ar_par_te = lbview.map(run_dadi, repeat(p0, 20), block=False, order=True)
In [14]:
ar_par_te.progress
Out[14]:
In [13]:
def get_flag_count(out, NM=True):
"""
out: list of tuples, each containing p_init and popt + additional info, including warnflags
as produced by run_dadi.py
"""
from collections import defaultdict
if NM: # if ar from Nelder-Mead
i = 4 # the warnflag is reported at index position 4 in the output array
else: # ar from BFGS optimisation
i = 6
warnflag = defaultdict(int)
for res in out:
if res[1][i] == 1: # notice the change in indexing
warnflag[1] +=1
elif res[1][i] == 2:
warnflag[2] += 1
elif res[1][i] == 0:
warnflag[0] += 1
else:
warnflag[999] +=1
if NM:
print "success", warnflag[0]
print "Maximum number of function evaluations made.", warnflag[1]
print "Maximum number of iterations reached.", warnflag[2]
print "unknown flag", warnflag[999]
else:
print "success", warnflag[0]
print "Maximum number of iterations exceeded.", warnflag[1]
print "Gradient and/or function calls not changing.", warnflag[2]
print "unknown flag", warnflag[999]
In [16]:
ar_par_te = []
import glob, dill
for filename in glob.glob("OUT_two_epoch/PAR_perturb*dill"):
ar_par_te.append( dill.load(open(filename)) )
In [17]:
get_flag_count(ar_par_te, NM=True)
In [18]:
def flatten(array):
"""
Returns a list of flattened elements of every inner lists (or tuples)
****RECURSIVE****
"""
import numpy
res = []
for el in array:
if isinstance(el, (list, tuple, numpy.ndarray)):
res.extend(flatten(el))
continue
res.append(el)
return list( res )
In [19]:
import pandas as pd
In [20]:
i = 4 # index of flag with NM algorithm
successfull_popt = [flatten(out)[:5] for out in ar_par_te if out[1][i] == 0]
df = pd.DataFrame(data=successfull_popt, \
columns=['nu_0','T_0', 'nu_opt', 'T_opt', 'logL'])
df.sort_values(by='logL', ascending=True)
Out[20]:
Note that these two epoch models have the same likelihood as the three epoch models above.
In [21]:
# specify the initial parameter values, they will be randomly perturbed by up to a factor of 4
p0 = [10, 0.5]
ar_par_te = lbview.map(run_dadi, repeat(p0, 20), block=False, order=True)
In [22]:
ar_par_te.progress
Out[22]:
In [24]:
get_flag_count(ar_par_te, NM=True)
In [25]:
i = 4 # index of flag with NM algorithm
successfull_popt = [flatten(out)[:5] for out in ar_par_te if out[1][i] == 0]
df = pd.DataFrame(data=successfull_popt, \
columns=['nu_0','T_0', 'nu_opt', 'T_opt', '-logL'])
df.sort_values(by='-logL', ascending=True)
Out[25]:
The likelihood of these parameter values is much higher than the the ones above. However, the time parameter is hitting the upper boundary.
In [35]:
%%px
upper_bound = [1e4, 6]
In [36]:
# specify the initial parameter values, they will be randomly perturbed by up to a factor of 4
p0 = [1, 1]
ar_par_te = lbview.map(run_dadi, repeat(p0, 10), block=False, order=True)
In [28]:
#ar_par_te = []
#import glob, dill
#for filename in glob.glob("OUT_two_epoch/PAR_perturb*dill"):
# ar_par_te.append( dill.load(open(filename)) )
In [37]:
get_flag_count(ar_par_te, NM=True)
In [38]:
i = 4 # index of flag with NM algorithm
successfull_popt = [flatten(out)[:5] for out in ar_par_te if out[1][i] == 0]
df = pd.DataFrame(data=successfull_popt, \
columns=['nu_0','T_0', 'nu_opt', 'T_opt', 'logL'])
df.sort_values(by='logL', ascending=True)
Out[38]:
In [40]:
dill.dump(list(ar_par_te.get()), open("OUT_two_epoch/PAR_perturb_ar_par.dill", "w"))
It is very hard for me to believe that the log likelihood of all parameter combinations is almost exactly the same AND also the same as the likelihood for all parameter combinations of the three epoch model above!
In [92]:
[get_ll(out[1][0]) for out in ar_par_te if out[1][4] == 0]
Out[92]:
In [93]:
ar_par_te.get()
Out[93]:
In [94]:
from scipy.optimize import fmin
In [95]:
?fmin
fopt
is the likelihood of the otimal parameter combination. I therefore don't need to calculate the likelihood with get_ll
. It's already returned by the optimisation function.
OK. it may be that the likelihood values are correct and that there is a ridge in the likelihood surface.
I previously also wasn't able to find reasonable optimal parameter values for the two epoch model with the erythropus spectrum. I want to make another attempt here, given the apparently successful optimisation for the three epoch model above.
In [99]:
%%px
# set up global variables on engines required for run_dadi function call
dadi_opt_func = dadi.Inference.optimize_log_fmin # uses Nelder-Mead algorithm
sfs = fs_ery # use ERY spectrum
perturb = True
fold = 1
maxiter = 100 # run a maximum of 100 iterations
verbose = 0
full_output = True # need to have full output to get the warnflags (see below)
In [100]:
# specify the initial parameter values, they will be randomly perturbed by up to a factor of 2
p0 = [1, 1]
ar_ery_te = lbview.map(run_dadi, repeat(p0, 20), block=False, order=True)
In [101]:
ar_ery_te.elapsed
Out[101]:
In [102]:
get_flag_count(ar_ery_te, NM=True)
In [103]:
ar_ery_te.get()[0]
Out[103]:
In [105]:
successfull_popt_ery_te = [flatten(out)[:5] for out in ar_ery_te if out[1][4] == 0]
df = pd.DataFrame(data=successfull_popt_ery_te, \
columns=['nuB_0', 'T_0', 'nuB_opt', 'T_opt', 'logL'])
df.sort_values(by='logL', ascending=False)
Out[105]:
These parameters are hitting the parameter limits set above and therefore cannot be regarded as converged.
In [106]:
# specify the initial parameter values, they will be randomly perturbed by up to a factor of 2
p0 = [0.1, 0.001]
ar_ery_te = lbview.map(run_dadi, repeat(p0, 20), block=False, order=True)
In [107]:
get_flag_count(ar_ery_te, NM=True)
Oho!
In [108]:
successfull_popt_ery_te = [flatten(out)[:5] for out in ar_ery_te if out[1][4] == 0]
df = pd.DataFrame(data=successfull_popt_ery_te, \
columns=['nuB_0', 'T_0', 'nuB_opt', 'T_opt', 'logL'])
df.sort_values(by='logL', ascending=False)
Out[108]:
Still no luck. It really looks as though the two epoch model cannot be fit to the erythropus spectrum.
In [ ]: