Here I am going to summarise, compare and interpret the results of the dadi analysis of spectra modified with Ludovic's correction (see 02_modified_2D_models for details). For the implementation of Ludovic's correction see Ludovics_correction.
In [1]:
import numpy as np
import sys
sys.path.insert(0, '/home/claudius/Downloads/dadi')
import dadi
In [2]:
from glob import glob
import dill
from utility_functions import *
import pandas as pd
# turn on floating point division by default, old behaviour via '//'
from __future__ import division
In [3]:
%matplotlib inline
import pylab
pylab.rcParams['figure.figsize'] = [12, 10]
pylab.rcParams['font.size'] = 14
In [5]:
# load modified spectra, Ludovic's correction applied
fs_ery = dadi.Spectrum.from_file("ERY_modified.sfs")
fs_par = dadi.Spectrum.from_file("PAR_modified.sfs")
In [10]:
# get standard neutral model spectra
# make the extrapolating version of the standard neutral model function
func_ex = dadi.Numerics.make_extrap_log_func(dadi.Demographics1D.snm)
# setting the smallest grid size slightly larger than the largest population sample size
pts_l = [40, 50, 60]
ns = fs_ery.sample_sizes
# calculate unfolded AFS under standard neutral model (up to a scaling factor theta)
snm = func_ex(0, ns, pts_l)
snm = snm.fold()
# scaled snm spectrum for ERY
snm_ery = dadi.Inference.optimally_scaled_sfs(snm, fs_ery)
# scaled snm spectrum for PAR
snm_par = dadi.Inference.optimally_scaled_sfs(snm, fs_par)
In [13]:
snm_ery - snm_par
Out[13]:
The scaling of the neutral model spectrum towards the ery or par spectrum is almost identical.
In [16]:
pylab.plot(snm_ery, 'bs-', label="snm")
pylab.plot(fs_ery, 'ro-', label="ery", linewidth=2.0, markersize=10)
pylab.plot(fs_par, 'g^-', label="par", linewidth=2.0, markersize=10)
pylab.grid()
pylab.legend()
pylab.xlabel("minor allele frequency")
pylab.ylabel("number of SNP's")
pylab.title("corrected spectra")
Out[16]:
The PAR spectrum was corrected with a $p$ of 0.36 and the ERY spectrum with a $p$ of 0.34, the optimal values of $p$ for each spectrum.
For ERY as well as for PAR the time parameter is hitting the upper parameter boundary (8) with the corrected spectrum. With uncorrected spectrum, optimisation was successful for ERY but not for PAR, because of lack of convergence.
The two epoch model could not be fit to the ERY nor the PAR spectrum with Ludovic's correction applied due hitting of upper boundary on the time parameter (8). A 2-epoch model could also not be fit to the ERY nor PAR with the uncorrected spectra due to hitting the lower and upper boundary on the time parameter, respectively.
With the uncorrected spectra, the optimal parameter values for ERY suggested an ancient population size of >21 million, which is unrealistic. For PAR the optimisation did not converge.
With the corrected spectra, the optimal parameter values imply an effective population size of >14 million for erythropus (before decline). I think such a high effective population size is unrealistic. The fitting to the corrected spectrum of PAR even estimated an initial increase to 36 million. Such high effective population sizes are obviously unrealistic. I must therefore regard the fitting of this model to the corrected spectra as failed.
With the corrected spectrum for ERY, this model unambiguously (i. e. in all optimisation runs) infers a long time with around 10-fold increased population size followed by a stepwise decrease to almost the size of the ancient population. The inference from the uncorrected spectrum ranged from a 186-fold increase to a reduction to 1/3 of the ancient population size.
Similar to the ERY spectrum, the inference from the corrected PAR spectrum showed a consistent population size increase (ranging from 38-fold to a ridiculous 7772-fold). This was followed by a very recent radical reduction in population size to only a fraction of the ancient size.
So, except for the consistent inference of an ancient population size increase (stronger in parallelus), the inference from corrected spectra is qualitatively quite similar and not more conclusive than the inference from the uncorrected spectra.
In [17]:
# load spectrum modified with Ludovic's correction, p=0.35
sfs2d = dadi.Spectrum.from_file("EryPar_modified.2dsfs")
In [18]:
dadi.Plotting.plot_single_2d_sfs(sfs2d, vmin=1, cmap=pylab.cm.jet)
Out[18]:
The 2D SFS was corrected with a $p$ of 0.35, the optimal $p$ when fit to a simple divergence in isolation model with parameters $\nu_1=1$, $\nu_2=1$ and $T=1$.
In [19]:
# number of SNP's in the spectrum
sfs2d.S()
Out[19]:
In [27]:
ns = sfs2d.sample_sizes # both populations have the same sample size
# setting the smallest grid size slightly larger than the largest population sample size (36)
pts_l = [40, 50, 60]
In [21]:
def split_nomig(params, ns, pts):
"""
params = (nu1,nu2,T)
ns = (n1,n2)
Split into two populations of specifed size, no migration.
nu1: Size ratio of population 1 after split (with respect to ancestral population size Na)
nu2: Size ratio of population 2 after split (with respect to ancestral population size Na)
T: Time in the past of split (in units of 2*Na generations)
n1,n2: Sample sizes of resulting Spectrum
pts: Number of grid points to use in integration.
"""
nu1,nu2,T = params
xx = dadi.Numerics.default_grid(pts)
phi = dadi.PhiManip.phi_1D(xx)
phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)
phi = dadi.Integration.two_pops(phi, xx, T, nu1, nu2, m12=0, m21=0)
fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
return fs
In [23]:
ar_split_nomig = []
for filename in glob("OUT_2D_models/split_nomig_[!p]*dill"):
ar_split_nomig.append(dill.load(open(filename)))
In [24]:
returned = [flatten(out)[:7] for out in ar_split_nomig]
df = pd.DataFrame(data=returned, \
columns=['nu1_0','nu2_0', 'T_0', 'nu1_opt', 'nu2_opt', 'T_opt', '-logL'])
df.sort_values(by='-logL', ascending=True)
Out[24]:
This is how convergence looks like.
In [28]:
# get best-fit parameter combination
popt = np.array(df.sort_values(by='-logL', ascending=True).iloc[0, 3:6])
# create extrapolating version of the model function
func_ex = dadi.Numerics.make_extrap_log_func(split_nomig)
# get unscaled, best-fit model spectrum
model = func_ex(popt, ns, pts_l)
# get logL of best-fit model
ll_model = dadi.Inference.ll_multinom(model, sfs2d)
# get theta
theta = dadi.Inference.optimal_sfs_scaling(model, sfs2d)
# get total sequence length in spectrum
L = sfs2d.data.sum()
# get theta per site
print "The optimal value of theta per site for the ancestral population is {0:.4f}.".format(theta/L)
In [29]:
mu = 3e-9 # assumed per generation per site mutation rate
# get ancestral population size
N_ref = theta/L/mu/4
print "The effective size of the ancestral population of ERY and PAR (in number of diploid individuals) implied by this theta is:\n {0:,}.".format(int(N_ref))
In [32]:
print "The ancestral population of ERY and PAR split apart {0:,} generations ago.".format(int(popt[2]*2*N_ref)),
print "Since then until present the ERY population had a constant size of {0:,} and the PAR population of {1:,}.".format(int(popt[0]*N_ref), int(popt[1]*N_ref))
The following table compares the best-fit parameter values from the uncorrected and corrected spectrum:
parameter | uncorrected | corrected |
---|---|---|
$N_a$ | 688,875 | 631,166 |
$\nu_1$ | 438,776 | 481,209 |
$\nu_2$ | 847,045 | 963,542 |
T | 400,188 | 381,147 |
-logL | 20387 | 15791 |
The values are translated to absolute units, i. e. $\nu_x$ in individuals and T in generations.
In [33]:
dadi.Plotting.plot_2d_comp_multinom(model, sfs2d, vmin=1)
The best-fit model spectrum predicts far fewer SNP's with frequency [1,1] than the observed (corrected) spectrum.
This model adds symmetrical migration.
In [36]:
ar_split_mig = []
for filename in glob("OUT_2D_models/split_mig_[0-9]*dill"):
ar_split_mig.append(dill.load(open(filename)))
In [37]:
# get all optimisations
returned = [flatten(out)[:9] for out in ar_split_mig]
df = pd.DataFrame(data=returned, \
columns=['nu1_0','nu2_0', 'T_0', 'm_0', 'nu1_opt', 'nu2_opt', 'T_opt', 'm_opt', '-logL'])
df.sort_values(by='-logL', ascending=True).head(10)
Out[37]:
This is very good convergence.
In [38]:
func_ex = dadi.Numerics.make_extrap_log_func(dadi.Demographics2D.split_mig)
popt = np.array(df.sort_values(by='-logL', ascending=True).iloc[0, 4:8])
# get unscaled, best-fit model spectrum
model = func_ex(popt, ns, pts_l)
# get theta
theta = dadi.Inference.optimal_sfs_scaling(model, sfs2d)
# get total sequence length in spectrum
L = sfs2d.data.sum()
# get theta per site
print "The optimal value of theta per site for the ancestral population is {0:.4f}.".format(theta/L)
In [39]:
mu = 3e-9 # assumed per generation per site mutation rate
N_ref = theta/L/mu/4
print "The effective size of the ancestral population of ery and par (in number of diploid individuals) implied by this theta is:\n {0:,}.".format(int(N_ref))
In [40]:
nu1, nu2, T, m = popt
In [41]:
print "The ancestral population split apart {0:,} generation ago.".format(int(T*2*N_ref)),
print "Since then, ERY and PAR had a constant population size of {0:,} and {1:,}, respectively.".format(int(popt[0]*N_ref), int(popt[1]*N_ref)),
print "Since the split, a fraction of {0:.2e} of the population size of ERY were made up each generation of new immigrant individuals from PAR".format(m/2/N_ref/nu1),
print "and a fraction of {0:.2e} of the population size of PAR were made up each generation of new immigrant individuals of ERY.".format(m/2/N_ref/nu2)
print "Put another way:",
print "Since the split ERY received a constant number of {0:.2f} new immigrant alleles per generation, while PAR received a constant number of {1:.2f} per generation.".format(m*nu1, m*nu2)
The following table compares the parameter estimates for the split-migration model derived with the uncorrected and the corrected spectrum in their absolute units:
parameter | uncorrected | corrected |
---|---|---|
$N_a$ | 468,295 | 308,029 |
$N_{ERY}$ | 465,572 | 488,151 |
$N_{PAR}$ | 827,069 | 881,645 |
T (gen.) | 864,128 | 1,063,520 |
$p_{par->ery}$ | 2.69e-07 | 2.16e-07 |
$p_{ery->par}$ | 1.52e-07 | 1.20e-07 |
-logL | 18574 | 12809 |
$N_x$ have unit individuals, T has unit generations, $p_x$ are proportions of new immigrant alleles per generation.
There is a marked difference in the inferred ancestral population size ($N_a$ eq. to $N_{ref}$), which affects all other parameters. The estimated contemporary population sizes for ERY and PAR are slightly higher for the corrected spectrum, but not much so. The inferred divergence time is 1/4 higher with the corrected spectrum. The inferred migration rates (as proportion of new immigrant individuals per generation) are both slightly smaller with the corrected spectrum, but not dramatically.
In [43]:
dadi.Plotting.plot_2d_comp_multinom(data=sfs2d, model=model, vmin=1)
The difference in the frequency class [1,1] between the observed (corrected) and model spectrum has dramatically decreased after introducing migration. The introduction of migration has improved the log likelihood of the model by 2982 units. So there is a strong signal of gene flow in the 2D spectrum.
I have also fit the above two models to spectra corrected with p=0.30 and p=0.39, both derived by using a slightly different null model for the optimisation of p, that is with $T=2$ and $T=0.5$ respectively (see notebook).
Since the degree of correction does not seem to have a strong influence on parameter estimation (see notebook), I am going to use only the spectrum corrected with $p=0.35$ in the following.
In [45]:
def split_asym_mig(params, ns, pts):
"""
params = (nu1,nu2,T,m1,m2)
ns = (n1,n2)
Split into two populations of specifed size, with potentially asymmetric migration.
nu1: population size ratio of population 1 after split (with respect to Na)
nu2: population size ratio of population 2 after split (with respect to Na)
T: Time in the past of split (in units of 2*Na generations)
m1: Migration rate from ery into par (in units of 2*Na ind per generation)
m2: Migration rate from par into ery (in units of 2*Na ind per generation)
n1,n2: Sample sizes of resulting Spectrum
pts: Number of grid points to use in integration.
"""
nu1,nu2,T,m1,m2 = params
xx = dadi.Numerics.default_grid(pts)
phi = dadi.PhiManip.phi_1D(xx)
# split
phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)
# divergence with potentially asymmetric migration
phi = dadi.Integration.two_pops(phi, xx, T, nu1, nu2, m12=m2, m21=m1)
fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
return fs
In [46]:
ar_split_asym_mig = []
for filename in glob("OUT_2D_models/split_asym_mig_[0-9]*dill"):
ar_split_asym_mig.append(dill.load(open(filename)))
In [47]:
l = 2*5+1
returned = [flatten(out)[:l] for out in ar_split_asym_mig]
df = pd.DataFrame(data=returned, \
columns=['nu1_0','nu2_0', 'T_0', 'm1_0', 'm2_0', 'nu1_opt', 'nu2_opt', 'T_opt', 'm1_opt', 'm2_opt', '-logL'])
df.sort_values(by='-logL', ascending=True)
Out[47]:
This looks like convergence. Allowing for asymmetric migration rates improves the model fit by 450 logL units and is therefore highly significant.
In [48]:
popt = np.array(df.sort_values(by='-logL', ascending=True).iloc[0, 5:10])
func_ex = dadi.Numerics.make_extrap_log_func(split_asym_mig)
# calculate best-fit model spectrum
model = func_ex(popt, ns, pts_l)
theta = dadi.Inference.optimal_sfs_scaling(model, sfs2d)
L = sfs2d.data.sum()
print "The optimal value of theta per site for the ancestral population is {0:.5f}.".format(theta/L)
In [50]:
mu = 3e-9
N_ref = theta/L/mu/4
print "The effective size of the ancestral population of ERY and PAR (in number of diploid individuals) implied by this theta is:\n {0:,}.".format(int(N_ref))
In [51]:
nu1, nu2, T, m1, m2 = popt
In [52]:
print "The ancestral population split apart {0:,} generations ago.".format(int(T*2*N_ref)),
print "Immediately after the split the ERY population changed to a size of {0:,} and the PAR population to {1:,}.".format(int(nu1*N_ref), int(nu2*N_ref)),
print "Since the split of the ancestral population, PAR received 1 individual from ERY every {0:.2f} generations,".format(1.0/(m1/2*nu2)),
print "while ERY received 1 PAR individual every {0:.2f} generations.".format(1.0/(m2/2*nu1)),
print "Put another way: The PAR population contained a constant proportion of {0:.2e} of new immigrant alleles each generation".format(m1/2/N_ref/nu2),
print "and the ERY population contained a constant proportion of {0:.2e} of new immigrant alleles each generation.".format(m2/2/N_ref/nu1)
The following table compares the inferred parameters for the asymmetric migration model from the uncorrected and corrected spectrum in their absolute units:
parameter | uncorrected | corrected |
---|---|---|
$N_{a}$ | 410,678 | 255,715 |
$N_{ery}$ | 378,198 | 397,740 |
$N_{par}$ | 919,352 | 975,012 |
T | 1,022,329 | 1,198,692 |
$p_{ery->par}$ | 4.58e-08 | 3.91e-08 |
$p_{par->ery}$ | 6.54e-07 | 4.37e-07 |
-logL | 18104 | 12359 |
$N_x$ have unit individuals, T has unit generations, $p_x$ are proportions of new immigrant alleles per generation.
There is marked difference in the inferred size of the ancestral population between corrected and uncorrected spectra. The inferred time of the split in number of generations for the unmodified spectrum was 1,022,329. Not a dramatic difference from the one inferred with this corrected spectrum, but still 176,000 years difference.
In [53]:
dadi.Plotting.plot_2d_comp_multinom(data=sfs2d, model=model, vmin=1)
Note the asymmetry in the model spectrum for frequency classes [1, x] and [x, 1].
So far, the split coincided with a stepwise population size change. This model adds a proportion $s$ of the ancestral population size that is allocated to each daughter population after the split and exponential population size change afterwards.
None of the parameter combinations has higher likelihood (12888.96) than the best-fit parameter combinations of the previous model (lowest neg. logL 12,359). Also the three best-fit parameter combinations are very close to the upper limit of parameter bound that I set for the time parameter $T$ and the population size ratio for PAR implies a current effective population size of several millions (something I think is unlikely given the disperal rate and the patchiness of the habitat of Chorthippus parallelus).
This model changes the population size growth or decline from exponential to linear.
There is no good convergence, the inferred $s$ is very high (close to 1 or 0) and the inferred T is close to the upper boundary that I set. Furthermore, the best -logL (12516.95) is still higher than the best -logL achieved with the asymmetric migration model (12,359).
In [54]:
def split_asym_mig_iso(params, ns, pts):
"""
params = (nu1,nu2,Tc,m1,m2,Ti)
ns = (n1,n2)
Split into two populations of specifed size, with potentially asymmetric migration
for a time Tc followed by complete isolation until present.
nu1: population size ratio of population 1 after split (with respect to Na)
nu2: population size ratio of population 2 after split (with respect to Na)
Tc: Time of gene flow after split (in units of 2*Na generations)
m1: Migration rate from ery into par (in units of 2*Na ind per generation)
m2: Migration rate from par into ery (in units of 2*Na ind per generation)
Ti: Time of isolation after cessation of gene flow
The split lies Tc+Ti * 2Na generations in the past.
n1,n2: Sample sizes of resulting Spectrum
pts: Number of grid points to use in integration.
"""
nu1,nu2,Tc,m1,m2,Ti = params
xx = dadi.Numerics.default_grid(pts)
phi = dadi.PhiManip.phi_1D(xx)
# split
phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)
# divergence with potentially asymmetric migration
phi = dadi.Integration.two_pops(phi, xx, Tc, nu1, nu2, m12=m2, m21=m1)
# divergence without gene flow
phi = dadi.Integration.two_pops(phi, xx, Ti, nu1, nu2, m12=0, m21=0)
fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
return fs
In [56]:
ar_split_asym_mig_iso = []
for filename in glob("OUT_2D_models/split_asym_mig_iso*dill"):
ar_split_asym_mig_iso.append(dill.load(open(filename)))
In [58]:
l = 2*6+1
success = [flatten(out)[:l] for out in ar_split_asym_mig_iso]
df = pd.DataFrame(data=success, \
columns=['nu1_0','nu2_0', 'Tc_0', 'm1_0', 'm2_0', 'Ti_0', 'nu1_opt', 'nu2_opt', 'Tc_opt', 'm1_opt', 'm2_opt', 'Ti_opt', '-logL'])
df.sort_values(by='-logL', ascending=True).head(10)
Out[58]:
These are the 10 best parameter combinations of an extensive and refined search. There is still quite some variation in the estimated population size for PAR (nu2) and the time of primary contact (Tc). The addition of a recent time of complete isolation (Ti) has improved the fit by 203 log likelihood units and is therefore highly significant. With the uncorrected spectrum the improvment in fit was 512 log likelihood units, so more than twice as strong.
In [61]:
popt = np.array(df.sort_values(by='-logL', ascending=True).iloc[0,6:12])
func_ex = dadi.Numerics.make_extrap_log_func(split_asym_mig_iso)
model = func_ex(popt, ns, pts_l)
theta = dadi.Inference.optimal_sfs_scaling(model, sfs2d)
L = sfs2d.data.sum()
print "The optimal value of theta per site for the ancestral population is {0:.5f}.".format(theta/L)
In [62]:
mu = 3e-9
L = sfs2d.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
N_ref = theta/L/mu/4
print "The effective size of the ancestral population of ery and par (in number of diploid individuals) implied by this theta is:\n {0:,}.".format(int(N_ref))
This is quite a small effective population size for an insect.
In [63]:
nu1, nu2, Tc, m1, m2, Ti = popt
In [64]:
print "The ancestral population split apart {0:,} generations ago.".format(int((Tc+Ti)*2*N_ref)),
print "Immediately after the split the ERY population changed to a size of {0:,} and the PAR population to {1:,}.".format(int(nu1*N_ref), int(nu2*N_ref)),
print "Since the split of the ancestral population, PAR received 1 individual from ERY every {0:.2f} generations,".format(1.0/(m1/2*nu2)),
print "while ERY received 1 PAR individual every {0:.2f} generations.".format(1.0/(m2/2*nu1)),
print "Put another way: The PAR population contained a constant proportion of {0:.2e} of new immigrant alleles each generation".format(m1/2/N_ref/nu2),
print "and the ERY population contained a constant proportion of {0:.2e} of new immigrant alleles each generation.".format(m2/2/N_ref/nu1),
print "ERY and PAR remained in contact for {0:,} generations.".format(int(Tc*2*N_ref)),
print "{0:,} generations ago gene flow between ERY and PAR had ceased.".format(int(Ti*2*N_ref))
The absolute values of the two time parameters are striking. This model infers a very ancient origin of ERY and PAR and a very recent cessation of gene flow. Obviously, this model is an abstraction of the certainly more complicated history of the two subspecies, as one would expect given that there have been at least three ice ages since the inferred origin of ERY and PAR.
The following table compares the inferred parameters for the ancient migration model from the uncorrected and corrected spectrum in their absolute units:
parameter | uncorrected | corrected |
---|---|---|
$N_{a}$ | 125,977 | 113,522 |
$N_{ery}$ | 363,349 | 389,237 |
$N_{par}$ | 886,821 | 950,409 |
$T_c$ | 1,748,505 | 1,525,640 |
$p_{ery->par}$ | 2.21e-08 | 2.14e-08 |
$p_{par->ery}$ | 3.27e-07 | 2.60e-07 |
$T_i$ | 43,628 | 26,970 |
D | 1044 | 406 |
$N_x$ have unit individuals, $T_x$ has unit generations, $p_x$ are proportions of new immigrant alleles per generation and D is two times the log likelihood ratio between the best fit parameters of this model with the hitherto best performing asymmetric migration model.
Note that with the uncorrected spectrum, I could not achieve satisfactory convergence of parameter estimations. The parameters inferred with the corrected spectrum are therefore more reliable. The inferred population sizes for ERY and PAR aren't very different between corrected and uncorrected spectra and so is the inferred time of contact ($T_c$). The migration rates are slightly reduced with the corrected spectrum and the time since cessation of gene flow ($T_i$) is inferred to be even more recent than with the uncorrected spectrum.
The addition of $T_i$ improves the model by 203 log likelihood units as compared to the model without a period of complete isolation (split_asym_mig
above). If the better fit is not just due to fitting noise and bias in the data, then this means that the final period of complete isolation is significantly greater than 0. I think the fact that both the raw and the corrected spectrum lead to very similar parameter estimates provides some support for the interpretation that $T_i$ is actually greater than zero and that this is not just due to fitting noise or bias in the data.
In [67]:
dadi.Plotting.plot_2d_comp_multinom(model=model, data=sfs2d, vmin=1)
The two populations are relatively close to the hybrid zone centre at the Col de la Quillane in the Pyrenees. The clines for some characters have been shown to be many kilometers wide. It can therefore be assumed that there is some recent gene flow between the two populations. Can this gene flow be detected from this spectrum?
In [1]:
def split_mig_iso_mig(params, ns, pts):
"""
params = (nu1,nu2,Tc,m1,m2,Ti,Tsc)
ns = (n1,n2)
Split into two populations of specifed size, with potentially asymmetric migration
for a time Tc followed by a period of complete isolation Ti which is followed by a restart of
migration until present. Migration rates in the two epochs are assumed to be equal (and constant).
nu1: Size of population 1 after split.
nu2: Size of population 2 after split.
Tc: Time of gene flow after split (in units of 2*Na generations)
m1: Migration rate from ery into par (in units of 2*Na ind per generation)
m2: Migration rate from par into ery (in units of 2*Na ind per generation)
Ti: Time of isolation after cessation of gene flow
Tsc: Time of gene flow after period of isolation until present
The split lies Tc+Ti+Tsc * 2Na generations in the past.
n1,n2: Sample sizes of resulting Spectrum
pts: Number of grid points to use in integration.
"""
nu1,nu2,Tc,m1,m2,Ti,Tsc = params
xx = dadi.Numerics.default_grid(pts)
phi = dadi.PhiManip.phi_1D(xx)
# split
phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)
# divergence with potentially asymmetric migration
phi = dadi.Integration.two_pops(phi, xx, Tc, nu1, nu2, m12=m2, m21=m1)
# divergence without gene flow
phi = dadi.Integration.two_pops(phi, xx, Ti, nu1, nu2, m12=0, m21=0)
# divergence with potentially asymmetric migration
phi = dadi.Integration.two_pops(phi, xx, Tsc, nu1, nu2, m12=m2, m21=m1)
fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
return fs
In [69]:
ar_mig_iso_mig = []
for filename in glob("OUT_2D_models/split_mig_iso_mig*dill"):
ar_mig_iso_mig.append(dill.load(open(filename)))
In [70]:
l = 2*7+1
# show all parameter combinations
returned = [flatten(out)[:l] for out in ar_mig_iso_mig]
df = pd.DataFrame(data=returned, \
columns=['nu1_0','nu2_0', 'Tc_0', 'm1_0', 'm2_0', 'Ti_0', 'Tsc_0', 'nu1_opt', 'nu2_opt', 'Tc_opt', 'm1_opt', 'm2_opt', 'Ti_opt', 'Tsc_opt', '-logL'])
df.sort_values(by='-logL', ascending=True).head(10)
Out[70]:
This is good convergence.
The best-fit ancient migration model had a log likelihood of -12156. So only one log likelihood unit worse.
In [72]:
ll_s = 12156.061309
ll_c = 12155.538032
D = 2 * (ll_s - ll_c)
D
Out[72]:
In [73]:
# calculate p-value for Chi-square dist.
# the weights specify a weighted sum of chi^2 distributions with 0 and 1 d.o.f
# this is because Tsc is 0 in the split_asym_mig_iso model and at the boundary of the parameter space
p = dadi.Godambe.sum_chi2_ppf(D, weights=(0.5, 0.5))
p
Out[73]:
A recent restart of gene flow cannot be detected from this corrected spectrum. I arrived at the same result with the uncorrected spectrum. Although not significant, the inferred Tsc corresponds to 1,135 generations. With the uncorrected spectrum, the inferred Tsc corresponded to 2 generations.
I have a fit a model that adds to the ancient migration model a potential size change in the ancestral population to ERY and PAR.
Refining runs with Nelder-Mead and BFGS algorithms could not achieve good convergence. Quite different demographic scenarios have almost identical likelihood (12,156). The ancient migration model from above had -logL of 12,156 with it's best fit parameter values. A population size change in the ancestral population to ERY and PAR can therefore not be inferred from this spectrum.
This ancestral population size change model predicts a very similar demographic history as the ancient migration model. That is, it does not predict a very different time of split between ERY and PAR, for instance. That is still estimated to be ~1.5 million years ago. The same is true for the other parameters that overlap with the ancient migration model.
There is no major difference in the parameter estimates from the corrected and uncorrected spectrum. Apart from a slightly younger but still ancient origin of ERY and PAR of 1,525,640 years, the corrected spectrum infers a more recent cessation of gene flow (26,970 years) than the uncorrected spectrum. There is however better convergence of parameter optimisations with the corrected spectrum, which may indicate that the estimates with the corrected spectrum are more reliable.
The numbers on the right in the upper plot are a bit misleading. It's a 27 (ky ago) and a zero.