2.1 Interpretation of parameters

3 compare marginal spectra

3.1 plot residuals

4 adding gene flow

4.1 plot residuals

4.2 Interpretation of parameters

4.2.1 Translation of m

4.3 LRT

5 Recent secondary contact model

5.1 plot residuals

5.2 Interpretation of parameters

5.3 Compare best fit IM with RSC model spectrum

6 adding asymmetric gene flow

6.1 plot residuals

6.2 Interpretation of parameters

6.3 LRT

6.4 Is the migration rate from ery into par (m1) statistically significant?

6.5 compare symmetric with asymmetric gene flow

7 adding gradual size changes

7.1 plot residuals

7.2 compare asym_mig with IM model

8 Ancient migration model

8.1 plot residuals

8.2 parameter interpretation

8.3 comparison with continuous gene flow model

9 ancient migration, isolation and secondary contact

10 Conclusion



In [1]:

    
import numpy

import sys

sys.path.insert(0, '/home/claudius/Downloads/dadi')

import dadi









    



error in importing Two Locus modules



In [2]:

    
from glob import glob
import dill
from utility_functions import *
import pandas as pd
import numpy as np
# turn on floating point division by default, old behaviour via '//'
from __future__ import division



In [3]:

    
# import 2D unfolded spectrum

sfs2d_unfolded = dadi.Spectrum.from_file('dadiExercises/EryPar.unfolded.2dsfs.dadi_format')

# add population labels
sfs2d_unfolded.pop_ids = ["ery", "par"]



In [4]:

    
ns = sfs2d_unfolded.sample_sizes
pts_l = [40, 50, 60]



In [5]:

    
# print number of segregating sites in the SFS

sfs2d_unfolded.S()









    Out[5]:





60573.584426000001

The 2D spectrum contains counts from 60k sites that are variable in par or ery or both.

For the estimation of the 2D SFS, realSFS has only taken sites that had data from at least 9 individuals in each population (see assembly.sh, lines 1430 onwards).

plot the data



In [6]:

    
import pylab

%matplotlib inline

pylab.rcParams['font.size'] = 14.0
pylab.rcParams['figure.figsize'] = [14.0, 12.0]

dadi.Plotting.plot_single_2d_sfs(sfs2d_unfolded, vmin=1, cmap=pylab.cm.jet)









    Out[6]:





<matplotlib.colorbar.Colorbar at 0x7fd65f86c190>

Cells with counts below 1 are masked and appear white in the upper plot.



In [7]:

    
sfs2d = sfs2d_unfolded.fold()



In [8]:

    
# plot the folded GLOBAL minor allele frequency spectrum

dadi.Plotting.plot_single_2d_sfs(sfs2d, vmin=1, cmap=pylab.cm.jet)









    Out[8]:





<matplotlib.colorbar.Colorbar at 0x7fd657c3f050>



In [9]:

    
ns = sfs2d.sample_sizes

# setting the smallest grid size slightly larger than the largest population sample size (36)
pts_l = [40, 50, 60]

divergence in isolation model



In [10]:

    
def split_nomig(params, ns, pts):
    """
    params = (nu1,nu2,T)
    ns = (n1,n2)

    Split into two populations of specifed size, no migration.

    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    T: Time in the past of split (in units of 2*Na generations) 
    n1,n2: Sample sizes of resulting Spectrum
    pts: Number of grid points to use in integration.
    """
    nu1,nu2,T = params

    xx = dadi.Numerics.default_grid(pts)

    phi = dadi.PhiManip.phi_1D(xx)
    phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)

    phi = dadi.Integration.two_pops(phi, xx, T, nu1, nu2, m12=0, m21=0)

    fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
    return fs

This model specifies a split of the ancestral population $T \times 2N_{ref}$ generations ago. The two daughter populations have a constant population size until the present of $\nu_1 \times N_{ref}$ and $\nu_2 \times N_{ref}$, respectively.



In [11]:

    
# load optimisation results from file

ar_split_nomig = []

for filename in glob("OUT_2D_models/split_nomig*dill"):
    ar_split_nomig.append(dill.load(open(filename)))



In [12]:

    
success = [flatten(out)[:7] for out in ar_split_nomig if out[1][4] == 0]

df = pd.DataFrame(data=success, \
                  columns=['nu1_0','nu2_0', 'T_0', 'nu1_opt', 'nu2_opt', 'T_opt', '-logL'])

df.sort_values(by='-logL', ascending=True)









    Out[12]:






  
    
      
      nu1_0
      nu2_0
      T_0
      nu1_opt
      nu2_opt
      T_opt
      -logL
    
  
  
    
      2
      0.877109
      0.741007
      0.360840
      0.636945
      1.229606
      0.290465
      20387.227270
    
    
      9
      0.136711
      7.063745
      0.075557
      0.637653
      1.229853
      0.290442
      20387.435815
    
    
      3
      1.461323
      4.088909
      2.236886
      0.637660
      1.229850
      0.290445
      20387.435815
    
    
      11
      0.854249
      1.682438
      0.240684
      0.637655
      1.229849
      0.290445
      20387.435815
    
    
      0
      0.146030
      1.659384
      0.282181
      0.637644
      1.229881
      0.290447
      20387.435816
    
    
      5
      0.324028
      1.847701
      0.160416
      0.637682
      1.229931
      0.290459
      20387.435816
    
    
      4
      0.248861
      4.984686
      0.085569
      0.637683
      1.229911
      0.290450
      20387.435817
    
    
      7
      0.201540
      0.651162
      0.031593
      0.637639
      1.229858
      0.290441
      20387.435817
    
    
      10
      0.533901
      0.839080
      0.139849
      0.637653
      1.229835
      0.290444
      20387.435818
    
    
      1
      0.218471
      1.740969
      0.295402
      0.637657
      1.229890
      0.290440
      20387.435819
    
    
      8
      0.147376
      0.642951
      0.091625
      0.637672
      1.229838
      0.290449
      20387.435820
    
    
      6
      0.277689
      1.509007
      0.044130
      0.637684
      1.229796
      0.290440
      20387.435837

The optimal parameter values look very robust: very different starting values converge to almost identical optimal parameter values.

Interpretation of parameters



In [13]:

    
# create link
func = split_nomig

# create extrapolating version of the model function
func_ex = dadi.Numerics.make_extrap_log_func(func)



In [14]:

    
# calculate best-fit model spectrum

model_spectrum = func_ex((0.636945, 1.229606, 0.290465), ns, pts_l)



In [15]:

    
theta = dadi.Inference.optimal_sfs_scaling(model_spectrum, sfs2d)



In [16]:

    
L = sfs2d.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
print "The optimal value of theta per site for the ancestral population is {0:.4f}.".format(theta/L)









    



The optimal value of theta per site for the ancestral population is 0.0083.



In [17]:

    
mu = 3e-9
print "The total sequence length for the 2D spectrum is {0:,}.".format(int(L))
N_ref = theta/L/mu/4
print "The effective size of the ancestral population of ery and par (in number of diploid individuals) implied by this theta is: {0:,}.".format(int(N_ref))









    



The total sequence length for the 2D spectrum is 1,130,775.
The effective size of the ancestral population of ery and par (in number of diploid individuals) implied by this theta is: 688,875.

I am assuming that $\nu_1$ refers to ery and $\nu_2$ refers to par. The split_nomig model with its optimal parameter values suggests the following:



In [18]:

    
print "The ancestral population of ery and par split apart {0:,} generations ago.".format(int(0.290465*2*N_ref)),
print "Since then until present the ery population had a size of {0:,} and the par population of {1:,}.".format(int(0.636945*N_ref)
, int(1.229606*N_ref))









    



The ancestral population of ery and par split apart 400,188 generations ago. Since then until present the ery population had a size of 438,776 and the par population of 847,045.

Note, that the fitting of 1D models to the spectra of each population had indicated a population size reduction for both populations and more so for parallelus, which seems to contradict what this 2D model says about the two populations.

compare marginal spectra

As mentioned above, the population size history inferred from the 2D spectrum contradicts what was inferred from the individual 1D spectra of each population. The 2D spectrum is not computed from the same number of sites as the two 1D spectra. While inclusion of a site in each 1D spectrum required read data from at least 9 individuals, the 2D spectrum required read data from at least 9 individuals in both populations. That is because each population's SAF files were created with minInd 9 (see lines 1426-1435 in assembly.sh). When estimating the 2D SFS with realSFS only sites that occurred in each population's SAF files were included (see lines 1661-1664 in assembly.sh).



In [19]:

    
# get marginal 1D spectra from the 2D spectrum

fsm_ery = sfs2d.marginalize([1])
fsm_par = sfs2d.marginalize([0])



In [20]:

    
# get original 1D spectra

# import 1D unfolded spectrum of ery
fs_ery_unfolded = dadi.Spectrum.from_file('ERY.unfolded.sfs')
fs_ery = fs_ery_unfolded.fold()

# import 1D unfolded spectrum of par
fs_par_unfolded = dadi.Spectrum.from_file('PAR.unfolded.sfs')
fs_par = fs_par_unfolded.fold()



In [21]:

    
pylab.plot(fs_par, 'go-', label='individual', linewidth=2.5, markersize=12)
pylab.plot(fsm_par, 'g>--', label='marginal', linewidth=2.5, markersize=12)
pylab.grid()
pylab.title('PAR')
pylab.legend()









    Out[21]:





<matplotlib.legend.Legend at 0x7fd657bec890>

The individual 1D spectrum for PAR and the 1D spectrum from marginalising over ERY in the 2D spectrum do not look drastically different.



In [22]:

    
pylab.plot(fs_ery, 'ro-', label='individual', linewidth=2.5, markersize=12)
pylab.plot(fsm_ery, 'r>--', label='marginal', linewidth=2.5, markersize=12)
pylab.grid()
pylab.title('ERY')
pylab.legend()









    Out[22]:





<matplotlib.legend.Legend at 0x7fd657aeb450>

The marginal 1D spectrum for ERY looks quite different. It mostly looks shifted down. This makes sense, when a smaller number of sites (those that are present in the SAF files of both ERY and PAR) go into the 2D spectrum than into the individual 1D spectrum.

Since the individual and marginal 1D spectra have similar shape, they should lead to similar demographic inference. The difference in the inference of population sizes for the two populations from 1D and 2D models can therefore not be explained by sytematic differences between 1D and 2D spectra.

plot residuals



In [73]:

    
dadi.Plotting.plot_2d_comp_multinom(model_spectrum, sfs2d, vmin=1)

The residuals show that the best-fit model has a deficit of fixed divergent SNP's as well as of low frequency shared SNP's. Maybe the fit can be improved by introducing gene flow.

adding gene flow



In [23]:

    
# use dadi's built-in model function
func = dadi.Demographics2D.split_mig

func_ex = dadi.Numerics.make_extrap_log_func(func)

This model adds a single migration rate parameter $m$ . It is assumed that migration is of equal size $m$ in each direction.



In [32]:

    
# load optimisation results from file

ar_split_mig = []

for filename in glob("OUT_2D_models/split_mig_[0-9]*dill"):
    ar_split_mig.append(dill.load(open(filename)))



In [33]:

    
success = [flatten(out)[:9] for out in ar_split_mig if out[1][4] == 0]

df = pd.DataFrame(data=success, \
                  columns=['nu1_0','nu2_0', 'T_0', 'm_0', 'nu1_opt', 'nu2_opt', 'T_opt', 'm_opt', '-logL'])

df.sort_values(by='-logL', ascending=True)









    Out[33]:






  
    
      
      nu1_0
      nu2_0
      T_0
      m_0
      nu1_opt
      nu2_opt
      T_opt
      m_opt
      -logL
    
  
  
    
      8
      2.059032
      0.360002
      0.682463
      0.192790
      0.994185
      1.766127
      0.922632
      2.506880e-01
      18574.119915
    
    
      31
      0.620927
      0.849787
      1.980000
      1.827165
      0.994247
      1.766242
      0.922690
      2.507990e-01
      18574.120405
    
    
      2
      2.894370
      11.363772
      1.980000
      0.272765
      0.993365
      1.764405
      0.921871
      2.509119e-01
      18574.121486
    
    
      28
      0.631049
      2.924698
      0.247732
      0.205392
      0.993307
      1.765318
      0.921817
      2.509740e-01
      18574.122241
    
    
      17
      0.837500
      2.063526
      0.058201
      0.000110
      0.996572
      1.771058
      0.928727
      2.508888e-01
      18574.138486
    
    
      18
      5.535163
      3.398472
      0.309525
      0.169667
      0.993863
      1.767477
      0.925657
      2.517209e-01
      18574.159471
    
    
      15
      0.318463
      0.315764
      0.587406
      1.138654
      0.989922
      1.761280
      0.914716
      2.507331e-01
      18574.164848
    
    
      1
      0.277233
      8.221059
      0.976626
      0.051974
      1.001148
      1.776879
      0.937823
      2.505879e-01
      18574.226541
    
    
      0
      5.653430
      3.362330
      1.980000
      0.101259
      0.995994
      1.769480
      0.927834
      2.507786e-01
      18574.341908
    
    
      33
      0.208178
      2.525835
      0.609693
      0.000767
      0.997653
      1.771560
      0.931469
      2.510613e-01
      18574.373801
    
    
      34
      0.683625
      3.770899
      0.327624
      0.960373
      0.986428
      1.750484
      0.899651
      2.503218e-01
      18574.507709
    
    
      7
      1.009844
      1.115355
      0.012909
      0.000309
      0.992355
      1.760054
      0.901086
      2.470464e-01
      18574.974381
    
    
      5
      2.556454
      1.352414
      1.569898
      0.135823
      1.194640
      2.112713
      1.324262
      2.303226e-01
      18613.113722
    
    
      26
      0.156669
      0.773484
      1.980000
      0.217233
      1.412143
      2.455870
      1.850644
      2.074998e-01
      18670.013487
    
    
      9
      6.348379
      0.230012
      1.457648
      0.117155
      1.464282
      2.548749
      1.990301
      2.024862e-01
      18681.403063
    
    
      24
      0.409099
      0.405881
      0.168447
      0.000052
      0.637198
      1.230384
      0.290581
      6.497219e-08
      20387.225419
    
    
      23
      2.175460
      0.556770
      0.086338
      0.000014
      0.637198
      1.230385
      0.290581
      1.748376e-20
      20387.226053
    
    
      12
      0.265271
      0.129476
      0.157514
      0.000045
      0.637196
      1.230384
      0.290580
      2.583554e-12
      20387.226053
    
    
      25
      0.085281
      0.402881
      0.364705
      0.000042
      0.637198
      1.230385
      0.290581
      3.044384e-18
      20387.226053
    
    
      16
      0.115098
      0.103779
      0.172020
      0.000113
      0.637162
      1.230734
      0.290565
      4.850122e-15
      20387.226666
    
    
      27
      3.731189
      0.119188
      0.015205
      0.000017
      0.637549
      1.230015
      0.290740
      9.454649e-15
      20387.229881
    
    
      19
      1.308999
      0.089865
      0.482895
      0.000052
      0.636599
      1.228599
      0.290307
      4.483462e-10
      20387.232239
    
    
      32
      2.699647
      0.315091
      0.058909
      0.000035
      0.636512
      1.227689
      0.290267
      8.437228e-12
      20387.240897
    
    
      20
      0.063430
      0.646475
      0.030202
      0.000377
      0.640787
      1.227274
      0.289655
      3.001320e-06
      20387.309598
    
    
      14
      0.109039
      1.752206
      0.068154
      0.000323
      0.637666
      1.229887
      0.290450
      1.278566e-07
      20387.434569
    
    
      11
      0.355181
      0.212621
      0.029912
      0.000157
      0.637670
      1.229889
      0.290449
      2.826076e-08
      20387.435537
    
    
      21
      0.199078
      0.146460
      0.338920
      0.000173
      0.637674
      1.229910
      0.290452
      2.509801e-08
      20387.435569
    
    
      30
      0.450847
      0.463437
      0.108087
      0.000060
      0.637667
      1.229892
      0.290450
      1.971549e-08
      20387.435620
    
    
      13
      0.318447
      0.199418
      0.790081
      0.000020
      0.637670
      1.229881
      0.290451
      3.715424e-09
      20387.435776
    
    
      6
      0.211271
      0.683610
      0.135427
      0.000021
      0.637667
      1.229884
      0.290449
      3.917260e-13
      20387.435812
    
    
      4
      0.670075
      0.343239
      1.980000
      0.080259
      0.632513
      1.349728
      2.000000
      4.635045e-01
      21059.797882
    
    
      29
      0.239562
      0.556576
      1.980000
      0.031463
      0.329749
      0.596793
      2.000000
      1.044005e+00
      23734.741735
    
    
      10
      5.874411
      4.471090
      1.980000
      1.041556
      6.145361
      3.497838
      1.999943
      1.044327e+00
      68992.522571
    
    
      22
      6.807827
      3.585435
      0.328641
      1.217099
      9.899311
      1.285312
      0.766758
      1.552602e+00
      71150.567474
    
    
      3
      2.157397
      13.379040
      1.980000
      1.878685
      1.662449
      13.064076
      1.999990
      2.253679e+00
      79308.735709

From many different initial parameter value combinations the optimisations converge on essentially the same set of parameters that includes a migration rate that is not 0. This optimal parameter combination is much more likely than the other combinations. From many initial parameter values the optimisations converge to a model that is essentially the same as the split-no-migration model from above. However, these parameter combinations have much lower likelihood.



In [27]:

    
popt = numpy.array(df.sort_values(by='-logL', ascending=True).iloc[0, 4:8])

# calculate best-fit model spectrum

model_spectrum = func_ex(popt, ns, pts_l)

theta = dadi.Inference.optimal_sfs_scaling(model_spectrum, sfs2d)

print "The optimal value of theta per site for the ancestral population is {0:.4f}.".format(theta/L)
mu = 3e-9
L = sfs2d.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
print "The total sequence length for the 2D spectrum is {0:,}.".format(int(L))
N_ref = theta/L/mu/4

print "The effective size of the ancestral population of ery and par (in number of individuals) implied by this theta is: {0:,}.".format(int(N_ref))









    



The optimal value of theta per site for the ancestral population is 0.0122.
The total sequence length for the 2D spectrum is 1,130,775.
The effective size of the ancestral population of ery and par (in number of individuals) implied by this theta is: 1,016,810.

plot residuals



In [78]:

    
dadi.Plotting.plot_2d_comp_multinom(model_spectrum, sfs2d, vmin=1)

This has reduced the residuals for low frequency shared polymorphisms along the diagonal, but the model still cannot fit divergently fixed SNP's well.

Interpretation of parameters

Translation of m

The dadi manual says about the migration rate parameter (on page 11):

The migration parameter m12 specifies the rate of migration from pop 2 into pop 1. It is equal to the fraction of individuals each generation in pop 1 that are new migrants from pop 2, times the 2Nref.

So the fraction of new immigrant alleles each generation from $pop_2$ in $pop_1$ is:

$$ p_{m12} = \frac{m_{12}}{2N_{ref}} $$

Since dadi's migration rate is scaled by 2Nref, the migration rate parameter as reported by dadi has units of immigrant gametes per generation. However, whenever pop 1 does not have the same size as the ancestral population ($N_{ref}$), dadi's $m_{12}$ needs to be rescaled by multiplication with $\nu_1$ to get the number of immigrant gametes per generation ($M_{12}$):

$$ \begin{align} p_{m12} &= \frac{m_{12}}{2N_{ref}\nu_1} \\[5pt] M_{12} &= p_{m12} \times 2N_{ref}\nu_1 \\[5pt] M_{12} &= \nu_1 m_{12} \end{align} $$

See this confirm units of dadi thread on the dadi forum.

What this model fit says in plain English:



In [79]:

    
print "The ancestral population split apart {0:,} generation ago.".format(int(popt[2]*2*N_ref)), 
print "Since then, ery and par had a constant population size of {0:,} and {1:,}, respectively.".format(int(popt[0]*N_ref), int(popt[1]*N_ref)), 
print "Since the split, a fraction of {0:.2e} of the population size of ERY were made up each generation of new immigrant individuals from PAR".format(popt[3]/2/N_ref/popt[0]),
print " and a fraction {0:.2e} of the population size of PAR were made up each generation of new immigrant individuals from ERY.".format(popt[3]/2/N_ref/popt[1]),
print "Put another way:",
print "Since the split ery received a constant number of {0:.2f} new immigrant alleles per generation, while par received a constant number of {1:.2f} per generation.".format(popt[3]*popt[0], popt[3]*popt[1])









    



The ancestral population split apart 864,128 generation ago. Since then, ery and par had a constant population size of 465,572 and 827,069, respectively. Since the split, a fraction of 2.69e-07 of the population size of ERY were made up each generation of new immigrant individuals from PAR  and a fraction 1.52e-07 of the population size of PAR were made up each generation of new immigrant individuals from ERY. Put another way: Since the split ery received a constant number of 0.25 new immigrant alleles per generation, while par received a constant number of 0.44 per generation.

LRT

Is the model with migration significantly better than the model without migration?

From the dadi manual:

The full likelihood ratio test statistic is equal to $D = 2(ll_c - ll_s)$ where $ll_c$ and $ll_s$ are the likelihoods of the complex and simple model, respectively.

$D$ should be $\chi^2$ distributed

with degrees of freedom equal to the difference in number of parameters between the simple and complex model.

Not all SNP's in the spectrum will be unlinked with each other. Therefore, the likelihoods calculated are composite likelihoods. Dadi can calculate an adjustment factor for $D$ that corrects for linkage. However, this requires bootstrapped data sets, which I haven't created yet. In the following calculation I am assuming that all SNP's can be assumed independent data points.



In [80]:

    
# calculate logL for split-migration model
ll_c = dadi.Inference.ll_multinom(model_spectrum, sfs2d)

func = split_nomig

# create extrapolating version of the model function
func_ex = dadi.Numerics.make_extrap_log_func(func)

# calculate best-fit model spectrum with optimal parameter values
model_spectrum = func_ex((0.636945, 1.229606, 0.290465), ns, pts_l)

# calculate logL for split-no-migration model
ll_s = dadi.Inference.ll_multinom(model_spectrum, sfs2d)



In [81]:

    
D = 2*(ll_c - ll_s)
D









    Out[81]:





3626.2146620022322

From the dadi example script YRI_CEY.py:

Because this is a test of a parameter on the boundary of the parameter space (m cannot be less than zero), our null distribution is an even proportion of chi^2 distributions with 0 and 1 d.o.f. To evaluate the p-value, we use the point percent function for a weighted sum of chi^2 dists.

See Self2007 for more complex cases.



In [82]:

    
# calculate p-value for Chi-square dist. with 1 degree of freedom
p = dadi.Godambe.sum_chi2_ppf(D, weights=(0.5, 0.5))
p









    Out[82]:





0.0

Gene flow significantly improves the fit to the observed spectrum. I doubt that there could be enough linkage in the data to compromise this result.

Recent secondary contact model

Phylogeographic studies as well as paleoclimatic data strongly suggest that the two subspecies, erythropus and parallelus, diverged in isolation during at least the last ice age and only recently (~10,000 YBP) came into secondary contact. Clines of several morphological as well as molecular markers have been shown to be very wide, which indicates substantial gene flow across the hybrid zone into adjacent pure populations. The two population samples come from such "pure" populations in close proximity to the hybrid zone. They may therefore show evidence of recent gene flow between the two subspecies.

I am now defining a model function that is a mixture of the split_nomig and split_mig models. It specifies a split, divergence in isolation for a time $T_i$, then secondary contact of the two diverged populations since $T_c$ times $2N_{ref}$ generations in the past.



In [28]:

    
def secondary_contact(params, ns, pts):
    """
    params = (nu1,nu2,Ti,Tc,m)
    ns = (n1,n2)

    Split into two populations of specified size, 
    then diverge in isolation,
    finally come into secondary contact.

    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    Ti: Time of divergence in isolation (in units of 2*Na generations)
    Tc: Time in the past when secondary contact began (and continues till present)
    m: symmetrical migration rate (in units of 2*Na per generation)
    
    n1,n2: Sample sizes of resulting Spectrum
    pts: Number of grid points to use in integration.
    """
    nu1,nu2,Ti,Tc,m = params

    xx = dadi.Numerics.default_grid(pts)

    phi = dadi.PhiManip.phi_1D(xx)
    phi = dadi.PhiManip.phi_1D_to_2D(xx, phi) # split

    # divergence in isolation
    phi = dadi.Integration.two_pops(phi, xx, Ti, nu1, nu2, m12=0, m21=0) 
    
    # secondary contact
    phi = dadi.Integration.two_pops(phi, xx, Tc, nu1, nu2, m12=m, m21=m)

    fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
    return fs



In [37]:

    
ar_sec_contact = []

for filename in glob("OUT_2D_models/sec_contact*dill"):
    ar_sec_contact.append(dill.load(open(filename)))



In [38]:

    
# get parameters from all runs into a table

returned = ([flatten(out)[:11] for out in ar_sec_contact])

df = pd.DataFrame(data=returned, \
                  columns=['nu1_0','nu2_0', 'Ti_0', 'Tc_0', 'm_0', 'nu1_opt', 'nu2_opt', 'Ti_opt', 'Tc_opt', 'm_opt', '-logL'])

df.sort_values(by='-logL', ascending=True).head(10)









    Out[38]:






  
    
      
      nu1_0
      nu2_0
      Ti_0
      Tc_0
      m_0
      nu1_opt
      nu2_opt
      Ti_opt
      Tc_opt
      m_opt
      -logL
    
  
  
    
      16
      0.800833
      1.775923
      0.481231
      0.034021
      0.318631
      0.993736
      1.766119
      6.606356e-02
      0.854641
      0.250961
      18574.115189
    
    
      11
      2.291806
      0.941107
      0.417754
      0.701156
      0.280979
      0.995790
      1.768780
      4.236017e-09
      0.928104
      0.250866
      18574.136842
    
    
      0
      1.236540
      2.847184
      0.230733
      3.068632
      0.849626
      0.995633
      1.767191
      8.952130e-02
      0.836080
      0.251181
      18574.175975
    
    
      9
      1.153465
      2.238480
      0.164853
      0.034364
      1.314128
      0.989694
      1.757785
      1.310731e-01
      0.779922
      0.251502
      18574.445432
    
    
      25
      0.390701
      1.897315
      0.759622
      0.446801
      0.700774
      0.994296
      1.771438
      1.354936e-01
      0.779559
      0.250514
      18574.532394
    
    
      1
      1.006920
      2.471800
      0.560705
      1.489177
      0.540249
      1.007125
      1.785494
      1.446806e-01
      0.789880
      0.249414
      18574.820474
    
    
      26
      0.228105
      5.126482
      0.093396
      0.087684
      0.331095
      1.005337
      1.778832
      1.211617e-03
      0.932793
      0.246241
      18574.947916
    
    
      28
      0.402956
      1.380227
      0.550776
      2.572429
      0.157063
      0.987372
      1.754340
      1.760467e-01
      0.726734
      0.252789
      18575.195890
    
    
      37
      1.127186
      4.079809
      0.302806
      0.931041
      0.188286
      1.002210
      1.779032
      3.747498e-02
      0.918896
      0.254661
      18575.351564
    
    
      30
      0.591352
      3.382051
      0.243697
      0.081228
      0.530071
      1.027357
      1.796094
      3.927616e-02
      0.932124
      0.249578
      18576.313944

The most likely parameter combination has the same likelihood as the best-fit for the split_mig model (i. e. the model with $T_i = 0$). The best-fit parameters for this secondary contact model suggest only a short time of complete isolation followed by a long time of secondary contact, which is qualitatively equivalent to a model without complete isolation. A $T_i$ greater 0 does not improve the likelihood, so I think it is not significant.



In [34]:

    
func = secondary_contact

# create extrapolating version of the model function
func_ex = dadi.Numerics.make_extrap_log_func(func)



In [35]:

    
import numpy as np



In [39]:

    
# determine theta

popt = np.array(df.sort_values(by='-logL', ascending=True).iloc[0, 5:10])

# get the unscaled optimal secondary contact model spectrum
model_spectrum = func_ex(popt, ns, pts_l)

# get theta
theta = dadi.Inference.optimal_sfs_scaling(model_spectrum.fold(), sfs2d)

theta









    Out[39]:





6357.6920380892325

plot residuals



In [40]:

    
dadi.Plotting.plot_2d_comp_multinom(model_spectrum, sfs2d, vmin=1)

Interpretation of parameters



In [41]:

    
mu = 3e-9
L = sfs2d.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
print "The total sequence length for the 2D spectrum is {0:,}.".format(int(L))
N_ref = theta/L/mu/4
print "The effective size of the ancestral population Nref of ery and par (in number of diploid individuals) implied by this theta is: {0:,}.".format(int(N_ref))









    



The total sequence length for the 2D spectrum is 1,130,775.
The effective size of the ancestral population Nref of ery and par (in number of diploid individuals) implied by this theta is: 468,535.



In [42]:

    
print "After the split of the ancestral population {0:,} generations ago, the daughter populations increased in size (instantaneously).".format(int((popt[2]+popt[3])*2*N_ref)),
print "ERY increased to about the same size as the ancestral population (Nref), while PAR increased to {0:.2f} times of Nref.".format(popt[1]),
print "After the split ERY and PAR diverged in isolation for {0:,} generations.".format(int(popt[2]*2*N_ref)),
print "The two populations came into secondary contact {0:,} generations ago.".format(int(popt[3]*2*N_ref)),
print "ERY and PAR stayed in continuous secondary contact until the present and",
print "during that time exchanged one individual from par to ery every {0:.1f} generations".format(1.0/(popt[4]/2*popt[0])),
print "and one individual from ery into par every {0:.1f} generations.".format(1.0/(popt[4]/2*popt[1]))









    



After the split of the ancestral population 862,764 generations ago, the daughter populations increased in size (instantaneously). ERY increased to about the same size as the ancestral population (Nref), while PAR increased to 1.77 times of Nref. After the split ERY and PAR diverged in isolation for 61,906 generations. The two populations came into secondary contact 800,858 generations ago. ERY and PAR stayed in continuous secondary contact until the present and during that time exchanged one individual from par to ery every 8.0 generations and one individual from ery into par every 4.5 generations.

Compare best fit IM with RSC model spectrum

I would like to compare expected spectra from a recent secondary contact model with expected spectra from an isolation with continuous migration model (previous section).



In [43]:

    
success = [flatten(out)[:9] for out in ar_split_mig if out[1][4] == 0]

df_split_mig = pd.DataFrame(data=success, \
                  columns=['nu1_0','nu2_0', 'T_0', 'm_0', 'nu1_opt', 'nu2_opt', 'T_opt', 'm_opt', '-logL'])

popt_split_mig = np.array(df_split_mig.sort_values(by='-logL', ascending=True).iloc[0, 4:8])
popt_split_mig









    Out[43]:





array([ 0.99418511,  1.76612674,  0.92263183,  0.25068798])



In [44]:

    
# get split_mig model function (built-in)
func_ex_split_mig = dadi.Numerics.make_extrap_log_func(dadi.Demographics2D.split_mig)

# get optimally scaled split-migration model spectrum
model_spectrum_split_mig = dadi.Inference.optimally_scaled_sfs(func_ex_split_mig(popt_split_mig, ns, pts_l), sfs2d)



In [45]:

    
# parameter combination that specifies a recent secondary contact model
p_sec_contact = [0.67858642,  1.29659005,  0.35015044,  0.00483189,  1.92454244]

These parameters says the following about the history of PAR and ERY:



In [46]:

    
print "After the split of the ancestral population {0:,} generations ago, the daughter populations increased in size (instantaneously).".format(int((p_sec_contact[2]+p_sec_contact[3])*2*N_ref)),
print "ERY decreased to about {0:.2f} the size of the ancestral population (Nref), while PAR increased to {1:.2f} times of Nref.".format(p_sec_contact[0],p_sec_contact[1]),
print "After the split ERY and PAR diverged in isolation for {0:,} generations.".format(int(p_sec_contact[2]*2*N_ref)),
print "The two populations came into secondary contact {0:,} generations ago.".format(int(p_sec_contact[3]*2*N_ref)),
print "ERY and PAR stayed in continuous secondary contact until present and",
print "during that time contained a proportion {0:.2e} new immigrant alleles each generations.".format(p_sec_contact[4]/2/N_ref)









    



After the split of the ancestral population 332,643 generations ago, the daughter populations increased in size (instantaneously). ERY decreased to about 0.68 the size of the ancestral population (Nref), while PAR increased to 1.30 times of Nref. After the split ERY and PAR diverged in isolation for 328,115 generations. The two populations came into secondary contact 4,527 generations ago. ERY and PAR stayed in continuous secondary contact until present and during that time contained a proportion 2.05e-06 new immigrant alleles each generations.



In [47]:

    
# get the optimally scaled secondary contact model spectrum

func_ex = dadi.Numerics.make_extrap_log_func(secondary_contact)

SC_model_spectrum = dadi.Inference.optimally_scaled_sfs(func_ex(p_sec_contact, ns, pts_l), sfs2d)



In [48]:

    
# get the log likelihood of this RSC parameterisation
ll_RSC = dadi.Inference.ll_multinom(SC_model_spectrum, sfs2d)
ll_RSC









    Out[48]:





-19711.069711547018



In [49]:

    
18574-19711









    Out[49]:





-1137

This RSC parameterisation reduces the fit to the data by 1137 log likelihood units cmpared to the best-fit parameters which suggest practically no time of complete isolation.



In [50]:

    
# I have edited the functions `dadi.Plotting.plot_2d_comp_multinom` and `dadi.Plotting.plot_2d_comp_Poisson`
# in `Plotting.py` in the dadi repo to add the argument `title`
dadi.Plotting.plot_2d_comp_multinom(data=SC_model_spectrum.fold(), model=model_spectrum_split_mig.fold(), \
                                    vmin=1, title=['RSC', 'continuous migration'], pop_ids = ['ery', 'par'])

The recent secondary contact model (with the parameters given above) predicts far fewer low frequency shared polymorphisms along the diagonal. Despite the long time of isolation it also predicts fewer highly diverged SNP's (red in the residual plot).

On the other hand, recent secondary contact predicts more high frequency shared polymorphisms than isolation with continuous migration (blue in the residual plot).

adding asymmetric gene flow

Define an IM model that allows for unequal migration rates between the two populations.



In [44]:

    
def split_asym_mig(params, ns, pts):
    """
    params = (nu1,nu2,T,m1,m2)
    ns = (n1,n2)

    Split into two populations of specifed size, with potentially asymmetric migration.

    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    T: Time in the past of split (in units of 2*Na generations) 
    m1: Migration rate from ery into par (in units of 2*Na ind per generation)
    m2: Migration rate from par into ery (in units of 2*Na ind per generation)
    n1,n2: Sample sizes of resulting Spectrum
    pts: Number of grid points to use in integration.
    """
    nu1,nu2,T,m1,m2 = params

    xx = dadi.Numerics.default_grid(pts)

    phi = dadi.PhiManip.phi_1D(xx)
    
    # split
    phi = dadi.PhiManip.phi_1D_to_2D(xx, phi) 
    
    # divergence with potentially asymmetric migration
    phi = dadi.Integration.two_pops(phi, xx, T, nu1, nu2, m12=m2, m21=m1)

    fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
    return fs



In [99]:

    
# load optimal parameters from files

ar_split_asym_mig = []

for filename in glob("OUT_2D_models/split_asym_mig_[0-9]*dill"):
    ar_split_asym_mig.append(dill.load(open(filename)))



In [100]:

    
# create table

l = 2*5+1

success = [flatten(out)[:l] for out in ar_split_asym_mig]

df = pd.DataFrame(data=success, \
                  columns=['nu1_0','nu2_0', 'T_0', 'm1_0', 'm2_0', 'nu1_opt', 'nu2_opt', 'T_opt', 'm1_opt', 'm2_opt', '-logL'])

df.sort_values(by='-logL', ascending=True).head(10)









    Out[100]:






  
    
      
      nu1_0
      nu2_0
      T_0
      m1_0
      m2_0
      nu1_opt
      nu2_opt
      T_opt
      m1_opt
      m2_opt
      -logL
    
  
  
    
      22
      0.782220
      1.448548
      4.639001
      0.072158
      0.638342
      0.920911
      2.238617
      1.244681
      0.084147
      0.494918
      18104.966267
    
    
      0
      0.661003
      7.163811
      1.265469
      0.257890
      0.190122
      0.920526
      2.236116
      1.241691
      0.084361
      0.494500
      18104.971893
    
    
      2
      0.303698
      1.230424
      0.319167
      0.222183
      0.228972
      0.922813
      2.243404
      1.254244
      0.083942
      0.494334
      18104.983699
    
    
      1
      0.287213
      1.386083
      0.493232
      0.096763
      0.103789
      0.915567
      2.223840
      1.228988
      0.084270
      0.497170
      18105.001298
    
    
      12
      0.234347
      0.568305
      0.310253
      0.112591
      0.538129
      0.912096
      2.209951
      1.210935
      0.084060
      0.497975
      18105.141289
    
    
      17
      0.445903
      1.955658
      2.155467
      0.025188
      0.607712
      0.928113
      2.253559
      1.279568
      0.084935
      0.493929
      18105.232355
    
    
      11
      0.690910
      1.962277
      0.655486
      0.716114
      0.081910
      0.929836
      2.267261
      1.289605
      0.085435
      0.493187
      18105.358461
    
    
      14
      0.840609
      1.431975
      0.454631
      0.032701
      0.460575
      0.933747
      2.248026
      1.278518
      0.086974
      0.490424
      18105.547423
    
    
      29
      0.474111
      0.694113
      1.372862
      0.274980
      0.360411
      0.915513
      2.217017
      1.244177
      0.087882
      0.492707
      18105.676702
    
    
      30
      0.318476
      1.235161
      0.915553
      0.202189
      0.451816
      0.889937
      2.161759
      1.149623
      0.084294
      0.506266
      18106.119019

These parameter combinations consistently indicate a much smaller gene flow from erythropus into parallelus than in the other direction. This seems plausible when considering the observed expansion of the range of parallelus towards the south of the Pyrenees. Shapes of neutral clines across the hybrid zone should also be less steep on the erythropus side than on the parallelus side. A more trivial explanation is that the sampling location for ERY (Greixer) is closer to the cline centre than the sampling location for PAR (Aunat). A quick look at the map does not seem to support this, but migration barriers, also those that do not exist anymore, like forests, will have had an effect on the past gene flow.

plot residuals



In [101]:

    
# get optimal parameter combination
popt = np.array(df.sort_values(by='-logL', ascending=True).iloc[0, 5:10])
popt









    Out[101]:





array([ 0.92091098,  2.2386172 ,  1.24468148,  0.08414741,  0.49491818])



In [39]:

    
func = split_asym_mig

func_ex = dadi.Numerics.make_extrap_log_func(func)

# calculate best-fit model spectrum
model_spectrum = func_ex(popt, ns, pts_l)

# optimally scale model spectrum
model_spectrum_asym_mig_scaled = dadi.Inference.optimally_scaled_sfs(model_spectrum, sfs2d)

theta = dadi.Inference.optimal_sfs_scaling(model_spectrum, sfs2d)
theta









    Out[39]:





5572.6260851273273



In [40]:

    
dadi.Plotting.plot_2d_comp_multinom(model_spectrum_asym_mig_scaled, sfs2d, vmin=1)

Interpretation of parameters



In [50]:

    
mu = 3e-9
L = sfs2d.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
print "The optimal value of theta per site for the ancestral population is {0:.5f}.".format(theta/L)
N_ref = theta/L/mu/4
print "The effective size of the ancestral population of ery and par (in number of diploid individuals) implied by this theta is: {0:,}.".format(int(N_ref))









    



The optimal value of theta per site for the ancestral population is 0.00493.
The effective size of the ancestral population of ery and par (in number of diploid individuals) implied by this theta is: 410,678.



In [49]:

    
print "The ancestral population split apart {0:,} generations ago.".format(int(popt[2]*2*N_ref)),
print "Immediately after the split the ERY population changed to a size of {0:,} and the PAR population to {1:,}.".format(int(popt[0]*N_ref), int(popt[1]*N_ref)),
print "Since the split of the ancestral population, PAR received 1 individual from ERY every {0:.2f} generations,".format(1.0/(popt[3]/2*popt[1])),
print "while ERY received 1 PAR individual every {0:.2f} generations.".format(1.0/(popt[4]/2*popt[0])),
print "Put another way: The PAR population contained a constant proportion of {0:.2e} of new immigrant alleles each generation".format(popt[3]/2/N_ref/popt[1]),
print "and the ERY population contained a constant proportion of {0:.2e} of new immigrant alleles each generation.".format(popt[4]/2/N_ref/popt[0])









    



The ancestral population split apart 1,022,329 generations ago. Immediately after the split the ERY population changed to a size of 378,198 and the PAR population to 919,352. Since the split of the ancestral population, PAR received 1 individual from ERY every 10.62 generations, while ERY received 1 PAR individual every 4.39 generations. Put another way: The PAR population contained a constant proportion of 4.58e-08 of new immigrant alleles each generation and the ERY population contained a constant proportion of 6.54e-07 of new immigrant alleles each generation.

This model therefore indicates a much stronger introgression from parallelus into erythropus than in the opposite direction. Is the asymmetrical gene flow significant?

LRT



In [85]:

    
ll_c = dadi.Inference.ll_multinom(model_spectrum_asym_mig_scaled, sfs2d)
ll_c









    Out[85]:





-18104.966266885222



In [86]:

    
# get maximum log likelihood of best fit parameter combination for the model
# with symmetrical migration rates

func = dadi.Demographics2D.split_mig

func_ex = dadi.Numerics.make_extrap_log_func(func)

# best-fit paramter combination for the split_mig model (see section 3 above)
popt_split_mig = [0.994185, 1.766127, 0.922632, 0.250688]

model_spectrum_sym_mig = func_ex(popt_split_mig, ns, pts_l)

model_spectrum_sym_mig_scaled = dadi.Inference.optimally_scaled_sfs(model_spectrum_sym_mig, sfs2d)

ll_s = dadi.Inference.ll_multinom(model_spectrum_sym_mig, sfs2d)
ll_s









    Out[86]:





-18574.11991248127



In [87]:

    
D = 2*(ll_c - ll_s)
D









    Out[87]:





938.30729119209718



In [88]:

    
# calculate p-value for Chi-square dist. with 1 degree of freedom
p = dadi.Godambe.sum_chi2_ppf(D)
p









    Out[88]:





0.0

It looks as though allowing for asymmetrical gene flow significantly improves the fit to the data. Linkage is unlikely to push the p-value above a significance value of 1e-2, for instance.

Is the migration rate from ery into par (m1) statistically significant?



In [89]:

    
ar_split_asym_mig_nom1 = []

for filename in glob("OUT_2D_models/split_asym_mig_nom1*dill"):
    ar_split_asym_mig_nom1.append(dill.load(open(filename)))



In [90]:

    
l = 2*5+1

returned = [flatten(out)[:l] for out in ar_split_asym_mig_nom1]

df = pd.DataFrame(data=returned, \
                  columns=['nu1_0','nu2_0', 'T_0', 'm1_0', 'm2_0', 'nu1_opt', 'nu2_opt', 'T_opt', 'm1_opt', 'm2_opt', '-logL'])

df.sort_values(by='-logL', ascending=True).head(10)









    Out[90]:






  
    
      
      nu1_0
      nu2_0
      T_0
      m1_0
      m2_0
      nu1_opt
      nu2_opt
      T_opt
      m1_opt
      m2_opt
      -logL
    
  
  
    
      4
      0.604286
      1.689639
      1.086538
      0.048727
      0.585935
      0.720082
      1.962785
      0.791503
      0.0
      0.664581
      18340.935861
    
    
      8
      1.197798
      1.060943
      0.387950
      0.070337
      0.177612
      0.718508
      1.960294
      0.788789
      0.0
      0.664332
      18340.951096
    
    
      6
      0.588394
      3.336045
      2.176182
      0.035238
      0.433038
      0.721147
      1.961776
      0.791353
      0.0
      0.664706
      18340.964537
    
    
      3
      0.400697
      4.483709
      0.486604
      0.282474
      0.612249
      0.721305
      1.961497
      0.790718
      0.0
      0.663976
      18340.965653
    
    
      1
      0.250388
      0.933745
      0.381047
      0.123940
      1.702059
      0.717236
      1.953896
      0.786673
      0.0
      0.667389
      18340.972028
    
    
      9
      1.163790
      2.959924
      0.753380
      0.124571
      0.155288
      0.719228
      1.967604
      0.794908
      0.0
      0.669281
      18341.008850
    
    
      7
      0.331430
      2.693409
      0.905852
      0.078054
      0.398861
      0.743397
      2.058666
      0.871767
      0.0
      0.663042
      18345.753837
    
    
      0
      0.258239
      2.546149
      2.410812
      0.024995
      0.808097
      0.658928
      1.875663
      0.777684
      0.0
      0.783119
      18374.981997
    
    
      2
      1.546874
      0.697055
      2.829271
      0.246889
      0.420687
      1.453423
      4.604630
      3.820350
      0.0
      0.406251
      18687.709775
    
    
      5
      1.561103
      0.681461
      2.626801
      0.037716
      0.265306
      0.460973
      1.135742
      0.295629
      0.0
      0.584516
      19282.370153



In [91]:

    
ll_s = -18340.9

D = 2*(ll_c - ll_s)
D









    Out[91]:





471.8674662295598



In [92]:

    
# calculate p-value for Chi-square dist.
# the weights specify a weighted sum of a chi^2 distributions with 1 and 2 d.o.f
# 
p = dadi.Godambe.sum_chi2_ppf(D, weights=(0.5, 0.5))
p









    Out[92]:





0.0

The gene flow from ery into par is highly significant. It is unlikely that this result can be compromised by linkage between the SNP's.

compare symmetric with asymmetric gene flow



In [93]:

    
# get maximum log likelihood of best fit parameter combination for the model
# with symmetrical migration rates

func = dadi.Demographics2D.split_mig

func_ex = dadi.Numerics.make_extrap_log_func(func)

# best-fit paramter combination for the split_mig model (see section 2 above)
popt_split_mig = [0.994185, 1.766127, 0.922632, 0.250688]

model_spectrum_sym_mig = func_ex(popt_split_mig, ns, pts_l)

model_spectrum_sym_mig_scaled = dadi.Inference.optimally_scaled_sfs(model_spectrum_sym_mig, sfs2d)



In [94]:

    
dadi.Plotting.plot_2d_comp_multinom(model=model_spectrum_asym_mig_scaled.fold(), \
                                    data=model_spectrum_sym_mig_scaled.fold(), vmin=1,\
                                   title=['sym mig', 'asym mig'], pop_ids=['ery', 'par'])

The model spectrum for asymmetric gene flow has an excess (brown) of low to medium MAF SNP's in PAR that occur at low frequency in ERY. It has fewer medium to high frequency SNP's that are private to PAR and more medium to high frequency SNP's that are private to ERY. It has far fewer low frequency SNP's that are private to ERY.



In [95]:

    
popt_split_mig = popt_split_mig + [popt_split_mig[-1]]
popt_split_mig









    Out[95]:





[0.994185, 1.766127, 0.922632, 0.250688, 0.250688]



In [102]:

    
popt









    Out[102]:





array([ 0.92091098,  2.2386172 ,  1.24468148,  0.08414741,  0.49491818])



In [103]:

    
pd.DataFrame(data=np.array([popt_split_mig, popt]), index=['sym_mig', 'asym_mig'], columns=['nu1', 'nu2', 'T', 'm1', 'm2'])

The excess of SNP's in MAF categroy 1 that did not introgress from PAR into ERY could be due to the greater population size increase inferred for PAR in the asymmetric gene flow model, although this model also inferred a greater time since the population size increase which should lead to greater equilibration of the initial rise in low frequency variants.

adding gradual size changes

I would now like to add a gradual population size change after the split of the ancestral population. So far, the split coincided with a stepwise population size change.

Dadi's built-in IM model function contains what I would like to model now. To avoid confusion of migration rate parameters, I am anyway going to specify a slightly modified version of this model that uses the same naming convention and order of m as used above.



In [6]:

    
def IM(params, ns, pts):
    """
    ns = (n1,n2)
    params = (s,nu1,nu2,T,m1,m2)

    Isolation-with-migration model with exponential pop growth or decline.

    s: Size of pop 1 after split. (Pop 2 has size 1-s.)
    nu1: Final size of pop 1.
    nu2: Final size of pop 2.
    T: Time in the past of split (in units of 2*Na generations) 
    m1: Migration from pop 1 to pop 2
    m2: Migration from pop 2 to pop 1 (2*Na*m12)
    n1,n2: Sample sizes of resulting Spectrum
    pts: Number of grid points to use in integration.
    """
    s,nu1,nu2,T,m1,m2 = params

    xx = dadi.Numerics.default_grid(pts)

    phi = dadi.PhiManip.phi_1D(xx)
    phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)

    nu1_func = lambda t: s * (nu1/s)**(t/T)
    nu2_func = lambda t: (1-s) * (nu2/(1-s))**(t/T)
    phi = dadi.Integration.two_pops(phi, xx, T, nu1_func, nu2_func,
                               m12=m2, m21=m1)

    fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
    return fs



In [105]:

    
% ll OUT_2D_models/IM*dill









    



-rw-rw-r-- 1 claudius 348 May 20 11:23 OUT_2D_models/IM_0.2031_0.8545_4.6009_0.5113_0.0858_0.5280.dill
-rw-rw-r-- 1 claudius 348 May 20 11:28 OUT_2D_models/IM_0.2226_0.8121_1.4049_1.3324_0.0311_1.1951.dill
-rw-rw-r-- 1 claudius 348 May 20 11:23 OUT_2D_models/IM_0.2310_1.2361_1.8753_1.2658_0.0658_0.2513.dill
-rw-rw-r-- 1 claudius 348 May 20 09:12 OUT_2D_models/IM_0.2509_1.8483_3.4565_3.8366_0.0445_1.6234.dill
-rw-rw-r-- 1 claudius 348 May 20 10:41 OUT_2D_models/IM_0.2565_0.5092_1.1938_1.7695_0.0523_0.5498.dill
-rw-rw-r-- 1 claudius 348 May 20 11:34 OUT_2D_models/IM_0.2876_0.5848_2.4915_0.5777_0.1549_0.9120.dill
-rw-rw-r-- 1 claudius 348 May 20 10:37 OUT_2D_models/IM_0.3112_0.7834_1.5277_1.5106_0.0975_0.5320.dill
-rw-rw-r-- 1 claudius 348 May 20 11:27 OUT_2D_models/IM_0.3233_2.5430_4.0198_2.7578_0.0704_1.1637.dill
-rw-rw-r-- 1 claudius 347 May 20 10:29 OUT_2D_models/IM_0.3407_0.6037_2.7433_1.9915_0.1010_0.9500.dill
-rw-rw-r-- 1 claudius 347 May 20 11:22 OUT_2D_models/IM_0.3456_0.5854_1.1924_0.5849_0.0421_0.5837.dill
-rw-rw-r-- 1 claudius 348 May 20 08:50 OUT_2D_models/IM_0.3491_0.9462_0.6957_1.0377_0.0261_0.6846.dill
-rw-rw-r-- 1 claudius 348 May 20 11:29 OUT_2D_models/IM_0.3491_1.5808_0.8167_0.5517_0.1343_0.5405.dill
-rw-rw-r-- 1 claudius 348 May 20 11:19 OUT_2D_models/IM_0.3495_2.4268_0.9492_0.9825_0.0801_0.4741.dill
-rw-rw-r-- 1 claudius 348 May 20 09:37 OUT_2D_models/IM_0.3527_0.4200_1.3329_1.2097_0.0213_1.6102.dill
-rw-rw-r-- 1 claudius 348 May 20 08:48 OUT_2D_models/IM_0.3707_0.3217_1.0568_0.5839_0.0953_0.2555.dill
-rw-rw-r-- 1 claudius 348 May 20 12:07 OUT_2D_models/IM_0.3712_2.4312_0.8172_2.9642_0.0675_0.2363.dill
-rw-rw-r-- 1 claudius 348 May 20 09:00 OUT_2D_models/IM_0.4170_0.3797_3.3358_1.6557_0.0497_0.4566.dill
-rw-rw-r-- 1 claudius 348 May 20 08:49 OUT_2D_models/IM_0.4191_0.3297_1.3333_0.3281_0.0396_0.5835.dill
-rw-rw-r-- 1 claudius 348 May 20 11:30 OUT_2D_models/IM_0.4248_0.7593_2.5618_0.6372_0.1382_1.0413.dill
-rw-rw-r-- 1 claudius 348 May 20 10:41 OUT_2D_models/IM_0.4319_1.4778_2.7629_0.6366_0.0679_0.3063.dill
-rw-rw-r-- 1 claudius 348 May 20 11:22 OUT_2D_models/IM_0.4525_0.3556_1.7123_0.6932_0.1401_1.3618.dill
-rw-rw-r-- 1 claudius 348 May 20 10:41 OUT_2D_models/IM_0.4634_0.7381_3.9058_0.7398_0.0586_0.4418.dill
-rw-rw-r-- 1 claudius 348 May 20 10:38 OUT_2D_models/IM_0.4808_0.5443_2.6985_1.2743_0.1082_0.2680.dill
-rw-rw-r-- 1 claudius 348 May 20 11:13 OUT_2D_models/IM_0.5131_0.3484_4.4150_0.6846_0.1322_0.4705.dill
-rw-rw-r-- 1 claudius 348 May 20 11:44 OUT_2D_models/IM_0.5251_0.6289_0.8518_1.3208_0.1161_0.3603.dill
-rw-rw-r-- 1 claudius 348 May 20 08:51 OUT_2D_models/IM_0.5385_1.2558_1.7027_0.6353_0.1126_0.2594.dill
-rw-rw-r-- 1 claudius 348 May 20 08:55 OUT_2D_models/IM_0.5619_3.4493_0.6549_3.5055_0.0375_0.8759.dill
-rw-rw-r-- 1 claudius 348 May 20 11:45 OUT_2D_models/IM_0.5884_2.1814_2.4257_3.2182_0.0327_0.7525.dill
-rw-rw-r-- 1 claudius 348 May 20 11:20 OUT_2D_models/IM_0.6583_0.9867_1.2219_0.9665_0.2024_0.6639.dill
-rw-rw-r-- 1 claudius 348 May 20 11:21 OUT_2D_models/IM_0.6884_1.7269_4.6350_0.9372_0.1095_0.6473.dill
-rw-rw-r-- 1 claudius 348 May 20 10:30 OUT_2D_models/IM_0.6901_0.7354_2.6172_0.9073_0.0537_0.5421.dill
-rw-rw-r-- 1 claudius 348 May 20 10:43 OUT_2D_models/IM_0.6915_0.9151_3.3370_0.6759_0.1638_0.8073.dill
-rw-rw-r-- 1 claudius 348 May 20 10:29 OUT_2D_models/IM_0.7242_1.4047_1.9813_1.4717_0.0431_0.5205.dill
-rw-rw-r-- 1 claudius 347 May 20 11:25 OUT_2D_models/IM_0.7673_0.9486_2.1605_1.9110_0.0541_1.0402.dill
-rw-rw-r-- 1 claudius 348 May 20 10:28 OUT_2D_models/IM_0.7732_1.2302_1.4930_1.0542_0.1328_0.6447.dill
-rw-rw-r-- 1 claudius 348 May 20 11:44 OUT_2D_models/IM_0.8743_0.9611_3.0516_3.5202_0.0756_0.1761.dill
-rw-rw-r-- 1 claudius 348 May 20 09:17 OUT_2D_models/IM_0.9900_0.6185_1.5880_0.8341_0.0347_0.7268.dill
-rw-rw-r-- 1 claudius 348 May 20 12:52 OUT_2D_models/IM_0.9900_0.7123_2.6093_1.3836_0.0412_0.2112.dill
-rw-rw-r-- 1 claudius 348 May 20 09:22 OUT_2D_models/IM_0.9900_1.2623_1.5577_0.4044_0.0358_0.3795.dill
-rw-rw-r-- 1 claudius 348 May 20 12:11 OUT_2D_models/IM_0.9900_1.4617_3.2959_1.1040_0.1360_0.2038.dill



In [106]:

    
ar_IM = []

for filename in glob("OUT_2D_models/IM*dill"):
    ar_IM.append(dill.load((open(filename))))



In [107]:

    
l = 2*6+1

returned = [flatten(out)[:l] for out in ar_IM]

df = pd.DataFrame(data=returned, columns=['s_0' ,'nu1_0','nu2_0', 'T_0', 'm1_0', 'm2_0', 's_opt' ,'nu1_opt', 'nu2_opt', 'T_opt', 'm1_opt', 'm2_opt', '-logL'])

df.sort_values(by='-logL', ascending=True).head(10)









    Out[107]:






  
    
      
      s_0
      nu1_0
      nu2_0
      T_0
      m1_0
      m2_0
      s_opt
      nu1_opt
      nu2_opt
      T_opt
      m1_opt
      m2_opt
      -logL
    
  
  
    
      23
      0.287604
      0.584816
      2.491513
      0.577716
      0.154881
      0.912029
      0.394336
      1.028757
      3.117404
      3.130620
      0.077685
      0.576641
      18814.402521
    
    
      0
      0.463402
      0.738050
      3.905796
      0.739800
      0.058625
      0.441791
      0.406139
      1.080665
      3.270704
      3.494185
      0.074393
      0.547156
      18814.762357
    
    
      34
      0.538469
      1.255813
      1.702693
      0.635345
      0.112638
      0.259371
      0.388115
      1.063798
      3.199012
      3.234170
      0.078406
      0.557619
      18814.877903
    
    
      11
      0.345561
      0.585417
      1.192358
      0.584863
      0.042089
      0.583673
      0.409575
      1.037432
      3.163748
      3.314343
      0.075790
      0.574198
      18814.902965
    
    
      9
      0.691487
      0.915053
      3.337032
      0.675904
      0.163826
      0.807328
      0.401345
      1.127900
      3.433437
      3.748322
      0.069455
      0.527414
      18815.275911
    
    
      19
      0.417002
      0.379670
      3.335798
      1.655678
      0.049658
      0.456647
      0.379121
      1.082723
      3.258400
      3.404997
      0.074637
      0.545341
      18815.329404
    
    
      36
      0.370682
      0.321735
      1.056795
      0.583936
      0.095256
      0.255517
      0.380064
      1.084249
      3.291107
      3.415423
      0.071002
      0.551352
      18815.491487
    
    
      2
      0.349065
      1.580840
      0.816700
      0.551687
      0.134324
      0.540489
      0.413618
      0.989663
      3.030744
      3.048123
      0.078635
      0.599699
      18815.663484
    
    
      5
      0.231016
      1.236144
      1.875334
      1.265815
      0.065793
      0.251310
      0.389511
      0.968089
      2.940081
      2.803636
      0.082062
      0.612002
      18815.733030
    
    
      32
      0.222590
      0.812110
      1.404866
      1.332359
      0.031135
      1.195093
      0.424820
      0.974611
      3.066101
      3.252257
      0.074601
      0.609123
      18818.545393

This model is hard to fit to the data. The best-fit parameter combinations contain an extremely long divergence time T of around 3, i. e. $6 \times N_{ref}$ generations. No parameter combination with higher likelihood could be found than in the previous runs and they are all much lower than from the split_asym_mig model (-18104), which imposed a stepwise size change coincident with the population split and otherwise constant population sizes.

plot residuals



In [108]:

    
popt = np.array(df.sort_values(by='-logL', ascending=True).iloc[0, 6:12])
popt









    Out[108]:





array([ 0.3943357 ,  1.02875659,  3.11740404,  3.13061996,  0.07768455,
        0.57664111])



In [109]:

    
func_ex = dadi.Numerics.make_extrap_log_func(IM)

model_spectrum = func_ex(popt, ns, pts_l)



In [110]:

    
dadi.Plotting.plot_2d_comp_multinom(model_spectrum, sfs2d, vmin=1)

compare `asym_mig` with `IM` model



In [111]:

    
# optimally scale the best-fit IM model spectrum,
# by multiplication with optimal theta
model_spectrum_scaled = dadi.Inference.optimally_scaled_sfs(model_spectrum, sfs2d)



In [112]:

    
# make extraplating function from `split_asym_mig` model function
func_ex = dadi.Numerics.make_extrap_log_func(split_asym_mig)

# these were the optimal parameters for the `split_asym_mig` model
popt_asym_mig = [0.92091098,  2.2386172 ,  1.24468148,  0.08414741,  0.49491818]

# get (unscaled) expected SFS for optimal `split_asym_mig` model
model_spectrum_asym_mig = func_ex(popt_asym_mig, ns, pts_l)

# optimally scale spectrum
model_spectrum_asym_mig = dadi.Inference.optimally_scaled_sfs(model_spectrum_asym_mig, sfs2d)



In [113]:

    
dadi.Plotting.plot_2d_comp_multinom(data=model_spectrum_asym_mig.fold(), model=model_spectrum_scaled.fold(), \
                                    vmin=1, title=['asym_mig', 'IM'], pop_ids=['ery', 'par'])

The IM model with gradual population size changes predicts far fewer low frequency SNP's that are private to PAR and far more high frequency SNP's that are private to ERY.

Ancient migration model

So far, a model with an ancient split and low but constant gene flow (more from par into ery than in the other direction) has fit the data best. I would like to know whether I can achieve an improvement if I let gene flow cease completely at some point in time.



In [47]:

    
def split_asym_mig_iso(params, ns, pts):
    """
    params = (nu1,nu2,Tc,m1,m2,Ti)
    ns = (n1,n2)

    Split into two populations of specifed size, with potentially asymmetric migration
    for a time Tc followed by complete isolation until present.

    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    Tc: Time of gene flow after split (in units of 2*Na generations) 
    m1: Migration rate from ery into par (in units of 2*Na ind per generation)
    m2: Migration rate from par into ery (in units of 2*Na ind per generation)
    Ti: Time of isolation after cessation of gene flow
    The split lies Tc+Ti * 2Na generations in the past.
    n1,n2: Sample sizes of resulting Spectrum
    pts: Number of grid points to use in integration.
    """
    nu1,nu2,Tc,m1,m2,Ti = params

    xx = dadi.Numerics.default_grid(pts)

    phi = dadi.PhiManip.phi_1D(xx)
    
    # split
    phi = dadi.PhiManip.phi_1D_to_2D(xx, phi) 
    
    # divergence with potentially asymmetric migration
    phi = dadi.Integration.two_pops(phi, xx, Tc, nu1, nu2, m12=m2, m21=m1)
    
    # divergence without gene flow
    phi = dadi.Integration.two_pops(phi, xx, Ti, nu1, nu2, m12=0, m21=0)

    fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
    return fs

This is a piecewise model. I could make gene flow fade out gradually, but that would require four more parameters that specify the initial and final levels of gene flow. Also, it would make the model less comparable to the split_asym_mig model, which is also piecewise.



In [48]:

    
ar_split_asym_mig_iso = []

for filename in glob("OUT_2D_models/split_asym_mig_iso*dill"):
    ar_split_asym_mig_iso.append(dill.load(open(filename)))



In [49]:

    
l = 2*6+1

# get all optimisation results
returned = [flatten(out)[:l] for out in ar_split_asym_mig_iso]

df = pd.DataFrame(data=returned, \
                  columns=['nu1_0','nu2_0', 'Tc_0', 'm1_0', 'm2_0', 'Ti_0', 'nu1_opt', 'nu2_opt', 'Tc_opt', 'm1_opt', 'm2_opt', 'Ti_opt', '-logL'])

df.sort_values(by='-logL', ascending=True).head(10)









    Out[49]:






  
    
      
      nu1_0
      nu2_0
      Tc_0
      m1_0
      m2_0
      Ti_0
      nu1_opt
      nu2_opt
      Tc_opt
      m1_opt
      m2_opt
      Ti_opt
      -logL
    
  
  
    
      38
      3.122675
      6.565590
      4.556457
      0.025178
      0.150502
      0.193779
      2.884250
      7.039549
      6.939774
      0.039132
      0.237419
      0.173159
      17582.884413
    
    
      15
      2.457166
      7.524601
      5.393413
      0.108182
      0.288238
      0.119209
      2.439955
      5.984046
      5.708981
      0.045731
      0.279332
      0.146457
      17583.559438
    
    
      6
      3.803361
      5.073436
      4.055461
      0.089821
      0.147407
      0.074291
      2.507140
      6.203819
      6.007240
      0.042590
      0.272537
      0.148739
      17584.306651
    
    
      10
      2.988730
      4.738118
      4.034295
      0.054839
      0.356824
      0.157766
      2.092882
      5.117888
      4.727453
      0.053452
      0.326719
      0.125626
      17584.511983
    
    
      48
      2.942805
      4.976983
      2.734166
      0.069029
      0.212524
      0.093116
      2.033029
      4.975807
      4.562200
      0.054811
      0.336354
      0.122034
      17584.788795
    
    
      37
      1.698863
      3.380579
      5.956428
      0.041756
      0.236200
      0.099436
      2.012841
      4.921590
      4.465928
      0.055141
      0.338755
      0.120221
      17585.405692
    
    
      9
      3.311864
      5.144676
      3.960000
      0.109423
      0.383802
      0.129523
      1.798721
      4.397479
      3.879859
      0.061955
      0.379276
      0.107981
      17585.882553
    
    
      25
      3.772272
      5.313562
      3.820628
      0.075429
      0.294183
      0.246429
      2.276085
      5.525343
      5.185373
      0.051899
      0.292315
      0.134893
      17585.939796
    
    
      34
      2.097129
      5.327970
      5.500053
      0.083428
      0.525671
      0.092544
      1.842790
      4.510400
      4.019504
      0.060138
      0.371075
      0.110615
      17586.022421
    
    
      46
      1.642916
      6.103883
      3.960000
      0.054931
      0.349428
      0.058007
      1.781764
      4.362300
      3.835582
      0.062103
      0.383396
      0.106964
      17586.026218

It was difficult to achieve convergence. Parameter combinations with quite different time of primary contact ($T_C$) have very similar likelihood.

The log likelihood of the split_asym_mig model with best-fit parameters was: -18104.96.



In [50]:

    
D = 2*(-17582.88 - -18104.96)
D









    Out[50]:





1044.1599999999962



In [51]:

    
# calculate p-value for Chi-square dist.
# the weights specify a weighted sum of chi^2 distributions with 1 and 2 d.o.f
# this is because Ti is 0 in the split_asym_mig model and at the boundary of the parameter space
p = dadi.Godambe.sum_chi2_ppf(D, weights=(0.5, 0.5))
p









    Out[51]:





0.0

This ancient migration model fits the data significantly better than the split asym mig model. That also means that the recent period of complete isolation ($T_i$) is significant. Assuming that I had found the best-fit parameters for that model, this means that a model that allows for a very ancient split (>1 million generations ago) and that specifies a recent cessation of gene flow fits the data better than a model with a less ancient split and constant migration until the present.

plot residuals



In [52]:

    
popt_mig_iso = np.array(df.sort_values(by='-logL', ascending=True).iloc[0,6:12])
popt_mig_iso









    Out[52]:





array([ 2.8842495 ,  7.03954944,  6.93977407,  0.03913244,  0.23741935,
        0.1731595 ])



In [53]:

    
# create unscaled model spectrum

func_ex = dadi.Numerics.make_extrap_log_func(split_asym_mig_iso)

model_mig_iso = func_ex(popt_mig_iso, ns, pts_l)



In [54]:

    
dadi.Plotting.plot_2d_comp_multinom(model=model_mig_iso, data=sfs2d, vmin=1)

parameter interpretation



In [55]:

    
theta = dadi.Inference.optimal_sfs_scaling(model_mig_iso, sfs2d)
theta









    Out[55]:





1709.4208758269674



In [56]:

    
mu = 3e-9
L = sfs2d.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
print "The total sequence length for the 2D spectrum is {0:,}.".format(int(L))
N_ref = theta/L/mu/4
print "The effective size of the ancestral population of ery and par (in number of diploid individuals) implied by this theta is: {0:,}.".format(int(N_ref))









    



The total sequence length for the 2D spectrum is 1,130,775.
The effective size of the ancestral population of ery and par (in number of diploid individuals) implied by this theta is: 125,977.



In [57]:

    
nu1, nu2, Tc, m1, m2, Ti = popt_mig_iso



In [58]:

    
popt = popt_mig_iso
print "The ancestral population split apart {0:,} generations ago.".format(int((Tc+Ti)*2*N_ref)),
print "Immediately after the split the ERY population changed to a size of {0:,} and the PAR population to {1:,}.".format(int(nu1*N_ref), int(nu2*N_ref)),
print "Since the split of the ancestral population, PAR received 1 individual from ERY every {0:.2f} generations,".format(1.0/(m1/2*nu2)),
print "while ERY received 1 PAR individual every {0:.2f} generations.".format(1.0/(m2/2*nu1)),
print "Put another way: The PAR population contained a constant proportion of {0:.2e} of new immigrant alleles each generation".format(m1/2/N_ref/nu2),
print "and the ERY population contained a constant proportion of {0:.2e} of new immigrant alleles each generation.".format(m2/2/N_ref/nu1),
print "ERY and PAR remained in contact for {0:,} generations.".format(int(Tc*2*N_ref)),
print "{0:,} generations ago gene flow between ERY and PAR had ceased.".format(int(Ti*2*N_ref))









    



The ancestral population split apart 1,792,133 generations ago. Immediately after the split the ERY population changed to a size of 363,349 and the PAR population to 886,821. Since the split of the ancestral population, PAR received 1 individual from ERY every 7.26 generations, while ERY received 1 PAR individual every 2.92 generations. Put another way: The PAR population contained a constant proportion of 2.21e-08 of new immigrant alleles each generation and the ERY population contained a constant proportion of 3.27e-07 of new immigrant alleles each generation. ERY and PAR remained in contact for 1,748,505 generations. 43,628 generations ago gene flow between ERY and PAR had ceased.

This is another indication that a recent secondary contact model is not compatible with the data. If the better fit of this ancient migration model (compared to constant migration) is not just due to fitting noise or biases in the data, then it would also mean that there is no detectable gene flow since the last ice age, despite the fact that the two populations are just a few kilometers apart and the clines of several markers are very wide across this transect of the hybrid zone.

comparison with continuous gene flow model



In [42]:

    
model_mig_iso_scaled = dadi.Inference.optimally_scaled_sfs(model_mig_iso, sfs2d)



In [45]:

    
func_ex = dadi.Numerics.make_extrap_log_func(split_asym_mig)

popt_asym_mig = [0.92091098,  2.2386172 ,  1.24468148,  0.08414741,  0.49491818]

model_asym_mig = func_ex(popt_asym_mig, ns, pts_l)

model_asym_mig_scaled = dadi.Inference.optimally_scaled_sfs(model_asym_mig, sfs2d)



In [46]:

    
dadi.Plotting.plot_2d_comp_multinom(data=model_mig_iso_scaled.fold() , model=model_asym_mig_scaled.fold(), vmin=1, title=['ancient mig', 'continuous mig'], pop_ids=['ery', 'par'])

Apparently the continuous migration model predicts many more variants at frequency 1 in erythropus that are shared with parallelus than the ancient migration model.

ancient migration, isolation and secondary contact

The two populations are relatively close to the hybrid zone centre at the Col de la Quillane in the Pyrenees. The clines for some characters have been shown to be many kilometers wide. It can therefore be assumed that there is some recent gene flow between the two populations. Can this gene flow be detected from this spectrum?



In [59]:

    
def split_mig_iso_mig(params, ns, pts):
    """
    params = (nu1,nu2,Tc,m1,m2,Ti,Tsc)
    ns = (n1,n2)

    Split into two populations of specifed size, with potentially asymmetric migration
    for a time Tc followed by a period of complete isolation Ti which is followed by a restart of
    migration until present. Migration rates in the two epochs are assumed to equal (and constant).

    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    Tc: Time of gene flow after split (in units of 2*Na generations) 
    m1: Migration rate from ery into par (in units of 2*Na ind per generation)
    m2: Migration rate from par into ery (in units of 2*Na ind per generation)
    Ti: Time of isolation after cessation of gene flow
    Tsc: Time of gene flow after period of isolation until present
    
    The split lies Tc+Ti+Tsc * 2Na generations in the past.
    
    n1,n2: Sample sizes of resulting Spectrum
    pts: Number of grid points to use in integration.
    """
    nu1,nu2,Tc,m1,m2,Ti,Tsc = params

    xx = dadi.Numerics.default_grid(pts)

    phi = dadi.PhiManip.phi_1D(xx)
    
    # split
    phi = dadi.PhiManip.phi_1D_to_2D(xx, phi) 
    
    # divergence with potentially asymmetric migration
    phi = dadi.Integration.two_pops(phi, xx, Tc, nu1, nu2, m12=m2, m21=m1)
    
    # divergence without gene flow
    phi = dadi.Integration.two_pops(phi, xx, Ti, nu1, nu2, m12=0, m21=0)
    
    # divergence with potentially asymmetric migration
    phi = dadi.Integration.two_pops(phi, xx, Tsc, nu1, nu2, m12=m2, m21=m1)

    fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
    return fs

This model just adds another epoch to the ancient migration model in the previous section. The period without gene flow is now followed by another period where gene flow recommences and lasts until the present. Note that migration rates are assumed to be the same in both epochs of gene flow.



In [60]:

    
ar_mig_iso_mig = []

for filename in glob("OUT_2D_models/split_mig_iso_mig*dill"):
    ar_mig_iso_mig.append(dill.load(open(filename)))



In [61]:

    
import pandas as pd

l = 2*7+1

# show all parameter combinations
returned = [flatten(out)[:l] for out in ar_mig_iso_mig]

df = pd.DataFrame(data=returned, \
                  columns=['nu1_0','nu2_0', 'Tc_0', 'm1_0', 'm2_0', 'Ti_0', 'Tsc_0', 'nu1_opt', 'nu2_opt', 'Tc_opt', 'm1_opt', 'm2_opt', 'Ti_opt', 'Tsc_opt', '-logL'])

df.sort_values(by='-logL', ascending=True).head(10)









    Out[61]:






  
    
      
      nu1_0
      nu2_0
      Tc_0
      m1_0
      m2_0
      Ti_0
      Tsc_0
      nu1_opt
      nu2_opt
      Tc_opt
      m1_opt
      m2_opt
      Ti_opt
      Tsc_opt
      -logL
    
  
  
    
      10
      3.393662
      10.566381
      7.920000
      0.027990
      0.147762
      0.138175
      0.000030
      2.976386
      7.290183
      7.236861
      0.037756
      0.229864
      0.178659
      0.000019
      17582.754156
    
    
      11
      3.738671
      6.291449
      7.920000
      0.067358
      0.181861
      0.141373
      0.000033
      2.933155
      7.168308
      7.104183
      0.038650
      0.233178
      0.176059
      0.000012
      17582.768290
    
    
      18
      1.310714
      3.824504
      4.369948
      0.049779
      0.307464
      0.237146
      0.033322
      2.362064
      5.842982
      5.547352
      0.046664
      0.290175
      0.140767
      0.000027
      17583.780419
    
    
      22
      1.837307
      10.757118
      7.645129
      0.059177
      0.229560
      0.213559
      0.000032
      2.242446
      5.484794
      5.156093
      0.050021
      0.304977
      0.134595
      0.000019
      17583.969711
    
    
      15
      2.096425
      4.660374
      4.265668
      0.045547
      0.244521
      0.129254
      0.000019
      2.172535
      5.310612
      4.954453
      0.051699
      0.314579
      0.130400
      0.000015
      17584.202249
    
    
      29
      2.330807
      4.339575
      7.920000
      0.064959
      0.400673
      0.087552
      0.000033
      2.156040
      5.304803
      4.934450
      0.050758
      0.318995
      0.129395
      0.000015
      17584.406282
    
    
      14
      1.779585
      5.280102
      4.381530
      0.024307
      0.165301
      0.198832
      0.000030
      2.036089
      4.973092
      4.563480
      0.054334
      0.337488
      0.123699
      0.000012
      17585.299903
    
    
      37
      1.305890
      3.807041
      4.293586
      0.049922
      0.207666
      0.180538
      0.000050
      1.968934
      4.846930
      4.420515
      0.056884
      0.346840
      0.118008
      0.000100
      17585.995715
    
    
      35
      1.749685
      2.948985
      7.920000
      0.023613
      0.395618
      0.196917
      0.000033
      1.739157
      4.251127
      3.725114
      0.063843
      0.392371
      0.104367
      0.000066
      17587.372433
    
    
      20
      2.313656
      3.765053
      5.049405
      0.046481
      0.379796
      0.276645
      0.000047
      1.671541
      4.090976
      3.534206
      0.065836
      0.409312
      0.100302
      0.000093
      17588.436814

The best parameter combination from the ancient migration model had a likelihood of 17582.88. So the addition of $T_{SC}$ to the model had not improved the fit significantly. A time of secondary contact cannot be inferred from this spectrum.



In [62]:

    
popt = np.array(df.sort_values(by='-logL', ascending=True).iloc[1,7:14])
popt









    Out[62]:





array([  2.93315472e+00,   7.16830807e+00,   7.10418266e+00,
         3.86497143e-02,   2.33177903e-01,   1.76059241e-01,
         1.18413266e-05])



In [65]:

    
func_ex = dadi.Numerics.make_extrap_log_func(split_mig_iso_mig)

model = func_ex(popt, ns, pts_l)

theta = dadi.Inference.optimal_sfs_scaling(model, sfs2d)

mu = 3e-9
L = sfs2d.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites

N_ref = theta/L/mu/4

print "The effective size of the ancestral population of ery and par (in number of diploid individuals) implied by this theta is: {0:,}.".format(int(N_ref))
print "A Tsc of {0:.6f} corresponds to {1:,} generations.".format(popt[-1], int(popt[-1]*2*N_ref))









    



The effective size of the ancestral population of ery and par (in number of diploid individuals) implied by this theta is: 123,659.
A Tsc of 0.000012 corresponds to 2 generations.

Such a low time since secondary contact is obviously nonsensical.

Conclusion

The split between ery and par is ancient, maybe >1 Mya. A much higher effective population size is inferred for parallelus than for erythropus. There is a strong signal of gene flow in the data, which is inferred to have been much higher in the direction par-->ery than the other way round. The gene flow has not begun after a long period of complete isolation (i. e. no recent secondary contact). Rather, there seems to be a strong signal of ancient gene flow that recently has ceased completely. A model that allowed for exponential (instead of instantaneous) population size change did not fit the data better.

	nu1_0	nu2_0	T_0	nu1_opt	nu2_opt	T_opt	-logL
2	0.877109	0.741007	0.360840	0.636945	1.229606	0.290465	20387.227270
9	0.136711	7.063745	0.075557	0.637653	1.229853	0.290442	20387.435815
3	1.461323	4.088909	2.236886	0.637660	1.229850	0.290445	20387.435815
11	0.854249	1.682438	0.240684	0.637655	1.229849	0.290445	20387.435815
0	0.146030	1.659384	0.282181	0.637644	1.229881	0.290447	20387.435816
5	0.324028	1.847701	0.160416	0.637682	1.229931	0.290459	20387.435816
4	0.248861	4.984686	0.085569	0.637683	1.229911	0.290450	20387.435817
7	0.201540	0.651162	0.031593	0.637639	1.229858	0.290441	20387.435817
10	0.533901	0.839080	0.139849	0.637653	1.229835	0.290444	20387.435818
1	0.218471	1.740969	0.295402	0.637657	1.229890	0.290440	20387.435819
8	0.147376	0.642951	0.091625	0.637672	1.229838	0.290449	20387.435820
6	0.277689	1.509007	0.044130	0.637684	1.229796	0.290440	20387.435837

	nu1_0	nu2_0	T_0	m_0	nu1_opt	nu2_opt	T_opt	m_opt	-logL
8	2.059032	0.360002	0.682463	0.192790	0.994185	1.766127	0.922632	2.506880e-01	18574.119915
31	0.620927	0.849787	1.980000	1.827165	0.994247	1.766242	0.922690	2.507990e-01	18574.120405
2	2.894370	11.363772	1.980000	0.272765	0.993365	1.764405	0.921871	2.509119e-01	18574.121486
28	0.631049	2.924698	0.247732	0.205392	0.993307	1.765318	0.921817	2.509740e-01	18574.122241
17	0.837500	2.063526	0.058201	0.000110	0.996572	1.771058	0.928727	2.508888e-01	18574.138486
18	5.535163	3.398472	0.309525	0.169667	0.993863	1.767477	0.925657	2.517209e-01	18574.159471
15	0.318463	0.315764	0.587406	1.138654	0.989922	1.761280	0.914716	2.507331e-01	18574.164848
1	0.277233	8.221059	0.976626	0.051974	1.001148	1.776879	0.937823	2.505879e-01	18574.226541
0	5.653430	3.362330	1.980000	0.101259	0.995994	1.769480	0.927834	2.507786e-01	18574.341908
33	0.208178	2.525835	0.609693	0.000767	0.997653	1.771560	0.931469	2.510613e-01	18574.373801
34	0.683625	3.770899	0.327624	0.960373	0.986428	1.750484	0.899651	2.503218e-01	18574.507709
7	1.009844	1.115355	0.012909	0.000309	0.992355	1.760054	0.901086	2.470464e-01	18574.974381
5	2.556454	1.352414	1.569898	0.135823	1.194640	2.112713	1.324262	2.303226e-01	18613.113722
26	0.156669	0.773484	1.980000	0.217233	1.412143	2.455870	1.850644	2.074998e-01	18670.013487
9	6.348379	0.230012	1.457648	0.117155	1.464282	2.548749	1.990301	2.024862e-01	18681.403063
24	0.409099	0.405881	0.168447	0.000052	0.637198	1.230384	0.290581	6.497219e-08	20387.225419
23	2.175460	0.556770	0.086338	0.000014	0.637198	1.230385	0.290581	1.748376e-20	20387.226053
12	0.265271	0.129476	0.157514	0.000045	0.637196	1.230384	0.290580	2.583554e-12	20387.226053
25	0.085281	0.402881	0.364705	0.000042	0.637198	1.230385	0.290581	3.044384e-18	20387.226053
16	0.115098	0.103779	0.172020	0.000113	0.637162	1.230734	0.290565	4.850122e-15	20387.226666
27	3.731189	0.119188	0.015205	0.000017	0.637549	1.230015	0.290740	9.454649e-15	20387.229881
19	1.308999	0.089865	0.482895	0.000052	0.636599	1.228599	0.290307	4.483462e-10	20387.232239
32	2.699647	0.315091	0.058909	0.000035	0.636512	1.227689	0.290267	8.437228e-12	20387.240897
20	0.063430	0.646475	0.030202	0.000377	0.640787	1.227274	0.289655	3.001320e-06	20387.309598
14	0.109039	1.752206	0.068154	0.000323	0.637666	1.229887	0.290450	1.278566e-07	20387.434569
11	0.355181	0.212621	0.029912	0.000157	0.637670	1.229889	0.290449	2.826076e-08	20387.435537
21	0.199078	0.146460	0.338920	0.000173	0.637674	1.229910	0.290452	2.509801e-08	20387.435569
30	0.450847	0.463437	0.108087	0.000060	0.637667	1.229892	0.290450	1.971549e-08	20387.435620
13	0.318447	0.199418	0.790081	0.000020	0.637670	1.229881	0.290451	3.715424e-09	20387.435776
6	0.211271	0.683610	0.135427	0.000021	0.637667	1.229884	0.290449	3.917260e-13	20387.435812
4	0.670075	0.343239	1.980000	0.080259	0.632513	1.349728	2.000000	4.635045e-01	21059.797882
29	0.239562	0.556576	1.980000	0.031463	0.329749	0.596793	2.000000	1.044005e+00	23734.741735
10	5.874411	4.471090	1.980000	1.041556	6.145361	3.497838	1.999943	1.044327e+00	68992.522571
22	6.807827	3.585435	0.328641	1.217099	9.899311	1.285312	0.766758	1.552602e+00	71150.567474
3	2.157397	13.379040	1.980000	1.878685	1.662449	13.064076	1.999990	2.253679e+00	79308.735709

	nu1_0	nu2_0	Ti_0	Tc_0	m_0	nu1_opt	nu2_opt	Ti_opt	Tc_opt	m_opt	-logL
16	0.800833	1.775923	0.481231	0.034021	0.318631	0.993736	1.766119	6.606356e-02	0.854641	0.250961	18574.115189
11	2.291806	0.941107	0.417754	0.701156	0.280979	0.995790	1.768780	4.236017e-09	0.928104	0.250866	18574.136842
0	1.236540	2.847184	0.230733	3.068632	0.849626	0.995633	1.767191	8.952130e-02	0.836080	0.251181	18574.175975
9	1.153465	2.238480	0.164853	0.034364	1.314128	0.989694	1.757785	1.310731e-01	0.779922	0.251502	18574.445432
25	0.390701	1.897315	0.759622	0.446801	0.700774	0.994296	1.771438	1.354936e-01	0.779559	0.250514	18574.532394
1	1.006920	2.471800	0.560705	1.489177	0.540249	1.007125	1.785494	1.446806e-01	0.789880	0.249414	18574.820474
26	0.228105	5.126482	0.093396	0.087684	0.331095	1.005337	1.778832	1.211617e-03	0.932793	0.246241	18574.947916
28	0.402956	1.380227	0.550776	2.572429	0.157063	0.987372	1.754340	1.760467e-01	0.726734	0.252789	18575.195890
37	1.127186	4.079809	0.302806	0.931041	0.188286	1.002210	1.779032	3.747498e-02	0.918896	0.254661	18575.351564
30	0.591352	3.382051	0.243697	0.081228	0.530071	1.027357	1.796094	3.927616e-02	0.932124	0.249578	18576.313944

	nu1_0	nu2_0	T_0	m1_0	m2_0	nu1_opt	nu2_opt	T_opt	m1_opt	m2_opt	-logL
22	0.782220	1.448548	4.639001	0.072158	0.638342	0.920911	2.238617	1.244681	0.084147	0.494918	18104.966267
0	0.661003	7.163811	1.265469	0.257890	0.190122	0.920526	2.236116	1.241691	0.084361	0.494500	18104.971893
2	0.303698	1.230424	0.319167	0.222183	0.228972	0.922813	2.243404	1.254244	0.083942	0.494334	18104.983699
1	0.287213	1.386083	0.493232	0.096763	0.103789	0.915567	2.223840	1.228988	0.084270	0.497170	18105.001298
12	0.234347	0.568305	0.310253	0.112591	0.538129	0.912096	2.209951	1.210935	0.084060	0.497975	18105.141289
17	0.445903	1.955658	2.155467	0.025188	0.607712	0.928113	2.253559	1.279568	0.084935	0.493929	18105.232355
11	0.690910	1.962277	0.655486	0.716114	0.081910	0.929836	2.267261	1.289605	0.085435	0.493187	18105.358461
14	0.840609	1.431975	0.454631	0.032701	0.460575	0.933747	2.248026	1.278518	0.086974	0.490424	18105.547423
29	0.474111	0.694113	1.372862	0.274980	0.360411	0.915513	2.217017	1.244177	0.087882	0.492707	18105.676702
30	0.318476	1.235161	0.915553	0.202189	0.451816	0.889937	2.161759	1.149623	0.084294	0.506266	18106.119019

	nu1_0	nu2_0	T_0	m1_0	m2_0	nu1_opt	nu2_opt	T_opt	m2_opt	-logL
4	0.604286	1.689639	1.086538	0.048727	0.585935	0.720082	1.962785	0.791503	0.664581	18340.935861
8	1.197798	1.060943	0.387950	0.070337	0.177612	0.718508	1.960294	0.788789	0.664332	18340.951096
6	0.588394	3.336045	2.176182	0.035238	0.433038	0.721147	1.961776	0.791353	0.664706	18340.964537
3	0.400697	4.483709	0.486604	0.282474	0.612249	0.721305	1.961497	0.790718	0.663976	18340.965653
1	0.250388	0.933745	0.381047	0.123940	1.702059	0.717236	1.953896	0.786673	0.667389	18340.972028
9	1.163790	2.959924	0.753380	0.124571	0.155288	0.719228	1.967604	0.794908	0.669281	18341.008850
7	0.331430	2.693409	0.905852	0.078054	0.398861	0.743397	2.058666	0.871767	0.663042	18345.753837
0	0.258239	2.546149	2.410812	0.024995	0.808097	0.658928	1.875663	0.777684	0.783119	18374.981997
2	1.546874	0.697055	2.829271	0.246889	0.420687	1.453423	4.604630	3.820350	0.406251	18687.709775
5	1.561103	0.681461	2.626801	0.037716	0.265306	0.460973	1.135742	0.295629	0.584516	19282.370153

	nu1	nu2	T	m1	m2
sym_mig	0.994185	1.766127	0.922632	0.250688	0.250688
asym_mig	0.920911	2.238617	1.244681	0.084147	0.494918

	s_0	nu1_0	nu2_0	T_0	m1_0	m2_0	s_opt	nu1_opt	nu2_opt	T_opt	m1_opt	m2_opt	-logL
23	0.287604	0.584816	2.491513	0.577716	0.154881	0.912029	0.394336	1.028757	3.117404	3.130620	0.077685	0.576641	18814.402521
0	0.463402	0.738050	3.905796	0.739800	0.058625	0.441791	0.406139	1.080665	3.270704	3.494185	0.074393	0.547156	18814.762357
34	0.538469	1.255813	1.702693	0.635345	0.112638	0.259371	0.388115	1.063798	3.199012	3.234170	0.078406	0.557619	18814.877903
11	0.345561	0.585417	1.192358	0.584863	0.042089	0.583673	0.409575	1.037432	3.163748	3.314343	0.075790	0.574198	18814.902965
9	0.691487	0.915053	3.337032	0.675904	0.163826	0.807328	0.401345	1.127900	3.433437	3.748322	0.069455	0.527414	18815.275911
19	0.417002	0.379670	3.335798	1.655678	0.049658	0.456647	0.379121	1.082723	3.258400	3.404997	0.074637	0.545341	18815.329404
36	0.370682	0.321735	1.056795	0.583936	0.095256	0.255517	0.380064	1.084249	3.291107	3.415423	0.071002	0.551352	18815.491487
2	0.349065	1.580840	0.816700	0.551687	0.134324	0.540489	0.413618	0.989663	3.030744	3.048123	0.078635	0.599699	18815.663484
5	0.231016	1.236144	1.875334	1.265815	0.065793	0.251310	0.389511	0.968089	2.940081	2.803636	0.082062	0.612002	18815.733030
32	0.222590	0.812110	1.404866	1.332359	0.031135	1.195093	0.424820	0.974611	3.066101	3.252257	0.074601	0.609123	18818.545393

	nu1_0	nu2_0	Tc_0	m1_0	m2_0	Ti_0	nu1_opt	nu2_opt	Tc_opt	m1_opt	m2_opt	Ti_opt	-logL
38	3.122675	6.565590	4.556457	0.025178	0.150502	0.193779	2.884250	7.039549	6.939774	0.039132	0.237419	0.173159	17582.884413
15	2.457166	7.524601	5.393413	0.108182	0.288238	0.119209	2.439955	5.984046	5.708981	0.045731	0.279332	0.146457	17583.559438
6	3.803361	5.073436	4.055461	0.089821	0.147407	0.074291	2.507140	6.203819	6.007240	0.042590	0.272537	0.148739	17584.306651
10	2.988730	4.738118	4.034295	0.054839	0.356824	0.157766	2.092882	5.117888	4.727453	0.053452	0.326719	0.125626	17584.511983
48	2.942805	4.976983	2.734166	0.069029	0.212524	0.093116	2.033029	4.975807	4.562200	0.054811	0.336354	0.122034	17584.788795
37	1.698863	3.380579	5.956428	0.041756	0.236200	0.099436	2.012841	4.921590	4.465928	0.055141	0.338755	0.120221	17585.405692
9	3.311864	5.144676	3.960000	0.109423	0.383802	0.129523	1.798721	4.397479	3.879859	0.061955	0.379276	0.107981	17585.882553
25	3.772272	5.313562	3.820628	0.075429	0.294183	0.246429	2.276085	5.525343	5.185373	0.051899	0.292315	0.134893	17585.939796
34	2.097129	5.327970	5.500053	0.083428	0.525671	0.092544	1.842790	4.510400	4.019504	0.060138	0.371075	0.110615	17586.022421
46	1.642916	6.103883	3.960000	0.054931	0.349428	0.058007	1.781764	4.362300	3.835582	0.062103	0.383396	0.106964	17586.026218

	nu1_0	nu2_0	Tc_0	m1_0	m2_0	Ti_0	Tsc_0	nu1_opt	nu2_opt	Tc_opt	m1_opt	m2_opt	Ti_opt	Tsc_opt	-logL
10	3.393662	10.566381	7.920000	0.027990	0.147762	0.138175	0.000030	2.976386	7.290183	7.236861	0.037756	0.229864	0.178659	0.000019	17582.754156
11	3.738671	6.291449	7.920000	0.067358	0.181861	0.141373	0.000033	2.933155	7.168308	7.104183	0.038650	0.233178	0.176059	0.000012	17582.768290
18	1.310714	3.824504	4.369948	0.049779	0.307464	0.237146	0.033322	2.362064	5.842982	5.547352	0.046664	0.290175	0.140767	0.000027	17583.780419
22	1.837307	10.757118	7.645129	0.059177	0.229560	0.213559	0.000032	2.242446	5.484794	5.156093	0.050021	0.304977	0.134595	0.000019	17583.969711
15	2.096425	4.660374	4.265668	0.045547	0.244521	0.129254	0.000019	2.172535	5.310612	4.954453	0.051699	0.314579	0.130400	0.000015	17584.202249
29	2.330807	4.339575	7.920000	0.064959	0.400673	0.087552	0.000033	2.156040	5.304803	4.934450	0.050758	0.318995	0.129395	0.000015	17584.406282
14	1.779585	5.280102	4.381530	0.024307	0.165301	0.198832	0.000030	2.036089	4.973092	4.563480	0.054334	0.337488	0.123699	0.000012	17585.299903
37	1.305890	3.807041	4.293586	0.049922	0.207666	0.180538	0.000050	1.968934	4.846930	4.420515	0.056884	0.346840	0.118008	0.000100	17585.995715
35	1.749685	2.948985	7.920000	0.023613	0.395618	0.196917	0.000033	1.739157	4.251127	3.725114	0.063843	0.392371	0.104367	0.000066	17587.372433
20	2.313656	3.765053	5.049405	0.046481	0.379796	0.276645	0.000047	1.671541	4.090976	3.534206	0.065836	0.409312	0.100302	0.000093	17588.436814

Table of Contents

plot the data

divergence in isolation model

Interpretation of parameters

compare marginal spectra

plot residuals

adding gene flow

plot residuals

Interpretation of parameters

Translation of m

LRT

Recent secondary contact model

plot residuals

Interpretation of parameters

Compare best fit IM with RSC model spectrum

adding asymmetric gene flow

plot residuals

Interpretation of parameters

LRT

Is the migration rate from ery into par (m1) statistically significant?

compare symmetric with asymmetric gene flow

adding gradual size changes

plot residuals

compare asym_mig with IM model

Ancient migration model

plot residuals

parameter interpretation

comparison with continuous gene flow model

ancient migration, isolation and secondary contact

Conclusion

compare `asym_mig` with `IM` model