In [1]:
# load dadi module
import sys
sys.path.insert(0, '/home/claudius/Downloads/dadi')
import dadi
In [2]:
%ll
In [3]:
%ll dadiExercises
In [2]:
# import 1D spectrum of ery
fs_ery = dadi.Spectrum.from_file('dadiExercises/ERY.FOLDED.sfs.dadi_format')
# import 1D spectrum of ery
fs_par = dadi.Spectrum.from_file('dadiExercises/PAR.FOLDED.sfs.dadi_format')
In [3]:
import pylab
In [4]:
%matplotlib inline
In [5]:
pylab.rcParams['figure.figsize'] = [12.0, 10.0]
In [6]:
pylab.plot(fs_par, 'go-', label='par')
pylab.plot(fs_ery, 'rs-', label='ery')
pylab.xlabel('minor allele frequency')
pylab.ylabel('SNP count')
pylab.legend()
Out[6]:
Does estimating an unfolded spectrum with ANGSD and then folding yield a sensible folded SFS when the sites are not polarised with respect to an ancestral allele but with respect to the reference allele? Matteo Fumagalli thinks that this is sensible.
In [9]:
! cat ERY.unfolded.sfs
In [10]:
! cat PAR.unfolded.sfs
In [2]:
# import 1D unfolded spectrum of ery
fs_ery_unfolded = dadi.Spectrum.from_file('ERY.unfolded.sfs')
# import 1D unfolded spectrum of par
fs_par_unfolded = dadi.Spectrum.from_file('PAR.unfolded.sfs')
In [12]:
pylab.plot(fs_ery, 'ro-', label='folded by ANGSD')
pylab.plot(fs_ery_unfolded.fold(), 'bs-', label='folded by ' + r'$\delta$a$\delta$i')
pylab.xlabel('minor allele frequency')
pylab.ylabel('SNP count')
pylab.legend()
Out[12]:
In [13]:
pylab.plot(fs_par, 'go-', label='folded by ANGSD')
pylab.plot(fs_par_unfolded.fold(), 'bs-', label='folded by ' + r'$\delta$a$\delta$i')
pylab.xlabel('minor allele frequency')
pylab.ylabel('SNP count')
pylab.legend()
Out[13]:
For parallelus, the spectrum folded by dadi looks better than the spectrum folded by ANGSD. I am therefore going to use spectra folded in dadi.
In [3]:
fs_ery = fs_ery_unfolded.fold()
fs_par = fs_par_unfolded.fold()
In [15]:
# create link to built-in model function
func = dadi.Demographics1D.snm
# make the extrapolating version of the demographic model function
func_ex = dadi.Numerics.make_extrap_log_func(func)
In [16]:
# setting the smallest grid size slightly larger than the largest population sample size
pts_l = [40, 50, 60]
ns = fs_ery.sample_sizes
ns
Out[16]:
In [17]:
# calculate unfolded AFS under standard neutral model (up to a scaling factor theta)
neutral_model = func_ex(0, ns, pts_l)
neutral_model
Out[17]:
In [18]:
theta_ery = dadi.Inference.optimal_sfs_scaling(neutral_model, fs_ery)
theta_ery
Out[18]:
In [19]:
theta_par = dadi.Inference.optimal_sfs_scaling(neutral_model, fs_par)
theta_par
Out[19]:
What effective ancestral population size would that imply?
According to section 5.2 in the dadi manual:
$$ \theta = 4 N_{ref} \mu_{L} \qquad \text{L: sequence length} $$Then
$$ \mu_{L} = \mu_{site} \times L $$So
$$ \theta = 4 N_{ref} \mu_{site} \times L $$and
$$ N_{ref} = \frac{\theta}{4 \mu_{site} L} $$Let's assume the mutation rate per nucleotide site per generation is $3\times 10^{-9}$ (see e. g. Liu2017).
In [20]:
mu = 3e-9
L = fs_ery.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
print "The total sequence length for the ery spectrum is {0:,}.".format(int(L))
N_ref_ery = theta_ery/L/mu/4
print "The effective ancestral population size of ery (in number of diploid individuals) implied by this theta is: {0:,}.".format(int(N_ref_ery))
In [21]:
mu = 3e-9
L = fs_par.data.sum() # this sums over all entries in the spectrum, including masked ones, i. e. also contains invariable sites
print "The total sequence length for par spectrum is {0:,}.".format(int(L))
N_ref_par = theta_par/L/mu/4
print "The effective ancestral population size of par (in number of diploid individuals) implied by this theta is: {0:,}.".format(int(N_ref_par))
These effective population sizes are consistent with those reported for other insect species (Lynch2016, fig. 3b).
In dadi, times are given in units of $2N_{ref}$ generations.
In [22]:
# compare neutral model prediction with ery spectrum
dadi.Plotting.plot_1d_comp_multinom(neutral_model.fold()[:19], fs_ery[:19], residual='linear')
In [23]:
# compare neutral model prediction with par spectrum
dadi.Plotting.plot_1d_comp_multinom(neutral_model.fold()[:19], fs_par[:19], residual='linear')
The lower plot (green line) is for the scaled Poisson residuals.
$$ residuals = (model - data)/\sqrt{model} $$The model is the expected counts in each frequency class. If these counts are Poisson distributed, then their variance is equal to their expectation. The differences between model and data are therefore scaled by the expected standard deviation of the model counts.
Although a standard neutral model is unrealistic for either population, the observed counts deviate by up to 90 standard deviations from this model!
What could be done about this?
The greatest deviations are seen for the first two frequency classes, the ones that should provide the greatest amount of information for theta (Fu1994) and therefore probably also other parameters. I think masking out just one of the first two frequency classes will lead to highly biased inferences. Masking both frequency classes will reduce a lot of the power to infer the demographic history. I therefore think that masking will most likely not lead to better estimates.
Toni has suggested that the doubleton class is inflated due to "miscalling" heterozygotes as homozygotes. When they contain a singleton they will be "called" as homozygote and therefore contribute to the doubleton count. This is aggravated by the fact that the sequenced individuals are all male which only possess one X chromosome. The X chromosome is the fourth largest of the 9 chromosomes of these grasshoppers (8 autosomes + X) (see Gosalvez1988, fig. 2). That is, about 1/9th of the sequenced RAD loci are haploid but ANGSD assumes all loci to be diploid. The genotype likelihoods it calculates are all referring to diploid genotypes.
I think one potential reason for the extreme deviations is that the genotype likelihoods are generally biased toward homozygote genotypes (i. e. also for autosomal loci) due to PCR duplicates (see eq. 1 in Nielsen2012). So, one potential improvement would be to remove PCR duplicates.
Another potential improvement could be found by subsampling 8/9th to 8/10th of the contigs in the SAF files and estimating an SFS from these. Given enough subsamples, one should eventually be found that maximally excludes loci from the X chromosome. This subsample is expected to produce the least squared deviations from an expected SFS under the standard neutral model. However, one could argue that this attempt to exclude problematic loci could also inadvertently remove loci that strongly deviate from neutral expectations due to non-neutral evolution, again reducing power to detect deviations from the standard neutral model. I think one could also just apply the selection criterion of the second MAF class to be lower than the first and just save all contig subsamples and SFS's that fulfill that criterioin, since that should be true for all demographic scenarios. This approach, however, is compute intensive and difficult to implement.
Ludovic has suggested optimising for a fraction $p$ of all SNP's that lie on the X chromosome. SNP's on the X chromosome are counted twice in the SFS. This fraction $p$ should therefore be subtracted from even frequency classes and added to the respective frequency class that contains SNP's that are 1/2 as frequent, e. g. from class 2 --> 1 or from 8 --> 4. One could optimise for the minimum deviation from a neutral spectrum.
Finally, it may be possible to acquire the the Locusta genome contigs scaffolded or annotated by linkage group (see Wang2014). One could then blast-annotate the RAD contigs with the homologous Locusta linkage group and create subsets of RAD contigs for each linkage group that exclude that linkage group. There may be a subset (presumably the one that excluded RAD contigs from the X chromosome) that has markedly reduced residuals when compared to a neutral spectrum.
This model assumes that exponential growth (or decline) started some time $T$ in the past and the current effective population size is a multiple $\nu$ of the ancient populations size, i. e. before exponential growth began. So this model just estimates two parameters. If $\nu$ is estimated to be 1, this indicates that the population size hasn't changed (although see Myers2008). If it is below one, this indicates exponential population decline (how realistic is that?). If it is above 1, this indicates exponential population growth.
In [24]:
import dill
# loading optimisation results from previous analysis
growth_ery = dill.load(open('OUT_exp_growth_model/ERY_perturb_ar_ery.dill'))
In [25]:
def flatten(array):
"""
Returns a list of flattened elements of every inner lists (or tuples)
****RECURSIVE****
"""
import numpy
res = []
for el in array:
if isinstance(el, (list, tuple, numpy.ndarray)):
res.extend(flatten(el))
continue
res.append(el)
return list( res )
In [26]:
import pandas as pd
In [27]:
i = 4 # where to find flag, 6 for BFGS, 4 for Nelder-Mead
successfull_popt_ery = [flatten(out)[:5] for out in growth_ery if out[1][i] == 0]
df = pd.DataFrame(data=successfull_popt_ery, \
columns=['nu_0', 'T_0', 'nu_opt', 'T_opt', '-logL'])
df.sort_values(by='-logL', ascending=True) # smaller is better
Out[27]:
In [28]:
print "The most likely parameter combination for this model suggests that the ery population size started to exponentially shrink at {0:,} generations ago to a current population size of {1:,}.".format(int(0.00769*2*N_ref_ery), int(0.14929*N_ref_ery))
This optimal parameter set was difficult to find. It can only be found from parameter values that are already quite close to the optimal values. The model enforces exponential population decline. This may not be very realistic.
A sweep through the parameter space did not lead to convergence of parameters for the parallelus spectrum. Optimisations starting from up to 6-fold randomly perturbed neutral parameter values (1, 1) also did not lead to successful optimisations. The exponential growth/decline model could therefore not be fit to the par spectrum.
The built-in two epoch model defines a piecewise history, where at some time $T$ in the past the ancestral population instantaneously changed in size to the contemporary population size, which has ratio of $\nu$ to the ancestral population size.
The parameters values with the highest likelihood ($\nu=0.0001$ and $T=0.000002$) are hitting the lower bound of parameter space that I've set. The time parameter $T$ that dadi infers with the highest likelihood corresponds to 3 generations and therefore does not make sense. The two epoch model could therefore not be fit to the erythropus spectrum.
In [29]:
ar_par_te = dill.load(open("OUT_two_epoch/PAR_perturb_ar_par.dill"))
In [30]:
i = 4 # index of flag with NM algorithm
successfull_popt = [flatten(out)[:5] for out in ar_par_te if out[1][i] == 0]
df = pd.DataFrame(data=successfull_popt, \
columns=['nu_0','T_0', 'nu_opt', 'T_opt', '-logL'])
df.sort_values(by='-logL', ascending=True)
Out[30]:
Very different population size histories have the same likelihood. The most likely parameter combinations are hitting the upper parameter bound for the time parameter (6).
In [31]:
print "A T of 6 corresponds to {0:,} generations.".format(6*2*N_ref_par)
From the dadi manual:
If your fits often push the bounds of your parameter space (i.e., results are often at the bounds of one or more parameters), this indicates a problem. It may be that your bounds are too conservative, so try widening them. It may also be that your model is misspecified or that there are unaccounted biases in your data.
Ryan Gutenkunst in a dadi forum thread on parameters hitting the boundary:
This indicates that dadi is having trouble fitting your data. One possibility is that the history of the population includes important events that aren’t in your models. Another possibility is that your data is biased in ways that aren’t in your models. For example, maybe your missing calls for rare alleles.
I am not sure whether an even higher $T$ can reasonably be assumed. The expected time to the most recent common ancestor (MRCA) in a neutral genealogy is:
$$ E\big[T_{MRCA}\big] = 2 \Big(1-\frac{1}{n}\Big) $$measured in $N_e$ generations (for coalescent time unit see p. 66, 92 in Wakeley2009). Note, that $T_{MRCA}$ is close to its large sample size limit of 2 already for moderate sample sizes. In order to put dadi's time parameter on the coalescent time scale it needs to be multiplied by 2! A T of 6 from dadi therefore corresponds to a T of 12 on the coalescent time scale, which should be completely beyond the possible $T_{MRCA}$ of any genealogy.
See figure 3.4, p. 79, in Wakeley2009 for the distribution of the $T_{MRCA}$. For $n=36$ it has a mode close to 1.2 and an expected value of 1.94. Values for $T_{MRCA}$ greater than 4 are very unlikely given a standard coalescent model, but may be more likely under models including population expansion or gene flow from another population.
I think a two epoch model cannot be fit to the parallelus spectrum.
The built-in bottlegrowth
model specifies an instantaneous size change $T \times 2N_{ref}$ generations in the past immediately followed by the start of exponential growth (or decline) toward the contemporary population size. The model has three parameters:
In [32]:
ar_ery = dill.load(open("OUT_bottlegrowth/ERY_perturb_ar_ery.dill"))
In [33]:
successfull_popt = [flatten(out)[:7] for out in ar_ery if out[1][4] == 0]
In [34]:
df = pd.DataFrame(data=successfull_popt, columns=['nuB_0', 'nuF_0', 'T_0', 'nuB_opt', 'nuF_opt', 'T_opt', '-logL'])
In [35]:
df.sort_values(by='-logL', ascending=False)
Out[35]:
These optimal parameter values were found by a broad sweep across the parameter space (data not shown).
The optimal parameters suggest the following:
In [36]:
print "At time {0:,} generations ago, the ery population size instantaneously increased by almost 40-fold (to {1:,}).".format(int(0.36794*2*N_ref_ery), int(39.2*N_ref_ery))
Can a shortterm effective population size of 21 million reasonably be assumed?
In [37]:
print "This was followed by exponential decline towards a current population size of {0:,}.".format(int(0.4576*N_ref_ery))
There was no convergence on a set of optimal parameter values.
The built-in three_epoch
model specifies a piecewise history (with only instantaneous population size changes instead of gradual changes). At time $TF+TB$ the ancient population underwent a size change, stayed at this size ($\nu_B \times N_{ref}$) for $TB \times 2N_{ref}$ generations and then underwent a second size size change $TF \times 2N_{ref}$ generations in the past to the contemporary population size ($\nu_F \times N_{ref}$). The model has therefore two population size parameters, $\nu_B$ and $\nu_F$ as well as two time parameters, $TB$ and $TF$.
With 4 parameters, this model is already so complex that a sweep through the parameter space (i. e. starting optimisations from many different positions) is prohibitively time-consuming. I have therefore started optimisations with initial parameter values that were generated by randomly perturbing a neutral parameter combination, $\nu_B = \nu_F = TB = TF = 1$, by up to 4-fold. I can therefore not be certain whether the optimal parameter combinations thus found correspond to the global or a just a local optimum of the likelihood function.
In [38]:
# load optimisation results from file
ar_ery = []
import glob
for filename in glob.glob("OUT_three_epoch/ERY_perturb_ar_ery*.dill"):
ar_ery.extend(dill.load(open(filename)))
In [39]:
# extract successful (i. e. converged) optimisation
successfull_popt = [flatten(out)[:9] for out in ar_ery if out[1][4] == 0]
# create data frame
df = pd.DataFrame(data=successfull_popt, \
columns=['nuB_0', 'nuF_0', 'TB_0', 'TF_0', 'nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt', '-logL'])
# sort data frame by negative log likelihood
df.sort_values(by='-logL', ascending=True) # smaller is better
Out[39]:
Note, all but the last parameter combination have the same likelihood. Fairly different parameter combinations have practically identical likelihood. A reduction of the contemporary population size to 1/4 of the ancient population size is quite conistently inferred ($\nu_F$). The ancient population size change ($\nu_B$) is not inferred consistently. It ranges from a 186-fold increase to a reduction to 1/3 of the ancient population size.
In [40]:
# remove one parameter combination with slightly lower logL than the others
df = df.sort_values(by='-logL').head(-1)
In [41]:
# the time of the ancient pop size change is TB+TF
df['TB+TF'] = pd.Series(df['TB_opt']+df['TF_opt'])
In [42]:
# extract columns from table
nuB = df.loc[:,'nuB_opt']
nuF = df.loc[:, 'nuF_opt']
Tb_Tf = df.loc[:, 'TB+TF']
TF = df.loc[:, 'TF_opt']
In [43]:
# turn nu (a ratio) into absolute Ne and T into generations
nuB_n = nuB*N_ref_ery
nuF_n = nuF*N_ref_ery
Tb_Tf_g = Tb_Tf*2*N_ref_ery
TF_g = TF*2*N_ref_ery
In [44]:
anc = [N_ref_ery] * len(nuB) # ancestral pop size
pres = [1] * len(nuB) # 1 generation in the past
past = [max(Tb_Tf_g)+1000] * len(nuB) # furthest time point in the past
In [45]:
pylab.rcParams['figure.figsize'] = [12.0, 10.0]
pylab.rcParams['font.size'] = 12.0
# plot the best 21 parameter combinations
for x, y in zip(zip(past, Tb_Tf_g, Tb_Tf_g, TF_g, TF_g, pres), zip(anc, anc, nuB_n, nuB_n, nuF_n, nuF_n)):
pylab.semilogy(x, y)
pylab.xlabel('generations in the past')
pylab.ylabel('effective population size')
Out[45]:
This plot visualises the range of stepwise population size histories that the above parameter combinations imply (all with likelihood equal to 2168.11186). Most parameter combinations infer an ancient population size increase followed by a drastic population size collapse to less than 1/3 of the ancient population size that happened more than 400,000 generations ago.
Are the strength of population size expansion, $(\nu_B)^{TB}$, and the strength population size reduction, $(\frac{1}{\nu_F})^{TF}$, correlated with each other?
In [46]:
pylab.plot(df['nuB_opt']**df['TB_opt'], (1.0/df['nuF_opt'])**df['TF_opt'], 'bo')
pylab.xlabel('strength of first size change')
pylab.ylabel('strength of second size change')
Out[46]:
If a long period of increased population size would be correlated with a long period of decreased population size, this plot should show a positive correlation. This does not seem to be the case.
In [47]:
import dill
ar_par = dill.load(open("OUT_three_epoch/PAR_perturb_ar_par.dill"))
ar_par_extreme = dill.load(open("OUT_three_epoch/PAR_perturb_extreme_ar_par.dill"))
In [48]:
# add new output to previous output
successfull_popt = [flatten(out)[:9] for out in ar_par if out[1][4] == 0]
successfull_popt.extend([flatten(out)[:9] for out in ar_par_extreme if out[1][4] == 0])
# create data frame
df = pd.DataFrame(data=successfull_popt, \
columns=['nuB_0', 'nuF_0', 'TB_0', 'TF_0', 'nuB_opt', 'nuF_opt', 'TB_opt', 'TF_opt', '-logL'])
# sort data frame by negative log likelihood
df.sort_values(by='nuB_opt', ascending=True)
Out[48]:
As can be seen, extremely different population size histories have the same likelihood.
In [49]:
# add time for ancient size change
df['TB+TF'] = pd.Series(df['TB_opt']+df['TF_opt'])
# extract columns from table
nuB = df.loc[:,'nuB_opt']
nuF = df.loc[:, 'nuF_opt']
Tb_Tf = df.loc[:, 'TB+TF']
TF = df.loc[:, 'TF_opt']
# turn nu into absolute Ne and T into generations
nuB_n = nuB*N_ref_par
nuF_n = nuF*N_ref_par
Tb_Tf_g = Tb_Tf*2*N_ref_par
TF_g = TF*2*N_ref_par
# auxilliary
anc = [N_ref_par] * len(nuB)
pres = [1] * len(nuB)
past = [max(Tb_Tf_g)+1000] * len(nuB)
pylab.rcParams['figure.figsize'] = [12.0, 10.0]
pylab.rcParams['font.size'] = 12.0
for x, y in zip(zip(past, Tb_Tf_g, Tb_Tf_g, TF_g, TF_g, pres), zip(anc, anc, nuB_n, nuB_n, nuF_n, nuF_n)):
pylab.semilogy(x, y)
pylab.xlabel('generations in the past')
pylab.ylabel('effective population size')
Out[49]:
All parameter combinations consistently show a reduction in the contemporary population size with respect to the ancient population size ($\nu_F$). The ancient population size change is much less clear and ranges from a 300-fold increase to 100-fold reduction. Also this event is inferred to have occurred quite distant in the past, generally much more than 1 million generations ago.
This model specifies exponential growth/decline toward $\nu_B \times N_{ref}$ during a time period of $TB \times 2N_{ref}$ generations, after which (at time $TF \times 2N_{ref}$) the population size undergoes an instantaneous size change to the contemporary size ($\nu_F \times N_{ref}$).
In [50]:
ar_ery = dill.load(open("OUT_expGrowth_bottleneck/ERY_perturb_ar_ery.dill"))
In [51]:
success = [flatten(out)[:9] for out in ar_ery if out[1][4] == 0]
df = pd.DataFrame(data=success, \
columns=['nuB_0','TB_0', 'nuF_0', 'TF_0', 'nuB_opt', 'TB_opt', 'nuF_opt', 'TF_opt', '-logL'])
df.sort_values(by='-logL', ascending=True)
Out[51]:
Except for the last two, all optimal parameter combinations have the same likelihood and they do not deviate much from their respective initial values. In all successful optimisations above, $\nu_F$, the ratio of contemporary population size to ancient population size, converges to a value below 1/3. I think an ancient size increase or decrease cannot be inferred.
In [52]:
F = df[df['-logL'] < 2172].loc[:,['nuF_opt', 'TF_opt']]
pylab.plot(F['TF_opt']*2*N_ref_ery, F['nuF_opt']*N_ref_ery, 'bo')
pylab.xlabel('generations')
pylab.ylabel('contemporary pop. size')
Out[52]:
There is a clear linear correlation between the size of the inferred contemporary population size ($\nu_F \times N_{ref}$) and the time ($TF$) for which the population has been at this size.
In [155]:
ar_par = dill.load(open("OUT_expGrowth_bottleneck/PAR_perturb_ar_ery.dill"))
In [156]:
success = [flatten(out)[:9] for out in ar_par if out[1][4] == 0]
df = pd.DataFrame(data=success, \
columns=['nuB_0','TB_0', 'nuF_0', 'TF_0', 'nuB_opt', 'TB_opt', 'nuF_opt', 'TF_opt', '-logL'])
df.sort_values(by='-logL', ascending=True)
Out[156]:
Often, but not always are optimal parameter values close to their respective initial values. All optimal parameter combinations have the same likelihood. Only a recent (stepwise) population decrease is consistently inferred ($\nu_F$).
In [161]:
F = df[df['TF_opt'] < 3].loc[:,['nuF_opt', 'TF_opt']]
pylab.plot(F['TF_opt']*2*N_ref_par, F['nuF_opt']*N_ref_par, 'bo')
pylab.xlabel('TF generations')
pylab.ylabel('contemporary pop. size')
Out[161]:
There could be a linear correlation between the duration of population size reduction and the size of the reduction, but it's weaker than for erythropus.
Not much can be inferred from the 1D spectra of both erythropus and parallelus. However, the dadi optimisations have consistently suggested a population size reduction more than 400,000 generations ago to less than 1/3 of the ancient population size. The reduction seems to have been much stronger in parallelus than in erythropus.
The fitting of 2D models was substantially easier than fitting 1D models and led to conistent and robust parameter estimates.
Could it be that the inability to model gene flow from an external source is responsible for the issues with fitting 1D models?
I have saved to files the results of each optimisation run (i. e. for each combination of starting parameter values).
In [4]:
# load results of fitting isolation with migration model to 2D sepctrum
import dill, glob
split_mig_res = []
for filename in glob.glob("OUT_2D_models/split_mig*dill"):
split_mig_res.append(dill.load(open(filename)))
In [5]:
from utility_functions import *
In [6]:
import pandas as pd
In [8]:
success = [flatten(out)[:9] for out in split_mig_res if out[1][4] == 0]
df = pd.DataFrame(data=success, \
columns=['nu1_0','nu2_0', 'T_0', 'm_0', 'nu1_opt', 'nu2_opt', 'T_opt', 'm_opt', '-logL'])
df.sort_values(by='-logL', ascending=True)
Out[8]:
These are the successful parameter optimisations I have for the model that specifies a split at time $T$ in the past into two daughter populations with population size ratios (relative to $N_{ref}$) of $\nu_1$ for ery and $\nu_2$ for par. $m$ is the rate of migration, which is assumed to be symmetrical. The migration rate is the fraction of individuals from the other population times $2N_{ref}$, i. e. number of immigrant alleles.
The best paramter combination, that I got from these optimisation runs is:
In [9]:
popt = df.sort_values(by='-logL', ascending=True).iloc[0, 4:8]
popt
Out[9]:
In [10]:
import numpy as np
In [11]:
popt = np.array(popt)
In [12]:
func = dadi.Demographics2D.split_mig # divergence-with-gene-flow model function (built-in)
func_ex = dadi.Numerics.make_extrap_log_func(func)
In [14]:
ns = np.array([36, 36])
pts_l = [40, 50, 60]
In [15]:
# get optimal model spectrum from 2D model 'divergence-with-migration' model
split_mig_2D_best_fit_model_spectrum = func_ex(popt, ns, pts_l)
In [19]:
split_mig_2D_best_fit_model_spectrum
Out[19]:
In [22]:
# import 2D unfolded spectrum
sfs2d_unfolded = dadi.Spectrum.from_file('dadiExercises/EryPar.unfolded.2dsfs.dadi_format')
# add population labels
sfs2d_unfolded.pop_ids = ["ery", "par"]
# fold the spectrum
sfs2d = sfs2d_unfolded.fold()
In [23]:
# need to scale model spectrum by optimal theta, which depends on the number of sites in the data
model_spectrum = dadi.Inference.optimally_scaled_sfs(split_mig_2D_best_fit_model_spectrum, sfs2d)
model_spectrum
Out[23]:
In [25]:
import pylab
%matplotlib inline
pylab.rcParams['figure.figsize'] = [12, 10]
pylab.rcParams['font.size'] = 14
model_spectrum.pop_ids = sfs2d.pop_ids
dadi.Plotting.plot_single_2d_sfs(model_spectrum.fold(), vmin=1, cmap=pylab.cm.jet)
Out[25]:
In [26]:
# get the marginal model spectra for ery and par from the optimal 2D model spectrum
fs_ery_model = model_spectrum.marginalize([1])
fs_par_model = model_spectrum.marginalize([0])
In [27]:
pylab.plot(fs_par_model.fold(), 'go-', label='par')
pylab.plot(fs_ery_model.fold(), 'rs-', label='ery')
pylab.xlabel('minor allele frequency')
pylab.ylabel('SNP count')
pylab.legend()
Out[27]:
In [29]:
# import 1D unfolded spectrum of ery
fs_ery_unfolded = dadi.Spectrum.from_file('ERY.unfolded.sfs')
# import 1D unfolded spectrum of par
fs_par_unfolded = dadi.Spectrum.from_file('PAR.unfolded.sfs')
In [31]:
fs_ery = fs_ery_unfolded.fold()
fs_par = fs_par_unfolded.fold()
In [34]:
dadi.Plotting.plot_1d_comp_multinom(fs_ery_model.fold()[:19], fs_ery[:19])
In [35]:
dadi.Plotting.plot_1d_comp_multinom(fs_par_model.fold()[:19], fs_par[:19])
In [36]:
# log likelihoods
ll_model_ery = dadi.Inference.ll_multinom(fs_ery_model.fold(), fs_ery)
ll_model_ery
Out[36]:
In [37]:
ll_model_par = dadi.Inference.ll_multinom(fs_par_model.fold(), fs_par)
ll_model_par
Out[37]:
model | ERY | PAR |
---|---|---|
exponential growth | 1833.6 | ? |
two epoch | 5502.3 | 5502.3 |
bottlegrowth | 1649.9 | ? |
three epoch | 2168.1 | 6925.6 |
exp growth+bottleneck | 2167.7 | 6929.4 |
divergence-with-migration | 2243.4 | 5605.2 |
The upper table shows the negative log likelihoods for the best parameter combination found for 5 1D models (see above sections) as well as the likelihood of 1D model spectra derived from a 2D model of divergence with migration.
The simple divergence with symmetrical gene flow
model from which the marginal 1D best-fit model spectra were derived is most similar to the two_epoch
model among the 1D models used for fitting. This is because the population split coincides with an instantaneous change in population size but then stays the same until present. When viewed from one of the two populations, the divergence with gene flow
model is therefore identical to the two_epoch
model with the addition of gene flow from the other diverging population.
For ery the addition of gene flow provides an extreme improvement compared to the two_epoch
model. For par the fit is even worse. So, the lack of gene flow in the 1D models is at best only one of several factors that could explain the problems with fitting 1D models to 1D spectra.
It should be noted that the three 1D models that included 2 population size changes, bottlegrowth
. three_epoch
and exp growth+bottleneck
, all get better fits to the data than the marginal 1D model spectrum from the best-fit divergence with gene flow model.
In [ ]: