1D spectra


In [19]:
# load dadi module

import sys

import numpy as np

sys.path.insert(0, '/home/claudius/Downloads/dadi')

import dadi

In [2]:
%less PAR.unfolded.sfs

In [2]:
# load 1D spectra from file

fs_ery = dadi.Spectrum.from_file("ERY.unfolded.sfs")

fs_par = dadi.Spectrum.from_file("PAR.unfolded.sfs")

See lines 1429-1435 and 1661-1664 in assembly.sh for the command lines that created these 1D unfolded spectra. Only sites with data from at least 9 individuals were included in the analysis.

plot raw spectra


In [23]:
%matplotlib inline

from matplotlib import pyplot as plt

plt.rcParams['figure.figsize'] = [14, 12]
plt.rcParams['font.size'] = 14

In [4]:
plt.grid()
plt.plot(fs_ery.fold(), 'ro-', label='ery', linewidth=2.5, markersize=12)
plt.plot(fs_par.fold(), 'g^-', label='par', linewidth=2.5, markersize=12)
plt.xlabel("minor allele frequency")
plt.ylabel("SNP count")
plt.legend()


Out[4]:
<matplotlib.legend.Legend at 0x7fec7598f9d0>

The site frequency spectrum should be roughly proportional to $1/p$, $p$ being the minor allele frequency, for any demographic history or selection regime. In particular the first and second frequency classes ($p=1$ and $p=2$) are deviating from this.


In [3]:
# fold spectra
fs_ery = fs_ery.fold()
fs_par = fs_par.fold()

plot residuals


In [6]:
# make the extrapolating version of the standard neutral model function
func_ex = dadi.Numerics.make_extrap_log_func(dadi.Demographics1D.snm)

# setting the smallest grid size slightly larger than the largest population sample size
pts_l = [40, 50, 60]

ns = fs_ery.sample_sizes

# calculate unfolded AFS under standard neutral model (up to a scaling factor theta)
snm = func_ex(0, ns, pts_l)
snm = snm.fold()

# scaled snm spectrum for ERY
snm_ery = dadi.Inference.optimally_scaled_sfs(snm, fs_ery)

# scaled snm spectrum for PAR
snm_par = dadi.Inference.optimally_scaled_sfs(snm, fs_par)

In [7]:
dadi.Plotting.plot_1d_comp_multinom(snm_ery[:19], fs_ery[:19], residual='linear')



In [8]:
dadi.Plotting.plot_1d_comp_multinom(snm_par[:19], fs_par[:19], residual='linear')


The lower plot (green line) is for the scaled Poisson residuals.

$$ residuals = (model - data)/\sqrt{model} $$

The model is the expected counts in each frequency class. If these counts are Poisson distributed, then their variance is equal to their expectation. The differences between model and data are therefore scaled by the expected standard deviation of the model counts.

Ludovic has suggested optimising for a fraction $p$ of all SNP's that lie on the X chromosome. SNP's on the X chromosome are counted twice in the SFS since all sequenced individuals were male and C. parallelus males have no Y chromosome homologous to the X. This fraction $p$ should therefore be subtracted from even frequency classes and added to the respective frequency class that contains SNP's that are 1/2 as frequent, e. g. from class 2 --> 1 or from 8 --> 4. One could optimise for the minimum deviation from a neutral spectrum.


In [4]:
# number of segregating sites in ERY spectrum
fs_ery.S()


Out[4]:
43905.909479000002

In [5]:
# total number of sites in ERY spectrum 
fs_ery.data.sum()


Out[5]:
1638467.9999999998

In [6]:
# number of segregating sites in PAR spectrum
fs_par.S()


Out[6]:
43969.638017000005

In [7]:
# total number of sites in PAR spectrum
fs_par.data.sum()


Out[7]:
1214938.9999990002

PAR has about the same number of SNP's as ERY, but these are distributed over a sequence length about 1/4 shorter than in ERY.

Since the ERY and PAR spectra have almost the same number of SNP's, the scaled standard neutral spectra are also almost the same.

Deviation from these standard neutral model spectra shall be minimised.

cost function

Now I need to write a cost function: it should take p (the fraction of SNP's on the X chromosome), modify the observed SFS and return the sum of squared deviations between the modified SFS and the standard neutral model SFS.


In [8]:
# turn on floating point division by default, old behaviour via '//'
from __future__ import division

In [9]:
p = 1/2
p


Out[9]:
0.5

In [10]:
# the spectrum object is a masked numpy array
fs_par.mask


Out[10]:
array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

In [11]:
# make copy of SFS, we don't want to modify the original in place
sfs = fs_par.copy()

In [12]:
sfs[1] = 0
sfs


Out[12]:
Spectrum([-- 0.0 12178.820193000001 5406.5333949999995 3011.110214 2913.832499
 1636.870339 482.55990800000006 2585.787032 854.9897370000001 595.999852
 605.580566 1642.878191 674.760989 501.638502 244.455727 1157.254931
 967.4315409999999 31.079364 -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- --], folded=True, pop_ids=None)

In [13]:
fs_par


Out[13]:
Spectrum([-- 8478.055037 12178.820193000001 5406.5333949999995 3011.110214
 2913.832499 1636.870339 482.55990800000006 2585.787032 854.9897370000001
 595.999852 605.580566 1642.878191 674.760989 501.638502 244.455727
 1157.254931 967.4315409999999 31.079364 -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- --], folded=True, pop_ids=None)

In [14]:
sfs = fs_par.copy()

In [15]:
# test
for i in range(len(sfs)):
    if not i%2:
        print i, i//2


0 0
2 1
4 2
6 3
8 4
10 5
12 6
14 7
16 8
18 9
20 10
22 11
24 12
26 13
28 14
30 15
32 16
34 17
36 18

In [16]:
# modifies sfs
for i in range(len(sfs)): # from 0..36 in this case
    if not sfs.mask[i]: # if count category is not masked (i. e index 1..18)
        if not i%2: # if count class is a multiple of 2
            #print i//2
            deduct = sfs[i] * p # get the proportion p of MAF counts in this class
            sfs[i] -= deduct # remove this count from the frequency class
            j = i//2 # get the frequency class that corresponds to the haploid genotype
            sfs[j] += deduct # add the deducted count to this frequency class

In [17]:
sfs


Out[17]:
Spectrum([-- 14567.465133500002 7594.965203500001 6224.968564499999 2798.448623
 3211.832425 1639.874265 733.3791590000001 1871.5209815 870.5294190000001
 297.999926 605.580566 821.4390955 674.760989 250.819251 244.455727
 578.6274655 967.4315409999999 15.539682 -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- --], folded=True, pop_ids=None)

In [20]:
sfs.data.sum()


Out[20]:
1214938.9999990002

In [21]:
fs_par.data.sum()


Out[21]:
1214938.9999990002

In [24]:
plt.plot(sfs, 'bo-', markersize=12, label="modified")
plt.plot(fs_par, 'g^-', markersize=12, label='raw')
plt.legend()


Out[24]:
<matplotlib.legend.Legend at 0x7ffaa0d47450>

The modification with an arbitrary p of 0.5 seems to work.


In [22]:
# define function that modifies a SFS with a given p
def modify(p, sfs):
    """
    modifies SFS with given p
    subtracts proportion p from even count classes and adds to corresponding 
    haploid count class, e. g. from 2 onto 1 or from 16 onto 8
    returns modified sfs
    """
    fs = sfs.copy() # it is necessary to make sure that a copy of the observed SFS is
    # made here for each function evaluation
    
    # modify observed SFS with given p:
    for i in range(len(fs)):
        if not fs.mask[i]:
            if not i%2:
                deduct = fs[i] * p
                fs[i] -= deduct
                j = i//2
                fs[j] += deduct
    return fs

In [4]:
import numpy as np

In [24]:
# define cost function used for optimisation
def f(p, snm, sfs):
    """
    p: proportion of all SNP's on the X chromosome [float, 0<p<1]
    snm: standard neutral model spectrum (optimally scaled)
    sfs: observed SFS 
    """
    
    # modify sfs
    fs = modify(p, sfs)
            
    # return sum of squared deviations of modified SFS with snm spectrum:
    return np.sum( (fs - snm)**2 )  # note, due to the mask it is necessary to use the numpy sum function

In [25]:
fs_par


Out[25]:
Spectrum([-- 8478.055037 12178.820193000001 5406.5333949999995 3011.110214
 2913.832499 1636.870339 482.55990800000006 2585.787032 854.9897370000001
 595.999852 605.580566 1642.878191 674.760989 501.638502 244.455727
 1157.254931 967.4315409999999 31.079364 -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- --], folded=True, pop_ids=None)

In [26]:
# test cost function
f(0.5, snm_par, fs_par)


Out[26]:
31032447.579653159

In [27]:
# get cost when modifiying snm spectrum with p of 0
# the cost should be 0
f(0.0, snm_par, snm_par)


Out[27]:
0.0

In [28]:
# get cost when modifying snm with p of 0.001
# the cost should be slightly greater than 0
f(0.001, snm_par, snm_par.copy())


Out[28]:
56.680295037473329

The cost function seems to be correct.

plot cost function


In [29]:
x = np.linspace(0, 1, 100)
y = map(lambda x: f(x, snm_par, fs_par), x)

In [30]:
plt.plot(x, y)
plt.grid()
plt.xlabel("p")
plt.ylabel("cost")
plt.title("cost function for PAR spectrum")


Out[30]:
<matplotlib.text.Text at 0x7fec6c2e89d0>

In [31]:
x = np.linspace(0, 1, 100)
y = map(lambda x: f(x, snm_par, snm_par), x)
plt.plot(x, y)
plt.grid()
plt.xlabel("p")
plt.ylabel("cost")
plt.title("cost function for SNM spectrum")


Out[31]:
<matplotlib.text.Text at 0x7fec6c2a2990>

As expected, the optimal value of p when trying to modify the snm spectrum to better fit the snm spectrum is 0.

optimise p for PAR


In [32]:
from scipy.optimize import minimize_scalar

In [38]:
minimize_scalar?

In [33]:
# test optimisation
# the function to optimise (cost function) 'f' gets the optimally scaled standard neutral SFS for PAR
# and the standard neutral SFS for PAR to modify
# I want to see an optimal value for p of zero as shown in the plot of the cost function above.
res_test = minimize_scalar(f, method = 'bounded', bounds = (-1.0, 1.0), args = (snm_par, snm_par))

In [34]:
res_test


Out[34]:
     fun: 0.0
 message: 'Solution found.'
    nfev: 6
  status: 0
 success: True
       x: 5.5511151231257827e-17

This makes sense: the optimal value is practically zero.


In [35]:
# check that original SFS has not been modified
fs_par


Out[35]:
Spectrum([-- 8478.055037 12178.820193000001 5406.5333949999995 3011.110214
 2913.832499 1636.870339 482.55990800000006 2585.787032 854.9897370000001
 595.999852 605.580566 1642.878191 674.760989 501.638502 244.455727
 1157.254931 967.4315409999999 31.079364 -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- --], folded=True, pop_ids=None)

In [36]:
# run optimisation
# the function to optimise (cost function) 'f' gets the optimally scaled standard neutral SFS for PAR
# and the observed SFS of PAR. The optimal value should be somewhat below 0.4 as shown in the 
# plot of the cost function above.
res_par = minimize_scalar(f, method = 'bounded', bounds = (-1.0, 1.0), args = (snm_par, fs_par))

In [37]:
res_par


Out[37]:
     fun: 26568859.884093955
 message: 'Solution found.'
    nfev: 6
  status: 0
 success: True
       x: 0.36433639778394755

This makes sense. The optimal value of $p$ is 0.36, which would mean that 36% of all SNP's lie on the X chromosome or are affected by "calling" of heterozgotes as homozygotes due to PCR drift or binomial allele sampling error.


In [38]:
# get modified SFS with optimal p
fs_par_m = modify(res_par.x, fs_par)
fs_par_m


Out[38]:
Spectrum([-- 12915.242515376021 8838.689763323191 6002.9048379506485
 2856.1494979761137 3130.976938157446 1639.0592181560992 665.3250728084156
 2065.320792202926 866.3130805251762 378.8554128425542 605.580566
 1044.317868893252 674.760989 318.8733371915845 244.455727
 735.6248381217492 967.4315409999999 19.756020474823902 -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- --], folded=True, pop_ids=None)

In [40]:
plt.plot(fs_par_m, 'bo-', markersize=12, label="modified")
plt.plot(fs_par, 'g^-', markersize=12, label='raw')
plt.title("PAR - comparison of optimally modified with observed SFS")
plt.legend()


Out[40]:
<matplotlib.legend.Legend at 0x7fec6c195550>

This modification looks good to me. It doesn't seem to change the general right-skew of the observed SFS, i. e. excess of low frequency variants and stays fairly close to the observed SFS for the frequency classes greater 2.


In [43]:
plt.plot(snm_par, 'bo-', markersize=12, label="SNM")
plt.plot(fs_par_m, 'g^-', markersize=12, label='modified')
plt.plot(fs_par, 'r>--', markersize=12, label='observed spectrum')
plt.title("PAR - comparison of observed, SNM and modified spectra")
plt.legend()


Out[43]:
<matplotlib.legend.Legend at 0x7fec66e9df50>

The modification does not seem to introduce an obvious bias towards the standard neutral model spectrum (SNM).


In [81]:
dadi.Spectrum.tofile?

optimise p for ERY


In [45]:
# run optimisation
# the function to optimise (objective function) 'f' gets the optimally scaled standard neutral SFS for ERY
# and a copy of the observed SFS of ERY
res_ery = minimize_scalar(f, method = 'bounded', bounds = (-1.0, 1.0), args = (snm_ery, fs_ery))

In [46]:
res_ery


Out[46]:
     fun: 4973116.0582852894
 message: 'Solution found.'
    nfev: 6
  status: 0
 success: True
       x: 0.34106859148429103

The optimal value of $p$ is close to the value estimated for PAR, which is what I would expect.


In [47]:
# get modified SFS with optimal p
fs_ery_m = modify(res_ery.x, fs_ery)

In [48]:
plt.plot(snm_ery, 'bo-', markersize=12, label="SNM")
plt.plot(fs_ery_m, 'g^-', markersize=12, label="modified")
plt.plot(fs_ery, 'r>--', markersize=12, label='raw')
plt.title("ERY - comparison of optimally modified with observed SFS")
plt.legend()


Out[48]:
<matplotlib.legend.Legend at 0x7fec745bf490>

The modification does not seem to introduce a bias toward an SNM spectrum.

This seems to be a reasonable modification.

2D spectra


In [3]:
sfs2d = dadi.Spectrum.from_file("dadiExercises/EryPar.unfolded.2dsfs.dadi_format")

In [4]:
sfs2d.pop_ids = ['ery', 'par']

In [39]:
%matplotlib inline

import pylab
import numpy as np

plt.rcParams['figure.figsize'] = [12, 10]
plt.rcParams['font.size'] = 14

In [11]:
# plot unfoled spectrum
dadi.Plotting.plot_single_2d_sfs(sfs2d, vmin=1, cmap=pylab.cm.jet)


Out[11]:
<matplotlib.colorbar.Colorbar at 0x7fa1d7769990>

In [12]:
sfs2d = sfs2d.fold()

In [13]:
# plot folded spectrum
dadi.Plotting.plot_single_2d_sfs(sfs2d, vmin=1, cmap=pylab.cm.jet)


Out[13]:
<matplotlib.colorbar.Colorbar at 0x7fa1d7612f10>

In [14]:
sfs2d


Out[14]:
Spectrum([[-- 6651.315570000001 9534.20133 ..., 59.216701 19.075761999999997
  208.9562905]
 [4407.352191 453.932731 409.549382 ..., 3.9669600000000003
  4.851331999999999 --]
 [3746.5561820000003 295.73506799999996 261.799583 ..., 1e-06 -- --]
 ..., 
 [26.699661 1.593931 1e-06 ..., -- -- --]
 [78.253372 4.851332 -- ..., -- -- --]
 [208.9562905 -- -- ..., -- -- --]], folded=True, pop_ids=['ery', 'par'])

In [15]:
sfs2d.mask


Out[15]:
array([[ True, False, False, ..., False, False, False],
       [False, False, False, ..., False, False,  True],
       [False, False, False, ..., False,  True,  True],
       ..., 
       [False, False, False, ...,  True,  True,  True],
       [False, False,  True, ...,  True,  True,  True],
       [False,  True,  True, ...,  True,  True,  True]], dtype=bool)

I now would like to apply the same modification on the 2D spectrum.


In [16]:
if not 3%2:
    print 3//2

In [17]:
for i in range(len(sfs2d)):
    for j in range(len(sfs2d[i])):
        if not sfs2d.mask[i][j]:
            if not i%2 or not j%2: # if PAR or ERY index are even
                print i, j,
                print "-->",
                if not i%2: # if PAR index is even
                    k = i//2 # get half of index
                else:
                    k = i
                if not j%2: # if ERY index is even
                    l = j//2 # get half of index
                else:
                    l = j
                print k, l


0 1 --> 0 1
0 2 --> 0 1
0 3 --> 0 3
0 4 --> 0 2
0 5 --> 0 5
0 6 --> 0 3
0 7 --> 0 7
0 8 --> 0 4
0 9 --> 0 9
0 10 --> 0 5
0 11 --> 0 11
0 12 --> 0 6
0 13 --> 0 13
0 14 --> 0 7
0 15 --> 0 15
0 16 --> 0 8
0 17 --> 0 17
0 18 --> 0 9
0 19 --> 0 19
0 20 --> 0 10
0 21 --> 0 21
0 22 --> 0 11
0 23 --> 0 23
0 24 --> 0 12
0 25 --> 0 25
0 26 --> 0 13
0 27 --> 0 27
0 28 --> 0 14
0 29 --> 0 29
0 30 --> 0 15
0 31 --> 0 31
0 32 --> 0 16
0 33 --> 0 33
0 34 --> 0 17
0 35 --> 0 35
0 36 --> 0 18
1 0 --> 1 0
1 2 --> 1 1
1 4 --> 1 2
1 6 --> 1 3
1 8 --> 1 4
1 10 --> 1 5
1 12 --> 1 6
1 14 --> 1 7
1 16 --> 1 8
1 18 --> 1 9
1 20 --> 1 10
1 22 --> 1 11
1 24 --> 1 12
1 26 --> 1 13
1 28 --> 1 14
1 30 --> 1 15
1 32 --> 1 16
1 34 --> 1 17
2 0 --> 1 0
2 1 --> 1 1
2 2 --> 1 1
2 3 --> 1 3
2 4 --> 1 2
2 5 --> 1 5
2 6 --> 1 3
2 7 --> 1 7
2 8 --> 1 4
2 9 --> 1 9
2 10 --> 1 5
2 11 --> 1 11
2 12 --> 1 6
2 13 --> 1 13
2 14 --> 1 7
2 15 --> 1 15
2 16 --> 1 8
2 17 --> 1 17
2 18 --> 1 9
2 19 --> 1 19
2 20 --> 1 10
2 21 --> 1 21
2 22 --> 1 11
2 23 --> 1 23
2 24 --> 1 12
2 25 --> 1 25
2 26 --> 1 13
2 27 --> 1 27
2 28 --> 1 14
2 29 --> 1 29
2 30 --> 1 15
2 31 --> 1 31
2 32 --> 1 16
2 33 --> 1 33
2 34 --> 1 17
3 0 --> 3 0
3 2 --> 3 1
3 4 --> 3 2
3 6 --> 3 3
3 8 --> 3 4
3 10 --> 3 5
3 12 --> 3 6
3 14 --> 3 7
3 16 --> 3 8
3 18 --> 3 9
3 20 --> 3 10
3 22 --> 3 11
3 24 --> 3 12
3 26 --> 3 13
3 28 --> 3 14
3 30 --> 3 15
3 32 --> 3 16
4 0 --> 2 0
4 1 --> 2 1
4 2 --> 2 1
4 3 --> 2 3
4 4 --> 2 2
4 5 --> 2 5
4 6 --> 2 3
4 7 --> 2 7
4 8 --> 2 4
4 9 --> 2 9
4 10 --> 2 5
4 11 --> 2 11
4 12 --> 2 6
4 13 --> 2 13
4 14 --> 2 7
4 15 --> 2 15
4 16 --> 2 8
4 17 --> 2 17
4 18 --> 2 9
4 19 --> 2 19
4 20 --> 2 10
4 21 --> 2 21
4 22 --> 2 11
4 23 --> 2 23
4 24 --> 2 12
4 25 --> 2 25
4 26 --> 2 13
4 27 --> 2 27
4 28 --> 2 14
4 29 --> 2 29
4 30 --> 2 15
4 31 --> 2 31
4 32 --> 2 16
5 0 --> 5 0
5 2 --> 5 1
5 4 --> 5 2
5 6 --> 5 3
5 8 --> 5 4
5 10 --> 5 5
5 12 --> 5 6
5 14 --> 5 7
5 16 --> 5 8
5 18 --> 5 9
5 20 --> 5 10
5 22 --> 5 11
5 24 --> 5 12
5 26 --> 5 13
5 28 --> 5 14
5 30 --> 5 15
6 0 --> 3 0
6 1 --> 3 1
6 2 --> 3 1
6 3 --> 3 3
6 4 --> 3 2
6 5 --> 3 5
6 6 --> 3 3
6 7 --> 3 7
6 8 --> 3 4
6 9 --> 3 9
6 10 --> 3 5
6 11 --> 3 11
6 12 --> 3 6
6 13 --> 3 13
6 14 --> 3 7
6 15 --> 3 15
6 16 --> 3 8
6 17 --> 3 17
6 18 --> 3 9
6 19 --> 3 19
6 20 --> 3 10
6 21 --> 3 21
6 22 --> 3 11
6 23 --> 3 23
6 24 --> 3 12
6 25 --> 3 25
6 26 --> 3 13
6 27 --> 3 27
6 28 --> 3 14
6 29 --> 3 29
6 30 --> 3 15
7 0 --> 7 0
7 2 --> 7 1
7 4 --> 7 2
7 6 --> 7 3
7 8 --> 7 4
7 10 --> 7 5
7 12 --> 7 6
7 14 --> 7 7
7 16 --> 7 8
7 18 --> 7 9
7 20 --> 7 10
7 22 --> 7 11
7 24 --> 7 12
7 26 --> 7 13
7 28 --> 7 14
8 0 --> 4 0
8 1 --> 4 1
8 2 --> 4 1
8 3 --> 4 3
8 4 --> 4 2
8 5 --> 4 5
8 6 --> 4 3
8 7 --> 4 7
8 8 --> 4 4
8 9 --> 4 9
8 10 --> 4 5
8 11 --> 4 11
8 12 --> 4 6
8 13 --> 4 13
8 14 --> 4 7
8 15 --> 4 15
8 16 --> 4 8
8 17 --> 4 17
8 18 --> 4 9
8 19 --> 4 19
8 20 --> 4 10
8 21 --> 4 21
8 22 --> 4 11
8 23 --> 4 23
8 24 --> 4 12
8 25 --> 4 25
8 26 --> 4 13
8 27 --> 4 27
8 28 --> 4 14
9 0 --> 9 0
9 2 --> 9 1
9 4 --> 9 2
9 6 --> 9 3
9 8 --> 9 4
9 10 --> 9 5
9 12 --> 9 6
9 14 --> 9 7
9 16 --> 9 8
9 18 --> 9 9
9 20 --> 9 10
9 22 --> 9 11
9 24 --> 9 12
9 26 --> 9 13
10 0 --> 5 0
10 1 --> 5 1
10 2 --> 5 1
10 3 --> 5 3
10 4 --> 5 2
10 5 --> 5 5
10 6 --> 5 3
10 7 --> 5 7
10 8 --> 5 4
10 9 --> 5 9
10 10 --> 5 5
10 11 --> 5 11
10 12 --> 5 6
10 13 --> 5 13
10 14 --> 5 7
10 15 --> 5 15
10 16 --> 5 8
10 17 --> 5 17
10 18 --> 5 9
10 19 --> 5 19
10 20 --> 5 10
10 21 --> 5 21
10 22 --> 5 11
10 23 --> 5 23
10 24 --> 5 12
10 25 --> 5 25
10 26 --> 5 13
11 0 --> 11 0
11 2 --> 11 1
11 4 --> 11 2
11 6 --> 11 3
11 8 --> 11 4
11 10 --> 11 5
11 12 --> 11 6
11 14 --> 11 7
11 16 --> 11 8
11 18 --> 11 9
11 20 --> 11 10
11 22 --> 11 11
11 24 --> 11 12
12 0 --> 6 0
12 1 --> 6 1
12 2 --> 6 1
12 3 --> 6 3
12 4 --> 6 2
12 5 --> 6 5
12 6 --> 6 3
12 7 --> 6 7
12 8 --> 6 4
12 9 --> 6 9
12 10 --> 6 5
12 11 --> 6 11
12 12 --> 6 6
12 13 --> 6 13
12 14 --> 6 7
12 15 --> 6 15
12 16 --> 6 8
12 17 --> 6 17
12 18 --> 6 9
12 19 --> 6 19
12 20 --> 6 10
12 21 --> 6 21
12 22 --> 6 11
12 23 --> 6 23
12 24 --> 6 12
13 0 --> 13 0
13 2 --> 13 1
13 4 --> 13 2
13 6 --> 13 3
13 8 --> 13 4
13 10 --> 13 5
13 12 --> 13 6
13 14 --> 13 7
13 16 --> 13 8
13 18 --> 13 9
13 20 --> 13 10
13 22 --> 13 11
14 0 --> 7 0
14 1 --> 7 1
14 2 --> 7 1
14 3 --> 7 3
14 4 --> 7 2
14 5 --> 7 5
14 6 --> 7 3
14 7 --> 7 7
14 8 --> 7 4
14 9 --> 7 9
14 10 --> 7 5
14 11 --> 7 11
14 12 --> 7 6
14 13 --> 7 13
14 14 --> 7 7
14 15 --> 7 15
14 16 --> 7 8
14 17 --> 7 17
14 18 --> 7 9
14 19 --> 7 19
14 20 --> 7 10
14 21 --> 7 21
14 22 --> 7 11
15 0 --> 15 0
15 2 --> 15 1
15 4 --> 15 2
15 6 --> 15 3
15 8 --> 15 4
15 10 --> 15 5
15 12 --> 15 6
15 14 --> 15 7
15 16 --> 15 8
15 18 --> 15 9
15 20 --> 15 10
16 0 --> 8 0
16 1 --> 8 1
16 2 --> 8 1
16 3 --> 8 3
16 4 --> 8 2
16 5 --> 8 5
16 6 --> 8 3
16 7 --> 8 7
16 8 --> 8 4
16 9 --> 8 9
16 10 --> 8 5
16 11 --> 8 11
16 12 --> 8 6
16 13 --> 8 13
16 14 --> 8 7
16 15 --> 8 15
16 16 --> 8 8
16 17 --> 8 17
16 18 --> 8 9
16 19 --> 8 19
16 20 --> 8 10
17 0 --> 17 0
17 2 --> 17 1
17 4 --> 17 2
17 6 --> 17 3
17 8 --> 17 4
17 10 --> 17 5
17 12 --> 17 6
17 14 --> 17 7
17 16 --> 17 8
17 18 --> 17 9
18 0 --> 9 0
18 1 --> 9 1
18 2 --> 9 1
18 3 --> 9 3
18 4 --> 9 2
18 5 --> 9 5
18 6 --> 9 3
18 7 --> 9 7
18 8 --> 9 4
18 9 --> 9 9
18 10 --> 9 5
18 11 --> 9 11
18 12 --> 9 6
18 13 --> 9 13
18 14 --> 9 7
18 15 --> 9 15
18 16 --> 9 8
18 17 --> 9 17
18 18 --> 9 9
19 0 --> 19 0
19 2 --> 19 1
19 4 --> 19 2
19 6 --> 19 3
19 8 --> 19 4
19 10 --> 19 5
19 12 --> 19 6
19 14 --> 19 7
19 16 --> 19 8
20 0 --> 10 0
20 1 --> 10 1
20 2 --> 10 1
20 3 --> 10 3
20 4 --> 10 2
20 5 --> 10 5
20 6 --> 10 3
20 7 --> 10 7
20 8 --> 10 4
20 9 --> 10 9
20 10 --> 10 5
20 11 --> 10 11
20 12 --> 10 6
20 13 --> 10 13
20 14 --> 10 7
20 15 --> 10 15
20 16 --> 10 8
21 0 --> 21 0
21 2 --> 21 1
21 4 --> 21 2
21 6 --> 21 3
21 8 --> 21 4
21 10 --> 21 5
21 12 --> 21 6
21 14 --> 21 7
22 0 --> 11 0
22 1 --> 11 1
22 2 --> 11 1
22 3 --> 11 3
22 4 --> 11 2
22 5 --> 11 5
22 6 --> 11 3
22 7 --> 11 7
22 8 --> 11 4
22 9 --> 11 9
22 10 --> 11 5
22 11 --> 11 11
22 12 --> 11 6
22 13 --> 11 13
22 14 --> 11 7
23 0 --> 23 0
23 2 --> 23 1
23 4 --> 23 2
23 6 --> 23 3
23 8 --> 23 4
23 10 --> 23 5
23 12 --> 23 6
24 0 --> 12 0
24 1 --> 12 1
24 2 --> 12 1
24 3 --> 12 3
24 4 --> 12 2
24 5 --> 12 5
24 6 --> 12 3
24 7 --> 12 7
24 8 --> 12 4
24 9 --> 12 9
24 10 --> 12 5
24 11 --> 12 11
24 12 --> 12 6
25 0 --> 25 0
25 2 --> 25 1
25 4 --> 25 2
25 6 --> 25 3
25 8 --> 25 4
25 10 --> 25 5
26 0 --> 13 0
26 1 --> 13 1
26 2 --> 13 1
26 3 --> 13 3
26 4 --> 13 2
26 5 --> 13 5
26 6 --> 13 3
26 7 --> 13 7
26 8 --> 13 4
26 9 --> 13 9
26 10 --> 13 5
27 0 --> 27 0
27 2 --> 27 1
27 4 --> 27 2
27 6 --> 27 3
27 8 --> 27 4
28 0 --> 14 0
28 1 --> 14 1
28 2 --> 14 1
28 3 --> 14 3
28 4 --> 14 2
28 5 --> 14 5
28 6 --> 14 3
28 7 --> 14 7
28 8 --> 14 4
29 0 --> 29 0
29 2 --> 29 1
29 4 --> 29 2
29 6 --> 29 3
30 0 --> 15 0
30 1 --> 15 1
30 2 --> 15 1
30 3 --> 15 3
30 4 --> 15 2
30 5 --> 15 5
30 6 --> 15 3
31 0 --> 31 0
31 2 --> 31 1
31 4 --> 31 2
32 0 --> 16 0
32 1 --> 16 1
32 2 --> 16 1
32 3 --> 16 3
32 4 --> 16 2
33 0 --> 33 0
33 2 --> 33 1
34 0 --> 17 0
34 1 --> 17 1
34 2 --> 17 1
35 0 --> 35 0
36 0 --> 18 0

This shows from which cell to which cell the transferral of SNP count takes place. A proportion $p$ of the SNP count in a cell would be deducted and added to the cell that corresponds to half the allele count in PAR, ERY or both. That means if the allele count (equal to index) in the source cell for one population (ERY or PAR) is an odd number, this index will not be changed to find the target cell, only the index of the other population, which must be even to do a transferral at all.


In [18]:
x = sfs2d.data.copy()
x


Out[18]:
array([[  1.07020142e+06,   6.65131557e+03,   9.53420133e+03, ...,
          5.92167010e+01,   1.90757620e+01,   2.08956290e+02],
       [  4.40735219e+03,   4.53932731e+02,   4.09549382e+02, ...,
          3.96696000e+00,   4.85133200e+00,   0.00000000e+00],
       [  3.74655618e+03,   2.95735068e+02,   2.61799583e+02, ...,
          1.00000000e-06,   0.00000000e+00,   0.00000000e+00],
       ..., 
       [  2.66996610e+01,   1.59393100e+00,   1.00000000e-06, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  7.82533720e+01,   4.85133200e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  2.08956290e+02,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00]])

In [19]:
x[0][0] = 1

In [20]:
x


Out[20]:
array([[  1.00000000e+00,   6.65131557e+03,   9.53420133e+03, ...,
          5.92167010e+01,   1.90757620e+01,   2.08956290e+02],
       [  4.40735219e+03,   4.53932731e+02,   4.09549382e+02, ...,
          3.96696000e+00,   4.85133200e+00,   0.00000000e+00],
       [  3.74655618e+03,   2.95735068e+02,   2.61799583e+02, ...,
          1.00000000e-06,   0.00000000e+00,   0.00000000e+00],
       ..., 
       [  2.66996610e+01,   1.59393100e+00,   1.00000000e-06, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  7.82533720e+01,   4.85133200e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  2.08956290e+02,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00]])

In [21]:
# check that original sfs has not changed
sfs2d.data


Out[21]:
array([[  1.07020142e+06,   6.65131557e+03,   9.53420133e+03, ...,
          5.92167010e+01,   1.90757620e+01,   2.08956290e+02],
       [  4.40735219e+03,   4.53932731e+02,   4.09549382e+02, ...,
          3.96696000e+00,   4.85133200e+00,   0.00000000e+00],
       [  3.74655618e+03,   2.95735068e+02,   2.61799583e+02, ...,
          1.00000000e-06,   0.00000000e+00,   0.00000000e+00],
       ..., 
       [  2.66996610e+01,   1.59393100e+00,   1.00000000e-06, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  7.82533720e+01,   4.85133200e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  2.08956290e+02,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00]])

This seems to work without Warning.


In [22]:
sfs = sfs2d.data.copy()

In [23]:
p = 0.5

In [24]:
for i in range(len(sfs)):
    for j in range(len(sfs[i])):
        if not sfs2d.mask[i][j]:
            if not i%2 or not j%2:
                deduct = sfs[i][j] * p
                sfs[i][j] -= deduct
                if not i%2:
                    k = i//2
                else:
                    k = i
                if not j%2:
                    l = j//2
                else:
                    l = j
                sfs[k][l] += deduct

In [25]:
# recreate dadi Spectrum object
sfs2d_m = dadi.Spectrum(data=sfs, mask=sfs2d.mask, data_folded=True, pop_ids=['ery', 'par'])

In [26]:
dadi.Plotting.plot_single_2d_sfs(sfs2d_m, vmin=1, cmap=pylab.cm.jet)


Out[26]:
<matplotlib.colorbar.Colorbar at 0x7fa1d7478c10>

This does look pretty good to me.


In [27]:
dadi.Plotting.plot_2d_comp_multinom(data=sfs2d, model=sfs2d_m, vmin=1, title=['original', 'modified'])


I think the modification does a good job.

cost function


In [28]:
def modify2D(p, sfs):
    """
    modifies an sfs with given p
    returns modified sfs
    leaves original sfs unchanged
    """
    # make copy of 2D array for modification
    fs = sfs.data.copy()
    
    for i in range(len(fs)):
        for j in range(len(fs[i])):
            if not sfs.mask[i][j]:
                if not i%2 or not j%2:
                    deduct = fs[i][j] * p
                    fs[i][j] -= deduct
                    if not i%2:
                        k = i//2
                    else:
                        k = i
                    if not j%2:
                        l = j//2
                    else:
                        l = j
                    fs[k][l] += deduct

    # recreate a dadi Spectrum object                
    fs_m = dadi.Spectrum(data=fs, mask=sfs.mask, data_folded=True, pop_ids=['ery', 'par'])
                
    return fs_m

In [29]:
# define cost function used for optimisation
def f_2D(p, snm, sfs):
    """
    p: proportion of all SNP's on the X chromosome [float, 0<p<1]
    snm: standard neutral model spectrum (optimally scaled)
    sfs: observed SFS 
    """
    
    # modify sfs
    fs = modify2D(p, sfs)
            
    # return sum of squared deviations of modified SFS with snm spectrum:
    return np.sum( (fs - snm)**2 )  # note, due to the mask it is necessary to use the numpy sum function

get neutral model spectra


In [30]:
dadi.Demographics2D.snm?

In [31]:
# make the extrapolating version of the demographic model function
func_ex = dadi.Numerics.make_extrap_log_func(dadi.Demographics2D.snm)

# setting the smallest grid size slightly larger than the largest population sample size

pts_l = [40, 50, 60]

ns = sfs2d.sample_sizes

# calculate unfolded AFS under standard neutral model (up to a scaling factor theta)

snm = func_ex(0, ns, pts_l)
snm = snm.fold()

# scaled snm spectrum
snm = dadi.Inference.optimally_scaled_sfs(snm, sfs2d)

In [32]:
dadi.Plotting.plot_single_2d_sfs(snm, vmin=1, cmap=pylab.cm.jet)


Out[32]:
<matplotlib.colorbar.Colorbar at 0x7fa1d6e17910>

I don't think that this would be a good model spectrum for fitting. We know that ERY and PAR are highly diverged and certainly not a single panmictic population.

I think the most parsimonious model would be a simple divergence in isolation model.


In [33]:
def split_nomig(params, ns, pts):
    """
    params = (nu1,nu2,T)
    ns = (n1,n2)

    Split into two populations of specifed size, no migration.

    nu1: Size of population 1 after split.
    nu2: Size of population 2 after split.
    T: Time in the past of split (in units of 2*Na generations) 
    n1,n2: Sample sizes of resulting Spectrum
    pts: Number of grid points to use in integration.
    """
    nu1,nu2,T = params

    xx = dadi.Numerics.default_grid(pts)

    phi = dadi.PhiManip.phi_1D(xx)
    phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)

    phi = dadi.Integration.two_pops(phi, xx, T, nu1, nu2, m12=0, m21=0)

    fs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))
    return fs

This model takes three parameters. I am going to assume that $\nu_1=\nu_2=1$ and that $T=1$.


In [34]:
# make the extrapolating version of the demographic model function
func_ex = dadi.Numerics.make_extrap_log_func(split_nomig)

In [35]:
# calculate unfolded AFS under divergence in isolation model (dim) (up to a scaling factor theta)
params = (1, 1, 1)

ns = sfs2d.sample_sizes
pts_l = [40, 50, 60]

dim = func_ex(params, ns, pts_l)
dim = dim.fold()

# optimally scale divergence in isolation model spectrum
dim = dadi.Inference.optimally_scaled_sfs(dim, sfs2d)

In [36]:
dadi.Plotting.plot_single_2d_sfs(dim, vmin=1, cmap=pylab.cm.jet)


Out[36]:
<matplotlib.colorbar.Colorbar at 0x7fa1d6df5e90>

This shall be the model to optimise the fit to.

optimise p


In [37]:
from scipy.optimize import minimize_scalar

In [40]:
# test optimisation
# the function to optimise (cost function) 'f_2D' gets the optimally scaled divergence in isolation model
# to fit to itself.
# I want to see an optimal value for p of zero.
res_test = minimize_scalar(f_2D, method = 'bounded', bounds = (-1.0, 1.0), args = (dim, dim))

In [41]:
res_test


Out[41]:
     fun: 0.0
 message: 'Solution found.'
    nfev: 6
  status: 0
 success: True
       x: 0.0

That is as expected.


In [42]:
res = minimize_scalar(f_2D, method='bounded', bounds=(-1.0, 1.0), args=(dim, sfs2d))

In [43]:
res


Out[43]:
     fun: 30634672.960740261
 message: 'Solution found.'
    nfev: 6
  status: 0
 success: True
       x: 0.35319246572685198

This is remarkably close to the value of p inferred for the 1D spectra and therefore a good sign.


In [44]:
sfs2d_m = modify2D(res.x, sfs2d)

In [45]:
dadi.Plotting.plot_single_2d_sfs(sfs2d_m, vmin=1, cmap=pylab.cm.jet)


Out[45]:
<matplotlib.colorbar.Colorbar at 0x7fa1d6b18a10>

This looks reasonable.


In [46]:
dadi.Plotting.plot_2d_comp_multinom(data=sfs2d, model=sfs2d_m, vmin=1, title=['observed', 'modified'])


This looks good to me.

sensitivity of p to the "neutral" model

I need to know whether I introduce a bias by choosing a certain null model for fitting. I therefore want to do two more fittings. One using a null model with short and one with long divergence time.

doubling divergence time


In [47]:
# calculate unfolded AFS under divergence in isolation model (dim) (up to a scaling factor theta)
params = (1, 1, 2) # note the doubling of divergence time

ns = sfs2d.sample_sizes
pts_l = [40, 50, 60]

dim = func_ex(params, ns, pts_l)
dim = dim.fold()

# optimally scale divergence in isolation model spectrum
dim = dadi.Inference.optimally_scaled_sfs(dim, sfs2d)

In [48]:
# plot null model
dadi.Plotting.plot_single_2d_sfs(dim, vmin=1, cmap=pylab.cm.jet)


Out[48]:
<matplotlib.colorbar.Colorbar at 0x7fa1d7212910>

In [49]:
res_a = minimize_scalar(f_2D, method='bounded', bounds=(-1.0, 1.0), args=(dim, sfs2d))

In [50]:
res_a


Out[50]:
     fun: 58350581.920640469
 message: 'Solution found.'
    nfev: 6
  status: 0
 success: True
       x: 0.30448684311763069

p has decreased by 5 percentage points.

halfing divergence time


In [51]:
# calculate unfolded AFS under divergence in isolation model (dim) (up to a scaling factor theta)
params = (1, 1, 0.5) # note the halfing of divergence time

ns = sfs2d.sample_sizes
pts_l = [40, 50, 60]

dim = func_ex(params, ns, pts_l)
dim = dim.fold()

# optimally scale divergence in isolation model spectrum
dim = dadi.Inference.optimally_scaled_sfs(dim, sfs2d)

In [52]:
# plot null model
dadi.Plotting.plot_single_2d_sfs(dim, vmin=1, cmap=pylab.cm.jet)


Out[52]:
<matplotlib.colorbar.Colorbar at 0x7fa1d68fa550>

In [53]:
res_b = minimize_scalar(f_2D, method='bounded', bounds=(-1.0, 1.0), args=(dim, sfs2d))

In [54]:
res_b


Out[54]:
     fun: 25419201.221575346
 message: 'Solution found.'
    nfev: 6
  status: 0
 success: True
       x: 0.39334370731547508

p has increased by 4 percent.

How different are the corresponding corrected spectra?


In [55]:
# get corrected spectra

sfs2d_m_a = modify2D(res_a.x, sfs2d) # p=0.30
sfs2d_m_b = modify2D(res_b.x, sfs2d) # p=0.39

In [56]:
# plot corrected spectra and residuals
dadi.Plotting.plot_2d_comp_multinom(data=sfs2d_m_a, model=sfs2d_m_b, vmin=1, title=['T=2', 'T=0.5'])


The difference looks small, but I need to use the differently corrected spectra for model fitting to see whether the degree of correction has a strong influence on parameter inference.

marginal spectra


In [57]:
ery_marg = sfs2d.marginalize([1])
par_marg = sfs2d.marginalize([0])

ery_p030_marg = sfs2d_m_a.marginalize([1])
par_p030_marg = sfs2d_m_a.marginalize([0])

ery_p039_marg = sfs2d_m_b.marginalize([1])
par_p039_marg = sfs2d_m_b.marginalize([0])

In [64]:
pylab.rcParams['figure.figsize'] = [18, 10]

In [67]:
pylab.subplot(121)
pylab.grid()
pylab.plot(ery_marg, 'ro--', label='ery uncorrected', linewidth=1.0, markersize=4)
pylab.plot(par_marg, 'g^--', label='par uncorrected', linewidth=1.0, markersize=4)
pylab.plot(ery_p030_marg, 'ro-', label='ery, p=0.30', linewidth=2.5, markersize=12)
pylab.plot(par_p030_marg, 'g^-', label='par, p=0.30', linewidth=2.5, markersize=12)
pylab.xlabel("minor allele frequency")
pylab.ylabel("SNP count")
pylab.ylim(0, 14000)
pylab.legend()

pylab.subplot(122)
pylab.grid()
pylab.plot(ery_marg, 'ro--', label='ery uncorrected', linewidth=1.0, markersize=4)
pylab.plot(par_marg, 'g^--', label='par uncorrected', linewidth=1.0, markersize=4)
pylab.plot(ery_p039_marg, 'ro-', label='ery, p=0.39', linewidth=2.5, markersize=12)
pylab.plot(par_p039_marg, 'g^-', label='par, p=0.39', linewidth=2.5, markersize=12)
pylab.xlabel("minor allele frequency")
pylab.ylabel("SNP count")
pylab.ylim(0, 14000)
pylab.legend()


Out[67]:
<matplotlib.legend.Legend at 0x7fa1d4621890>

In [ ]:

save modified spectra


In [84]:
# save PAR modified spectrum to file
dadi.Spectrum.tofile(fs_par_m, fid="PAR_modified.sfs", comment_lines=['Ludovics correction applied'])

In [25]:
# load saved spectrum from file
fs = dadi.Spectrum.from_file("PAR_modified.sfs")

In [26]:
fs


Out[26]:
Spectrum([-- 12915.24251537602 8838.689763323191 6002.904837950648 2856.149497976114
 3130.976938157446 1639.059218156099 665.3250728084156 2065.320792202926
 866.3130805251762 378.8554128425542 605.580566 1044.317868893252
 674.760989 318.8733371915845 244.455727 735.6248381217492
 967.4315409999999 19.7560204748239 -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- --], folded=True, pop_ids=None)

In [27]:
plt.plot(fs, 'g^-', label='par', linewidth=2.5, markersize=12)


Out[27]:
[<matplotlib.lines.Line2D at 0x7ffa97c02f50>]

In [28]:
fs.data.sum()


Out[28]:
1214938.9999989998

In [93]:
# save modified ERY spectrum to file
dadi.Spectrum.tofile(fs_ery_m, fid="ERY_modified.sfs", comment_lines=['Ludovics correction applied'])

In [29]:
# load saved spectrum from file
fs = dadi.Spectrum.from_file("ERY_modified.sfs")

fs.data.sum()


Out[29]:
1638468.0000000002

In [94]:
# save modified 2D spectrum to file 
dadi.Spectrum.tofile(sfs2d_m, fid="EryPar_modified.2dsfs", comment_lines=['Ludovics correction applied'])

In [98]:
# load saved 2D spectrum from file
fs2d = dadi.Spectrum.from_file("EryPar_modified.2dsfs")

In [99]:
dadi.Plotting.plot_single_2d_sfs(fs2d, vmin=1, cmap=pylab.cm.jet)


Out[99]:
<matplotlib.colorbar.Colorbar at 0x7fec65f445d0>

In [49]:
# save modified 2D spectrum to file 
dadi.Spectrum.tofile(sfs2d_m, fid="EryPar_modified.2dsfs", comment_lines=['Ludovics correction applied, p=0.35'])

In [50]:
# save modified 2D spectrum to file 
dadi.Spectrum.tofile(sfs2d_m_a, fid="EryPar_modified_a.2dsfs", comment_lines=['Ludovics correction applied, p=0.30'])

In [51]:
# save modified 2D spectrum to file 
dadi.Spectrum.tofile(sfs2d_m_b, fid="EryPar_modified_b.2dsfs", comment_lines=['Ludovics correction applied, p=0.39'])

In [ ]: