In [1]:
import sys
In [2]:
sys.path
Out[2]:
In [3]:
import os
In [4]:
os.getcwd()
Out[4]:
I have cloned the $\delta$a$\delta$i repository into '/home/claudius/Downloads/dadi' and have compiled the code. Now I need to add that directory to the PYTHONPATH variable:
In [5]:
sys.path.insert(0, '/home/claudius/Downloads/dadi')
In [6]:
sys.path
Out[6]:
Now, I should be able to import $\delta$a$\delta$i
In [7]:
import dadi
In [8]:
dir(dadi)
Out[8]:
In [9]:
import pylab
In [10]:
%matplotlib inline
In [11]:
x = pylab.linspace(0, 4*pylab.pi, 1000)
In [12]:
pylab.plot(x, pylab.sin(x), '-r')
Out[12]:
In [13]:
%%sh
# this allows me to execute a shell command
ls
I have turned the 1D folded SFS's from realSFS
into $\delta$d$\delta$i format by hand according to the description in section 3.1 of the manual. I have left out the masking line from the input file.
In [14]:
fs_ery = dadi.Spectrum.from_file('ERY.FOLDED.sfs.dadi_format')
In [15]:
fs_ery
Out[15]:
$\delta$a$\delta$i is detecting that the spectrum is folded (as given in the input file), but it is also automatically masking the 0th and 18th count category. This is a not a good behaviour.
In [16]:
# number of segregating sites
fs_ery.data[1:].sum()
Out[16]:
In [17]:
fs_ery.pi()
Out[17]:
I have next added a masking line to the input file, setting it to '1' for the first position, i. e. the 0-count category.
In [18]:
fs_ery = dadi.Spectrum.from_file('ERY.FOLDED.sfs.dadi_format', mask_corners=False)
$\delta$a$\delta$i is issuing the following message when executing the above command:
WARNING:Spectrum_mod:Creating Spectrum with data_folded = True, but mask is not True for all entries which are nonsensical for a folded Spectrum.
In [19]:
fs_ery
Out[19]:
I do not understand this warning from $\delta$a$\delta$i. The 18-count category is sensical for a folded spectrum with even sample size, so should not be masked. Anyway, I do not understand why $\delta$a$\delta$i is so reluctant to keep all positions, including the non-variable one.
In [20]:
fs_ery.pi()
Out[20]:
The function that returns $\pi$ produces the same output with or without the last count category masked ?! I think that is because even if the last count class (966.62...) is masked, it is still included in the calculation of $\pi$. However, there is no obvious unmasking in the pi
function. Strange!
There are (at least) two formulas that allow the calculation of $\pi$ from a folded sample allele frequency spectrum. One is given in Wakeley2009, p.16, equation (1.4): $$ \pi = \frac{1}{n \choose 2} \sum_{i=1}^{n/2} i(n-i)\eta_{i} $$ Here, $n$ is the number of sequences and $\eta_{i}$ is the SNP count in the i'th minor sample allele frequency class.
The other formula is on p. 45 in Gillespie "Population Genetics - A concise guide":
$$
\hat{\pi} = \frac{n}{n-1} \sum_{i=1}^{S_{n}} 2 \hat{p_{i}}(1-\hat{p_{i}})
$$
This is the formula that $\delta$a$\delta$i's pi
function uses, with the modification that it multiplies each $\hat{p_{i}}$ by the count in the i'th class of the SFS, i. e. the sum is not over all SNP's but over all SNP frequency classes.
In [21]:
# Calcualting pi with the formula from Wakeley2009
n = 36 # 36 sequences sampled from 18 diploid individuals
pi_Wakeley = (sum( [i*(n-i)*fs_ery[i] for i in range(1, n/2+1)] ) * 2.0 / (n*(n-1)))/pylab.sum(fs_ery.data)
# note fs_ery.data gets the whole fs_ery list, including masked entries
pi_Wakeley
Out[21]:
This is the value of $\pi_{site}$ that I calculated previously and included in the first draft of the thesis.
In [22]:
fs_ery.mask
Out[22]:
In [23]:
fs_ery.data # gets all data, including the masked one
Out[23]:
In [24]:
# Calculating pi with the formula from Gillespie:
n = 18
p = pylab.arange(0, n+1)/float(n)
p
Out[24]:
In [25]:
# Calculating pi with the formula from Gillespie:
n / (n-1.0) * 2 * pylab.sum(fs_ery * p*(1-p))
Out[25]:
This is the same as the output of dadi's pi
function on the same SFS.
In [26]:
# the sample size (n) that dadi stores in this spectrum object and uses as n in the pi function
fs_ery.sample_sizes[0]
Out[26]:
In [27]:
# what is the total number of sites in the spectrum
pylab.sum(fs_ery.data)
Out[27]:
So, 1.6 million sites went into the ery spectrum.
In [28]:
# pi per site
n / (n-1.0) * 2 * pylab.sum(fs_ery * p*(1-p)) / pylab.sum(fs_ery.data)
Out[28]:
Apart from the incorrect small sample size correction by $\delta$a$\delta$i in case of folded spectra ($n$ refers to sampled sequences, not individuals), Gillespie's formula leads to a much higher estimate of $\pi_{site}$ than Wakeley's. Why is that?
In [29]:
# with correct small sample size correction
2 * n / (2* n-1.0) * 2 * pylab.sum(fs_ery * p*(1-p)) / pylab.sum(fs_ery.data)
Out[29]:
In [30]:
# Calculating pi with the formula from Gillespie:
n = 18
p = pylab.arange(0, n+1)/float(n)
p = p/2 # with a folded spectrum, we are summing over minor allele freqs only
pi_Gillespie = 2*n / (2*n-1.0) * 2 * pylab.sum(fs_ery * p*(1-p)) / pylab.sum(fs_ery.data)
pi_Gillespie
Out[30]:
In [31]:
pi_Wakeley - pi_Gillespie
Out[31]:
As can be seen from the insignificant difference (must be due to numerical inaccuracies) between the $\pi_{Wakeley}$ and the $\pi_{Gillespie}$ estimates, they are equivalent with the calculation for folded spectra given above as well as the correct small sample size correction. Beware: $\delta$a$\delta$i does not handle folded spectra correctly.
It should be a relatively easy to fix the pi
function to work correctly with folded spectra. Care should be taken to also correctly handle uneven sample sizes.
In [32]:
fs_ery.folded
Out[32]:
I think for now it would be best to import unfolded spectra from realSFS
and fold them if necessary in dadi.
In [33]:
fs_par = dadi.Spectrum.from_file('PAR.FOLDED.sfs.dadi_format')
In [34]:
pylab.plot(fs_ery, 'r', label='ery')
pylab.plot(fs_par, 'g', label='par')
pylab.legend()
Out[34]:
I am trying to fit eq. 4.21 of Wakeley2009 to the oberseved 1D folded spectra.
Each frequency class, $\eta_i$, provides an estimate of $\theta$. However, I would like to find the value of $\theta$ that minimizes the deviation of the above equation from all observed counts $\eta_i$.
I am following the example given here: https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#example-of-solving-a-fitting-problem
I have just one parameter to optimize.
In [35]:
from scipy.optimize import least_squares
In [36]:
def model(theta, eta, n):
"""
theta: scaled population mutation rate parameter [scalar]
eta: the folded 1D spectrum, including 0-count cat. [list]
n: number of sampled gene copies, i. e. 2*num_ind [scalar]
returns a numpy array
"""
i = pylab.arange(1, eta.size)
delta = pylab.where(i == n-i, 1, 0)
return theta * 1/i + 1/(n-i) / (1 + delta)
In [37]:
?pylab.where
In [38]:
# test
i = pylab.arange(1, 19)
n = 36
print i == n-i
#
print pylab.where(i == n-i, 1, 0)
# get a theta estimate from pi:
theta = pi_Wakeley * fs_ery.data.sum()
print theta
#
print len(fs_ery)
#
model(theta, fs_ery, 36)
Out[38]:
In [39]:
def fun(theta, eta, n):
"""
return residuals between model and data
"""
return model(theta, eta, n) - eta[1:]
In [40]:
def jac(theta, eta, n, test=False):
"""
creates a Jacobian matrix
"""
J = pylab.empty((eta.size-1, theta.size))
i = pylab.arange(1, eta.size, dtype=float)
delta = pylab.where(i == n-i, 1, 0)
num = 1/i + 1/(n-i)
den = 1 + delta
if test:
print i
print num
print den
J[:,0] = num / den
return J
In [41]:
# test
jac(theta, fs_ery, 36, test=True)
Out[41]:
In [42]:
# starting value
theta0 = theta # pi_Wakeley from above
In [43]:
# sum over unmasked entries, i. e. without 0-count category, i. e. returns number of variable sites
fs_ery.sum()
Out[43]:
In [44]:
# optimize
res = least_squares(fun, x0=theta0, jac=jac, bounds=(0,fs_ery.sum()),
kwargs={'eta': fs_ery, 'n': 36}, verbose=1)
In [45]:
res.success
Out[45]:
In [46]:
?least_squares
In [47]:
print res.x
print theta
In [48]:
pylab.rcParams['figure.figsize'] = [12.0, 8.0]
In [49]:
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14.0
i = range(1, len(fs_ery))
eta_model = model(res.x, eta=fs_ery, n=36) # get predicted values with optimal theta
plt.plot(i, fs_ery[1:], "bo", label="data from ery") # plot observed spectrum
ymax = max( fs_ery[1:].max(), eta_model.max() )
plt.axis([0, 19, 0, ymax*1.1]) # set axis range
plt.xlabel("minor allele frequency (i)")
plt.ylabel(r'$\eta_i$', fontsize='large', rotation='horizontal')
plt.title("folded SFS of ery")
plt.plot(i, eta_model, "go-",
label="\nneutral model"
+ "\n"
+ r'$\theta_{opt} = $' + str(round(res.x, 1))
) # plot model prediction with optimal theta
plt.legend()
Out[49]:
The counts in each frequency class should be Poisson distributed with rate equal to $E[\eta_i]$ as given above. The lowest frequency class has the highest rate and therefore also the highest variance
In [50]:
#?plt.ylabel
In [51]:
#print plt.rcParams
In [52]:
fs_ery[1:].max()
Out[52]:
In [53]:
#?pylab
In [54]:
os.getcwd()
Out[54]:
In [55]:
%%sh
ls
The following function will take the file name of a file containing the flat 1D folded frequency spectrum of one population and plots it together with the best fitting neutral expectation.
In [56]:
def plot_folded_sfs(filename, n, pop = ''):
# read in spectrum from file
data = open(filename, 'r')
sfs = pylab.array( data.readline().split(), dtype=float )
data.close() # should close connection to file
#return sfs
# get starting value for theta from Watterson's theta
S = sfs[1:].sum()
T_total = sum([1.0/i for i in range(1, n)]) # onhe half the expected total length of the genealogy
theta0 = S / T_total # see eq. 4.7 in Wakeley2009
# optimize
res = least_squares(fun, x0=theta0, jac=jac, bounds=(0, sfs.sum()),
kwargs={'eta': sfs, 'n': 36}, verbose=1)
#print "Optimal theta per site is {0:.4f}".format(res.x[0]/sfs.sum())
#print res.x[0]/sfs.sum()
#return theta0, res
# plot
plt.rcParams['font.size'] = 14.0
i = range(1, len(sfs))
eta_model = model(res.x, eta=sfs, n=36) # get predicted values with optimal theta
plt.plot(i, sfs[1:], "rs", label="data of " + pop) # plot observed spectrum
ymax = max( sfs[1:].max(), eta_model.max() )
plt.axis([0, 19, 0, ymax*1.1]) # set axis range
plt.xlabel("minor allele frequency (i)")
plt.ylabel(r'$\eta_i$', fontsize='large', rotation='horizontal')
plt.title("folded SFS")
plt.text(5, 10000,
r"Optimal neutral $\theta$ per site is {0:.4f}".format(res.x[0]/sfs.sum()))
plt.plot(i, eta_model, "go-",
label="\nneutral model"
+ "\n"
+ r'$\theta_{opt} = $' + str(round(res.x, 1))
) # plot model prediction with optimal theta
plt.legend()
In [57]:
plot_folded_sfs('PAR.FOLDED.sfs', n=36, pop='par')
In [58]:
plot_folded_sfs('ERY.FOLDED.sfs', n=36, pop='ery')
Since I only have one value to optimize, I can use a slightly simpler approach than used above:
In [59]:
from scipy.optimize import minimize_scalar
In [60]:
?minimize_scalar
In [61]:
# define cost function
def f(theta, eta, n):
"""
return sum of squared deviations between model and data
"""
return sum( (model(theta, eta, n) - eta[1:])**2 ) # see above for definition of the 'model' function
It would be interesting to know whether the cost function is convex or not.
In [62]:
theta = pylab.arange(0, fs_ery.data[1:].sum()) # specify range of theta
cost = [f(t, fs_ery.data, 36) for t in theta]
plt.plot(theta, cost, 'b-', label='ery')
plt.xlabel(r'$\theta$')
plt.ylabel('cost')
plt.title("cost function for ery")
plt.legend(loc='best')
Out[62]:
In [63]:
?plt.legend
Within the specified bounds (the observed $\theta$, i. e. derived from the data, cannot lie outside these bounds), the cost function is convex. This is therefore an easy optimisation problem. See here for more details.
In [64]:
res = minimize_scalar(f, bounds = (0, fs_ery.data[1:].sum()), method = 'bounded', args = (fs_ery.data, 36))
In [65]:
res
Out[65]:
In [66]:
# number of segregating sites
fs_par.data[1:].sum()
Out[66]:
In [67]:
res = minimize_scalar(f, bounds = (0, fs_par.data[1:].sum()), method = 'bounded', args = (fs_par.data, 36))
In [68]:
res
Out[68]:
The fitted values of $\theta$ are similar to the ones obtained above with the least_squares
function. The estimates for ery deviate more than for par.
In [69]:
from sympy import *
In [70]:
x0 , x1 = symbols('x0 x1')
In [71]:
init_printing(use_unicode=True)
In [72]:
diff(0.5*(1-x0)**2 + (x1-x0**2)**2, x0)
Out[72]:
In [73]:
diff(0.5*(1-x0)**2 + (x1-x0**2)**2, x1)
Out[73]:
Wow! Sympy is a replacement for Mathematica. There is also Sage, which may include even more functionality.
In [74]:
from scipy.optimize import curve_fit
Curve_fit
is another function that can be used for optimization.
In [75]:
?curve_fit
In [76]:
def model(i, theta):
"""
i: indpendent variable, here minor SNP frequency classes
theta: scaled population mutation rate parameter [scalar]
returns a numpy array
"""
n = len(i)
delta = pylab.where(i == n-i, 1, 0)
return theta * 1/i + 1/(n-i) / (1 + delta)
In [77]:
i = pylab.arange(1, fs_ery.size)
popt, pcov = curve_fit(model, i, fs_ery.data[1:])
In [78]:
# optimal theta
print popt
In [79]:
perr = pylab.sqrt(pcov)
perr
Out[79]:
In [80]:
print str(int(popt[0] - 1.96*perr[0])) + ' < ' + str(int(popt[0])) + ' < ' + str(int(popt[0] + 1.96*perr[0]))
In [81]:
popt, pcov = curve_fit(model, i, fs_par.data[1:])
perr = pylab.sqrt(pcov)
print str(int(popt[0] - 1.96*perr[0])) + ' < ' + str(int(popt[0])) + ' < ' + str(int(popt[0] + 1.96*perr[0]))
I am not sure whether these standard errors (perr) are correct. It may be that it is assumed that errors are normally distributed, which they are not exactly in this case. They should be close to Poisson distributed (see Fu1995), which should be fairly similar to normal with such high expected values as here.
If the standard errors are correct, then the large overlap of the 95% confidence intervals would indicate that the data do not provide significant support for a difference in $\theta$ between par and ery.
In [82]:
%pwd
Out[82]:
In [83]:
% ll
In [84]:
! cat ERY.FOLDED.sfs.dadi_format
In [85]:
fs_ery = dadi.Spectrum.from_file('ERY.FOLDED.sfs.dadi_format', mask_corners=False)
In [86]:
fs_ery
Out[86]:
In [87]:
fs_ery.pop_ids = ['ery']
In [88]:
# get a Poisson sample from the observed spectrum
fs_ery_param_boot = fs_ery.sample()
In [89]:
fs_ery_param_boot
Out[89]:
In [90]:
fs_ery_param_boot.data
Out[90]:
In [91]:
%psource fs_ery.sample
There must be a way to get more than one bootstrap sample per call.
In [92]:
fs_ery_param_boot = pylab.array([fs_ery.sample() for i in range(100)])
In [93]:
# get the first 3 boostrap samples from the doubleton class
fs_ery_param_boot[:3, 2]
Out[93]:
It would be good to get the 5% and 95% quantiles from the bootstrap samples of each frequency class and add those intervals to the plot of the observed frequency spectrum and the fitted neutral spectrum. This would require to find a quantile function and to find out how to add lines to a plot with matplotlib.
It would also be good to use the predicted counts from the neutral model above with the fitted $\theta$ as parameters for the bootstrap with sample()
and add 95% confidence intervals to the predicted neutral SFS. I have done this in R instead (see /data3/claudius/Big_Data/ANGSD/SFS/SFS.Rmd
)
I edited the 2D SFS created for estimating $F_{ST}$ by realSFS
. I have convinced myself that realSFS
outputs a flattened 2D matrix as expected by $\delta$a$\delta$i's Spectrum.from_file
function (see section 3.1 of the manual with my comments). Note, that in the manual, "samples" stands for number of allele copies, so that the correct specification of dimensions for this 2D unfolded SFS of 18 diploid individuals in each of 2 populations is 37 x 37.
In [94]:
# read in the flattened 2D SFS
EryPar_unfolded_2dsfs = dadi.Spectrum.from_file('EryPar.unfolded.2dsfs.dadi_format', mask_corners=True)
In [95]:
# check dimension
len(EryPar_unfolded_2dsfs[0,])
Out[95]:
In [96]:
EryPar_unfolded_2dsfs.sample_sizes
Out[96]:
In [97]:
# add population labels
EryPar_unfolded_2dsfs.pop_ids = ["ery", "par"]
In [98]:
EryPar_unfolded_2dsfs.pop_ids
Out[98]:
$\delta$a$\delta$i offers a function to get the marginal spectra from multidimensional spectra. Note, that this marginalisation is nothing fancy. In R
it would be taking either the rowSums
or the colSums
of the matrix.
In [99]:
# marginalise over par to get 1D SFS for ery
fs_ery = EryPar_unfolded_2dsfs.marginalize([1])
# note the argument is an array with dimensions, one can marginalise over more than one dimension at the same time,
# but that is only interesting for 3-dimensional spectra, which I don't have here
In [100]:
fs_ery
Out[100]:
In [101]:
# marginalise over ery to get 1D SFS for par
fs_par = EryPar_unfolded_2dsfs.marginalize([0])
In [102]:
fs_par
Out[102]:
Note, that these marginalised 1D SFS's are not identical to the 1D SFS estimated directly with realSFS
. This is because, for the estimation of the 2D SFS, realSFS
has only taken sites that had data from at least 9 individuals in each population (see assembly.sh
, lines 1423 onwards).
The SFS's of par and ery had conspicuous shape differences. It would therefore be good to plot them to see, whether the above commands have done the correct thing.
In [103]:
# plot 1D spectra for each population
pylab.plot(fs_par, 'g', label="par")
pylab.plot(fs_ery, 'r', label="ery")
pylab.legend()
Out[103]:
These marginal unfolded spectra look similar in shape to the 1D folded spectra of each subspecies (see above).
In [104]:
fs_ery.pi() / pylab.sum(fs_ery.data)
Out[104]:
In [105]:
fs_ery.data
Out[105]:
In [106]:
n = 36 # 36 sequences sampled from 18 diploid individuals
pi_Wakeley = (sum( [i*(n-i)*fs_ery[i] for i in range(1, n)] ) * 2.0 / (n*(n-1)))
pi_Wakeley = pi_Wakeley / pylab.sum(fs_ery.data)
pi_Wakeley
Out[106]:
$\delta$a$\delta$i's pi
function seems to calculate the correct value of $\pi$ for this unfolded spectrum. However, it is worrying that $\pi$ from this marginal spectrum is about 20 times larger than the one calculated from the directly estimated 1D folded spectrum (see above the $\pi$ calculated from the folded 1D spectrum).
In [107]:
fs_par.pi() / pylab.sum(fs_par.data)
Out[107]:
In [108]:
pylab.sum(fs_par.data)
Out[108]:
In [109]:
pylab.sum(EryPar_unfolded_2dsfs.data)
Out[109]:
The sum over the marginalised 1D spectra should be the same as the sum over the 2D spectrum !
In [110]:
# from dadi's marginalise function:
fs_ery.data
Out[110]:
In [111]:
sfs2d = EryPar_unfolded_2dsfs.copy()
In [112]:
# this should get the marginal spectrum for ery
ery_mar = [pylab.sum(sfs2d.data[i]) for i in range(0, len(sfs2d))]
ery_mar
Out[112]:
In [113]:
# this should get the marginal spectrum for ery and then take the sum over it
sum([pylab.sum(sfs2d.data[i]) for i in range(0, len(sfs2d))])
Out[113]:
In [114]:
# look what happens if I include masking
sum([pylab.sum(sfs2d[i]) for i in range(0, len(sfs2d))])
Out[114]:
In [115]:
fs_ery.data - ery_mar
Out[115]:
So, during the marginalisation the masking of data in the fixed categories (0, 36) is the problem, producing incorrectly marginalised counts in those masked categories. This is shown in the following:
In [116]:
sfs2d[0]
Out[116]:
In [117]:
pylab.sum(sfs2d[0])
Out[117]:
In [118]:
# from dadi's marginalise function:
fs_ery.data
Out[118]:
In [119]:
# dividing by the correct number of sites to get pi per site:
fs_ery.pi() / pylab.sum(sfs2d.data)
Out[119]:
This is very close to the estimate of $\pi$ derived from the folded 1D spectrum of ery! (see above)
In [120]:
fs_par.pi() / pylab.sum(sfs2d.data)
Out[120]:
This is also nicely close to the estimate of $\pi_{site}$ of par from its folded 1D spectrum.
In [121]:
fs_ery.Watterson_theta() / pylab.sum(sfs2d.data)
Out[121]:
In [122]:
fs_ery.Tajima_D()
Out[122]:
In [123]:
fs_par.Tajima_D()
Out[123]:
Now, I am calculating Tajima's D from the ery marginal spectrum by hand in order to check whether $\delta$a$\delta$i is doing the right thing.
In [124]:
n = 36
pi_Wakeley = (sum( [i*(n-i)*fs_ery.data[i] for i in range(1, n+1)] )
* 2.0 / (n*(n-1)))
#/ pylab.sum(sfs2d.data)
pi_Wakeley
Out[124]:
In [125]:
# number of segregating sites
# this sums over all unmasked positions in the array
pylab.sum(fs_ery)
Out[125]:
In [126]:
fs_ery.S()
Out[126]:
In [127]:
S = pylab.sum(fs_ery)
theta_Watterson = S / pylab.sum(1.0 / (pylab.arange(1, n)))
theta_Watterson
Out[127]:
In [128]:
# normalizing constant, see page 45 in Gillespie
a1 = pylab.sum(1.0 / pylab.arange(1, n))
#print a1
a2 = pylab.sum(1.0 / pylab.arange(1, n)**2.0)
#print a2
b1 = (n+1.0)/(3.0*(n-1))
#print b1
b2 = 2.0*(n**2 + n + 3)/(9.0*n*(n-1))
#print b2
c1 = b1 - (1.0/a1)
#print c1
c2 = b2 - (n+2.0)/(a1*n) + a2/a1**2
#print c2
C = ((c1/a1)*S + (c2/(a1**2.0 + a2))*S*(S-1))
C = C**(1/2.0)
In [129]:
ery_Tajimas_D = (pi_Wakeley - theta_Watterson) / C
print '{0:.6f}'.format(ery_Tajimas_D)
In [130]:
ery_Tajimas_D - fs_ery.Tajima_D()
Out[130]:
$\delta$a$\delta$i seems to do the right thing. Note, that the estimate of Tajima's D from this marginal spectrum of ery is slightly different from the estimate derived from the folded 1D spectrum of ery (see /data3/claudius/Big_Data/ANGSD/SFS/SFS.Rmd). The folded 1D spectrum resulted in a Tajima's D estimate of $\sim$0.05, i. e. a difference of almost 0.1. Again, the 2D spectrum is based on only those sites for which there were at least 9 individiuals with data in both populations, whereas the 1D folded spectrum of ery included all sites for which there were 9 ery individuals with data (see line 1571 onwards in assembly.sh
).
In [131]:
fs_par.Tajima_D()
Out[131]:
My estimate from the folded 1D spectrum of par was -0.6142268 (see /data3/claudius/Big_Data/ANGSD/SFS/SFS.Rmd).
In [134]:
EryPar_unfolded_2dsfs.S()
Out[134]:
The 2D spectrum contains counts from 60k sites that are variable in par or ery or both.
In [136]:
EryPar_unfolded_2dsfs.Fst()
Out[136]:
This estimate of $F_{ST}$ according to Weir and Cockerham (1984) is well below the estimate of $\sim$0.3 from ANGSD according to Bhatia/Hudson (2013). Note, however, that this estimate showed a positive bias of around 0.025 in 100 permutations of population labels of individuals. Taking the positive bias into account, both estimates of $F_{ST}$ are quite similar.
The following function scramble_pop_ids
should generate a 2D SFS with counts as if individuals were assigned to populations randomly. Theoretically, the $F_{ST}$ calculated from this SFS should be 0.
In [145]:
%psource EryPar_unfolded_2dsfs.scramble_pop_ids
In [153]:
# plot the scrambled 2D SFS
dadi.Plotting.plot_single_2d_sfs(EryPar_unfolded_2dsfs.scramble_pop_ids(), vmin=1)
Out[153]:
So, this is how the 2D SFS would look like if ery and par were not genetically differentiated.
In [154]:
# get Fst for scrambled SFS
EryPar_unfolded_2dsfs.scramble_pop_ids().Fst()
Out[154]:
The $F_{ST}$ from the scrambled SFS is much lower than the $F_{ST}$ of the observed SFS. That should mean that there is significant population structure. However, the $F_{ST}$ from the scrambled SFS is not 0. I don't know why that is.
In [ ]:
In [138]:
# folding
EryPar_folded_2dsfs = EryPar_unfolded_2dsfs.fold()
In [139]:
EryPar_folded_2dsfs
Out[139]:
In [140]:
EryPar_folded_2dsfs.mask
Out[140]:
In [141]:
dadi.Plotting.plot_single_2d_sfs(EryPar_unfolded_2dsfs, vmin=1)
Out[141]:
In [142]:
dadi.Plotting.plot_single_2d_sfs(EryPar_folded_2dsfs, vmin=1)
Out[142]:
The folded 2D spectrum is not a minor allele frequency spectrum as are the 1D folded spectra of ery and par. This is because an allele that is minor in one population can be the major allele in the other. What is not counted are the alleles that are major in both populations, i. e. the upper right corner.
For the 2D spectrum to make sense it is crucial that allele frequencies are polarised the same way in both populations, either with an outgroup sequence or arbitrarily with respect to the reference sequence (as I did here).
In [143]:
# unfolded spectrum from marginalisation of 2D unfolded spectrum
fs_ery
Out[143]:
In [144]:
len(fs_ery)
Out[144]:
In [145]:
fs_ery.fold()
Out[145]:
Let's use the formula (1.2) from Wakeley2009 to fold the 1D spectrum manually: $$ \eta_{i} = \frac{\zeta_{i} + \zeta_{n-i}}{1 + \delta_{i, n-i}} \qquad 1 \le i \le [n/2] $$ $n$ is the number of gene copies sampled, i. e. haploid sample size. $[n/2]$ is the largest integer less than or equal to n/2 (to handle uneven sample sizes). $\zeta_{i}$ are the unfolded frequencies and $\delta_{i, n-i}$ is Kronecker's $\delta$ which is 1 if $i = n-i$ and zero otherwise (to avoid counting the unfolded n/2 frequency class twice with even sample sizes).
In [146]:
fs_ery_folded = fs_ery.copy() # make a copy of the UNfolded spectrum
n = len(fs_ery)-1
for i in range(len(fs_ery)):
fs_ery_folded[i] += fs_ery[n-i]
if i == n/2.0:
fs_ery_folded[i] /= 2
fs_ery_folded[0:19]
Out[146]:
In [147]:
isinstance(fs_ery_folded, pylab.ndarray)
Out[147]:
In [148]:
mask = [True]
mask.extend([False] * 18)
mask.extend([True] * 18)
print mask
print sum(mask)
In [149]:
mask = [True] * 37
for i in range(len(mask)):
if i > 0 and i < 19:
mask[i] = False
print mask
print sum(mask)
Here is how to flatten an array of arrays with list comprehension:
In [150]:
mask = [[True], [False] * 18, [True] * 18]
print mask
In [151]:
print [elem for a in mask for elem in a]
Set new mask for the folded spectrum:
In [152]:
fs_ery_folded.mask = mask
In [153]:
fs_ery_folded.folded = True
In [154]:
fs_ery_folded - fs_ery.fold()
Out[154]:
The fold()
function works correctly for 1D spectra, at least. How about 2D spectra?
In [155]:
EryPar_unfolded_2dsfs.sample_sizes
Out[155]:
In [156]:
EryPar_unfolded_2dsfs._total_per_entry()
Out[156]:
In [157]:
# copy the unfolded 2D spectrum for folding
import copy
sfs2d_folded = copy.deepcopy(EryPar_unfolded_2dsfs)
In [158]:
n = len(sfs2d_folded)-1
m = len(sfs2d_folded[0])-1
for i in range(n+1):
for j in range(m+1):
sfs2d_folded[i,j] += sfs2d_folded[n-i, m-j]
if i == n/2.0 and j == m/2.0:
sfs2d_folded[i,j] /= 2
In [159]:
mask = sfs2d_folded._total_per_entry() > (n+m)/2
mask
Out[159]:
In [160]:
sfs2d_folded.mask = mask
sfs2d_folded.fold = True
In [161]:
dadi.Plotting.plot_single_2d_sfs(sfs2d_folded, vmin=1)
Out[161]:
I am going to go through every step in the fold
function of dadi:
In [162]:
# copy the unfolded 2D spectrum for folding
import copy
sfs2d_unfolded = copy.deepcopy(EryPar_unfolded_2dsfs)
In [163]:
total_samples = pylab.sum(sfs2d_unfolded.sample_sizes)
total_samples
Out[163]:
In [164]:
total_per_entry = dadi.Spectrum(sfs2d_unfolded._total_per_entry(), pop_ids=['ery', 'par'])
#total_per_entry.pop_ids = ['ery', 'par']
dadi.Plotting.plot_single_2d_sfs(total_per_entry, vmin=1)
Out[164]:
In [165]:
total_per_entry = sfs2d_unfolded._total_per_entry()
total_per_entry
Out[165]:
In [166]:
where_folded_out = total_per_entry > total_samples/2
where_folded_out
Out[166]:
In [167]:
original_mask = sfs2d_unfolded.mask
original_mask
Out[167]:
In [168]:
pylab.logical_or([True, False, True], [False, False, True])
Out[168]:
In [169]:
# get the number of elements along each axis
sfs2d_unfolded.shape
Out[169]:
In [170]:
[slice(None, None, -1) for i in sfs2d_unfolded.shape]
Out[170]:
In [171]:
matrix = pylab.array([
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]
])
reverse_slice = [slice(None, None, -1) for i in matrix.shape]
reverse_slice
Out[171]:
In [172]:
matrix[reverse_slice]
Out[172]:
In [173]:
matrix[::-1,::-1]
Out[173]:
With the variable length list of slice objects, one can generalise the reverse of arrays with any dimensions.
In [174]:
final_mask = pylab.logical_or(original_mask, dadi.Numerics.reverse_array(original_mask))
final_mask
Out[174]:
Here, folding doesn't mask new cells.
In [175]:
?pylab.where
In [176]:
pylab.where(matrix < 6, matrix, 0)
Out[176]:
In [177]:
# this takes the part of the spectrum that is non-sensical if the derived allele is not known
# and sets the rest to 0
print pylab.where(where_folded_out, sfs2d_unfolded, 0)
In [178]:
# let's plot the bit of the spectrum that we are going to fold onto the rest:
dadi.Plotting.plot_single_2d_sfs(dadi.Spectrum(pylab.where(where_folded_out, sfs2d_unfolded, 0)), vmin=1)
Out[178]:
In [179]:
# now let's reverse this 2D array, i. e. last row first and last element of each row first:
_reversed = dadi.Numerics.reverse_array(pylab.where(where_folded_out, sfs2d_unfolded, 0))
_reversed
Out[179]:
In [180]:
dadi.Plotting.plot_single_2d_sfs(dadi.Spectrum(_reversed), vmin=1)
Out[180]:
The transformation we have done with the upper-right diagonal 2D array above should be identical to projecting it across a vertical center line (creating an upper left triangular matrix) and then projecting it across a horizontal center line (creating the final lower left triangular matrix). Note, that this is not like mirroring the upper-right triangular 2D array across the 36-36 diagonal!
In [181]:
# This shall now be added to the original unfolded 2D spectrum.
sfs2d_folded = pylab.ma.masked_array(sfs2d_unfolded.data + _reversed)
In [182]:
dadi.Plotting.plot_single_2d_sfs(dadi.Spectrum(sfs2d_folded), vmin=1)
Out[182]:
In [183]:
sfs2d_folded.data
Out[183]:
In [184]:
sfs2d_folded.data[where_folded_out] = 0
sfs2d_folded.data
Out[184]:
In [185]:
dadi.Plotting.plot_single_2d_sfs(dadi.Spectrum(sfs2d_folded), vmin=1)
Out[185]:
In [186]:
sfs2d_folded.shape
Out[186]:
In [187]:
where_ambiguous = (total_per_entry == total_samples/2.0)
where_ambiguous
Out[187]:
SNP's with joint frequencies in the True cells are counted twice at the moment due to the folding and the fact that the sample sizes are even.
In [188]:
# this extracts the diagonal values from the UNfolded spectrum and sets the rest to 0
ambiguous = pylab.where(where_ambiguous, sfs2d_unfolded, 0)
dadi.Plotting.plot_single_2d_sfs(dadi.Spectrum(ambiguous), vmin=1)
Out[188]:
These are the values in the diagonal before folding.
In [189]:
reversed_ambiguous = dadi.Numerics.reverse_array(ambiguous)
dadi.Plotting.plot_single_2d_sfs(dadi.Spectrum(reversed_ambiguous), vmin=1)
Out[189]:
These are the values that got added to the diagonal during folding. Comparing with the previous plot, one can see for instance that the value in the (0, 36) class got added to the value in the (36, 0) class and vice versa. The two frequency classes are equivalent, since it is arbitrary which allele we call minor in the total sample (of 72 gene copies). These SNP's are therefore counted twice.
In [190]:
a = -1.0*ambiguous + 0.5*ambiguous + 0.5*reversed_ambiguous
b = -0.5*ambiguous + 0.5*reversed_ambiguous
a == b
Out[190]:
In [191]:
sfs2d_folded += -0.5*ambiguous + 0.5*reversed_ambiguous
In [192]:
final_mask = pylab.logical_or(final_mask, where_folded_out)
final_mask
Out[192]:
In [193]:
sfs2d_folded = dadi.Spectrum(sfs2d_folded, mask=final_mask, data_folded=True, pop_ids=['ery', 'par'])
In [194]:
pylab.rcParams['figure.figsize'] = [12.0, 8.0]
In [195]:
dadi.Plotting.plot_single_2d_sfs(sfs2d_folded, vmin=1)
Out[195]:
In [ ]:
In [ ]:
In [ ]: