In the bootstrap, you divide your sequence data into regions that can (ideally) be considered independent. Then you resample from those regions and generate new frequency spectra, using all the SNPs from each sampled region.
Have you run any power tests? Running fits on simulated data from a plausible demographic history will give you insight as to whether it's even possible to reconstruct the history with the sample size you have.
In principle, I could simulate data sets under a RSC model with ms
and see whether I can infer a Ti > 0.
When doing the non-parametric bootstrapping, you want to bootstrap over regions of the genome, not just SNPs. This is because we want our resampling to capture the correlati ons between the SNPs caused by linkage. So instead of resampling from the SNPs, you'd resample from regions you've sequenced. For example, you could break the genome up into 1 Mb chunks and resample over those. The exact region breakdown isn't critical, as long as they are large compared to the typical scale of LD in your organism.
We typically run several optimizations on each bootstrap data set, to help ensure we find the true best fit. (Often we run optimizations until the best-3 results are all within some delta of each other.)
if your bootstraps aren't normally distributed, it's better to report confidence intervals as quantiles, e.g. 2.5-97.5%. You can also try a log-transform.
By default dadi masks entries for which it is having computational problems, and this can cause LLs too look too good.
I need to look out for this. dadi should emit a warning when it does this.
In [1]:
# which *sites
% ll ../*sites
The sites
file to use is ../ParEry.noSEgt2.nogtQ99Cov.noDUST.3.15.noTGCAGG.ANGSD_combinedNegFisFiltered.noGtQ99GlobalCov.sorted.sites
.
In [1]:
from collections import defaultdict
In [2]:
# open sites file and read into dictionary
# contigs as keys, with sites as values collected in array
SITES = defaultdict(list)
with open("../ParEry.noSEgt2.nogtQ99Cov.noDUST.3.15.noTGCAGG.ANGSD_combinedNegFisFiltered.noGtQ99GlobalCov.sorted.sites") as sites_file:
for line in sites_file:
contig, site = line.strip().split()
SITES[contig].append(site)
In [3]:
print SITES['Contig_16']
In [4]:
import numpy as np
In [5]:
# efficient bootstrapping of keys with numpy
# np.random.choice
n_contigs = len(SITES.keys())
boot_contigs = np.random.choice(SITES.keys(), size=n_contigs, replace=True)
In [6]:
boot_contigs[:10]
Out[6]:
In [7]:
print n_contigs
print len(boot_contigs)
In [8]:
# print the number of different contigs in the bootstrap resample
print len(set(boot_contigs))
In [15]:
i=0
for contig in sorted(boot_contigs):
i+=1
if not i> 10:
print contig
In [23]:
i=0
for contig in sorted(boot_contigs, key=lambda x: int(x.replace("Contig_", ""))):
i+=1
if not i> 10:
print contig
The sites
file needs to be sorted on column 1 (contig ids) for indexing. Lexicographical sorting seems to be enough, but I like to have the contigs sorted numerically by id. Therefore the fancy key function.
In [30]:
# write out bootstrapped *sites file
with open("0_boot.sites", "w") as out:
for contig in sorted(boot_contigs, key=lambda x: int(x.replace("Contig_", ""))):
for site in range(len(SITES[contig])):
out.write(contig + "\t" + SITES[contig][site] + "\n")
Since the sites
file is 28Mb large, I cannot create too many of them at one time without using up a lot of storage. I will therefore try to create them as they are needed for SAF calculation and then erase them again to free up storage.
In [12]:
import subprocess32 as sp
In [31]:
# index sites file
sp.call("angsd sites index 0_boot.sites 2>/dev/null", shell=True)
Out[31]:
In [32]:
# create regions file
sp.call("cut -f 1 0_boot.sites | uniq > 0_boot.rf", shell=True)
Out[32]:
In [40]:
cmd = "angsd -bam {0}.slim.bamfile.list -ref Big_Data_ref.fa -anc Big_Data_ref.fa -out SAF/{0}/{0}.unfolded.{1} -fold 0 \
-sites {1}.sites -rf {1}.rf -only_proper_pairs 0 -baq 1 -minMapQ 5 -minInd 9 -GL 1 -doSaf 1 -nThreads 1 2>/dev/null"
for POP in ['ERY', 'PAR']:
print cmd.format(POP, "0_boot")
In [41]:
# create SAF files for ERY and PAR, for several bootstraps at the same time
running = []
cmd = "angsd -bam {0}.slim.bamfile.list -ref Big_Data_ref.fa -anc Big_Data_ref.fa -out SAF/{0}/{0}.unfolded.{1} -fold 0 \
-sites {1}.sites -rf {1}.rf -only_proper_pairs 0 -baq 1 -minMapQ 5 -minInd 9 -GL 1 -doSaf 1 -nThreads 1 2>/dev/null"
for POP in ['ERY', 'PAR']:
#print cmd % (POP, "0_boot.sites", "0_boot.rf")
p = sp.Popen(cmd.format(POP, "0_boot"), shell=True)
running.append(p)
This only takes 6 minutes to finish.
In [33]:
# estimate 2D unfolded SFS with realSFS allowing enough iterations to reach convergence, one bootstrap at a time
cmd = "realSFS -P 12 -maxIter 50000 -tole 1e-6 -m 0 SAF/ERY/ERY.unfolded.{}.saf.idx SAF/PAR/PAR.unfolded.{}.saf.idx 2>/dev/null >SFS/{}.2dsfs"
In [34]:
# clean up sites file, regions file, split regions files and SAF files
In [2]:
% ll SFS
In [40]:
import numpy
import sys
sys.path.insert(0, '/home/claudius/Downloads/dadi')
import dadi
In [4]:
boot_2dsfs = dadi.Spectrum.from_file('SFS/0_boot.2dsfs.dadi')
In [5]:
boot_2dsfs.sample_sizes
Out[5]:
In [6]:
boot_2dsfs.pop_ids = ['ery', 'par']
In [7]:
boot_2dsfs.S()
Out[7]:
In [49]:
%matplotlib inline
import pylab
pylab.rcParams['font.size'] = 14.0
pylab.rcParams['figure.figsize'] = [12.0, 10.0]
In [8]:
dadi.Plotting.plot_single_2d_sfs(boot_2dsfs, vmin=1, cmap=pylab.cm.jet)
Out[8]:
In [10]:
!pwd
In [11]:
% ll ../../DADI/dadiExercises/
In [12]:
sfs2d = dadi.Spectrum.from_file('../../DADI/dadiExercises/EryPar.unfolded.2dsfs.dadi_format')
In [13]:
sfs2d.sample_sizes
Out[13]:
In [14]:
sfs2d.pop_ids = ['ery', 'par']
In [15]:
sfs2d.S()
Out[15]:
Practically the same number of SNP's as in the bootstrapped SFS! That's not a good sign.
In [17]:
(sfs2d - boot_2dsfs).sum()
Out[17]:
The bootstrap SFS is practically identical to the data SFS ?!
So the sites
function to angsd or realSFS cannot be used for bootstrapping (see angsd issue #81).
However, since the contigs are the independent units I want to resample over, it may be possible to use a bootstrapped regions file.
In [11]:
# write out bootstrapped *rf file
with open("0_boot_regions.rf", "w") as out:
for contig in sorted(boot_contigs, key=lambda x: int(x.replace("Contig_", ""))):
out.write(contig + "\n")
In [16]:
% ll
In [ ]:
# split regions file
sp.call("split -l 500 0_boot_regions.rf SPLIT_RF/", shell=True)
In [14]:
% ll SAF
I have run angsd with GNU parallel on the split regions files and the data for ERY, then merged the SAF files with realSFS cat
. Printing the merged SAF file with realSFS print
then emits the following error message:
Problem with chr: Contig_20977, key already exists, saffile needs to be sorted. (sort your -rf that you used for input)
So it looks as though realSFS is really picky about contigs occurring more than once in the SAF file, which precludes bootstrapping.
./realSFS cat
-> This will cat together .saf files from angsd
-> regions has to be disjoint between saf files. This WONT be checked (alot) !
-> This has only been tested on safs for different chrs !
I am going to try and create SAF files which do not overlap in the contigs that they contain. For that I need to split the regions file in a way that repeated contig names (they are sorted) do not end up in different split region files.
In [9]:
from collections import Counter
In [10]:
# turn array of contigs into a hash with counts
boot_contigs_dict = Counter(boot_contigs)
In [11]:
i=0
for k,v in boot_contigs_dict.items():
print k, v
i+=1
if i > 10:
break
In [13]:
len(boot_contigs_dict.keys())
Out[13]:
In [16]:
i=0
for k,v in sorted(boot_contigs_dict.items(), key=lambda x: int(x[0].replace("Contig_", ""))):
for _ in range(v):
print k
i+=1
if i > 10:
break
Now let's introduce a split about every 10 contigs.
In [18]:
i=0
c = 0
for k,v in sorted(boot_contigs_dict.items(), key=lambda x: int(x[0].replace("Contig_", ""))):
c+=v
if c > 10:
print "***************\n"
c=v
for _ in range(v):
print k
i+=1
if i > 20:
break
This looks good. No repetitions of contigs are spread over splits.
In [22]:
import string
string.letters
Out[22]:
In [24]:
from itertools import product
In [37]:
for n in product(string.lowercase, repeat=2):
print n[0] + n[1] + " ",
In [38]:
len([n for n in product(string.lowercase, repeat=2)])
Out[38]:
That is plenty of file names to choose from.
In [39]:
# write out bootstrapped *rf file
import string
c = 0 # initialise contig count
# create iterator over filenames
fnames = product(string.lowercase, repeat=2)
# get next file name from iterator
fn = fnames.next()
# open new file for writing and get filehandle
out = open("split_rf/" + fn[0] + fn[1], "w")
# iterate over Counter dict of bootstrapped contigs, key=contig name, value=count (rep)
for contig,rep in sorted(boot_contigs_dict.items(), key=lambda x: int(x[0].replace("Contig_", ""))):
c+=rep
if c > 500: # write up to 500 contigs to each split rf file
out.close() # close current rf file
fn = fnames.next() # get next file name from iterator
out = open("split_rf/" + fn[0] + fn[1], "w") # open new rf file for writing
c = rep
for _ in range(rep): # write contig name to rf file as often as it occurs in the bootstrap resample
out.write(contig + "\n")
This seems to do the right thing.
In [41]:
# read in bootstrapped spectrum
fs_ery_boot = dadi.Spectrum.from_file("SAF/with_nonoverlap_boot.rf/ERY/ERY.unfolded.sfs.boot.dadi")
In [42]:
fs_ery_boot
Out[42]:
In [46]:
% ll ../SFS/ERY/
In [47]:
# read in original spectrum
fs_ery = dadi.Spectrum.from_file("../SFS/ERY/ERY.unfolded.sfs.dadi")
In [48]:
fs_ery
Out[48]:
In [51]:
pylab.plot(fs_ery.fold(), "ro-", label="original")
pylab.plot(fs_ery_boot.fold(), 'bs-', label="bootstrapped")
pylab.legend()
Out[51]:
In [52]:
# get number of segregating sites for original spectrum
fs_ery.S()
Out[52]:
In [53]:
# get number of segregating sites for bootstrapped spectrum
fs_ery_boot.S()
Out[53]:
In [62]:
# get total number of sites in the original spectrum
fs_ery.data.sum()
Out[62]:
In [63]:
# get total number of sites in the bootstrapped spectrum
fs_ery_boot.data.sum()
Out[63]:
That looks good: similar but not almost identical values.
This really does look like a proper bootstrap!
I need to generate a second bootstrapped SFS to convince myself that this is actually working.
In [54]:
% ll
In [55]:
# open sites file and read into dictionary
# contigs as keys, with sites as values collected in array
SITES = defaultdict(list)
with open("all.sites") as sites_file:
for line in sites_file:
contig, site = line.strip().split()
SITES[contig].append(site)
In [56]:
# efficient bootstrapping of keys with numpy
# np.random.choice
n_contigs = len(SITES.keys())
boot_contigs = np.random.choice(SITES.keys(), size=n_contigs, replace=True)
In [57]:
# turn array of contigs into a hash with counts
boot_contigs_dict = Counter(boot_contigs)
In [58]:
len(boot_contigs_dict)
Out[58]:
In [59]:
# write out bootstrapped *rf file
import string
c = 0 # initialise contig count
# create iterator over filenames
fnames = product(string.lowercase, repeat=2)
# get next file name from iterator
fn = fnames.next()
# open new file for writing and get filehandle
out = open("split_rf/" + fn[0] + fn[1], "w")
# iterate over Counter dict of bootstrapped contigs, key=contig name, value=count (rep)
for contig,rep in sorted(boot_contigs_dict.items(), key=lambda x: int(x[0].replace("Contig_", ""))):
c+=rep
if c > 500: # write up to 500 contigs to each split rf file
out.close() # close current rf file
fn = fnames.next() # get next file name from iterator
out = open("split_rf/" + fn[0] + fn[1], "w") # open new rf file for writing
c = rep
for _ in range(rep): # write contig name to rf file as often as it occurs in the bootstrap resample
out.write(contig + "\n")
In [61]:
boot_contigs_dict['Contig_10071']
Out[61]:
In [64]:
fs_ery_boot_1 = dadi.Spectrum.from_file("SAF/with_nonoverlap_boot.rf/ERY/ERY.unfolded.sfs.boot_1.dadi")
In [65]:
pylab.plot(fs_ery.fold(), "ro-", label="original")
pylab.plot(fs_ery_boot_1.fold(), 'bs-', label="bootstrapped")
pylab.legend()
Out[65]:
In [66]:
fs_ery_boot_1.data.sum()
Out[66]:
This looks really good. I am convinced that this bootstrapping works!
Now I need to create an programme that does this over and over again.
In [2]:
import numpy as np
In [4]:
# open regions file and read into numpy array
REGIONS = []
with open("all.rf") as regions_file:
for line in regions_file:
contig = line.strip()
REGIONS.append(contig)
In [5]:
REGIONS[:10]
Out[5]:
In [6]:
len(REGIONS)
Out[6]:
In [7]:
# efficient bootstrapping of contigs with numpy
# np.random.choice
n_contigs = len(REGIONS)
boot_contigs = np.random.choice(REGIONS, size=n_contigs, replace=True)
In [8]:
from collections import Counter
In [9]:
boot_contigs_dict = Counter(boot_contigs)
In [10]:
len(boot_contigs_dict)
Out[10]:
In [11]:
% ll
In [12]:
% ll split_rf/
In [13]:
import subprocess32 as sp
In [14]:
# remove split regions files from previous bootstrap
sp.call("rm -f split_rf/*", shell=True)
Out[14]:
In [15]:
% ll split_rf/
In [16]:
% ll SAF/ERY
In [17]:
sp.call("rm -f SAF/ERY/*", shell=True)
sp.call("rm -f SAF/PAR/*", shell=True)
Out[17]:
In [19]:
# write out bootstrapped *rf files
import string
from itertools import product
c = 0 # initialise contig count
# create iterator over filenames
fnames = product(string.lowercase, repeat=2)
# get next file name from iterator
fn = fnames.next()
# open new file for writing and get filehandle
out = open("split_rf/" + fn[0] + fn[1], "w")
# iterate over Counter dict of bootstrapped contigs, key=contig name, value=count (rep)
for contig,rep in sorted(boot_contigs_dict.items(), key=lambda x: int(x[0].replace("Contig_", ""))):
c+=rep
if c > 500: # write up to 500 contigs to each split rf file
out.close() # close current rf file
fn = fnames.next() # get next file name from iterator
out = open("split_rf/" + fn[0] + fn[1], "w") # open new rf file for writing
c = rep
for _ in range(rep): # write contig name to rf file as often as it occurs in the bootstrap resample
out.write(contig + "\n")
In [20]:
cmd = 'ls split_rf/* | parallel -j 12 "angsd -bam PAR.slim.bamfile.list -ref Big_Data_ref.fa -anc Big_Data_ref.fa -out SAF/PAR/PAR.unfolded.{/} -fold 0 \
-sites all.sites -rf {} -only_proper_pairs 0 -baq 1 -minMapQ 5 -minInd 9 -GL 1 -doSaf 1 -nThreads 1 2>/dev/null"'
In [21]:
cmd
Out[21]:
In [22]:
sp.call(cmd, shell=True)
Out[22]:
In [25]:
cmd = cmd.replace("PAR", "ERY")
cmd
Out[25]:
In [26]:
sp.call(cmd, shell=True)
Out[26]:
In [27]:
cmd = "realSFS cat -outnames SAF/ERY/ERY.unfolded.merged SAF/ERY/*saf.idx 2>/dev/null"
cmd
Out[27]:
In [28]:
sp.call(cmd, shell=True)
Out[28]:
In [30]:
cmd = cmd.replace("ERY", "PAR")
In [31]:
sp.call(cmd, shell=True)
Out[31]:
In [37]:
cmd = "realSFS SAF/ERY/ERY.unfolded.merged.saf.idx -P 20 2>/dev/null > SFS/1D/ERY/ERY.unfolded.{0:04d}.sfs".format(1)
cmd
Out[37]:
In [38]:
p = sp.Popen(cmd, shell=True)
In [40]:
cmd = cmd.replace("ERY", "PAR")
cmd
Out[40]:
In [41]:
p.poll()
Out[41]:
In [42]:
p = sp.Popen(cmd, shell=True)
In [44]:
p.poll()
Out[44]:
In [46]:
cmd = "realSFS SAF/ERY/ERY.unfolded.merged.saf.idx SAF/PAR/PAR.unfolded.merged.saf.idx -P 20 2>/dev/null > SFS/2D/EryPar.unfolded.{0:04d}.sfs2d".format(1)
cmd
Out[46]:
In [48]:
% ll SAF/ERY/ERY.unfolded.merged.saf.idx SAF/PAR/PAR.unfolded.merged.saf.idx
In [47]:
sp.call(cmd, shell=True)
Out[47]:
This stops with an error. Apparently, finding the overlap in sites between two bootstrapped SAF files is not straightforward. See ANGSD issue #86.
I would like to create a sites
file that only contains sites for which there are at least 9 individuals in each population with read data. I have previously created SAF files for each population including only sites with read data from at least 9 individuals. I have extracted those sites from the SAF files. While doing that I also sorted on contig number, then on position within contig. This will be required by the following algorithm for finding the intersection between these two sites
files.
In [4]:
% ll ../SAFs/ERY/*sites
In [5]:
% ll ../SAFs/PAR/*sites
In [6]:
ery_sites = open("../SAFs/ERY/ERY.sites", "r")
par_sites = open("../SAFs/PAR/PAR.sites", "r")
In [7]:
from itertools import izip
In [8]:
i=0
for ery,par in izip(ery_sites, par_sites):
i+=1
print ery.rstrip(), par.rstrip()
if i>10: break
In [12]:
ery_sites.seek(0)
par_sites.seek(0)
i=0
ery_forward = True
par_forward = True
while 1:
# stop loop when reaching end of file
if ery_forward:
try:
ery = ery_sites.next()
except:
break
if par_forward:
try:
par = par_sites.next()
except:
break
# extract contig ID and position within contig
e = ery.replace("Contig_", "").rstrip().split()
p = par.replace("Contig_", "").rstrip().split()
# if contig ID of ery lower than of par
if int(e[0]) < int(p[0]):
ery_forward = True
par_forward = False
elif int(e[0]) > int(p[0]):
ery_forward = False
par_forward = True
# if position within contig of ery lower than of par
elif int(e[1]) < int(p[1]):
ery_forward = True
par_forward = False
elif int(e[1]) > int(p[1]):
ery_forward = False
par_forward = True
else:
print ery,
ery_forward = True
par_forward = True
i+=1
if i > 200: break
That seems to work correctly.
In [15]:
def printIntersection(fh_1, fh_2, out_fh):
ery_forward = True
par_forward = True
while 1:
# stop loop when reaching end of file
if ery_forward:
try:
ery = fh_1.next()
except:
break
if par_forward:
try:
par = fh_2.next()
except:
break
# extract contig ID and position within contig
e = ery.replace("Contig_", "").rstrip().split()
p = par.replace("Contig_", "").rstrip().split()
# if contig ID of ery lower than of par
if int(e[0]) < int(p[0]):
ery_forward = True
par_forward = False
elif int(e[0]) > int(p[0]):
ery_forward = False
par_forward = True
# if position within contig of ery lower than in par
elif int(e[1]) < int(p[1]):
ery_forward = True
par_forward = False
elif int(e[1]) > int(p[1]):
ery_forward = False
par_forward = True
else:
out_fh.write(ery)
ery_forward = True
par_forward = True
In [13]:
?int
In [16]:
ery_sites = open("../SAFs/ERY/ERY.sites", "r")
par_sites = open("../SAFs/PAR/PAR.sites", "r")
In [17]:
ery_sites.seek(0)
par_sites.seek(0)
In [18]:
out = open("from_SAFs_minInd9_overlapping.sites", "w")
In [19]:
printIntersection(ery_sites, par_sites, out)
In [20]:
% ll
In [20]:
! wc -l from_SAFs_minInd9_overlapping.sites
The unfolded 2D SFS estimated with these two SAF files and realSFS
contains 1.13 million sites (see line 1449 in assembly.sh
), summing over the SFS. So this seems to have worked correctly.
Another way to get to the overlapping sites is directly from the samtools depth
file. See line 1320 in assembly.sh
for the command line that created a depth file for the final filtered sites that I used for ANGSD analysis.
Counting sites with at least 9 individuals with read data in ERY and PAR in the depth file confirms that the number of overlapping sites must be around 1.13 million (see assembly.sh
, line 2541 onwards). However, the numbers counted from the depth file are slightly higher than the number of sites in the SAF files, probably due to some additional filtering via the angsd -dosaf
commands (see line 1426 onwards in assembly.sh
). It really can only be the mapping quality filter. Note that the site coverages in the depth file were from reads with MQ at least 5. It could be that -Q 5
in the samtools depth
command means >=5
, whereas minMapQ 5
for ANGSD means >5
.
In [28]:
% ll ../Quality_Control/
It is this file: ParEry.noSEgt2.nogtQ99Cov.noDUST.3.15.noTGCAGG.ANGSD_combinedNegFisFiltered.noGtQ99GlobalCov.sorted.depths.gz
.
The sites in this depth file have at least 15 individuals with each at least 3x coverage. The first two columns in this file are contig ID and position. The remaining 36 columns contain the coverage for each individual. I believe columns 3 to 20 belong to ERY individuals and columns 21 to 38 to PAR individuals. Is used gawk
commands to extract those sites (see line 2544 onwards in assembly.sh
). The number of overlapping sites counted from the depth file is 1,143,453. So slightly higher than the overlap between the two SAF files. I am therefore going to use the sites file created from the overlap between the SAF files.
I have estimated 1D SFS for ERY and PAR using this sites file containing only overlapping sites. See assembly.sh
, line 2576 onwards, for the details.
In [1]:
import numpy as np
import sys
sys.path.insert(0, '/home/claudius/Downloads/dadi')
import dadi
In [4]:
% ll minInd9_overlapping/SFS/ERY
In [37]:
fs_ery = dadi.Spectrum.from_file("minInd9_overlapping/SFS/original/ERY/ERY.unfolded.sfs.dadi")
In [7]:
fs_ery.pop_ids = ['ery']
In [38]:
fs_par = dadi.Spectrum.from_file("minInd9_overlapping/SFS/original/PAR/PAR.unfolded.sfs.dadi")
In [9]:
fs_par.pop_ids = ['par']
In [5]:
import matplotlib.pyplot as plt
In [6]:
%matplotlib inline
In [7]:
plt.rcParams['figure.figsize'] = [6, 5]
plt.rcParams['font.size'] = 10
In [14]:
Out[14]:
I would have expected these 1D SFS to look more like the marginal spectra from the 2D spectrum (see 05_2D_models.ipynb). For the estimation of these SFS's, I have run realSFS
without the accelerated mode and specified convergence parameters that allowed it to find a unique optimum (as I observed with previous 1D SFS from also non-overlapping sites). The 2D SFS I had estimated with the accelerated algorithm (see line 1439 in assembly.sh
). So maybe it is not fully converged.
In [15]:
print fs_ery.data.sum()
print fs_par.data.sum()
This is the same number of sites as in the 2D SFS.
In [16]:
fs_ery.S()
Out[16]:
The marginal spectrum for ERY had 31,162 segregating sites.
In [17]:
fs_par.S()
Out[17]:
The marginal spectrum for PAR had 41,419 segregating sites.
In [18]:
fs_ery.pi()/fs_ery.data.sum()
Out[18]:
The $\pi_{site}$ for ERY from the marginal spectrum was 0.006530.
In [19]:
fs_par.pi()/fs_par.data.sum()
Out[19]:
The $\pi_{site}$ for PAR from the marginal spectrum was 0.007313.
I have estimated a 2D SFS from the SAF files created with the overlapping.sites file above (see line 2622 in assembly.sh): from_SAFs_minInd9_overlapping.sites
. I have done an exhaustive optimisation which took almost 6 hours on all 24 virtual cores of huluvu.
In [2]:
# get 2D SFS estimated from SAF files containing only overlapping sites
sfs2d_unfolded = dadi.Spectrum.from_file("minInd9_overlapping/SFS/original/EryPar.unfolded.sfs.dadi")
In [3]:
sfs2d_unfolded.pop_ids = ['ery', 'par']
In [8]:
dadi.Plotting.plot_single_2d_sfs(sfs2d_unfolded, vmin=1, cmap='jet')
Out[8]:
To estimate this 2D SFS, I (inadvertently) ran realSFS
specifying the PAR SAF file before the ERY SAF file. So the axis labelling needs to be switched or the matrix transposed.
In [9]:
sfs2d_unfolded.pop_ids = ['par', 'ery']
In [10]:
sfs2d = sfs2d_unfolded.fold()
In [11]:
dadi.Plotting.plot_single_2d_sfs(sfs2d, vmin=1, cmap='jet')
Out[11]:
In [12]:
# transpose the matrix
sfs2d = dadi.Spectrum.transpose(sfs2d)
In [13]:
sfs2d.pop_ids.reverse()
In [14]:
dadi.Plotting.plot_single_2d_sfs(sfs2d, vmin=1, cmap='jet')
Out[14]:
I think it would be good to plot the residuals between this and my previous 2D SFS, that I estimated without exhaustive search and used for $F_{ST}$ estimation as well as all 2D model fitting so far.
I have previously estimated a 2D SFS from SAF files that contained all sites with at least 9 individuals with read data in the population. realSFS
had automatically determined the intersection of sites between ERY and PAR when estimating the 2D SFS. I have estimated this previous 2D SFS without exhaustive optimisation.
In [15]:
# get previous 2D SFS
prev_sfs2d_unfolded = dadi.Spectrum.from_file("../../DADI/dadiExercises/EryPar.unfolded.2dsfs.dadi_format")
In [16]:
prev_sfs2d_unfolded.pop_ids = ['ery', 'par']
In [17]:
prev_sfs2d = prev_sfs2d_unfolded.fold()
In [61]:
# get residuals between previous and new 2D SFS
resid = prev_sfs2d - sfs2d
In [62]:
# plot absolute residuals
plt.figure(figsize=(6,5))
dadi.Plotting.plot_2d_resid(resid)
Apparently, there are large absolute differences in the [0, 1] and [1, 0] frequency classes as might be expected, since these have the highest expected count. The new 2D SFS (with exhaustive optimisation) has many more SNP's in those two frequency classes.
In [20]:
# plot Poisson residuals, saturating at 20 standard deviations in the heatmap of residuals
dadi.Plotting.plot_2d_comp_Poisson(data=prev_sfs2d, model=sfs2d, vmin=1, resid_range=20, title=['previous 2D SFS', 'new 2D SFS'])
The new 2D SFS seems to have fewer middle and high frequency SNP's in PAR that are absent from ERY (lowest row) and more low frequency variants, shared and unshared (lower left corner) than the previous 2D SFS.
In [59]:
fs_ery_m = sfs2d.marginalize([1])
fs_par_m = sfs2d.marginalize([0])
max(fs_ery_m.data[1:]), max(fs_ery.fold().data[1:]), max(fs_par_m.data[1:]), max(fs_par.fold().data[1:])
plt.figure(figsize=(14, 7))
m = max(fs_ery_m.data[1:]), max(fs_ery.fold().data[1:]), max(fs_par_m.data[1:]), max(fs_par.fold().data[1:])
plt.subplot(121)
plt.plot(fs_ery_m, 'ro-', label='ery')
plt.plot(fs_par_m, 'g^-', label = 'par')
plt.ylim(0, max(m)*1.1)
plt.grid()
plt.legend()
plt.xlabel('minor allele frequency')
plt.ylabel('SNP count')
plt.title('marginal 1D SFS')
plt.subplot(122)
plt.plot(fs_ery.fold(), 'ro-', label='ery')
plt.plot(fs_par.fold(), 'g^-', label = 'par')
plt.ylim(0, max(m)*1.1)
plt.grid()
plt.legend()
plt.xlabel('minor allele frequency')
plt.ylabel('SNP count')
plt.title('1D SFS')
Out[59]:
The marginal 1D spectra from the 2D spectrum look very similar to the directly estimated 1D SFS's. This is as expected, since both are estimated from the same sites: those that have at least 9 individuals with read data in both ERY and PAR.
I have now re-estimated the 2D SFS from SAF files containing also non-overlapping sites with exhaustive search parameters (see line 2629 in assembly.sh
).
In [21]:
# get re-stimated 2D SFS from non-fully-overlapping SAF files
prev_sfs2d_unfolded_exhaust = dadi.Spectrum.from_file("../FST/EryPar.unfolded.exhaustive.2dsfs.dadi")
In [22]:
prev_sfs2d_unfolded_exhaust.pop_ids = ['ery', 'par']
In [23]:
prev_sfs2d_exhaust = prev_sfs2d_unfolded_exhaust.fold()
In [63]:
# get residuals betwee non-exhaust and exhaust 2D SFS,
# both estimated from non-fully-overlapping SAF files
resid = prev_sfs2d - prev_sfs2d_exhaust
In [64]:
# plot absolute residuals
plt.figure(figsize=(6,5))
dadi.Plotting.plot_2d_resid(resid)
These are small residuals.
In [65]:
# plot Poisson residuals, saturating at 20 standard deviations in the heatmap of residuals
dadi.Plotting.plot_2d_comp_Poisson(data=prev_sfs2d, model=prev_sfs2d_exhaust, vmin=1, \
resid_range=20, title=['non-exhaustive', 'exhaustive'])
In [66]:
fs_ery = prev_sfs2d_exhaust.marginalize([1])
fs_par = prev_sfs2d_exhaust.marginalize([0])
In [67]:
plt.figure(figsize=(6,5))
plt.plot(fs_ery, 'ro-', label='ery')
plt.plot(fs_par, 'g^-', label = 'par')
plt.grid()
plt.legend()
plt.xlabel('minor allele frequency')
plt.ylabel('SNP count')
plt.title('marginal 1D SFS')
Out[67]:
Unsurprisingly, the marginal spectra also look very similar to the marginal spectra taken from the 2D SFS optimised without exhaustive search. So, the difference between the spectra estimated from full SAF files and the spectra estimated from fully overlapping SAF files is not due to differences in the degree of optimisation of the SFS.
In [27]:
# total number of sites in previous non-exhaust 2D SFS
prev_sfs2d.data.sum()
Out[27]:
In [28]:
# total number of sites in previous exhaust 2D SFS
prev_sfs2d_exhaust.data.sum()
Out[28]:
In [29]:
# notal number of sites in new exhaust 2D SFS (from fully overlapping SAF files)
sfs2d.data.sum()
Out[29]:
All three 2D spectra contain the same number of sites, indicating that they were indeed estimated from the same sites. Then why are the 2D spectra so different if estimated with SAF files containing only overlapping sites or with SAF files containing also non-overlapping sites. This does not make sense to me.
The regions file is really small. So I am first going to create 200 bootstrapped regions files.
Then I will do the SAF estimation for these bootstrapped regions files. The SAF files are about 100MB large. So keeping them will require several giga bytes of storage, but I have that on /data3, where there is currently 1.7TB storage available.
Finally, I am going to estimate 1D SFS from these SAF files.
In [22]:
% ll *rf
In [25]:
regions = []
with open("from_SAFs_minInd9_overlapping.rf", "r") as rf:
for r in rf:
regions.append(r.rstrip())
In [26]:
regions[:10]
Out[26]:
In [28]:
len(regions)
Out[28]:
In [31]:
len(set(regions))
Out[31]:
In [27]:
import numpy as np
In [29]:
# efficient bootstrapping of array with numpy
# np.random.choice
n_contigs = len(regions)
boot_contigs = np.random.choice(regions, size=n_contigs, replace=True)
In [30]:
len(boot_contigs)
Out[30]:
In [32]:
len(set(boot_contigs))
Out[32]:
In [33]:
# write out bootstrapped *regions file
with open("minInd9_overlapping/BOOT_RF/0_boot.rf", "w") as out:
for contig in sorted(boot_contigs, key=lambda x: int(x.replace("Contig_", ""))):
out.write(contig + "\n")
Now, let's redo this 199 times.
In [44]:
def bootstrapContigs(regions, index):
"""
takes an array of contig ID's and an index for the bootstrap repetition
"""
import numpy as np
# get number of contigs to resample
n_contigs = len(regions)
# resample contigs with replacement (bootstrap)
boot_contigs = np.random.choice(regions, size=n_contigs, replace=True)
# write out bootstrapped *regions file
with open("minInd9_overlapping/BOOT_RF/{:03d}_boot.rf".format(index), "w") as out:
for contig in sorted(boot_contigs, key=lambda x: int(x.replace("Contig_", ""))):
out.write(contig + "\n")
In [47]:
for index in range(1, 200):
bootstrapContigs(regions, index)
I need to read in the bootstrapped regions files again and split them for parallisation of SAF creation, but in a way that allows concatenation of split SAF files. This means repeated contigs must not be spread over split regions files. Otherwise realSFS cat
throws an error message.
In [83]:
from glob import glob
In [88]:
sorted(glob("minInd9_overlapping/BOOT_RF/*"))[:10]
Out[88]:
I would like to extract the bootstrap index from the regions file name.
In [86]:
import re
In [87]:
p = re.compile(r'\d+')
In [93]:
for rf in sorted(glob("minInd9_overlapping/BOOT_RF/*"))[:20]:
#print rf
m = p.findall(rf)
print m[1]
In [95]:
# the same as above without a pre-compiled RE object
for rf in sorted(glob("minInd9_overlapping/BOOT_RF/*"))[:20]:
print re.findall(r'\d+', rf)[-1]
In [40]:
# read in one bootstrapped regions file
boot_contigs = []
with open("minInd9_overlapping/BOOT_RF/000_boot.rf", "r") as boot_rf:
for contig in boot_rf:
boot_contigs.append(contig.rstrip())
In [41]:
from collections import Counter
In [42]:
# turn array of contigs into a hash with counts
boot_contigs_dict = Counter(boot_contigs)
In [44]:
i=0
for contig,count in boot_contigs_dict.items():
print contig, "\t", count
i+=1
if i > 10: break
Clear the directory that takes split regions files, to receive new one:
In [46]:
% ll split_rf/
In [47]:
import subprocess32 as sp
In [48]:
sp.call("rm -f split_rf/*", shell=True)
Out[48]:
In [49]:
% ll split_rf/
In [50]:
import string
from itertools import product
# create iterator over filenames
fnames = product(string.lowercase, repeat=2)
In [52]:
def split_regions_file(boot_contigs_dict, fnames, size):
"""
takes Counter dictionary of bootstrapped contigs
and an iterator over filenames to choose
writes out split regions files with repetitions of contigs
NOT spread over different split regions files
"""
c = 0 # initialise contig count
# get next file name from iterator
fn = fnames.next()
# open new file for writing and get filehandle
out = open("split_rf/" + fn[0] + fn[1], "w")
# iterate over Counter dict of bootstrapped contigs, key=contig name, value=count (rep)
for contig,rep in sorted(boot_contigs_dict.items(), key=lambda x: int(x[0].replace("Contig_", ""))):
c+=rep
if c > size: # write up to 'size' contigs to each split rf file
out.close() # close current rf file
fn = fnames.next() # get next file name from iterator
out = open("split_rf/" + fn[0] + fn[1], "w") # open new rf file for writing
c = rep
for _ in range(rep): # write contig name to rf file as often as it occurs in the bootstrap resample
out.write(contig + "\n")
In [53]:
split_regions_file(boot_contigs_dict, fnames, 400)
In [56]:
! wc -l split_rf/*
Estimate SAF files in parallel:
In [59]:
cmd = 'ls split_rf/* | parallel -j 12 "angsd -bam PAR.slim.bamfile.list -ref Big_Data_ref.fa \
-anc Big_Data_ref.fa -out minInd9_overlapping/SAF/bootstrap/PAR/{/}.unfolded -fold 0 \
-sites from_SAFs_minInd9_overlapping.sites -rf {} -only_proper_pairs 0 -baq 1 -minMapQ 5 -minInd 9 -GL 1 -doSaf 1 -nThreads 1 2>/dev/null"'
In [60]:
cmd
Out[60]:
In [61]:
sp.Popen(cmd, shell=True)
Out[61]:
In [62]:
% ll minInd9_overlapping/SAF/bootstrap/PAR/
Concatenate split SAF files:
In [97]:
index = '000'
In [98]:
cmd = "realSFS cat -outnames minInd9_overlapping/SAF/bootstrap/PAR/{}.unfolded minInd9_overlapping/SAF/bootstrap/PAR/*saf.idx 2>/dev/null".format(index)
cmd
Out[98]:
Note that I need to give the concatenated SAF file the index of the bootstrap repetition.
In [71]:
sp.call(cmd, shell=True)
Out[71]:
In [76]:
# concatenated SAF file begins with "PAR"
# remove all split SAF files, they are not need anymore
sp.call("rm -f minInd9_overlapping/SAF/bootstrap/PAR/[a-z]*", shell=True)
Out[76]:
In [80]:
% ll minInd9_overlapping/SAF/bootstrap/PAR/[0-9]*
In [81]:
% ll minInd9_overlapping/SAF/bootstrap/PAR/[!0-9]*
I think it would be best to put the steps of creating SAF files into a separate Python script to run in the background. I have put the code into the script estimate_SAFs.py
.
In [19]:
a = ['par', 'ery']
a.reverse()
a
Out[19]:
In [43]:
"{:03d}".format(50)
Out[43]:
In [7]:
ery_sites.seek(0)
par_sites.seek(0)
ery = ery_sites.next()
par = par_sites.next()
e = ery.replace("Contig_", "").rstrip().split()
p = par.replace("Contig_", "").rstrip().split()
print e
print p
In [ ]:
In [25]:
i = 0
while 1:
i+=1
print i
if i>9: break
In [29]:
s = 'abcdefg'
it = iter(s)
it
Out[29]:
In [30]:
while 1:
try:
print it.next()
except:
break
In [ ]:
In [7]:
str.replace?
In [10]:
str.strip?
In [14]:
print ery_sites.next()
In [15]:
print ery_sites.next()
In [16]:
ery_sites.seek(0)
In [17]:
print ery_sites.next()
In [18]:
s = 'abcdefg'
it = iter(s)
it
Out[18]:
In [19]:
print it.next()
In [20]:
for l in it:
print l
In [28]:
it
In [3]:
arr = ['aa', 'bb']
"".join(arr)
Out[3]:
In [2]:
str.join?
In [ ]: