In [2]:
# load dadi module
import sys
sys.path.insert(0, '/home/claudius/Downloads/dadi')
import dadi
In [3]:
% ll
In [4]:
! cat ERY.unfolded.sfs.dadi
This doesn't look good!
In [10]:
# import 1D spectrum of ery
fs_ery = dadi.Spectrum.from_file('ERY.unfolded.sfs.dadi')
fs_par = dadi.Spectrum.from_file('PAR.unfolded.sfs.dadi')
In [6]:
import pylab
In [7]:
%matplotlib inline
In [8]:
pylab.rcParams['figure.figsize'] = [14, 12]
In [11]:
pylab.plot(fs_par.fold(), 'go-', label='par')
pylab.plot(fs_ery.fold(), 'rs-', label='ery')
pylab.xlabel('minor allele frequency')
pylab.ylabel('SNP count')
pylab.legend()
Out[11]:
What is going on here?
I have done new excess and minimum coverage filtering. See excess_coverage_filter.py
and minimum_coverage_filter.py
. This resulted in many more positions, contigs and reads retained, i. e. more info to determine SFS.
I have created SAF's with a minimum of 9 individuals with read data and used the accellerated version of the EM algorithm in realSFS
.
In [1]:
# load dadi module
import sys
sys.path.insert(0, '/home/claudius/Downloads/dadi')
import dadi
In [12]:
# import 1D spectrum of ery
fs_ery = dadi.Spectrum.from_file('Ery.unfolded.sfs.dadi')
fs_par = dadi.Spectrum.from_file('Par.unfolded.sfs.dadi')
In [13]:
fs_ery.S()
Out[13]:
In [8]:
import pylab
In [9]:
%matplotlib inline
In [10]:
pylab.rcParams['figure.figsize'] = [12, 10]
In [8]:
pylab.plot(fs_par.fold(), 'go-', label='par')
pylab.plot(fs_ery.fold(), 'rs-', label='ery')
pylab.xlabel('minor allele frequency')
pylab.ylabel('SNP count')
pylab.legend()
Out[8]:
Unfortunately, this is still completely unusable. Only few sites are inferred variable.
I think that the coverage is too low for ANGSD to be able to destinguish alleles from sequencing errors. The ML SFS is therefore probably very unstable.
I have applied a slightly more stringent minimum coverage filtering: instead of at least 1x in 15 ind. I have applied 3x in 10 ind. This retained 368,764 sites on 9,119 contigs. Other than that, I have applied exactly the same filtering as above. Before HWE filtering there were 414,122 sites from originally 85,488,084 sites (0.48%). There were only 125,875 sites if I had filtered for 3x coverage in at least 15 individuals.
In [6]:
% ll ../../ANGSD/DEDUPLICATED/afs/ery
In [15]:
# import 1D spectrum of ery
fs_ery = dadi.Spectrum.from_file('../../ANGSD/DEDUPLICATED/afs/ery/ery.unfolded.sfs.dadi')
fs_par = dadi.Spectrum.from_file('../../ANGSD/DEDUPLICATED/afs/par/par.unfolded.sfs.dadi')
In [16]:
fs_ery.S()
Out[16]:
In [11]:
pylab.plot(fs_par.fold(), 'go-', label='par')
pylab.plot(fs_ery.fold(), 'rs-', label='ery')
pylab.xlabel('minor allele frequency')
pylab.ylabel('SNP count')
pylab.legend()
Out[11]:
Unfortunately, the Big Data set from standard RAD sequencing does not provide enough information for genotype likelihoods to estimate allele frequency spectra from. It may be that the coverage filtering just leaves too few sites that can provide enough signal to overcome the noise in the data.
Using the sites that had at least 3x coverage in at least 10 individuals across the two populations, I have run the SAF calculation again with -minInd 15
, that is requiring at least 15 (of 18) individuals with read data (i. e. at least 1x coverage).
In [18]:
# import 1D spectrum of ery
fs_ery = dadi.Spectrum.from_file('../../ANGSD/DEDUPLICATED/afs/ery/ery.unfolded.15.sfs.dadi')
fs_par = dadi.Spectrum.from_file('../../ANGSD/DEDUPLICATED/afs/par/par.unfolded.15.sfs.dadi')
In [19]:
fs_ery.S()
Out[19]:
In [20]:
pylab.plot(fs_par.fold(), 'go-', label='par')
pylab.plot(fs_ery.fold(), 'rs-', label='ery')
pylab.xlabel('minor allele frequency')
pylab.ylabel('SNP count')
pylab.legend()
Out[20]:
In [ ]: