I have created a depth file with the samtools depth
command.
In [1]:
% ll
In [3]:
! zcat ParEry.noSEgt2.noDUST.noTGCAGG.depth.gz | wc -l
The depth file contains the coverage information for 85,488,084 sites.
In [4]:
! zcat ParEry.noSEgt2.noDUST.noTGCAGG.depth.gz | head
The first column contains the contig id, the second the position in the contig (1-based coordinates). The remaining 36 columns contain the coverage information for each of the 36 individuals.
I would like to get the 99th percentile of the coverage distribution of each individual from this as well as the 99th percentile of the global coverage distribution.
During a first read through the depth file I want to compress the coverage information into dictionaries and use them to determine percentiles of the coverage distribution. Although, in terms of programming, reading in all columns of the depth file into very long lists would make it straightforward to get the percentiles, this would require a substantial amount of memory, also for the required sorting of such large lists. I therefore consider this bad programming.
In [5]:
import gzip
# get open file handle for gzipped depth file
depth_fh = gzip.open("ParEry.noSEgt2.noDUST.noTGCAGG.depth.gz")
In [6]:
from collections import defaultdict
In [7]:
# initialisze global coverage count dictionary
global_cov_dist = defaultdict(int)
In [8]:
depth_fh.seek(0)
i = 0
for line in depth_fh:
i+=1
if i > 1000: break
fields = line.rstrip().split('\t')
depths = fields[2:]
across_sample_depth = sum([int(d) for d in depths])
global_cov_dist[across_sample_depth] += 1
I would like to get the coverage that lies at or just below the 99th percentile of the coverage distribution. Later I would like to discard all sites with coverage greater than this coverage.
In [30]:
def compute_percentile_from_dict(my_dict, percentile):
"""
my_dict : dictionary, containing depths as keys and their counts as values
percentile : float in range of [0,100], percentile to compute, which must be between 0 and 100 inclusive.
returns : key in my_dict that is less than or equal to the specified percentile of the distribution of "keys"
(i. e. values are counts) stored in my_dict
"""
assert percentile < 100 and percentile > 0, "percentile must be a number between 0 and 100"
# get sum of counts
total_cnt = sum(my_dict.values())
perc = float() # initialise percentile of the coverage distribution
cum_cnt = float()
for d in sorted(my_dict.keys(), reverse=True): # search from the upper tail of the distribution
cum_cnt += my_dict[d]
perc = cum_cnt / total_cnt * 100
if perc < (100 - percentile):
continue
else:
return d
In [29]:
print "coverage \t count \t cumsum_cnt \t percentile"
cumsum_cnt = int()
total_cnt = float( sum(global_cov_dist.values()) )
percentile = float()
for k in sorted(global_cov_dist.keys(), reverse=True):
cumsum_cnt += global_cov_dist[k]
percentile = cumsum_cnt / total_cnt * 100
print "{0:5d} \t {1:10d} \t {2:5d} \t {3:>15.2f}".format(k, global_cov_dist[k], cumsum_cnt, percentile)
In [31]:
compute_percentile_from_dict(global_cov_dist, 90)
Out[31]:
In [32]:
compute_percentile_from_dict(global_cov_dist, 99)
Out[32]:
In [33]:
compute_percentile_from_dict(global_cov_dist, 50)
Out[33]:
The funciton seems to work.
In [38]:
coverage_threshold = compute_percentile_from_dict(global_cov_dist, 50)
coverage_threshold
Out[38]:
In [40]:
depth_fh.seek(0) # restart reading file from beginning
i = 0
for line in depth_fh:
i+=1
if i > 100: break
fields = line.rstrip().split('\t')
depths = fields[2:]
across_sample_depth = sum([int(d) for d in depths])
print across_sample_depth
if not across_sample_depth > coverage_threshold:
print '\t'.join(fields)
So this is how I would filter for sites with depth below or equal to a certain global coverage percentile.
But I also want to filter for sites that have individual depths that are below or equal to their respective individual coverage percentiles.
In [41]:
# initialisze global coverage count dictionary
global_cov_dist = defaultdict(int)
depth_fh.seek(0) # go back to beginning of file
# initialise individual coverage count dictionaries
fields = depth_fh.readline().rstrip().split('\t')
ind_cov_dists = dict([(i, defaultdict(int)) for i in range(0, len(fields)-2)])
depth_fh.seek(0) # seek back to beginning of file
i = 0
for line in depth_fh:
i+=1
if i > 1000: break
fields = line.rstrip().split('\t')
depths = fields[2:]
# get global coverage
across_sample_depth = sum([int(d) for d in depths])
# count global coverages in dictionary
global_cov_dist[across_sample_depth] += 1
# count individual coverages in dictionaries
for idx, d in enumerate(depths):
ind_cov_dists[idx][int(d)] += 1
In [42]:
for ind in ind_cov_dists:
print "showing coverage distribution for individual {}:".format(ind)
for de, c in ind_cov_dists[ind].items():
print "{0} : {1}".format(de, c)
The individual coverage count dictionaries seem to work.
Now, let's test whether we can get from these dictionaries global and individual coverage percentiles and use them to filter sites.
In [56]:
# set global coverage threshold to 90th percentile
global_coverage_threshold = compute_percentile_from_dict(global_cov_dist, 90)
global_coverage_threshold
Out[56]:
In [55]:
# set indidivual coverage thresholds to 90th percentile
ind_coverage_thresholds = [compute_percentile_from_dict(ind_cov_dists[i], 90) for i in range(36)]
print ind_coverage_thresholds
Now, let's read in a few lines from the depth file and filter for sites that neither have excess global nor excess individual coverage.
In [65]:
depth_fh.seek(0) # restart reading file from beginning
i = 0
for line in depth_fh:
i+=1
if i > 100: break
fields = line.rstrip().split('\t')
depths = fields[2:]
across_sample_depth = sum([int(d) for d in depths])
if across_sample_depth > global_coverage_threshold:
print "{0}: global depth {1} > {2}".format(i, across_sample_depth, global_coverage_threshold)
continue
excess_depth = False
for idx, d in enumerate(depths):
if int(d) > ind_coverage_thresholds[idx]:
print "{0}: individual depth {1} > {2}".format(i, d, ind_coverage_thresholds[idx])
excess_depth = True
break
if not excess_depth:
print '\t'.join(fields)
The filtering seems to work.
Finally, I would also like to filter for sites that have enough coverage for downstream analysis.
I think at least 1x depth for each of at least 15 individuals would be reasonable.
In [68]:
depth_fh.seek(0) # restart reading file from beginning
i = 0
for line in depth_fh:
i+=1
if i > 100: break
fields = line.rstrip().split('\t')
depths = fields[2:]
across_sample_depth = sum([int(d) for d in depths])
if across_sample_depth > global_coverage_threshold:
print "{0}: global depth {1} > {2}".format(i, across_sample_depth, global_coverage_threshold)
continue
excess_depth = False
sufficient_depth = 0
for idx, d in enumerate(depths):
if int(d) > ind_coverage_thresholds[idx]:
print "{0}: individual depth {1} > {2}".format(i, d, ind_coverage_thresholds[idx])
excess_depth = True
break
elif int(d) > 0:
sufficient_depth += 1
if sufficient_depth < 15:
print "{0}: only {1} individuals with read data".format(i, sufficient_depth)
if not excess_depth and sufficient_depth >= 15:
print '\t'.join(fields)
In [70]:
depth_fh.close()
In [71]:
% ll
I have put the upper commands in a separate python file.
In [74]:
! ./coverage_filter.py -h
coverage_filter.py
takes 3h 30min to filter the samtools depth
file. I think this is quite a long time. I am going to try and speed this up a little bit.
Read filtered sites into a pandas data frame:
In [2]:
% ll
In [8]:
import pandas as pd
In [12]:
pd.read_table?
In [16]:
# read compressed TAB-delimited file directly into pandas data frame
filtered_sites_df = pd.read_table("ParEry.noSEgt2.noDUST.noTGCAGG.covFiltered.gz", header=None)
In [31]:
filtered_sites_df.head(6)
Out[31]:
In [52]:
# get basix statistics from the coverage table
filtered_sites_df.iloc[:, 2:].describe()
Out[52]:
As can bee seen from this table, the maximum coverage any individual gets for a site is 4x.
In [19]:
filtered_sites_df.dtypes
Out[19]:
In [23]:
%matplotlib inline
In [25]:
import matplotlib.pyplot as plt
In [26]:
import matplotlib
matplotlib.style.use('ggplot')
In [27]:
import numpy as np
In [32]:
filtered_sites_df[2].plot.hist()
Out[32]:
In [39]:
# get global coverage for each site
filtered_sites_df['rowsums'] = pd.Series(filtered_sites_df.iloc[:, 2:].sum(axis=1)) # axis=1 means sum over columns
In [40]:
filtered_sites_df.head()
Out[40]:
In [57]:
filtered_sites_df['rowsums'].plot.hist(figsize=(14,12), bins=100, color='k', fontsize=14)
plt.xlim(0, 60)
plt.xlabel("global coverage")
plt.ylabel("count of sites")
plt.title("global coverage distribution")
plt.savefig("global_coverage_dist_covFilteredSites.pdf")
Restart kernel now!
In [1]:
%who
Define percentile funciton:
In [2]:
def compute_percentile_from_dict(my_dict, percentile):
"""
my_dict : dictionary, containing depths as keys and their counts as values
percentile : float in range of [0,100], percentile to compute, which must be between 0 and 100 inclusive.
returns : key in my_dict that is less than or equal to the specified percentile of the distribution of "keys"
(i. e. values are counts) stored in my_dict
"""
assert percentile < 100 and percentile > 0, "percentile must be a number between 0 and 100"
# get sum of counts
total_cnt = sum(my_dict.values())
perc = float() # initialise percentile of the coverage distribution
cum_cnt = float()
for d in sorted(my_dict.keys(), reverse=True): # search from the upper tail of the distribution
cum_cnt += my_dict[d]
perc = cum_cnt / total_cnt * 100
if perc < (100 - percentile):
continue
else:
return d
Open smaller test file:
In [3]:
import gzip
# get open file handle for gzipped depth file
depth_fh = gzip.open("samtools_depth_testFile.gz")
In [18]:
from collections import defaultdict
First read through file:
In [22]:
def first_run():
# initialisze global coverage count dictionary
global global_cov_dist
global_cov_dist = defaultdict(int)
depth_fh.seek(0) # go back to beginning of file
# initialise individual coverage count dictionaries
fields = depth_fh.readline().rstrip().split('\t')
num_inds = len(fields[2:])
global ind_cov_dists
ind_cov_dists = dict([(ind, defaultdict(int)) for ind in range(num_inds)])
depth_fh.seek(0) # seek back to beginning of file
for line in depth_fh:
fields = line.rstrip().split('\t')
depths = fields[2:]
# get global coverage
across_sample_depth = sum([int(d) for d in depths])
# count global coverages in dictionary
global_cov_dist[across_sample_depth] += 1
# count individual coverages in dictionaries
for ind, d in enumerate(depths):
ind_cov_dists[ind][int(d)] += 1
# set global coverage threshold to 90th percentile
global global_coverage_threshold
global_coverage_threshold = compute_percentile_from_dict(global_cov_dist, 90)
# set indidivual coverage thresholds to 90th percentile
global ind_coverage_thresholds
ind_coverage_thresholds = [compute_percentile_from_dict(ind_cov_dists[ind], 90) for ind in range(num_inds)]
Second read through file:
In [23]:
def second_run():
global sites_kept
sites_kept = 0
depth_fh.seek(0) # restart reading file from beginning
for line in depth_fh:
fields = line.rstrip().split('\t')
depths = fields[2:]
across_sample_depth = sum([int(d) for d in depths])
if across_sample_depth > global_coverage_threshold:
continue
excess_depth = False
sufficient_depth = 0
for ind, d in enumerate(depths):
if int(d) > ind_coverage_thresholds[ind]:
excess_depth = True
break
elif int(d) > 0:
sufficient_depth += 1
if not excess_depth and sufficient_depth >= 15:
sites_kept += 1
In [24]:
%%time
first_run()
second_run()
print sites_kept
This is the time of the unoptimised code.
Reduce the number of for
loops:
In [25]:
def first_run():
"""
first run through depth file
"""
# initialisze global coverage count dictionary
global global_cov_dist
global_cov_dist = defaultdict(int)
depth_fh.seek(0) # go back to beginning of file
# initialise individual coverage count dictionaries
fields = depth_fh.readline().rstrip().split('\t')
num_inds = len(fields[2:])
global ind_cov_dists
ind_cov_dists = dict([(ind, defaultdict(int)) for ind in range(num_inds)])
depth_fh.seek(0) # seek back to beginning of file
for line in depth_fh:
fields = line.rstrip().split('\t')
depths = fields[2:]
# get global coverage
across_sample_depth = 0
# count individual coverages in dictionaries
for ind, d in enumerate(depths):
ind_cov_dists[ind][int(d)] += 1
across_sample_depth += int(d)
# count global coverages in dictionary
global_cov_dist[across_sample_depth] += 1
# set global coverage threshold to 90th percentile
global global_coverage_threshold
global_coverage_threshold = compute_percentile_from_dict(global_cov_dist, 90)
# set indidivual coverage thresholds to 90th percentile
global ind_coverage_thresholds
ind_coverage_thresholds = [compute_percentile_from_dict(ind_cov_dists[ind], 90) for ind in range(num_inds)]
In [26]:
def second_run():
"""
second run through depth file
"""
global sites_kept
sites_kept = 0
depth_fh.seek(0) # restart reading file from beginning
for line in depth_fh:
fields = line.rstrip().split('\t')
depths = fields[2:]
excess_depth = False
sufficient_depth = 0
across_sample_depth = 0
for ind, d in enumerate(depths):
if int(d) > ind_coverage_thresholds[ind]:
excess_depth = True
break
elif int(d) > 0:
sufficient_depth += 1
across_sample_depth += int(d)
if not excess_depth and not across_sample_depth > global_coverage_threshold and sufficient_depth >= 15:
sites_kept += 1
In [27]:
%%time
first_run()
second_run()
print sites_kept
This is only 17 seconds faster than the unoptimised code.
In [28]:
import numpy as np
from numba.decorators import jit
In [49]:
@jit
def count_depths(depths):
across_sample_depth = 0
for ind, d in enumerate(depths):
ind_cov_dists[ind][int(d)] += 1
across_sample_depth += int(d)
return across_sample_depth
In [50]:
def first_run():
"""
first run through depth file
"""
# initialisze global coverage count dictionary
global global_cov_dist
global_cov_dist = defaultdict(int)
depth_fh.seek(0) # go back to beginning of file
# initialise individual coverage count dictionaries
fields = depth_fh.readline().rstrip().split('\t')
num_inds = len(fields[2:])
global ind_cov_dists
ind_cov_dists = dict([(ind, defaultdict(int)) for ind in range(num_inds)])
depth_fh.seek(0) # seek back to beginning of file
for line in depth_fh:
fields = line.rstrip().split('\t')
depths = np.array(fields[2:]) # create numpy array
# count depths
across_sample_depth = count_depths(depths) # numba function
# count global coverages in dictionary
global_cov_dist[across_sample_depth] += 1
# set global coverage threshold to 90th percentile
global global_coverage_threshold
global_coverage_threshold = compute_percentile_from_dict(global_cov_dist, 90)
# set indidivual coverage thresholds to 90th percentile
global ind_coverage_thresholds
ind_coverage_thresholds = [compute_percentile_from_dict(ind_cov_dists[ind], 90) for ind in range(num_inds)]
In [54]:
@jit
def check_depths(depths):
across_sample_depth = 0
excess_depth = False
sufficient_depth = 0
for ind, d in enumerate(depths):
if int(d) > ind_coverage_thresholds[ind]:
excess_depth = True
return across_sample_depth, excess_depth, sufficient_depth
elif int(d) > 0:
sufficient_depth += 1
across_sample_depth += int(d)
return across_sample_depth, excess_depth, sufficient_depth
In [55]:
def second_run():
"""
second run through depth file
"""
global sites_kept
sites_kept = 0
depth_fh.seek(0) # restart reading file from beginning
for line in depth_fh:
fields = line.rstrip().split('\t')
depths = np.array(fields[2:])
# check depths
across_sample_depth, excess_depth, sufficient_depth = check_depths(depths) # Numba function
if not excess_depth and not across_sample_depth > global_coverage_threshold and sufficient_depth >= 15:
sites_kept += 1
In [56]:
%%time
first_run()
second_run()
print sites_kept
OK, this even took 3 times than the unoptimised version.
This may be because Numba is not optimising and falling back on python code. With the following function decorators I am forcing Numba to optimise or throw an exception.
In [57]:
@jit(nopython=True)
def count_depths(depths):
across_sample_depth = 0
for ind, d in enumerate(depths):
ind_cov_dists[ind][int(d)] += 1
across_sample_depth += int(d)
return across_sample_depth
In [58]:
@jit(nopython=True)
def check_depths(depths):
across_sample_depth = 0
excess_depth = False
sufficient_depth = 0
for ind, d in enumerate(depths):
if int(d) > ind_coverage_thresholds[ind]:
excess_depth = True
return across_sample_depth, excess_depth, sufficient_depth
elif int(d) > 0:
sufficient_depth += 1
across_sample_depth += int(d)
return across_sample_depth, excess_depth, sufficient_depth
In [59]:
%%time
first_run()
second_run()
print sites_kept
I want to change excess and minimum coverage filtering from how it is done above.
First, I want to separate the excess and minimum coverage filtering.
This should be based on SE read count per contig and should filter contigs, not sites. The coverage distribution (i. e. SE read per contig) should not contain the category 0, it should start from 1.
I can derive such individual coverage percentiles from individual bam
files, which could be read in parallel. I could also get an across sample coverage distribution in the process.
Once I have the individual and across sample coverage percentiles in a dictionary, I read in the same bam
files again and record contigs with excess coverage.
The result of this filtering should be a list of contigs to exclude. These contigs should be removed from a bed
or sites
file.
In [1]:
import glob
print glob.glob('*sorted.bam')
In [2]:
import subprocess32 as sp # allows to call external programmes and get their output
In [3]:
import numpy as np
In [4]:
# command line for testing
cmd = "samtools view -f 64 par_34-9_dedup.sorted.bam | gawk '$4==2' | cut -f 3 | uniq -c | gawk '{print $1}'"
The command line takes the SE reads from the bam file, filters out only those that map to position 2 on the contig (i. e. with correct mapping position), then extracts the contig name (in column 3) from the SAM record. Since the BAM file is already sorted I can uniq
straight away while keeping the count of contig name (i. e. number of SE reads) and finally I am extracting only that count. The output of command line is the SE read count for each contig, one per line.
In [5]:
# read SE read count per contig into numpy array
a = np.fromstring( sp.check_output(cmd, shell=True), sep='\n', dtype=int )
In [56]:
np.fromstring?
In [6]:
a[:10]
Out[6]:
In [7]:
# get 99th percentile of SE read coverage distribution
np.percentile(a, 99)
Out[7]:
In [8]:
%%time
# read in all sorted BAM files and get their SE read coverage 99th percentile,
# store them in a dictionary with BAM file name as key
Q99_ind = {}
for file in glob.glob('*sorted.bam'):
cmd = "samtools view -f 64 %s | gawk '$4==2' | cut -f 3 | uniq -c | gawk '{print $1}'" % (file)
#print cmd
a = np.fromstring( sp.check_output(cmd, shell=True), sep='\n', dtype=int )
Q99_ind[file] = np.percentile(a, 99)
This is so fast, that I don't care about speeding this up. It's many orders of magnitude faster than parsing the samtools depth
file.
In [52]:
for ind, Q in Q99_ind.items():
print ind, Q
I now have individual 99 coverage percentiles. Remember that they do not include contigs with 0 coverage. If I had included the 0 count category, the 99th percentiles would obviously be much lower.
Next, I want to read in contig id and count for each bam
file, check whether the count is above Q99 and if so, record the contig in separate list.
In [9]:
file = "par_34-9_dedup.sorted.bam"
cmd = "samtools view -f 64 %s | gawk '$4==2' | cut -f 3 | uniq -c" % (file)
print cmd
This command line returns SE read count and contig name, one contig per line.
In [16]:
# open a connection (filehandle) to the standard output of the upper command line
pipe = sp.Popen(cmd, shell=True, stdout=sp.PIPE).stdout
I can now iterate over that pipe as with a filehandle.
In [17]:
i = 0
for line in pipe:
c, id = line.strip().split()
print c, id
i+=1
if i > 10: break
pipe.close() # need to close the filehandle again
In [41]:
pipe = sp.Popen(cmd, shell=True, stdout=sp.PIPE).stdout
In [42]:
excess_ind_cov = set() # list of id's of contig's with excess coverage
keep = set() # contigs to keep
for line in pipe:
c, id = line.strip().split()
if int(c) >= Q99_ind[file]:
excess_ind_cov.add(id)
else:
keep.add(id)
pipe.close()
In [43]:
Q99_ind[file]
Out[43]:
In [44]:
len(excess_ind_cov)
Out[44]:
There are 851 contigs with SE read coverage above the 99th coverage percentile of this individual.
In [27]:
from collections import Counter
In [34]:
c = Counter(excess_ind_cov)
In [35]:
c.most_common(1)
Out[35]:
In [36]:
2 in c.values()
Out[36]:
In [45]:
len(keep)
Out[45]:
Now, let's collect all contig id's whose SE read coverage is above the 99th percentile in any of the 36 individuals here.
In [46]:
%%time
excess_ind_cov = set() # set of contig ids's from contigs with excess coverage
for file in glob.glob('*sorted.bam'):
cmd = "samtools view -f 64 %s | gawk '$4==2' | cut -f 3 | uniq -c" % (file)
#print cmd
pipe = sp.Popen(cmd, shell=True, stdout=sp.PIPE).stdout
for line in pipe:
c, id = line.strip().split()
if int(c) > Q99_ind[file]:
excess_ind_cov.add(id)
pipe.close()
In [47]:
list(excess_ind_cov)[:10]
Out[47]:
In [48]:
len(excess_ind_cov)
Out[48]:
With the R
quantile
function and print_Q99_exCov_contigs.pl
I had found 2192 contigs. This is close enough.
While reading in the SE read coverage from all contigs for each individual, I could also at the same time determine the across sample coverage for each contig.
In [49]:
from collections import defaultdict
In [59]:
%%time
excess_ind_cov = set() # set of contig ids's from contigs with excess individual coverage
excess_global_cov = set() # excess across sample coverage
across_sample_cov = defaultdict(int)
keep = set() # set of contig id's to keep for downstream analysis
for file in glob.glob('*sorted.bam'):
cmd = "samtools view -f 64 %s | gawk '$4==2' | cut -f 3 | uniq -c" % (file)
#print cmd
pipe = sp.Popen(cmd, shell=True, stdout=sp.PIPE).stdout
for line in pipe:
cov, contig = line.strip().split()
if int(cov) > Q99_ind[file]:
excess_ind_cov.add(contig)
else:
keep.add(contig)
across_sample_cov[contig] += int(cov)
pipe.close()
print "{0:d} contigs passed individual coverage filtering".format(len(keep))
Q99_across_sample = np.percentile(across_sample_cov.values(), 99.0)
for contig, cov in across_sample_cov.items():
if cov > Q99_across_sample:
excess_global_cov.add(contig)
# Return a new set with elements in the set 'keep' that are not in the set 'excess_cov':
keep_contigs = keep.difference(excess_global_cov)
print "{0:d} contigs passed individual and across sample coverage filtering".format(len(keep_contigs))
In [54]:
Q99_across_sample
Out[54]:
The across sample 99th coverage percentile looks reasonable.
In [60]:
len(excess_ind_cov)
Out[60]:
In [61]:
len(excess_global_cov)
Out[61]:
In [62]:
# number of contigs that passed individual coverage filtering
# but not global coverage filtering
len(excess_global_cov.difference(excess_ind_cov))
Out[62]:
In [63]:
# number of contigs that did not pass individual coverage filtering
# or global coverage filtering or both
len(excess_ind_cov.union(excess_global_cov))
Out[63]:
In [ ]:
# get all contig ids from contigs that failed one of the
# coverage filters
excess_cov_contigs = excess_ind_cov.union(excess_global_cov)
This would be a list that I could use to filter bed
or depth
or sites
files.
In [53]:
np.percentile?
I don't like the standard interpolation (because I don't understand what it does). I think 'lower' should be ok.
I am going to try some code that only needs to read in the BAM files once.
In [1]:
from glob import glob
In [2]:
import subprocess32 as sp
In [3]:
import numpy as np
In [4]:
from collections import defaultdict
In [8]:
%%time
#
# initialise data structures
#
# set of contig ids's from contigs with excess individual coverage:
excess_ind_cov = set()
# stores contig id's with excess across sample coverage:
excess_global_cov = set()
# stores individual coverage distributions:
ind_sample_cov = dict([(file, defaultdict(int)) for file in glob('*sorted.bam')])
# stores across sample coverage distribution:
across_sample_cov = defaultdict(int)
# stores 99th coverage percentile for each individual:
Q99_ind = {}
# list of contigs with SE read mappings from any individual:
all_contigs = set() # contigs with only PE read mappings are completely ignored
# get coverage distributions
for file in glob('*sorted.bam'):
cmd = "samtools view -f 64 %s | gawk '$4==2' | cut -f 3 | uniq -c" % (file)
pipe = sp.Popen(cmd, shell=True, stdout=sp.PIPE).stdout
for line in pipe:
cov, contig = line.strip().split()
ind_sample_cov[file][contig] += int(cov)
Q99_ind[file] = 0 # initialise dict
all_contigs.add(contig)
across_sample_cov[contig] += int(cov)
pipe.close()
# get 99th percentiles of individual coverage distributions
for file in glob('*sorted.bam'):
Q99_ind[file] = np.percentile(ind_sample_cov[file].values(), 99.0, interpolation='lower')
# get 99th percentile of across sample coverage distribution
Q99_across_sample = np.percentile(across_sample_cov.values(), 99.0, interpolation='lower')
# check contigs for excess global, then individual coverage
for contig in all_contigs:
if across_sample_cov[contig] > Q99_across_sample:
excess_global_cov.add(contig)
continue
for file in glob('*sorted.bam'):
if ind_sample_cov[file][contig] > Q99_ind[file]:
excess_ind_cov.add(contig)
break
# Return a new set with elements in the set 'keep' that are not in the set 'excess_cov':
keep_contigs = all_contigs.difference(excess_global_cov, excess_ind_cov)
print "{0:d} contigs passed individual and across sample coverage filtering".format(len(keep_contigs))
I would like to tim individual parts of the previous code cell.
In [9]:
%%time
#
# initialise data structures
#
# set of contig ids's from contigs with excess individual coverage:
excess_ind_cov = set()
# stores contig id's with excess across sample coverage:
excess_global_cov = set()
# stores individual coverage distributions:
ind_sample_cov = dict([(file, defaultdict(int)) for file in glob('*sorted.bam')])
# stores across sample coverage distribution:
across_sample_cov = defaultdict(int)
# stores 99th coverage percentile for each individual:
Q99_ind = {}
# list of contigs with SE read mappings from any individual:
all_contigs = set() # contigs with only PE read mappings are completely ignored
# get coverage distributions
for file in glob('*sorted.bam'):
cmd = "samtools view -f 64 %s | gawk '$4==2' | cut -f 3 | uniq -c" % (file)
pipe = sp.Popen(cmd, shell=True, stdout=sp.PIPE).stdout
for line in pipe:
cov, contig = line.strip().split()
ind_sample_cov[file][contig] += int(cov)
Q99_ind[file] = 0 # initialise dict
all_contigs.add(contig)
across_sample_cov[contig] += int(cov)
pipe.close()
In [10]:
%%time
# get 99th percentiles of individual coverage distributions
for file in glob('*sorted.bam'):
Q99_ind[file] = np.percentile(ind_sample_cov[file].values(), 99.0, interpolation='lower')
# get 99th percentile of across sample coverage distribution
Q99_across_sample = np.percentile(across_sample_cov.values(), 99.0, interpolation='lower')
In [11]:
%%time
# check contigs for excess global, then individual coverage
for contig in all_contigs:
if across_sample_cov[contig] > Q99_across_sample:
excess_global_cov.add(contig)
continue
for file in glob('*sorted.bam'):
if ind_sample_cov[file][contig] > Q99_ind[file]:
excess_ind_cov.add(contig)
break
Clearly, that is the "slow" part.
In [12]:
len(all_contigs)
Out[12]:
In [13]:
%%time
# Return a new set with elements in the set 'keep' that are not in the set 'excess_cov':
keep_contigs = all_contigs.difference(excess_global_cov, excess_ind_cov)
print "{0:d} contigs passed individual and across sample coverage filtering".format(len(keep_contigs))
Briefly try numba.
In [19]:
a = np.array(all_contigs, dtype='str')
In [20]:
a.dtype
Out[20]:
In [21]:
from numba import jit
In [23]:
@jit(nopython=True)
def check_excess_cov(all_contigs):
for contig in all_contigs:
if across_sample_cov[contig] > Q99_across_sample:
excess_global_cov.add(contig)
continue
for file in glob('*sorted.bam'):
if ind_sample_cov[file][contig] > Q99_ind[file]:
excess_ind_cov.add(contig)
break
return excess_global_cov, excess_ind_cov
In [24]:
b, c = check_excess_cov(a)
In [26]:
bed = "Big_Data_Contigs.noSEgt2.noDUST.bed"
In [27]:
bed.replace(".bed", ".COVfiltered.bed")
Out[27]:
In [28]:
bed_in = open("Big_Data_Contigs.noSEgt2.noDUST.bed")
In [29]:
bed_out = open("Big_Data_Contigs.noSEgt2.noDUST.bed".replace(".bed", ".COVfiltered.bed"), "w")
In [30]:
for line in bed_in:
contig, _, _ = line.strip().split()
if contig in keep_contigs:
bed_out.write(line)
In [31]:
bed_in.close()
bed_out.close()
This would require a samtools depth
file. It should do the same thing as even_depth.pl
.
In [43]:
# open gzipped depth file
import gzip
depth_fh = gzip.open("ParEry.noSEgt2.noDUST.COVfiltered.noTGCAGG.depth.gz")
In [37]:
# iterate of lines of file and split into array
i = 0
for line in depth_fh:
i+=1
fields = line.strip().split()
if i>10: break
depth_fh.close()
In [41]:
# count number of individuals with > x coverage
import numpy as np
i = 0
for line in depth_fh:
i+=1
fields = line.strip().split()
depths = np.array(fields[2:], dtype='int8')
print depths
if i>10: break
depth_fh.close()
In [44]:
# count number of individuals with > x coverage
import numpy as np
i = 0
for line in depth_fh:
i+=1
fields = line.strip().split()
depths = np.array(fields[2:], dtype='int8')
print depths >= 1
if i>10: break
depth_fh.close()
In [45]:
# count number of individuals with > x coverage
depth_fh = gzip.open("ParEry.noSEgt2.noDUST.COVfiltered.noTGCAGG.depth.gz")
import numpy as np
i = 0
for line in depth_fh:
i+=1
fields = line.strip().split()
depths = np.array(fields[2:], dtype='int8')
print sum( depths >= 1 )
if i>10: break
depth_fh.close()
In [51]:
# if enough individuals have > x coverage, print out the line, otherwise skip that site
depth_fh = gzip.open("ParEry.noSEgt2.noDUST.COVfiltered.noTGCAGG.depth.gz")
import numpy as np
i = 0
for line in depth_fh:
i+=1
fields = line.strip().split()
depths = np.array(fields[2:], dtype='int8')
if sum( depths >= 1 ) >= 15:
print line
if i>600: break
depth_fh.close()
OK, this seems to work.
I have put the upper code in a script called minimum_coverage_filter.py
.
Minimum coverage filtering on a samtools depth file is very slow. It took 2h 35 minutes to complete.
In [1]:
import pandas as pd
In [2]:
% ll *sites.gz
Unfortunately, I have let minimum_coverage_filter.py
output only contig id and position, since that is all I need for ANGSD analysis. However, now I cannot get coverage distributions from the filtered sites. I have changed minimum_coverage_filter.py
to print out the depth information for retained sites as well. The file is not very large anyway, when gzipped.
In order to get coverage distributions over these filtered sites, I can turn the *sites.gz file into a BED file with my script sites2bed.pl
. I can then use this for another samtools depth
command, which should be fairly quick.
In [3]:
% ll *depth.gz
In [4]:
# read compressed TAB-delimited file directly into pandas data frame
filtered_sites_df = pd.read_table("ParEry.noSEgt2.noDUST.COVfiltered.noTGCAGG.1.15.depth.gz", header=None)
In [5]:
filtered_sites_df.head(6)
Out[5]:
In [6]:
# get basix statistics from the coverage table
filtered_sites_df.iloc[:, 2:].describe()
Out[6]:
In [7]:
import matplotlib.pyplot as plt
In [8]:
%matplotlib inline
In [9]:
import matplotlib
matplotlib.style.use('ggplot')
In [10]:
# get global coverage for each site
filtered_sites_df['rowsums'] = pd.Series(filtered_sites_df.iloc[:, 2:].sum(axis=1))
# axis=1 means sum over columns
In [20]:
filtered_sites_df['rowsums'].plot.hist(figsize=(14,12), bins=100, color='k', fontsize=14)
#plt.xlim(0, 200)
plt.xlabel("global coverage")
plt.ylabel("count of sites")
plt.title("global coverage distribution")
plt.savefig("ParEry.noSEgt2.noDUST.COVfiltered.noTGCAGG.1.15.GlobalCovDist.pdf")
In [14]:
pd.DataFrame.plot.hist?
The bins are difficult to set. I am going to create the histogram with R. matplotlib seems to be incapable of producing a full resolution histogram without artefacts.
Besides the artefacts, the across sample coverage distribution of excess coverage filtered and minimum coverage filtered sites looks good. Much better from previous coverage filtering (see above)! There are negligibly few sites with coverage above the 99th percentile of across-sample SE read coverage (148).
In [ ]: