In [9]:
! wget https://github.com/sequana/resources/raw/master/coverage/FN433596.bed.bz2 1>out 2>err
! bunzip2 FN433596.bed.bz2 1>out 2>err
If you want to reproduce this BED file, please see the bacteria notebook.
In [3]:
%pylab inline
matplotlib.rcParams['figure.figsize'] = [10,7]
In [5]:
! sequana_coverage --download-reference FN433596
! sequana_coverage --download-genbank FN433596
In [16]:
from sequana import GenomeCov, logger
logger.level = "ERROR"
# load the BED and genbank file
b = GenomeCov("FN433596.bed", "FN433596.gbk")
b.compute_gc_content("FN433596.fa")
In [17]:
# load the first chromosome and compute running median and get z-scores
chromosome = b.chr_list[0]
chromosome.thresholds.low = -4
chromosome.thresholds.high = 4
chromosome.run(20001, circular=True)
Out[17]:
In [18]:
rois = chromosome.get_rois()
rois.df
Out[18]:
In [14]:
#
chromosome.plot_coverage()
The following code is a function that does the analysis shown above (running median, zscore computation, identification of the ROI, and some plotting). The function then returns the ROIs. It allows to focus on a specific genome range and emphasize the ROIs that have been detected.
One of the parameter is the window parameter of the running median step, which is investigated in this notebook.
In [2]:
def identify_rois_and_plot_variant( x1=526000, x2=538000, W=20001):
clf()
from sequana import GenomeCov
# load the BED and genbank file
b = GenomeCov("FN433596.bed", "FN433596.gbk")
b.chr_list[0]._df = b.chr_list[0].df.iloc[x1:x2]
#b.compute_gc_content("FN433596.fasta")
chromosome = b.chr_list[0]
chromosome.run(W, circular=True)
chromosome.plot_coverage()
roi = chromosome.get_rois()
low = roi.get_low_rois()
high = roi.get_high_rois()
chr = chromosome
highT = (chr.thresholds.high * chr.best_gaussian["sigma"] +
chr.best_gaussian["mu"]) * chr.df["rm"]
lowT = (chr.thresholds.low * chr.best_gaussian["sigma"] +
chr.best_gaussian["mu"]) * chr.df["rm"]
if len(low):
for k,v in low.iterrows():
Y1 = chr.df['cov'].loc[v.start:v.end]
Y2 = lowT.loc[v.start:v.end]
Y1 = Y1.combine(Y2, max) *0
if v.start > x1 and v.end<x2:
try:fill_between(range(v.start, v.end+1), Y1, Y2, alpha=0.6, color="blue")
except:pass
if len(high):
for k,v in high.iterrows():
Y1 = highT.loc[v.start:v.end]
Y2 = chr.df['cov'].loc[v.start:v.end]
Y2 = Y2.combine(Y1,max)
if v.start > x1 and v.end<x2:
try:fill_between(range(v.start, v.end+1), y1=Y1,y2=Y2 ,alpha=0.6, color="orange")
except:pass
xlim([x1,x2])
return roi
First, let us use a small window size of only 2000 bases. If we focus on the area where there is a deleted regions, we see that the thresholds (implied by the running median) are affected by the deleted regions. This means that even the running median did not ignore the deleted regions. To be ignored, the window parameter should be twice as much as the deleted regions itself
In [8]:
# Here we emphasize the fact that you will miss simple CNV-like event
# if the window parameter is too short. For instance this 1100-long
# event is missed because the window itself is less than 2 times 1100 bases
roi2000 = identify_rois_and_plot_variant(W=2001, x1=526000, x2=538000)
ylim([0,800])
Out[8]:
In [9]:
# To detect the 1100-long deleted regions, a window size of 2200 should be enough.
# even though, slightly larger (e.g. 2300) would give a smoother and therefore less
# biases results
# Moreover, as can be seen in this example, the event detected at position
# around 527,000 is still detected.
roi2200 = identify_rois_and_plot_variant(W=2200, x1=526000, x2=538000)
ylim([0,800])
Out[9]:
We have run sequana_coverage to store the ROIs for different window sizes. There is no gold standard but we can compare the different results and show that the impact of the window parameter is small.
In [4]:
import pandas as pd
In [11]:
bins = range(0, 2500, 200)
rois1000 = pd.read_csv("rois_1000.csv")
rois2200 = pd.read_csv("rois_2200.csv")
rois4000 = pd.read_csv("rois_4000.csv")
rois32000 = pd.read_csv("rois_32000.csv")
rois64000 = pd.read_csv("rois_64000.csv")
_ = hist([rois1000['size'], rois2200['size'], rois4000['size'], rois32000['size'], rois64000['size']],
bins=bins, log=True)
legend(["1000", "2200", "4000", "32000", "64000"])
grid()
_ = xlabel("length of the ROIs detected ")
_ = ylabel("PDF")
In the plot above, we see that short ROIs are detected whatever is the length of the window. However, with increasing ROIs length (e.g. 500), the parameter set to 1000 (blue bars) already misses some events. Conversely the largest windows (32000, 64000) detect all events (short or long)). The large windows do not seem to add false detections.
In [14]:
_ = identify_rois_and_plot_variant(x1=2000000, x2=2500000, W=160001)
Impact of using a larger window: You may not follow the trend closely anymore, which may be an isssue with coverage signal with lots of fluctuations.
For instance, here below, we have a few green circles just next to the threshold line (red line), which are detected but would not with the larger window.
In [21]:
chromosome.run(128000, circular=True)
chromosome.plot_coverage(x1=540000, x2=610000, sample=False)
for i,this in enumerate([1000, 128000]):
marker = "oo"
color = "gb"
ms=[12,6]
df = pd.read_csv("rois_{}.csv".format(this))
for _, d in df.iterrows():
plot([d['start'], d['end']], [d["mean_cov"], d["mean_cov"]], marker[i], color=color[i],
markersize=ms[i], alpha=0.6, lw=1, label=this)
ylim([0, 1200])
legend()
Out[21]:
Same as above but here the running median plot uses a window set to 1000. Green and blue dots are the same as above.
In [23]:
chromosome.run(1000, circular=True)
chromosome.plot_coverage(x1=540000, x2=610000, sample=False)
for i,this in enumerate([1000,128000]):
marker = "oo"
color = "gb"
ms=[12,6]
df = pd.read_csv("rois_{}.csv".format(this))
for _, d in df.iterrows():
plot([d['start'], d['end']], [d["mean_cov"], d["mean_cov"]], marker[i], color=color[i],
markersize=ms[i], alpha=0.6, lw=1, label=1000)
ylim([0,1200])
legend()
Out[23]:
Using the previous code, you can simply compare two sets of ROIs available in the directory
In [24]:
import glob
glob.glob("rois*csv")
Out[24]:
There seem to be a false alarm increase for low window size (zscore remains small). For instance, if we plot the number of ROIs for window size from 1000 to 256,000 as show below, there seem to be a plateau after 20,000 where the number of events converges to about 650.
Choosing 20,000 (in this case) seems to be a good choice since it reducs the false alarm. Larger window, would be useful only to detect deleted or duplicated regions.
drawbacks of large windows:
In [6]:
L = []
vals = [1000,2200,4000,8000,16000,20000,24000, 32000,48000, 64000,72000,
104000, 128000,176000, 232000, 256000, 300000,400000,512000]
for this in vals:
df = pd.read_csv("rois_{}.csv".format(this))
L.append(len(df))
plot(vals, L, "o-")
ylim([600,800])
M = mean( [l for l,v in zip(L, vals) if v>20000])
axhline(M, color="r", ls="--")
axvline(20000, color="r", ls="--")
xlabel("Window size")
ylabel("Number of detected ROIs")
Out[6]:
In [ ]: