In [17]:
%pylab inline
matplotlib.rcParams['figure.figsize'] = [10,7]
Here below, we provide the results of the sequana_coverage and CNVNator analysis as files within this directory itself. Nevertheless, if you need to rerun the analysis yourself, you can still do it but you need to generate the BAM file for CNVnator (see virus notebook). For sequana_coverage, you could use the BED file directly, which is also used here below for the plotting. This BED file can simply be downloaded as follows.
In [2]:
!wget https://github.com/sequana/resources/raw/master/coverage/JB409847.bed.bz2
!bunzip2 JB409847.bed.bz2
Let us have a quick look at the coverage itself.
In [14]:
from sequana import GenomeCov
b = GenomeCov("JB409847.bed")
chromosome = b.chr_list[0]
chromosome.run(4001, 2, circular=True)
Out[14]:
In [18]:
chromosome.plot_coverage()
_ = ylim([0,1500])
What you see are two long deleted regions of about 700 bases (one fully deleted) and two 3-bases deleted regions on the right hand side. Our goal is to detect those 4 events automatically.
We ran the analysis with circular chromosome set on and a window parameter of
For example:
sequana_coverage -o -input JB409847.bed --no-html --no-multiqc -w 3000
Results are available in 4 files for your convenience: e.g. rois_1000.csv, rois_4000.csv
Note that the window size cannot be larger than a fifth of the genome length from the command line. Would you need to force it, you would need to use the library itself.
cnvnator -root out1.root -tree JB409847.bed
cnvnator -root out1.root -his 1
cnvnator -root out1.root -stat 1
cnvnator -root out1.root -partition 1 -ngc
cnvnator -root out1.root -call 1 -ngc > events_bin1.txt
The default parameter of 100 gives no detection. We then used bin=1, 5, 10, 20 and stored the results in:
In [8]:
from sequana.cnv import CNVnator
In [9]:
cnv1 = CNVnator("events_bin1.txt").df
cnv5 = CNVnator("events_bin5.txt").df
cnv10 = CNVnator("events_bin10.txt").df
cnv20 = CNVnator("events_bin20.txt").df
In [10]:
def plot_rois_cnv(cnv):
chromosome.plot_coverage()
for _, this in cnv.iterrows():
type_ = this['type']
positions = [this.start, this.end]
if type_ == "deletion":
fill_between(positions, 0, 1000)
else:
fill_between(positions, 1000,2000)
ylim([0,2000])
In [11]:
plot_rois_cnv(cnv10)
Here, the two main events on the left hand side (deleted regions of several hundreds of bases) are detected. However, the short deleted regions on the right hand side (a few bases) are not.
In [12]:
plot_rois_cnv(cnv5)
Here again, the two main deleted regions are detected but the two short events are not. There is also a false detection at position 0 and around 4000
In [13]:
plot_rois_cnv(cnv1)
binning of 1 is too small: the low deleted regions are detected but then, the small deleted regions are not and there are lots of upper regions classified as duplicated CNVs.
In [28]:
# W can be 1000,2000,3000,4000
import pandas as pd
def plot_rois_sequana(W):
assert W in [2000, 3000, 4000, 1000,5000, "4000_t3"]
if W == "4000_t3":
chromosome.run(4001, circular=True)
else:
chromosome.run(W+1, circular=True)
chromosome.plot_coverage(sample=False)
rois = pd.read_csv("rois_{}.csv".format(W))
for _, this in rois.iterrows():
#if this.max_zscore >-4.5 and this.max_zscore<4.5 and this.max_cov!=0:
# continue
positions = [this.start, this.end]
if abs(this.mean_zscore) >12: col="red"
elif abs(this.mean_zscore) >8: col="orange"
elif abs(this.mean_zscore) >4: col="yellow"
if this.mean_zscore > 0:
fill_between(positions, 1000, 2000, color=col)
else:
fill_between(positions, 0, 1000, color=col)
if this.end-this.start < 100:
axvline(this.start, ymax=0.5, lw=3, color="r")
ylim([0,2000])
In [29]:
plot_rois_sequana(5000)
In [71]:
plot_rois_sequana(4000)
In [40]:
plot_rois_sequana(3000)
Using W=2000, 3000, 4000 the results are robust and consistent: the same ROIs are detected. In particular, the 2 long and 2 short deleted regions are detected;
In addition, some short events not fully deleted are also systematically detected. The detection seems coherent visually except maybe for the duplicated events at position 5000, which could be considered a false detection. Note, however, that the zscore associated with this events could be used to discard the events (mean zscore of 5)
In [26]:
rois = pd.read_csv("rois_4000.csv")
rois[["start", "end", "size", "mean_rm", "max_zscore", "log2_ratio"]]
Out[26]:
Now, if we decrease the W parameter to 1000, we may miss the large deleted regions, which length is 711 and 782 (to detect those events we recommend to use W as twice this length so about 1500)
In [82]:
plot_rois_sequana(1000)
So, here we still detect the two deleted regions but we see that the running median fits the data too closely. Consequently, we have some false detections at position 0, 6600, 17923 to cite a few examples.
Running CNOGpro manually using window length of 100 and 10, we get these results
In [58]:
CNOGpro_10 = [[3521, 4220, 0], [4221, 4230, 5],
[4241,4250,0], [4251,4260,3],
[5681,5740,0], [19771,19795,0]]
CNOGpro_100 = [[3601,4200,0], [4201,4300,3], [4301,4400,2]]
In [75]:
plot_rois_sequana(4000)
for this in CNOGpro_10:
plot(this[0:2], [1000, 1000], "ob-")
In [76]:
plot_rois_sequana(4000)
for this in CNOGpro_100:
axhline(1000, this[0], 1000, color="r", lw=20)
plot(this[0:2], [1000, 1000], "ob-")
On a viral genome (length 18000), we use sequana_coverage and cnvnator to detect events of interests that can be seen by eyes on the coverage signals that is one deleted region of about 700 bases, one depleted region of about 700 and two short deleted regions of a few bases long.
Sequana_coverage detects those 4 events with the default parameter (window length set to a fifth of the full genome length). It is also robust with respect to the parameter (2000, 3000, 4000 that give the same results). Note, however, that a window of 1000 is too short and lead to possibly false detections. Those false detections could be easily removed using the mean zscore associated with these events.
CNVnator detects the two long deleted events. However, the two short deleted ones are systematically missed, which is not surprising since CNVnator is designed to detect long regions. Depending of the bin parameter, there are a few false detections. Decreasing to a bin of 1, the results are difficult to assess since most of the genome is classified as a mix of regions of interests.
CNOGpro detects the two long deleted events. However, their exact duration is not correct. Short events are missed using bins of 5, 10, 100, 200.
computation time: