Sequana_coverage versus CNVnator (viral genome)

This notebook compares CNVnator, CNOGpro and sequana_coverage behaviour on a viral genome instance (same as in the virus notebook).

Versions used:

  • sequana 0.7.0

In [17]:
%pylab inline
matplotlib.rcParams['figure.figsize'] = [10,7]


Populating the interactive namespace from numpy and matplotlib

Here below, we provide the results of the sequana_coverage and CNVNator analysis as files within this directory itself. Nevertheless, if you need to rerun the analysis yourself, you can still do it but you need to generate the BAM file for CNVnator (see virus notebook). For sequana_coverage, you could use the BED file directly, which is also used here below for the plotting. This BED file can simply be downloaded as follows.


In [2]:
!wget https://github.com/sequana/resources/raw/master/coverage/JB409847.bed.bz2
!bunzip2 JB409847.bed.bz2


--2018-07-23 22:54:23--  https://github.com/sequana/resources/raw/master/coverage/JB409847.bed.bz2
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sequana/resources/master/coverage/JB409847.bed.bz2 [following]
--2018-07-23 22:54:24--  https://raw.githubusercontent.com/sequana/resources/master/coverage/JB409847.bed.bz2
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.120.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.120.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63485 (62K) [application/octet-stream]
Saving to: ‘JB409847.bed.bz2’

JB409847.bed.bz2    100%[===================>]  62.00K  --.-KB/s    in 0.02s   

2018-07-23 22:54:24 (3.91 MB/s) - ‘JB409847.bed.bz2’ saved [63485/63485]

The coverage signal

Let us have a quick look at the coverage itself.


In [14]:
from sequana import GenomeCov
b = GenomeCov("JB409847.bed")
chromosome = b.chr_list[0]
chromosome.run(4001, 2, circular=True)


Out[14]:
<sequana.bedtools.ChromosomeCovMultiChunk at 0x7f4c6105e4e0>

In [18]:
chromosome.plot_coverage()
_ = ylim([0,1500])


What you see are two long deleted regions of about 700 bases (one fully deleted) and two 3-bases deleted regions on the right hand side. Our goal is to detect those 4 events automatically.

Get ROIs using sequana_coverage

We ran the analysis with circular chromosome set on and a window parameter of

  • 1000
  • 2000
  • 3000
  • 4000

For example:

sequana_coverage -o -input JB409847.bed --no-html --no-multiqc -w 3000

Results are available in 4 files for your convenience: e.g. rois_1000.csv, rois_4000.csv

Note that the window size cannot be larger than a fifth of the genome length from the command line. Would you need to force it, you would need to use the library itself.

Get ROIs using CNVnator

cnvnator -root out1.root -tree JB409847.bed 
cnvnator -root out1.root -his 1
cnvnator -root out1.root -stat 1
cnvnator -root out1.root -partition 1 -ngc
cnvnator -root out1.root -call 1 -ngc > events_bin1.txt

The default parameter of 100 gives no detection. We then used bin=1, 5, 10, 20 and stored the results in:

  • events_bin1.txt
  • events_bin5.txt
  • events_bin10.txt
  • events_bin20.txt

How events are detected using CNVnator?


In [8]:
from sequana.cnv import CNVnator

In [9]:
cnv1 = CNVnator("events_bin1.txt").df
cnv5 = CNVnator("events_bin5.txt").df
cnv10 = CNVnator("events_bin10.txt").df
cnv20 = CNVnator("events_bin20.txt").df

In [10]:
def plot_rois_cnv(cnv):
    chromosome.plot_coverage()
    for _, this in cnv.iterrows():
        type_ = this['type']
        positions = [this.start, this.end]
        if type_ == "deletion":
            fill_between(positions, 0, 1000)
        else:
            fill_between(positions, 1000,2000)
    ylim([0,2000])

binning of 20 and 10


In [11]:
plot_rois_cnv(cnv10)


Here, the two main events on the left hand side (deleted regions of several hundreds of bases) are detected. However, the short deleted regions on the right hand side (a few bases) are not.

Binning of 5


In [12]:
plot_rois_cnv(cnv5)


Here again, the two main deleted regions are detected but the two short events are not. There is also a false detection at position 0 and around 4000

Binning of 1


In [13]:
plot_rois_cnv(cnv1)


binning of 1 is too small: the low deleted regions are detected but then, the small deleted regions are not and there are lots of upper regions classified as duplicated CNVs.

How events are detected using sequana_coverage ?


In [28]:
# W can be 1000,2000,3000,4000
import pandas as pd
def plot_rois_sequana(W):
    assert W in [2000, 3000, 4000, 1000,5000, "4000_t3"]
    if W == "4000_t3":
        chromosome.run(4001, circular=True)
    else:
        chromosome.run(W+1, circular=True)
    chromosome.plot_coverage(sample=False)
    rois = pd.read_csv("rois_{}.csv".format(W))
    for _, this in rois.iterrows():
        #if this.max_zscore >-4.5 and this.max_zscore<4.5 and this.max_cov!=0:
        #    continue    
        positions = [this.start, this.end]
            
        if abs(this.mean_zscore) >12: col="red"
        elif abs(this.mean_zscore) >8: col="orange"
        elif abs(this.mean_zscore) >4: col="yellow"
            
        if this.mean_zscore > 0:
            fill_between(positions, 1000, 2000, color=col)        
        else:
            fill_between(positions, 0, 1000, color=col)
        if this.end-this.start < 100:
            axvline(this.start, ymax=0.5, lw=3, color="r")
    ylim([0,2000])

In [29]:
plot_rois_sequana(5000)



In [71]:
plot_rois_sequana(4000)



In [40]:
plot_rois_sequana(3000)


Using W=2000, 3000, 4000 the results are robust and consistent: the same ROIs are detected. In particular, the 2 long and 2 short deleted regions are detected;

In addition, some short events not fully deleted are also systematically detected. The detection seems coherent visually except maybe for the duplicated events at position 5000, which could be considered a false detection. Note, however, that the zscore associated with this events could be used to discard the events (mean zscore of 5)


In [26]:
rois = pd.read_csv("rois_4000.csv")
rois[["start", "end", "size", "mean_rm", "max_zscore", "log2_ratio"]]


Out[26]:
start end size mean_rm max_zscore log2_ratio
0 3553 4252 699 919.410587 -16.204462 -5.224348
1 4756 5051 295 821.044068 4.864558 0.369192
2 5399 6212 813 877.333333 -12.933157 -1.128289
3 9344 9561 217 1043.811060 -4.815085 -0.445117
4 17045 17048 3 1005.000000 -16.204462 -9.972980
5 17052 17053 1 1005.000000 -16.140656 -7.972980
6 17186 17525 339 987.203540 -6.962564 -0.544632
7 19257 19258 1 962.000000 -15.987824 -6.209453
8 19554 19722 168 963.000000 -5.483700 -0.492862

Now, if we decrease the W parameter to 1000, we may miss the large deleted regions, which length is 711 and 782 (to detect those events we recommend to use W as twice this length so about 1500)


In [82]:
plot_rois_sequana(1000)


So, here we still detect the two deleted regions but we see that the running median fits the data too closely. Consequently, we have some false detections at position 0, 6600, 17923 to cite a few examples.

CNOGpro detection

Running CNOGpro manually using window length of 100 and 10, we get these results


In [58]:
CNOGpro_10 = [[3521, 4220, 0], [4221, 4230, 5], 
               [4241,4250,0], [4251,4260,3],
              [5681,5740,0], [19771,19795,0]]
CNOGpro_100 = [[3601,4200,0], [4201,4300,3], [4301,4400,2]]

In [75]:
plot_rois_sequana(4000)
for this in CNOGpro_10:
    plot(this[0:2], [1000, 1000], "ob-")



In [76]:
plot_rois_sequana(4000)
for this in CNOGpro_100:
    axhline(1000, this[0], 1000, color="r", lw=20)
    plot(this[0:2], [1000, 1000], "ob-")


Conclusions

On a viral genome (length 18000), we use sequana_coverage and cnvnator to detect events of interests that can be seen by eyes on the coverage signals that is one deleted region of about 700 bases, one depleted region of about 700 and two short deleted regions of a few bases long.

Sequana_coverage detects those 4 events with the default parameter (window length set to a fifth of the full genome length). It is also robust with respect to the parameter (2000, 3000, 4000 that give the same results). Note, however, that a window of 1000 is too short and lead to possibly false detections. Those false detections could be easily removed using the mean zscore associated with these events.

CNVnator detects the two long deleted events. However, the two short deleted ones are systematically missed, which is not surprising since CNVnator is designed to detect long regions. Depending of the bin parameter, there are a few false detections. Decreasing to a bin of 1, the results are difficult to assess since most of the genome is classified as a mix of regions of interests.

CNOGpro detects the two long deleted events. However, their exact duration is not correct. Short events are missed using bins of 5, 10, 100, 200.

computation time:

  • sequana_coverage: (with --no-html and --no-multiqc options), sequana_coverage takes 2.5 seconds irrespective of the window parameter
  • cnvnator: 8.5 s for bin=5 or 10, 18s for bins=1