Reuse the FN433596 bam file. Need to name the reference.fasta to ENA|FN433596|FN433596.1.fa

https://github.com/abyzovlab/CNVnator/wiki/Calling-CNVs

cnvnator -root out.root -tree FN433596.fasta.sorted.bam 

# GENERATING HISTOGRAM
cnvnator -root out.root -his 100 

#CALCULATING STATISTICS 
cnvnator -root out.root -stat 100

#RD SIGNAL PARTITIONING
cnvnator -root out.root -partition 100 -ngc

# CNV calling
cnvnator -root out.root -call 100 -ngc > events_bin100.txt

Took 40 seconds

Setting the binning to 1, the all analysis takes about 10 minutes.

cnvnator -root out.root -tree FN433596.fasta.sorted.bam 
cnvnator -root out.root -his 1
cnvnator -root out.root -stat 1
cnvnator -root out.root -partition 1 -ngc
cnvnator -root out.root -call 1 -ngc > events_bin1.txt

Took 24 minutes

Finally, we also run CNVnator with a bin of 6. According to the paper, optimal bin size, and thus breakpoint resolution accuracy scales roughtly inversely with the coverage. 100 for coverage of 20-30X, 30 for coverage og 100X, so we extrapolate to a bin of 6 for 500X. The analysis took about 5 minutes.

sequana coverage W=20000 takes 2 minutes with reports and conversion BAM to BED and about 1 minute if BED is already present.



In [1]:

    
from sequana import GenomeCov
%pylab inline
rcParams['figure.figsize'] = (10,7)
import pandas as pd









    



Populating the interactive namespace from numpy and matplotlib



In [10]:

    
from sequana.cnv import CNVnator
df1 = CNVnator("events_bin1.txt").df
df6 = CNVnator("events_bin6.txt").df
df100 = CNVnator("events_bin100.txt").df



In [11]:

    
len(df1), len(df6), len(df100)









    Out[11]:





(207, 72, 13)



In [12]:

    
# load the BED and genbank file
b = GenomeCov("FN433596.bed")
chromosome = b.chr_list[0]
chromosome.run(100001, circular=True, cnv_delta=10)
chromosome.rois.df.to_csv("rois_100000_10.csv")    
    
def plot_variant(chrom, x1=526000, x2=538000):
    clf()    
    chrom.plot_rois(x1, x2)
    roi = chrom.rois
    low = roi.get_low_rois()
    high = roi.get_high_rois()
    chr = chrom
    highT = (chr.thresholds.high * chr.best_gaussian["sigma"] +
         chr.best_gaussian["mu"]) * chr.df["rm"]
    lowT = (chr.thresholds.low * chr.best_gaussian["sigma"] +
        chr.best_gaussian["mu"]) * chr.df["rm"]

    if len(low):
        for k,v in low.iterrows():        
            Y1 = chr.df['cov'].loc[v.start:v.end]
            Y2 = lowT.loc[v.start:v.end] 
            Y1 = Y1.combine(Y2, max) *0
            if v.start > x1 and v.end<x2:
                try:fill_between(range(v.start, v.end+1), Y1, Y2, alpha=0.6, color="blue")
                except:pass
    
    if len(high):
        for k,v in high.iterrows():
            Y1 = highT.loc[v.start:v.end]
            Y2 = chr.df['cov'].loc[v.start:v.end]
            Y2 = Y2.combine(Y1,max)
            if v.start > x1 and v.end<x2:
                try:fill_between(range(v.start, v.end+1), y1=Y1,y2=Y2 ,alpha=0.6, color="orange")
                except:pass

    xlim([x1,x2])

    return chromosome



In [13]:

    
from easydev import Progress
cnvnator = df6
N = len(cnvnator)
pb = Progress(N)
for i in range(N):
    this = cnvnator.loc[i]
    try:
        # develop branhc
        chro = plot_variant(chromosome, max(this.start-10000, 0), 
                    this.end+10000, color_low="b")
    except:
        chro = plot_variant(chromosome, max(this.start-10000, 0), 
                    this.end+10000)
        
    x1, x2 = xlim()
    L = x2-x1
    color = "yellow"
    # this[3] currently holds the ratio between CNV and mean
    axhline(chro.df['cov'].mean() * this[3], (this.start-x1)/L, 
            (this.end-x1)/L, lw=8, color=color, alpha=0.5)
    savefig("comp_cnv_bin6_{}.png".format(i+1), dpi=200)
    pb.animate(i+1)
close()

conclusion

Using a bin of 100, the 13 events detected by CNVnator are also detected with sequana_coverage
Using a bin of 6, the 72 events detected by CNVnator are also detected with sequana_coverage
Using a bin of 1, amongst the 200 events many false positives are reported either short (few bases) or long (kb).
sequana_coverage detects additional events (see images below for a few examples)
- very short (See e.G. image comp_cnv_2.png, position 380000)
- medium size (see e.g. image comp_cnv_2.png, position 395000)
- very short and strong events (e.g. comp_cnv_3, position 540000)
- long (e.g. comp_cnv_12.png, position 2444242, event of 2163 bases)
CNVnator is better at detected long events. For instance, in comp_cnv_10.png, the large gap is detected. Here with sequana_coverage we also detect it but if the window paramter were too short, we could have missed it.



In [6]:

    
from IPython.display import Image, HTML

Event detected in CNVnator (red thick line) and sequana_coverage (green line). Note that there are other events detected with sequana_coverage: short (red and green dots), medium size event also at position 395000



In [7]:

    
Image("comp_cnv_2.png")









    Out[7]:

Strong and short events (right hand side) are detected by sequana_coverage (here as a single events because we clustered events with distances below 1000 bases)



In [51]:

    
Image("comp_cnv_3.png")









    Out[51]:

Complex pattern such as the following ones are handled correctly in sequana_coverage: we detect 4 distincts long events. CNVnator itself, find the first and last events only;



In [53]:

    
Image("comp_cnv_12.png")









    Out[53]:

Event that could have been missed by sequana_coverage if $W$ were shorter. Although, special case of deleted regions are systematically reported.



In [54]:

    
Image("comp_cnv_10.png")









    Out[54]:

Example of false positives found by CNVnator with bin=1



In [14]:

    
Image("comp_cnv_bin1_1.png")









    Out[14]:



In [16]:

    
Image("comp_cnv_bin1_17.png")









    Out[16]:

See other files calld comp_cnv_bin1_*.png in this directory



In [ ]: