sequana_coverage test case example (Bacteria Staphylococcus aureus)

This notebook illustrates the impact of the window parameter on the results.

Here, we can download a BED file generated for you corresponding to:


In [9]:
! wget https://github.com/sequana/resources/raw/master/coverage/FN433596.bed.bz2 1>out 2>err
! bunzip2 FN433596.bed.bz2 1>out 2>err

If you want to reproduce this BED file, please see the bacteria notebook.


In [3]:
%pylab inline
matplotlib.rcParams['figure.figsize'] = [10,7]


Populating the interactive namespace from numpy and matplotlib

Download reference and genbank (optional)


In [5]:
! sequana_coverage --download-reference FN433596
! sequana_coverage --download-genbank FN433596


INFO    [sequana]:  Downloading reference FN433596 from ENA

INFO    [sequana]:  Downloading genbank FN433596 from ENA

Sequana coverage (library)

Here we will execute the coverage code to search for ROIs but looking at the impact of the window size parameter. Instead of using the standalone code we will the library to be able to loop through the window parameter.


In [16]:
from sequana import GenomeCov, logger
logger.level = "ERROR"
# load the BED and genbank file
b = GenomeCov("FN433596.bed", "FN433596.gbk")
b.compute_gc_content("FN433596.fa")

In [17]:
# load the first chromosome and compute running median and get z-scores
chromosome = b.chr_list[0]
chromosome.thresholds.low = -4
chromosome.thresholds.high = 4
chromosome.run(20001, circular=True)


 
Out[17]:
<sequana.bedtools.ChromosomeCovMultiChunk at 0x7f7c348366d8>

In [18]:
rois = chromosome.get_rois()
rois.df


Out[18]:
chr start end size mean_cov max_cov mean_rm mean_zscore max_zscore log2_ratio
0 ENA|FN433596|FN433596.1 1 19 18 311.666667 367 521.000000 -5.500030 -7.142028 -0.741280
1 ENA|FN433596|FN433596.1 5375 5401 26 354.846154 365 521.000000 -4.368637 -4.941052 -0.554090
2 ENA|FN433596|FN433596.1 6202 6239 37 322.108108 368 520.000000 -5.210212 -6.210645 -0.690967
3 ENA|FN433596|FN433596.1 10419 10442 23 333.913043 363 520.000000 -4.900303 -6.000625 -0.639039
4 ENA|FN433596|FN433596.1 24149 24251 102 356.833333 400 518.000000 -4.262420 -5.470304 -0.537702
5 ENA|FN433596|FN433596.1 26042 26046 4 357.000000 365 523.000000 -4.347973 -4.504585 -0.550887
6 ENA|FN433596|FN433596.1 31011 31070 59 356.169492 373 524.000000 -4.387393 -4.834694 -0.557003
7 ENA|FN433596|FN433596.1 36531 36532 1 67.000000 67 521.000000 -11.910810 -11.910810 -2.959050
8 ENA|FN433596|FN433596.1 48293 48373 80 755.087500 841 508.425000 6.608366 8.933525 0.570609
9 ENA|FN433596|FN433596.1 50916 50954 38 301.157895 351 504.000000 -5.509219 -6.515674 -0.742904
10 ENA|FN433596|FN433596.1 53785 53787 2 343.500000 346 498.000000 -4.250248 -4.318779 -0.535836
11 ENA|FN433596|FN433596.1 53795 53801 6 328.333333 351 497.166667 -4.650814 -5.041590 -0.598568
12 ENA|FN433596|FN433596.1 54431 54432 1 62.000000 62 495.000000 -11.956492 -11.956492 -2.997088
13 ENA|FN433596|FN433596.1 58025 58039 14 346.857143 353 492.000000 -4.042269 -4.204784 -0.504317
14 ENA|FN433596|FN433596.1 59852 59853 1 53.000000 53 488.000000 -12.183731 -12.183731 -3.202817
15 ENA|FN433596|FN433596.1 60495 60505 10 332.100000 338 486.000000 -4.337967 -4.537400 -0.549339
16 ENA|FN433596|FN433596.1 60790 60794 4 337.500000 341 485.000000 -4.166740 -4.237107 -0.523097
17 ENA|FN433596|FN433596.1 61955 61975 20 266.950000 311 480.000000 -6.074241 -7.125106 -0.846465
18 ENA|FN433596|FN433596.1 62192 62219 27 291.555556 333 478.592593 -5.349583 -6.170964 -0.715027
19 ENA|FN433596|FN433596.1 65030 65353 323 325.944272 381 455.278638 -3.892755 -7.065726 -0.482124
20 ENA|FN433596|FN433596.1 66245 66370 125 129.456000 366 455.000000 -9.782303 -13.666350 -1.813405
21 ENA|FN433596|FN433596.1 67874 67875 1 594.000000 594 457.000000 4.077342 4.077342 0.378269
22 ENA|FN433596|FN433596.1 68690 73864 5174 32.052184 390 468.490143 -12.750211 -13.666350 -3.869524
23 ENA|FN433596|FN433596.1 74099 74136 37 346.270270 363 491.000000 -4.038985 -4.380135 -0.503824
24 ENA|FN433596|FN433596.1 74738 74794 56 291.535714 351 496.589286 -5.652315 -7.914092 -0.768381
25 ENA|FN433596|FN433596.1 74876 75122 246 196.044715 351 498.760163 -8.302775 -13.611526 -1.347163
26 ENA|FN433596|FN433596.1 87846 87860 14 321.500000 350 513.000000 -5.111008 -5.975856 -0.674140
27 ENA|FN433596|FN433596.1 88604 88621 17 366.352941 373 513.000000 -3.917440 -4.086496 -0.485725
28 ENA|FN433596|FN433596.1 88899 88909 10 343.200000 357 513.000000 -4.533556 -4.964650 -0.579909
29 ENA|FN433596|FN433596.1 93391 93406 15 340.800000 359 509.000000 -4.526153 -5.030366 -0.578740
... ... ... ... ... ... ... ... ... ... ...
649 ENA|FN433596|FN433596.1 2848995 2849013 18 333.222222 358 506.000000 -4.676402 -5.356868 -0.602653
650 ENA|FN433596|FN433596.1 2850248 2850276 28 327.928571 360 509.750000 -4.884436 -5.555877 -0.636408
651 ENA|FN433596|FN433596.1 2851822 2851828 6 339.333333 355 509.000000 -4.565489 -4.842627 -0.584963
652 ENA|FN433596|FN433596.1 2852520 2852539 19 354.947368 362 508.000000 -4.127984 -4.395296 -0.517223
653 ENA|FN433596|FN433596.1 2869377 2869378 1 32.000000 32 513.000000 -12.814808 -12.814808 -4.002815
654 ENA|FN433596|FN433596.1 2880411 2880416 5 28.200000 69 513.000000 -12.915928 -13.240579 -4.185192
655 ENA|FN433596|FN433596.1 2880745 2880773 28 743.821429 757 514.000000 6.088754 6.438763 0.533188
656 ENA|FN433596|FN433596.1 2882717 2882736 19 357.368421 365 515.000000 -4.193456 -4.600832 -0.527160
657 ENA|FN433596|FN433596.1 2883446 2883461 15 328.200000 364 516.066667 -4.984551 -5.623729 -0.652982
658 ENA|FN433596|FN433596.1 2884127 2884234 107 354.327103 392 517.000000 -4.310407 -5.375237 -0.545082
659 ENA|FN433596|FN433596.1 2892275 2892276 1 66.000000 66 517.000000 -11.923632 -11.923632 -2.969626
660 ENA|FN433596|FN433596.1 2900787 2900802 15 348.066667 358 515.000000 -4.440021 -4.706862 -0.565209
661 ENA|FN433596|FN433596.1 2905671 2905679 8 339.500000 362 513.000000 -4.632016 -5.071092 -0.595547
662 ENA|FN433596|FN433596.1 2914047 2914051 4 76.750000 88 513.000000 -11.623979 -11.883433 -2.740720
663 ENA|FN433596|FN433596.1 2922294 2922319 25 332.000000 366 518.000000 -4.916874 -6.208212 -0.641769
664 ENA|FN433596|FN433596.1 2929439 2929447 8 352.625000 368 522.000000 -4.444536 -4.696248 -0.565915
665 ENA|FN433596|FN433596.1 2935798 2935819 21 333.238095 369 524.000000 -4.984803 -5.590204 -0.653013
666 ENA|FN433596|FN433596.1 2943741 2943743 2 368.500000 370 524.000000 -4.066157 -4.105235 -0.507902
667 ENA|FN433596|FN433596.1 2957573 2957639 66 361.348485 390 522.000000 -4.216401 -5.245438 -0.530659
668 ENA|FN433596|FN433596.1 2958224 2958241 17 335.176471 364 523.000000 -4.917609 -5.653068 -0.641890
669 ENA|FN433596|FN433596.1 2971883 2971911 28 328.107143 364 518.000000 -5.019465 -6.155504 -0.658785
670 ENA|FN433596|FN433596.1 2973580 2973598 18 355.444444 367 519.000000 -4.317071 -4.697004 -0.546110
671 ENA|FN433596|FN433596.1 2982946 2982990 44 362.909091 372 525.000000 -4.229821 -4.461479 -0.532709
672 ENA|FN433596|FN433596.1 2994132 2994160 28 347.178571 371 526.000000 -4.656016 -5.179711 -0.599385
673 ENA|FN433596|FN433596.1 2996247 2996251 4 175.250000 181 524.000000 -9.100723 -9.237496 -1.580152
674 ENA|FN433596|FN433596.1 3009891 3009904 13 347.769231 361 512.000000 -4.393891 -4.841003 -0.558014
675 ENA|FN433596|FN433596.1 3025953 3025973 20 347.550000 366 519.000000 -4.524719 -5.091550 -0.578514
676 ENA|FN433596|FN433596.1 3032369 3032384 15 355.066667 367 519.000000 -4.327008 -4.775913 -0.547645
677 ENA|FN433596|FN433596.1 3036405 3036408 3 370.666667 371 524.000000 -4.009711 -4.027079 -0.499444
678 ENA|FN433596|FN433596.1 3043203 3043211 8 339.875000 368 521.875000 -4.776028 -5.543701 -0.618700

679 rows × 10 columns


In [14]:
# 
chromosome.plot_coverage()


The following code is a function that does the analysis shown above (running median, zscore computation, identification of the ROI, and some plotting). The function then returns the ROIs. It allows to focus on a specific genome range and emphasize the ROIs that have been detected.

One of the parameter is the window parameter of the running median step, which is investigated in this notebook.


In [2]:
def identify_rois_and_plot_variant( x1=526000, x2=538000, W=20001):
    clf()    
    from sequana import GenomeCov
    # load the BED and genbank file
    b = GenomeCov("FN433596.bed", "FN433596.gbk")
    b.chr_list[0]._df = b.chr_list[0].df.iloc[x1:x2]
    #b.compute_gc_content("FN433596.fasta")
    chromosome = b.chr_list[0]
    chromosome.run(W, circular=True)
    chromosome.plot_coverage()
        
    roi = chromosome.get_rois()
    low = roi.get_low_rois()
    high = roi.get_high_rois()
    
    chr = chromosome
    highT = (chr.thresholds.high * chr.best_gaussian["sigma"] +
         chr.best_gaussian["mu"]) * chr.df["rm"]
    lowT = (chr.thresholds.low * chr.best_gaussian["sigma"] +
        chr.best_gaussian["mu"]) * chr.df["rm"]

    if len(low):
        for k,v in low.iterrows():        
            Y1 = chr.df['cov'].loc[v.start:v.end]
            Y2 = lowT.loc[v.start:v.end] 
            Y1 = Y1.combine(Y2, max) *0
            if v.start > x1 and v.end<x2:
                try:fill_between(range(v.start, v.end+1), Y1, Y2, alpha=0.6, color="blue")
                except:pass
    
    if len(high):
        for k,v in high.iterrows():
            Y1 = highT.loc[v.start:v.end]
            Y2 = chr.df['cov'].loc[v.start:v.end]
            Y2 = Y2.combine(Y1,max)
            if v.start > x1 and v.end<x2:
                try:fill_between(range(v.start, v.end+1), y1=Y1,y2=Y2 ,alpha=0.6, color="orange")
                except:pass

    xlim([x1,x2])

    return roi

First, let us use a small window size of only 2000 bases. If we focus on the area where there is a deleted regions, we see that the thresholds (implied by the running median) are affected by the deleted regions. This means that even the running median did not ignore the deleted regions. To be ignored, the window parameter should be twice as much as the deleted regions itself


In [8]:
# Here we emphasize the fact that you will miss simple CNV-like event
# if the window parameter is too short. For instance this 1100-long 
# event is missed because the window itself is less than 2 times 1100 bases
roi2000 = identify_rois_and_plot_variant(W=2001, x1=526000, x2=538000)
ylim([0,800])


 
Out[8]:
(0, 800)

In [9]:
# To detect the 1100-long deleted regions, a window size of 2200 should be enough. 
# even though, slightly larger (e.g. 2300) would give a smoother and therefore less
# biases results

# Moreover, as can be seen in this example, the event detected at position
# around 527,000 is still detected.
roi2200 = identify_rois_and_plot_variant(W=2200, x1=526000, x2=538000)
ylim([0,800])


 
Out[9]:
(0, 800)

We have run sequana_coverage to store the ROIs for different window sizes. There is no gold standard but we can compare the different results and show that the impact of the window parameter is small.


In [4]:
import pandas as pd

In [11]:
bins = range(0, 2500, 200)
rois1000 = pd.read_csv("rois_1000.csv")
rois2200 = pd.read_csv("rois_2200.csv")
rois4000 = pd.read_csv("rois_4000.csv")
rois32000 = pd.read_csv("rois_32000.csv")
rois64000 = pd.read_csv("rois_64000.csv")


_ = hist([rois1000['size'], rois2200['size'], rois4000['size'], rois32000['size'], rois64000['size']], 
         bins=bins, log=True)
legend(["1000", "2200", "4000", "32000", "64000"])
grid()
_ = xlabel("length of the ROIs detected ")
_ = ylabel("PDF")


/home/cokelaer/miniconda3/envs/py3/lib/python3.5/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  return getattr(obj, method)(*args, **kwds)

In the plot above, we see that short ROIs are detected whatever is the length of the window. However, with increasing ROIs length (e.g. 500), the parameter set to 1000 (blue bars) already misses some events. Conversely the largest windows (32000, 64000) detect all events (short or long)). The large windows do not seem to add false detections.


In [14]:
_ = identify_rois_and_plot_variant(x1=2000000, x2=2500000, W=160001)


 

Impact of using a larger window: You may not follow the trend closely anymore, which may be an isssue with coverage signal with lots of fluctuations.

For instance, here below, we have a few green circles just next to the threshold line (red line), which are detected but would not with the larger window.


In [21]:
chromosome.run(128000, circular=True)
chromosome.plot_coverage(x1=540000, x2=610000, sample=False)
for i,this in enumerate([1000, 128000]):
    marker = "oo" 
    color = "gb"
    ms=[12,6]
    df = pd.read_csv("rois_{}.csv".format(this))    
    for _, d in df.iterrows():
        plot([d['start'], d['end']], [d["mean_cov"], d["mean_cov"]], marker[i], color=color[i], 
            markersize=ms[i], alpha=0.6, lw=1, label=this)
ylim([0, 1200])
legend()


 
Out[21]:
<matplotlib.legend.Legend at 0x7f7c3279a0f0>

Same as above but here the running median plot uses a window set to 1000. Green and blue dots are the same as above.


In [23]:
chromosome.run(1000, circular=True)
chromosome.plot_coverage(x1=540000, x2=610000, sample=False)
for i,this in enumerate([1000,128000]):
    marker = "oo" 
    color = "gb"
    ms=[12,6]
    df = pd.read_csv("rois_{}.csv".format(this))    
    for _, d in df.iterrows():
        plot([d['start'], d['end']], [d["mean_cov"], d["mean_cov"]], marker[i], color=color[i], 
            markersize=ms[i], alpha=0.6, lw=1, label=1000)
ylim([0,1200])
legend()


 
Out[23]:
<matplotlib.legend.Legend at 0x7f7c36a00f60>

Using the previous code, you can simply compare two sets of ROIs available in the directory


In [24]:
import glob
glob.glob("rois*csv")


Out[24]:
['rois_64000.csv',
 'rois_72000.csv',
 'rois_1000.csv',
 'rois_20000.csv',
 'rois_232000.csv',
 'rois_104000.csv',
 'rois_48000.csv',
 'rois_128000.csv',
 'rois_24000.csv',
 'rois_256000.csv',
 'rois_32000.csv',
 'rois_2200.csv',
 'rois_4000.csv',
 'rois_16000.csv',
 'rois_8000.csv',
 'rois_176000.csv']

Criteria to select the window size of 20,000

There seem to be a false alarm increase for low window size (zscore remains small). For instance, if we plot the number of ROIs for window size from 1000 to 256,000 as show below, there seem to be a plateau after 20,000 where the number of events converges to about 650.

Choosing 20,000 (in this case) seems to be a good choice since it reducs the false alarm. Larger window, would be useful only to detect deleted or duplicated regions.

drawbacks of large windows:

  • time computation may be a bit slower
  • you may miss fluctuations.

In [6]:
L = []
vals = [1000,2200,4000,8000,16000,20000,24000, 32000,48000, 64000,72000,
        104000, 128000,176000, 232000, 256000, 300000,400000,512000]
for this in vals:
    df = pd.read_csv("rois_{}.csv".format(this))
    L.append(len(df))
plot(vals, L, "o-")
ylim([600,800])
M = mean( [l for l,v in zip(L, vals) if v>20000])
axhline(M, color="r", ls="--")
axvline(20000, color="r", ls="--")
xlabel("Window size")
ylabel("Number of detected ROIs")


Out[6]:
Text(0,0.5,'Number of detected ROIs')

In [ ]: