Comparison sequana_coverage vs CNVnator (human case)4

Author: Thomas Cokelaer, 2018

We will need to download this alignment data (CRAM format), which is quite large (18Gb).

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA12878/alignment/

We then convert cram to bam for cnvnator (the BAM is 24 Gb).

Conversion CRAM to BAM (using samtools 1.6):

samtools view -T reference.fa -b -o out.bam  in.cram   


Direct conversion of the BAM to BED containing one chromosome only (you still need to know the names of the chromosomes):

bamtools index -in out.bam   # about 


with samtools 1.6 for these commands work but could be simplified:

samtools view -bh alldata.bam chr11 > out.bam
samtools view out.bam out.sam
# edit SAM file to remove other chromosomes.... surely there is a simpler solution.
samtools depth -aa out.sam > out.bed

We now have the 24 BED files for sequana_coverage.


In [3]:
import pandas as pd
%pylab inline
rcParams['figure.figsize'] = (10,8)


Populating the interactive namespace from numpy and matplotlib

sequana_coverage and CNVnator analysis

sequana_coverage

Because the data input files are huge and the analysis takes a while for this 3.5Gb genome, we designed a snakemake pipeline available in sequana (http://sequana.readthedocs.io/en/master/index.html) and in the directory ./coverage/. It can be run using snakemake. For instance on a slurm scheduler, we used:

srun -c 1 snakemake -s coverage.rules -p --cluster "sbatch --mem 8000 " -j 30

The config file is also available in ./coverage and looks like

input_directory: ''
input_readtag: _R[12]_
input_extension: ''
input_pattern: /pasteur/homes/tcokelae/TEMP/coverage/NA12878.bed
input_samples:
    file1: ''
    file2: ''
coverage:
    circular: true
    double_threshold: 0.5
    genbank_file: ''
    high_threshold: 6.0
    low_threshold: -4.0
    mixture_models: 2
    reference_file: ''
    window: 250001
    chunksize: 10000000
    binning: 100
    cnv_delta: 1000

One can run the analyse of each chromososme one by one using the standalone, but here we can take advantage of a cluster and parallelise each chromosome analysis. This took less than 1 hour to analyse the 24 chromosomes. We provide the final rois.csv in this directory together with the multiqc_report.html file that was generated by sequana_coverage. Note that we do not provide the individual HTML reports (too large) but only the ROIs files and the summary files (json format).

Note that we provided an input file called NA.bed.chrom_names.txt that contains the list of 24 chromosomes of interests and represents of 90% of the data. We ignored the other contigs for convenience. CNVnator analysis (here below) takes into account all chromosomes.

Performances

  • cnvnator used about 6.5Gb and finished in 5.5 hours using bin parameter of 100 and one core.
  • sequana_coverage uses 2Gb at max and took about 1 hour with each chromosome being analysed sequentially on one core. The binning parameter was set to 100 (or 1000).
  • Using a snakemake pipeline to analyse the chromosomes in parallel (one core per chromosome), the analysis with sequana_coverage took about 7 minutes. Actually, it is limited by the largest chromosome.

Here below are the actual time it took for each chromosome on a HPC cluster on one core.


In [33]:
# For each chromosome, (23 is chrX and 24 is chrY), we store the chromosome length and time is seconds it took to be analysed
# on one core.

res = {1: [248956422, 4*60+53], 2: [242193529,4*60+50], 3: [198295559,4*60+10], 
       4: [190214555, 3*60+58], 5: [181538259,3*60+47], 6: [170805979,3*60+44], 
       7: [159345973,3*60+25], 8: [145138636,3*60+8], 9: [138394717, 60*3+1], 
       10: [133797422, 2*60+57], 11: [135086622, 2*60+55], 12: [133275309, 2*60+57],
       13:[114364328, 60*2+31],  14: [107043718, 60*2+25], 15: [101991189, 2*60+16],  # chr14 is empty
       16:[90338345, 2*60+7],  17: [83257441,2*60+15], 18: [80373285, 60+59], 
       19: [58617616,60+28], 20: [64444167, 60+40], 21: [46709983, 60+20], 
       22: [50818468, 60+24], 24:[57227415, 60+37], 23:[156040895, 60*3+21],
      }
L = [res[i][0] for i in range(1,25) if i in res]
T = [res[i][1] for i in range(1,25) if i in res]
plot(L, T, "o")
a=1.077e-6, 
b=31.6
plot(L, a*np.array(L)+b, "r-")  
xlabel("chromosome size")
ylabel("Time of computation (seconds)")
# command used 
# sequana_coverage --input data.bed --binning 1000 --no-html -w 250000 -H 6 --cnv-clustering 1000


Out[33]:
Text(0,0.5,'Time of computation (seconds)')

In [36]:
# Estimated time for full genome in minutes
(1.077e-6*3.5e9+31.6)/60*60/80.


Out[36]:
47.51375

comparison on chromosome chr21

sequana ROIs


In [4]:
rois = pd.read_csv("coverage/coverage_reports/chr21/rois.csv")
rois = rois.drop("Unnamed: 0", axis=1)
print("Sequana identified {} ROIS amongst which {} with length >= 800".format(len(rois), len(rois.query("size > 800"))))


Sequana identified 1071 ROIS amongst which 226 with length >= 800

In [10]:
len(rois.query("size>100 and abs(mean_zscore)>5"))


Out[10]:
327

cnvnator ROIs


In [36]:
from sequana.cnv import CNVnator
cnvnator = CNVnator("events_bin100.txt").df.query("name=='chr21'")
print("cnvnator identified {} ROIs with size>800".format(len(cnvnator)))


cnvnator identified 127 ROIs with size>800

It is not easy to validate one or the other set of ROIs. Some events clearly overlap. Some others are seen by one tool and not the other, and vice versa.

One general trend is that sequana_coverage detects many small events. Some are just noise, some are real events.

The best is to give a set of representative examples. Interested reader can then explore farther what comparison could be done.


In [100]:
from sequana import GenomeCov
b = GenomeCov("NA12878_chr21.bed")
c = b.chr_list[0]
c.thresholds.high = 6
c.run(250000, 2, binning=100, cnv_delta=1000)    
c.rois.df = c.rois.df.query("size>15")


 

 


In [103]:
def plot_rois(x1, x2, chrom, cnvnator):
    chrom.plot_rois(x1,x2)
    for _, item in cnvnator.query("size<1000000").iterrows():
        if item['type'] == "deletion":
            fill_between([item.start, item.end], [600,600], color="green", alpha=0.5)
        else:
             fill_between([item.start, item.end], [600,600], color="red", alpha=0.5)
        ylim([0,600])

5Mb to 5.5Mb


In [104]:
plot_rois(5000000, 5500000, c, cnvnator)
ylim([-10, 100])


Out[104]:
(-10, 100)
  • red and green areas correpond to identified regions in CNVnator.
  • red and green dots (or horizontal lines) are those identified by Sequana_coverage
  • we see that sequana_coverage has many short detections. Those events can be clustered or filtered out depending on the needs.

  • We see also that the first red area or the second red areas are classified similarly by CNVnator and sequana_coverage.

  • The event on the right hand side at position 5,450,000 is detected by sequana as a CNV but missed somehow by CNVnator that may be misled by the two empty areas around it.

Let us now focus on 5,200,000 to 5,400,000

5.2Mb to 5.4Mb


In [105]:
plot_rois(5200000, 5400000, c, cnvnator)
ylim([-10, 100])


Out[105]:
(-10, 100)

as mentionned above, the two red regions detected are compatible with the outcome of sequana_coverage. However, concerning the green areas, this could be questionned. Let us look at the first one in 5.25 to 5.3Mb. Here there is an alternance of deleted regions and outliers (red dots). By eyes there is no clear pattern except that this area is surrounded by two red regions (excess of DOC). However, the green area itself has a DOC of around 5, which is consistent with the overall DOC.

5.6Mb to 6.3Mb


In [106]:
plot_rois(5600000,6220000,c, cnvnator)
ylim((-1,15))


Out[106]:
(-1, 15)

Here, there is an alternance of deleted regions. The large deleted are detected by CNVnator and sequana_coverage.

More interestingly, let us look at the green regions where the DOC is not zero:

- in sequana, again, there are lots of small detected events (red dots). Most of them could be ignored except 
  maybe the ones at position 5,900,000 that could be considered as a CNV with a CN of 1.25 (average DOC is about 4.9)
- in cnvnator, the entire region from 5900000 to 6200000 is green meaning there is no difference here in the deleted regions and the one in the middle, which could be questionned. 
- In the green region in the middle (position 5900000), cnvnator identifies a depleted region surrounded by two normal (white regions) and this seems a sensible choice. Sequana does not classify this region. 

6.5Mb to 7.4Mb


In [107]:
plot_rois(6500000,7400000, c, cnvnator)
_ = ylim([-1,50])


  • again, alternance of CNVs. At position 6.75Mb to 7Mb, the deleted region reported by CNVnator englobes 2 deleted regions and the region in the middle and classify this as one event
  • CNVnator identifies a CNV like event (duplication) in the right hand side. Indeed, the CN is about 2 here. With sequana_coverage, this is not clear due to the fact that the running median follows the data. One would need to increaese the window parameter and find clustering more suitable. Note however that the red dots here have large zscore.
  • an event detected in sequana is not seen in cnvnator (left hand side at position 6550000). This is a very short but string event.

12.5Mb to 15Mb


In [108]:
plot_rois(12500000, 15000000, c, cnvnator)
ylim([-1,40])


Out[108]:
(-1, 40)

The green area is missed by sequana. This could be fixed by changing the low threshold to -2 instead of -4. Additionally, one can play with the

c.rois.apply_threshold_after_merging to False

15Mb to 20Mb


In [110]:
plot_rois(15000000,20000000, c, cnvnator)
ylim(-1,20)


Out[110]:
(-1, 20)

No detection in sequana_coverage in this clean data set match with cnvnator. As for the 4 events detected by cnvnator, there are short but of interest. Note however, that sequana_coverage allows the detection of shorter events (filtered out for convenience in the plots).


In [111]:
plot_rois(30000000,35000000, c, cnvnator)
ylim([0,120])


Out[111]:
(0, 120)

Here, the event of the right hand side (125,000 bases long) is missed by sequana_coverage. We recommend to use W = 2 times the CNV length to be detected. This is due to the low depth of coverage. Future versions could use a more appropriate clustering and detection procedure for low coverage. Note here, that sequana_coverage detects 2 additional strong but short events .

Conclusion

  • CNVnator is a robust tool, fast to run and is designed to analyse whome genome.
  • Sequana_coverage for the analysis of human genome is less mature and there still place for improvements in terms performance. However, it is fast enough that it can be used to analyse whole genome as well
  • in this notebook we looked at the chr21 chromosome to compare the two tools. We believe that the two tools could be used to confirm events and to complement themselves.

In [ ]: