In [ ]:

    
# Sensitivity of sequana_coverage in detecting CNVs

Author: Thomas Cokelaer
Jan 2018
Local time execution: about 10  minutes

In this notebook, we will simulate fastq reads and inject CNVs. We will then look at the sensitivity (proportion of true positive by the sum of positives) of sequana_coverage.

We use the data and strategy described in section 3.2 of "CNOGpro: detection and quantification of CNVs in prokaryotic whole-genome sequencing data, bioinformatics 31(11), 2015 (Brynildsrud et al)"

Here, we will use the same reference: FN433596 (staphylococcus aureus) as in the paper above, which is also used in the manuscript

The main goal is to generate simulated data, and check that the sensitivity is high (keeping specificity low) by injecting various CNVs.

Requirements

sequana version 0.7.0 was used
art_illumina

Get the reference

There are many ways to download the reference (FN433596). Here below we use sequana_coverage tool but of course, you can use your own tool, or simply go to http://github.com/sequana/resources/coverage (look for FN433596.fasta.bz2).



In [1]:

    
!sequana_coverage --download-reference FN433596









    



INFO    [sequana.fasta]:  Downloading reference FN433596 from ENA

Simulated FastQ data

Installation: conda install art

Simulation of data coverage 100X

-l: length of the reads
-f: coverage
-m: mean size of fragments
-s: standard deviation of fragment size
-ss: type of hiseq

This takes a few minutes to produce



In [16]:

    
! art_illumina -sam -i FN433596.fa -p -l 100 -ss HS20 -f 20 -m 500 -s 40 -o paired_dat -f 100









    



    ====================ART====================
             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

                  Paired-end sequencing simulation

Total CPU time used: 87.3081

The random seed for the run: 1531986674

Parameters used during run
	Read Length:	100
	Genome masking 'N' cutoff frequency: 	1 in 100
	Fold Coverage:            100X
	Mean Fragment Length:     500
	Standard Deviation:       40
	Profile Type:             Combined
	ID Tag:                   

Quality Profile(s)
	First Read:   HiSeq 2000 Length 100 R1 (built-in profile) 
	First Read:   HiSeq 2000 Length 100 R2 (built-in profile) 

Output files

  FASTQ Sequence Files:
	 the 1st reads: paired_dat1.fq
	 the 2nd reads: paired_dat2.fq

  ALN Alignment Files:
	 the 1st reads: paired_dat1.aln
	 the 2nd reads: paired_dat2.aln

  SAM Alignment File:
	paired_dat.sam

Creating the BAM (mapping) and BED files



In [17]:

    
# no need for the *aln and *sam, let us remove them to save space
!rm -f paired*.aln paired_dat.sam
!sequana_mapping --reference FN433596.fa --file1 paired_dat1.fq --file2 paired_dat2.fq  1>out 2>err

This uses bwa and samtools behind the scene. Then, we will convert the resulting BAM file (FN433596.fasta.sorted.bam) into a BED file once for all. To do so, we use bioconvert (http://bioconvert.readthedocs.io) that uses bedtools behind the scene:



In [18]:

    
# bioconvert FN433596.fa.sorted.bam simulated.bed -f
# or use e.g. bedtools:
!bedtools genomecov -d -ibam FN433596.fa.sorted.bam > simulated.bed

sequana_coverage

We execute sequana_coverage to find the ROI (region of interest). We should a few detections (depends on the threshold and length of the genome of course).

Later, we will inject events as long as 8000 bases. So, we should use at least 16000 bases for the window parameter length. As shown in the window_impact notebook a 20,000 bases is a good choice to keep false detection rate low.



In [24]:

    
!sequana_coverage --input simulated.bed --reference FN433596.fa -w 20001 -o --level WARNING -C .5
!cp report/*/*/rois.csv rois_noise_20001.csv









    



WARNING [sequana.fasta]:  There is only one chromosome. Selected automatically.
Computing 2D histogram. Please wait
[WARNING]         multiqc : MultiQC Version v1.5 now available!
[INFO   ]         multiqc : This is MultiQC v1.0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Report title: Sequana multi summary
[INFO   ]         multiqc : Searching '.'
[INFO   ]         multiqc : Only using modules sequana_coverage
Searching 39 files..  [####################################]  100%
[INFO   ]        coverage : Found 1 reports
[INFO   ]         multiqc : Compressing plot data
[WARNING]         multiqc : Deleting    : multiqc_report.html   (-f was specified)
[INFO   ]         multiqc : Report      : multiqc_report.html
[WARNING]         multiqc : Deleting    : multiqc_data   (-f was specified)
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete



In [45]:

    
# An instance of coverage signal (yours may be slightly different)
from IPython.display import Image
Image("coverage.png")









    Out[45]:

The false positives



In [21]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [68]:

    
# Here is a convenient function to plot the ROIs in terms of sizes
# and max zscore

def plot_results(file_roi, choice="max"):
    import pandas as pd
    roi = pd.read_csv(file_roi) #"rois_cnv_deletion.csv")
    roi = roi.query("start>100 and end<3043210")
    plot(roi["size"], roi["{}_zscore".format(choice)], "or", label="candidate ROIs")
    for this in [3,4,5,-3,-4,-5]:
        if this == 3: label = "thresholds"
        else: label="_nolegend_"
        axhline(this, ls="--", label=label)
    print("{} ROIs found".format(len(roi)))
    xlabel("length of the ROIs")
    ylabel("z-scores")
    legend()
    return roi



In [125]:

    
roi = plot_results("rois_noise_20001.csv", "max")









    



16 ROIs found

Most of the detected events have a zscore close to the chosen thresholds (-4 and 4). Moreover, most events have a size below 50.

So for the detection of CNVs with size above let us say 2000, the False positives is (FP = 0).

More simulations would be required to get a more precise idea of the FP for short CNVs but the FP would remain small. For instance on this example, FP=1 for CNV size >100, FP=2 for CNV size >40, which remains pretty small given the length of the genome (3Mbp).

Checking CNV detection

Event injections (deletion, duplication, or mix of both)



In [48]:

    
import random
import pandas as pd

def create_deletion():
    df = pd.read_csv("simulated.bed", sep="\t", header=None)
    positions = []
    sizes = []
    for i in range(80):
        # the + and -4000 shift are there to guarantee the next
        # CNV does not overlap with the previous one since
        # CNV length can be as much as 8000
        pos = random.randint(37000*i+4000, 37000*(i+1)-4000)
        size = random.randint(1,8) * 1000
        positions.append(pos)    
        #size = 2000
        df.loc[pos:pos+size,2] = 0 #deletion
        sizes.append(size)
    df.to_csv("cnv_deletion.bed", sep="\t", header=None, index=None)
    return positions, sizes


def create_duplicated():
    df = pd.read_csv("simulated.bed", sep="\t", header=None)
    positions = []
    sizes = []
    for i in range(80):
        pos = random.randint(37000*i+4000, 37000*(i+1)-4000)
        size = random.randint(1,8) * 1000
        positions.append(pos)    
        
        df.loc[pos:pos+size,2] += 100 #duplicated
        sizes.append(size)
    df.to_csv("cnv_duplicated.bed", sep="\t", header=None, index=None)
    return positions, sizes


def create_cnvs_mixed():
    df = pd.read_csv("simulated.bed", sep="\t", header=None)
    # we will place 10% of CNV of size from 1000 to 8000
    import random
    positions = []
    sizes = []
    for i in range(80):
        pos = random.randint(37000*i+4000, 37000*(i+1)-4000)
        size = random.randint(1,8) * 1000
        positions.append(pos)
    
        status = random.randint(0,1)
    
        if status == 0:
            df.loc[pos:pos+size,2] -= 50 
        elif status == 1:
            df.loc[pos:pos+size,2] += 50 
        
        sizes.append(size)
    df.to_csv("cnv_mixed.bed", sep="\t", header=None, index=None)
    return positions, sizes



In [134]:

    
def check_found(positions, sizes, roi, precision=200, min_size=150):
    """A simple function to check given the position and size that
    the injected CNVs are detected in the ROIs
    
    We check that the starting or ending position of at least one
    ROI coincide with one ROI and that this ROI has at least a length of 200.
    
    Indeed, injections are at least 1000 bases and noise are generally below 100 bases
    as shown above.
    
    
    """
    found = [False] * len(positions)
    i = 0
    zscores = []
    for position,size in zip(positions, sizes):

        for this in roi.iterrows():
            this = this[1]        
            if (abs(this.start-position)<precision or abs(this.end-position-size)<precision )and this['size'] > min_size:
                #print(this.start, this.end, position, size)
                found[i] = True
                zscores.append(this.mean_zscore)
                continue
        
        if found[i] is False:
            print("position not found {} size={}".format(position, size))
        i+=1
    print("Found {}".format(sum(found)))
    return zscores

Deleted regions are all detected



In [50]:

    
# call this only once !!!!
positions_deletion, sizes_deletion = create_deletion()
!sequana_coverage --input cnv_deletion.bed -o -w 20001 --level WARNING 
!cp report/*/*/rois.csv rois_cnv_deleted.csv









    



WARNING [sequana.fasta]:  There is only one chromosome. Selected automatically.
[WARNING]         multiqc : MultiQC Version v1.5 now available!
[INFO   ]         multiqc : This is MultiQC v1.0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Report title: Sequana multi summary
[INFO   ]         multiqc : Searching '.'
[INFO   ]         multiqc : Only using modules sequana_coverage
Searching 39 files..  [####################################]  100%
[INFO   ]        coverage : Found 1 reports
[INFO   ]         multiqc : Compressing plot data
[WARNING]         multiqc : Deleting    : multiqc_report.html   (-f was specified)
[INFO   ]         multiqc : Report      : multiqc_report.html
[WARNING]         multiqc : Deleting    : multiqc_data   (-f was specified)
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete



In [126]:

    
rois_deletion = plot_results("rois_cnv_deleted.csv")









    



100 ROIs found



In [140]:

    
# as precise as 2 base positions but for safety, we put precision of 10 and we can check that the detection rate is 100%
zscores = check_found(positions_deletion, sizes_deletion, rois_deletion, 
            precision=5)









    



Found 80

duplicated regions



In [54]:

    
positions_duplicated, sizes_duplicated = create_duplicated()
!sequana_coverage --input cnv_duplicated.bed -o -w 40001 --level ERROR -C .3 --no-html --no-multiqc
!cp report/*/*/rois.csv rois_cnv_duplicated_40001.csv



In [78]:

    
rois_duplicated = plot_results("rois_cnv_duplicated_40001.csv", choice="max")









    



114 ROIs found

Same results with W=20000,40000,60000,100000 but recovered CN is better with larger W



In [76]:

    
rois_duplicated = plot_results("rois_cnv_duplicated_20000.csv", choice="max")









    



101 ROIs found

Note that you may see events with negative zscore. Those are false detection due to the presence of two CNVs close to each other. This can be avoided by increasing the window size e.g. to 40000



In [62]:

    
check_found(positions_duplicated, sizes_duplicated, rois_duplicated, 
            precision=5)









    



Found 80

Mixes of duplicated and deleted regions



In [67]:

    
positions_mix, sizes_mix = create_cnvs_mixed()
!sequana_coverage --input cnv_mixed.bed -o -w 40001 --level ERROR  --no-multiqc --no-html --cnv-clustering 1000 
!cp report/*/*/rois.csv rois_cnv_mixed.csv



In [19]:

    
Image("coverage_with_cnvs.png")









    Out[19]:



In [74]:

    
rois_mixed = plot_results("rois_cnv_mixed.csv", choice="max")









    



142 ROIs found



In [69]:

    
# note that here we increase the precision to 100 bases. The positions
# are not as precise as in the duplication or deletion cases. 
check_found(positions_mix, sizes_mix, rois_mixed, precision=20)









    



position not found 2856833 size=7000
Found 79

Some events (about 1%) may be labelled as not found but visual inspection will show that there are actually detected. This is due to a starting position being offset due to noise data set that interfer with the injected CNVs.

Conclusions

with simulated data and no CNV injections, sequana coverage detects some events that cross the threshold. However, there all have low zscores (close to the chosen threshold) and exhibit short lengths (below 100 bases).
Simulated CNVs:
- the 80 deletions are all detected with the correct position (+/- 1 base) and sizes (+/- 1 base)
- the 80 duplications are all detected with the correct position (+/- 1 base) and sizes (+/- 1 base !)
- the mix of 80 detection with coverage at 50 and 150 are all detected. Note, however, that some CNVs detections are split in several events. We implemented an additional clustering for CNVs (use --cnv-clustering 1000). This solve this issue. As for the position, there are usually correct (+-20 bases). Visual inspection of the ROIs files show that the events are all detected. However, they may be split or the actual starting point of the event is not precise.

So, for those simulated data and type of CNVs injection (CN 0, 2, 0.5, 1.5), we get a sensitivity close to 100%.

Extra notes about False positives

when we applied sequana coverage on the simulated mapped reads to estimate the rate of False Positives, we got about 20 ROIs with events having (max) zscore below 5 but up to 50 bases.

One question we had is

Does ROIs shorter than 50bp and with z-scores below 5 should be
ignored in CNV or other analyses, or are these bases identifying  
genuine features in the genome (such as unmappable sequence)?

In brief, we think that those events are part of the background noise. So, such events should not be interpreted as genuine features. Here is why



In [28]:

    
roi = plot_results("rois_noise_20001.csv")









    



16 ROIs found



In [ ]:

    
what is happening here is that we detect many events close to the threshold. 
So for instance all short events on the left hand side have z-score close to 4, 
which is our threshold. 

By pure chance, we get longer events of 40 or 50bp. This is quite surprinsing and wanted to know whether those 
are real false positives or due to a genuine feature in the genome (e.g. repeated regions that prevent a good mapping)

What is not shown in this plot is the position of the event. We can simulate the same data again (different seed). 
If those long events appear at the same place, they ca be considered as genuine, otherwise, they should be considered 
as potential background appearing just by chance. 

so, we generated 50 simulated data set and reproduce the image above. We store the data in 50_rois.csv



In [32]:

    
from easydev import execute as shell

def create_data(start=0,end=10):
    for i in range(start, end):
        print("---------------- {}".format(i))
        cmd = "art_illumina -sam -i FN433596.fa -p -l 100 -ss HS20 -f 20 -m 500 -s 40 -o paired_dat -f 100"
        shell(cmd)

        cmd = "rm -f paired*.aln paired_dat.sam"
        shell(cmd)

        cmd = "sequana_mapping --reference FN433596.fa --file1 paired_dat1.fq --file2 paired_dat2.fq  1>out 2>err"
        shell(cmd)

        cmd = "bedtools genomecov -d -ibam FN433596.fa.sorted.bam > simulated.bed"
        shell(cmd)

        cmd = "sequana_coverage --input simulated.bed --reference FN433596.fa -w 20001 -o --no-html --no-multiqc"
        shell(cmd)

        cmd = "cp report/*/*/rois.csv rois_{}.csv".format(i)
        shell(cmd)
#create_data(0,50)

import pandas as pd
rois = pd.read_csv("50_simulated_rois.csv")
rois = rois.query("start>100 and end <3043210")



In [69]:

    
roi = plot_results("50_simulated_rois.csv", choice="max")









    



826 ROIs found

With 50 simulations, we get 826 events. (100 are removed because on the edge of the origin of replication), which means about 16 events per simulation. The max length is 90.

None of the long events (above 50) appear at the same position (distance by more than 500 bases at least) so long events are genuine false positives.



In [120]:

    
roi = plot_results("100_simulated_rois.csv", choice="mean")









    



1745 ROIs found



In [122]:

    
roi = plot_results("100_simulated_rois.csv", choice="max")









    



1745 ROIs found



In [ ]: