In [ ]:
# Sensitivity of sequana_coverage in detecting CNVs
Author: Thomas Cokelaer
Jan 2018
Local time execution: about 10 minutes
In this notebook, we will simulate fastq reads and inject CNVs. We will then look at the sensitivity (proportion of true positive by the sum of positives) of sequana_coverage.
We use the data and strategy described in section 3.2 of "CNOGpro: detection and quantification of CNVs in prokaryotic whole-genome sequencing data, bioinformatics 31(11), 2015 (Brynildsrud et al)"
Here, we will use the same reference: FN433596 (staphylococcus aureus) as in the paper above, which is also used in the manuscript
The main goal is to generate simulated data, and check that the sensitivity is high (keeping specificity low) by injecting various CNVs.
There are many ways to download the reference (FN433596). Here below we use sequana_coverage tool but of course, you can use your own tool, or simply go to http://github.com/sequana/resources/coverage (look for FN433596.fasta.bz2).
In [1]:
!sequana_coverage --download-reference FN433596
Installation: conda install art
Simulation of data coverage 100X
-l: length of the reads
-f: coverage
-m: mean size of fragments
-s: standard deviation of fragment size
-ss: type of hiseq
This takes a few minutes to produce
In [16]:
! art_illumina -sam -i FN433596.fa -p -l 100 -ss HS20 -f 20 -m 500 -s 40 -o paired_dat -f 100
In [17]:
# no need for the *aln and *sam, let us remove them to save space
!rm -f paired*.aln paired_dat.sam
!sequana_mapping --reference FN433596.fa --file1 paired_dat1.fq --file2 paired_dat2.fq 1>out 2>err
This uses bwa and samtools behind the scene. Then, we will convert the resulting BAM file (FN433596.fasta.sorted.bam) into a BED file once for all. To do so, we use bioconvert (http://bioconvert.readthedocs.io) that uses bedtools behind the scene:
In [18]:
# bioconvert FN433596.fa.sorted.bam simulated.bed -f
# or use e.g. bedtools:
!bedtools genomecov -d -ibam FN433596.fa.sorted.bam > simulated.bed
We execute sequana_coverage to find the ROI (region of interest). We should a few detections (depends on the threshold and length of the genome of course).
Later, we will inject events as long as 8000 bases. So, we should use at least 16000 bases for the window parameter length. As shown in the window_impact notebook a 20,000 bases is a good choice to keep false detection rate low.
In [24]:
!sequana_coverage --input simulated.bed --reference FN433596.fa -w 20001 -o --level WARNING -C .5
!cp report/*/*/rois.csv rois_noise_20001.csv
In [45]:
# An instance of coverage signal (yours may be slightly different)
from IPython.display import Image
Image("coverage.png")
Out[45]:
In [21]:
%pylab inline
In [68]:
# Here is a convenient function to plot the ROIs in terms of sizes
# and max zscore
def plot_results(file_roi, choice="max"):
import pandas as pd
roi = pd.read_csv(file_roi) #"rois_cnv_deletion.csv")
roi = roi.query("start>100 and end<3043210")
plot(roi["size"], roi["{}_zscore".format(choice)], "or", label="candidate ROIs")
for this in [3,4,5,-3,-4,-5]:
if this == 3: label = "thresholds"
else: label="_nolegend_"
axhline(this, ls="--", label=label)
print("{} ROIs found".format(len(roi)))
xlabel("length of the ROIs")
ylabel("z-scores")
legend()
return roi
In [125]:
roi = plot_results("rois_noise_20001.csv", "max")
Most of the detected events have a zscore close to the chosen thresholds (-4 and 4). Moreover, most events have a size below 50.
So for the detection of CNVs with size above let us say 2000, the False positives is (FP = 0).
More simulations would be required to get a more precise idea of the FP for short CNVs but the FP would remain small. For instance on this example, FP=1 for CNV size >100, FP=2 for CNV size >40, which remains pretty small given the length of the genome (3Mbp).
In [48]:
import random
import pandas as pd
def create_deletion():
df = pd.read_csv("simulated.bed", sep="\t", header=None)
positions = []
sizes = []
for i in range(80):
# the + and -4000 shift are there to guarantee the next
# CNV does not overlap with the previous one since
# CNV length can be as much as 8000
pos = random.randint(37000*i+4000, 37000*(i+1)-4000)
size = random.randint(1,8) * 1000
positions.append(pos)
#size = 2000
df.loc[pos:pos+size,2] = 0 #deletion
sizes.append(size)
df.to_csv("cnv_deletion.bed", sep="\t", header=None, index=None)
return positions, sizes
def create_duplicated():
df = pd.read_csv("simulated.bed", sep="\t", header=None)
positions = []
sizes = []
for i in range(80):
pos = random.randint(37000*i+4000, 37000*(i+1)-4000)
size = random.randint(1,8) * 1000
positions.append(pos)
df.loc[pos:pos+size,2] += 100 #duplicated
sizes.append(size)
df.to_csv("cnv_duplicated.bed", sep="\t", header=None, index=None)
return positions, sizes
def create_cnvs_mixed():
df = pd.read_csv("simulated.bed", sep="\t", header=None)
# we will place 10% of CNV of size from 1000 to 8000
import random
positions = []
sizes = []
for i in range(80):
pos = random.randint(37000*i+4000, 37000*(i+1)-4000)
size = random.randint(1,8) * 1000
positions.append(pos)
status = random.randint(0,1)
if status == 0:
df.loc[pos:pos+size,2] -= 50
elif status == 1:
df.loc[pos:pos+size,2] += 50
sizes.append(size)
df.to_csv("cnv_mixed.bed", sep="\t", header=None, index=None)
return positions, sizes
In [134]:
def check_found(positions, sizes, roi, precision=200, min_size=150):
"""A simple function to check given the position and size that
the injected CNVs are detected in the ROIs
We check that the starting or ending position of at least one
ROI coincide with one ROI and that this ROI has at least a length of 200.
Indeed, injections are at least 1000 bases and noise are generally below 100 bases
as shown above.
"""
found = [False] * len(positions)
i = 0
zscores = []
for position,size in zip(positions, sizes):
for this in roi.iterrows():
this = this[1]
if (abs(this.start-position)<precision or abs(this.end-position-size)<precision )and this['size'] > min_size:
#print(this.start, this.end, position, size)
found[i] = True
zscores.append(this.mean_zscore)
continue
if found[i] is False:
print("position not found {} size={}".format(position, size))
i+=1
print("Found {}".format(sum(found)))
return zscores
In [50]:
# call this only once !!!!
positions_deletion, sizes_deletion = create_deletion()
!sequana_coverage --input cnv_deletion.bed -o -w 20001 --level WARNING
!cp report/*/*/rois.csv rois_cnv_deleted.csv
In [126]:
rois_deletion = plot_results("rois_cnv_deleted.csv")
In [140]:
# as precise as 2 base positions but for safety, we put precision of 10 and we can check that the detection rate is 100%
zscores = check_found(positions_deletion, sizes_deletion, rois_deletion,
precision=5)
In [54]:
positions_duplicated, sizes_duplicated = create_duplicated()
!sequana_coverage --input cnv_duplicated.bed -o -w 40001 --level ERROR -C .3 --no-html --no-multiqc
!cp report/*/*/rois.csv rois_cnv_duplicated_40001.csv
In [78]:
rois_duplicated = plot_results("rois_cnv_duplicated_40001.csv", choice="max")
Same results with W=20000,40000,60000,100000 but recovered CN is better with larger W
In [76]:
rois_duplicated = plot_results("rois_cnv_duplicated_20000.csv", choice="max")
Note that you may see events with negative zscore. Those are false detection due to the presence of two CNVs close to each other. This can be avoided by increasing the window size e.g. to 40000
In [62]:
check_found(positions_duplicated, sizes_duplicated, rois_duplicated,
precision=5)
In [67]:
positions_mix, sizes_mix = create_cnvs_mixed()
!sequana_coverage --input cnv_mixed.bed -o -w 40001 --level ERROR --no-multiqc --no-html --cnv-clustering 1000
!cp report/*/*/rois.csv rois_cnv_mixed.csv
In [19]:
Image("coverage_with_cnvs.png")
Out[19]:
In [74]:
rois_mixed = plot_results("rois_cnv_mixed.csv", choice="max")
In [69]:
# note that here we increase the precision to 100 bases. The positions
# are not as precise as in the duplication or deletion cases.
check_found(positions_mix, sizes_mix, rois_mixed, precision=20)
Some events (about 1%) may be labelled as not found but visual inspection will show that there are actually detected. This is due to a starting position being offset due to noise data set that interfer with the injected CNVs.
So, for those simulated data and type of CNVs injection (CN 0, 2, 0.5, 1.5), we get a sensitivity close to 100%.
when we applied sequana coverage on the simulated mapped reads to estimate the rate of False Positives, we got about 20 ROIs with events having (max) zscore below 5 but up to 50 bases.
One question we had is
Does ROIs shorter than 50bp and with z-scores below 5 should be
ignored in CNV or other analyses, or are these bases identifying
genuine features in the genome (such as unmappable sequence)?
In brief, we think that those events are part of the background noise. So, such events should not be interpreted as genuine features. Here is why
In [28]:
roi = plot_results("rois_noise_20001.csv")
In [ ]:
what is happening here is that we detect many events close to the threshold.
So for instance all short events on the left hand side have z-score close to 4,
which is our threshold.
By pure chance, we get longer events of 40 or 50bp. This is quite surprinsing and wanted to know whether those
are real false positives or due to a genuine feature in the genome (e.g. repeated regions that prevent a good mapping)
What is not shown in this plot is the position of the event. We can simulate the same data again (different seed).
If those long events appear at the same place, they ca be considered as genuine, otherwise, they should be considered
as potential background appearing just by chance.
so, we generated 50 simulated data set and reproduce the image above. We store the data in 50_rois.csv
In [32]:
from easydev import execute as shell
def create_data(start=0,end=10):
for i in range(start, end):
print("---------------- {}".format(i))
cmd = "art_illumina -sam -i FN433596.fa -p -l 100 -ss HS20 -f 20 -m 500 -s 40 -o paired_dat -f 100"
shell(cmd)
cmd = "rm -f paired*.aln paired_dat.sam"
shell(cmd)
cmd = "sequana_mapping --reference FN433596.fa --file1 paired_dat1.fq --file2 paired_dat2.fq 1>out 2>err"
shell(cmd)
cmd = "bedtools genomecov -d -ibam FN433596.fa.sorted.bam > simulated.bed"
shell(cmd)
cmd = "sequana_coverage --input simulated.bed --reference FN433596.fa -w 20001 -o --no-html --no-multiqc"
shell(cmd)
cmd = "cp report/*/*/rois.csv rois_{}.csv".format(i)
shell(cmd)
#create_data(0,50)
import pandas as pd
rois = pd.read_csv("50_simulated_rois.csv")
rois = rois.query("start>100 and end <3043210")
In [69]:
roi = plot_results("50_simulated_rois.csv", choice="max")
With 50 simulations, we get 826 events. (100 are removed because on the edge of the origin of replication), which means about 16 events per simulation. The max length is 90.
None of the long events (above 50) appear at the same position (distance by more than 500 bases at least) so long events are genuine false positives.
In [120]:
roi = plot_results("100_simulated_rois.csv", choice="mean")
In [122]:
roi = plot_results("100_simulated_rois.csv", choice="max")
In [ ]: