We will need to download this alignment data (CRAM format), which is quite large (18Gb).
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA12878/alignment/
We then convert cram to bam for cnvnator (the BAM is 24 Gb).
Conversion CRAM to BAM (using samtools 1.6):
samtools view -T reference.fa -b -o out.bam in.cram
Direct conversion of the BAM to BED containing one chromosome only (you still need to know the names of the chromosomes):
bamtools index -in out.bam # about
with samtools 1.6 for these commands work but could be simplified:
samtools view -bh alldata.bam chr11 > out.bam
samtools view out.bam out.sam
# edit SAM file to remove other chromosomes.... surely there is a simpler solution.
samtools depth -aa out.sam > out.bed
We now have the 24 BED files for sequana_coverage.
In [3]:
import pandas as pd
%pylab inline
rcParams['figure.figsize'] = (10,8)
Because the data input files are huge and the analysis takes a while for this 3.5Gb genome, we designed a snakemake pipeline available in sequana (http://sequana.readthedocs.io/en/master/index.html) and in the directory ./coverage/. It can be run using snakemake. For instance on a slurm scheduler, we used:
srun -c 1 snakemake -s coverage.rules -p --cluster "sbatch --mem 8000 " -j 30
The config file is also available in ./coverage and looks like
input_directory: ''
input_readtag: _R[12]_
input_extension: ''
input_pattern: /pasteur/homes/tcokelae/TEMP/coverage/NA12878.bed
input_samples:
file1: ''
file2: ''
coverage:
circular: true
double_threshold: 0.5
genbank_file: ''
high_threshold: 6.0
low_threshold: -4.0
mixture_models: 2
reference_file: ''
window: 250001
chunksize: 10000000
binning: 100
cnv_delta: 1000
One can run the analyse of each chromososme one by one using the standalone, but here we can take advantage of a cluster and parallelise each chromosome analysis. This took less than 1 hour to analyse the 24 chromosomes. We provide the final rois.csv in this directory together with the multiqc_report.html file that was generated by sequana_coverage. Note that we do not provide the individual HTML reports (too large) but only the ROIs files and the summary files (json format).
Note that we provided an input file called NA.bed.chrom_names.txt that contains the list of 24 chromosomes of interests and represents of 90% of the data. We ignored the other contigs for convenience. CNVnator analysis (here below) takes into account all chromosomes.
Here below are the actual time it took for each chromosome on a HPC cluster on one core.
In [33]:
# For each chromosome, (23 is chrX and 24 is chrY), we store the chromosome length and time is seconds it took to be analysed
# on one core.
res = {1: [248956422, 4*60+53], 2: [242193529,4*60+50], 3: [198295559,4*60+10],
4: [190214555, 3*60+58], 5: [181538259,3*60+47], 6: [170805979,3*60+44],
7: [159345973,3*60+25], 8: [145138636,3*60+8], 9: [138394717, 60*3+1],
10: [133797422, 2*60+57], 11: [135086622, 2*60+55], 12: [133275309, 2*60+57],
13:[114364328, 60*2+31], 14: [107043718, 60*2+25], 15: [101991189, 2*60+16], # chr14 is empty
16:[90338345, 2*60+7], 17: [83257441,2*60+15], 18: [80373285, 60+59],
19: [58617616,60+28], 20: [64444167, 60+40], 21: [46709983, 60+20],
22: [50818468, 60+24], 24:[57227415, 60+37], 23:[156040895, 60*3+21],
}
L = [res[i][0] for i in range(1,25) if i in res]
T = [res[i][1] for i in range(1,25) if i in res]
plot(L, T, "o")
a=1.077e-6,
b=31.6
plot(L, a*np.array(L)+b, "r-")
xlabel("chromosome size")
ylabel("Time of computation (seconds)")
# command used
# sequana_coverage --input data.bed --binning 1000 --no-html -w 250000 -H 6 --cnv-clustering 1000
Out[33]:
In [36]:
# Estimated time for full genome in minutes
(1.077e-6*3.5e9+31.6)/60*60/80.
Out[36]:
In [4]:
rois = pd.read_csv("coverage/coverage_reports/chr21/rois.csv")
rois = rois.drop("Unnamed: 0", axis=1)
print("Sequana identified {} ROIS amongst which {} with length >= 800".format(len(rois), len(rois.query("size > 800"))))
In [10]:
len(rois.query("size>100 and abs(mean_zscore)>5"))
Out[10]:
In [36]:
from sequana.cnv import CNVnator
cnvnator = CNVnator("events_bin100.txt").df.query("name=='chr21'")
print("cnvnator identified {} ROIs with size>800".format(len(cnvnator)))
It is not easy to validate one or the other set of ROIs. Some events clearly overlap. Some others are seen by one tool and not the other, and vice versa.
One general trend is that sequana_coverage detects many small events. Some are just noise, some are real events.
The best is to give a set of representative examples. Interested reader can then explore farther what comparison could be done.
In [100]:
from sequana import GenomeCov
b = GenomeCov("NA12878_chr21.bed")
c = b.chr_list[0]
c.thresholds.high = 6
c.run(250000, 2, binning=100, cnv_delta=1000)
c.rois.df = c.rois.df.query("size>15")
In [103]:
def plot_rois(x1, x2, chrom, cnvnator):
chrom.plot_rois(x1,x2)
for _, item in cnvnator.query("size<1000000").iterrows():
if item['type'] == "deletion":
fill_between([item.start, item.end], [600,600], color="green", alpha=0.5)
else:
fill_between([item.start, item.end], [600,600], color="red", alpha=0.5)
ylim([0,600])
In [104]:
plot_rois(5000000, 5500000, c, cnvnator)
ylim([-10, 100])
Out[104]:
we see that sequana_coverage has many short detections. Those events can be clustered or filtered out depending on the needs.
We see also that the first red area or the second red areas are classified similarly by CNVnator and sequana_coverage.
The event on the right hand side at position 5,450,000 is detected by sequana as a CNV but missed somehow by CNVnator that may be misled by the two empty areas around it.
Let us now focus on 5,200,000 to 5,400,000
In [105]:
plot_rois(5200000, 5400000, c, cnvnator)
ylim([-10, 100])
Out[105]:
as mentionned above, the two red regions detected are compatible with the outcome of sequana_coverage. However, concerning the green areas, this could be questionned. Let us look at the first one in 5.25 to 5.3Mb. Here there is an alternance of deleted regions and outliers (red dots). By eyes there is no clear pattern except that this area is surrounded by two red regions (excess of DOC). However, the green area itself has a DOC of around 5, which is consistent with the overall DOC.
In [106]:
plot_rois(5600000,6220000,c, cnvnator)
ylim((-1,15))
Out[106]:
Here, there is an alternance of deleted regions. The large deleted are detected by CNVnator and sequana_coverage.
More interestingly, let us look at the green regions where the DOC is not zero:
- in sequana, again, there are lots of small detected events (red dots). Most of them could be ignored except
maybe the ones at position 5,900,000 that could be considered as a CNV with a CN of 1.25 (average DOC is about 4.9)
- in cnvnator, the entire region from 5900000 to 6200000 is green meaning there is no difference here in the deleted regions and the one in the middle, which could be questionned.
- In the green region in the middle (position 5900000), cnvnator identifies a depleted region surrounded by two normal (white regions) and this seems a sensible choice. Sequana does not classify this region.
In [107]:
plot_rois(6500000,7400000, c, cnvnator)
_ = ylim([-1,50])
In [108]:
plot_rois(12500000, 15000000, c, cnvnator)
ylim([-1,40])
Out[108]:
The green area is missed by sequana. This could be fixed by changing the low threshold to -2 instead of -4. Additionally, one can play with the
c.rois.apply_threshold_after_merging to False
In [110]:
plot_rois(15000000,20000000, c, cnvnator)
ylim(-1,20)
Out[110]:
No detection in sequana_coverage in this clean data set match with cnvnator. As for the 4 events detected by cnvnator, there are short but of interest. Note however, that sequana_coverage allows the detection of shorter events (filtered out for convenience in the plots).
In [111]:
plot_rois(30000000,35000000, c, cnvnator)
ylim([0,120])
Out[111]:
Here, the event of the right hand side (125,000 bases long) is missed by sequana_coverage. We recommend to use W = 2 times the CNV length to be detected. This is due to the low depth of coverage. Future versions could use a more appropriate clustering and detection procedure for low coverage. Note here, that sequana_coverage detects 2 additional strong but short events .
In [ ]: