The file is organized in 4 lines per read: 1 - The header of the DNA sequence with the read id (the read length is optional) 2 - The DNA sequence 3 - The header of the sequence quality (this line could be either a repetition of line 1 or empty) 4 - The sequence quality (it is not human readble, but is provided as PHRED score. Check https://en.wikipedia.org/wiki/Phred_quality_score for more details)
In [1]:
for renz in ['HindIII', 'MboI']:
print renz
! head -n 4 /media/storage/FASTQs/K562_"$renz"_1.fastq
print ''
Count the number of lines in the file (4 times the number of reads)
In [2]:
! wc -l /media/storage/FASTQs/K562_HindIII_1.fastq
There are 40 M lines in the file, which means 10 M reads in total.
In [3]:
from pytadbit.utils.fastq_utils import quality_plot
In [4]:
for r_enz in ['HindIII', 'MboI']:
quality_plot('/media/storage/FASTQs/K562_{0}_1.fastq'.format(r_enz), r_enz=r_enz,
nreads=1000000, paired=False)
These plots provide a quick overview on the quality of the genome sequencing, as well as a rough estimate of the efficiency of the digestion and ligation steps of the Hi-C experiment.