FASTQ format

The file is organized in 4 lines per read: 1 - The header of the DNA sequence with the read id (the read length is optional) 2 - The DNA sequence 3 - The header of the sequence quality (this line could be either a repetition of line 1 or empty) 4 - The sequence quality (it is not human readble, but is provided as PHRED score. Check https://en.wikipedia.org/wiki/Phred_quality_score for more details)


In [1]:
for renz in ['HindIII', 'MboI']:
    print renz
    ! head -n 4 /media/storage/FASTQs/K562_"$renz"_1.fastq
    print ''


HindIII
@NS500645:59:HCL32BGXY:1:11101:14163:1054 1:N:0:GCCAAT
CATTCNTAAAGAAAAGAATTTTCAACNCAGAATTTCATATCCAGCCAACTAAGCTAGCTTCAAGGAAATACATTT
+
AAAAA#EEAEEEEEEEEEEAE/EEEE#EEEEEEEEEEEEEEEEAE</EE/EEEEAEEEEEEEEEEEAAEE</EEE

MboI
@NS500645:59:HCL32BGXY:1:11101:4463:1054 1:N:0:CAGATC
CATAGNCCCAAGTGGCTATATCTTCCNCAGAAGTGTGACATATGAGGAGGAAGGATTTTAAGCCCAGATTGACCT
+
AAAAA#EEEEEEEEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE

Count the number of lines in the file (4 times the number of reads)


In [2]:
! wc -l /media/storage/FASTQs/K562_HindIII_1.fastq


100000000 /media/storage/FASTQs/K562_HindIII_1.fastq

There are 40 M lines in the file, which means 10 M reads in total.

Quality check before mapping


In [3]:
from pytadbit.utils.fastq_utils import quality_plot

In [4]:
for r_enz in ['HindIII', 'MboI']:
    quality_plot('/media/storage/FASTQs/K562_{0}_1.fastq'.format(r_enz), r_enz=r_enz, 
                 nreads=1000000, paired=False)


These plots provide a quick overview on the quality of the genome sequencing, as well as a rough estimate of the efficiency of the digestion and ligation steps of the Hi-C experiment.