The file is organized in 4 lines per read:
@
, the header of the DNA sequence with the read id (plus optional fields)+
, the header of the sequence quality (this line could be either a repetition of first line or empty)
In [1]:
%%bash
dsrc d -s FASTQs/mouse_B_rep1_1.fastq.dsrc | head -n 8
Count the number of lines in the file (4 times the number of reads)
In [2]:
%%bash
dsrc d -s FASTQs/mouse_B_rep1_1.fastq.dsrc | wc -l
There are 400 M lines in the file, which means 100M reads in total.
Most important to analyze Hi-C dataset is the restriction enzyme used in the experiment. TADbit provides a simple function to check for it:
In [3]:
from pytadbit.mapping.restriction_enzymes import identify_re
In [4]:
pat, enz, pv = identify_re('FASTQs/mouse_B_rep1_1.fastq.dsrc')
print '- Most probable pattern: %s, matching enzymes: %s' % (pat, ','.join(enz))
In order to quickly assess the quality of the HiC experiment (before mapping), and given that we know the restriction enzyme used, we can check the proportion of reads with ligation sites as well as the number of reads starting by a cut-site.
These numbers will give us a first hint on the efficiencies of two critical steps in the HiC experiment, the digestion and the ligation.
In [2]:
from pytadbit.utils.fastq_utils import quality_plot
In [3]:
r_enz = 'MboI'
In [4]:
cell = 'B'
repl = 'rep1'
The plot on the top represents the typical per nucleotide quality profile of NGS reads, with, in addition, the proportion of N
found at each position.
The second plot, is specific to Hi-C experiments. Given a restriction enzyme the function searches for the presence of ligation sites and of undigested restriction enzyme sites. Depending on the enzyme used the function can differentiate between dangling-ends and undigested sites.
From these proportions some quality statistics can be inferred before mapping:
In [7]:
quality_plot('FASTQs/mouse_{0}_{1}_1.fastq.dsrc'.format(cell, repl), r_enz=r_enz, nreads=1000000)
Out[7]:
Note: this plot is compatible with the use of multiple restriction enzymes (which is why the ligation site is labeled as MboI-MboI)