To model chromatin structure, we need to ensure that our data is clean enough. The first step is thus to draw the distribution of the sum of interactions per raw/columns in the Hi-C matrix. According to this distribution, we may remove some columns if they present a suspiciously low count of interaction.
Here an example, where "exp" is an preloaded Experiment corresponding to human's 19th chromosome:
In [1]:
from pytadbit import Chromosome
my_chrom = Chromosome('19')
my_chrom.add_experiment('gm', resolution=10000,
hic_data='../../scripts/sample_data/HIC_gm06690_chr19_chr19_100000_obs.txt')
exp = my_chrom.experiments[0]
zeroes = exp.filter_columns(draw_hist=True)
Note that the columns cited in the warning correspond to the columns on the left of the dot vertical red line
Than, according to the fit represented above, we would discard all columns in the Hi-C raw data having cumulative count of interaction below the dashed red line in the graph above (~67). This columns will be removed from the modeling, and their associated particles will have no experimental data.
This step is done automatically within tadbit each time an experiment is loaded. In order to ensure that we do remove outlier columns, tadbit checks if this root corresponds to a concave down region and if it stands between zero and the median of the overall distribution. The result of these "bad" columns is stored in the variable Experiment._zeros, that represents the columns to be skipped in the consecutive steps.
*Also it is not recommended to do it, the column filtering can be skipped, using the filter_columns=False
parameter when loading or creating a :class:pytadbit.experiment.Experiment
In case TADbit find a null value right in the diagonal of the Hi-C data matrix (where highest values are expected), TADbit assumes that this observation is artefactual and removes the whole row and column passing through this bin.
Any row or column that contains a NaN value will be removed from further steps.