One-off script to process PRC2 annotations into a unified track mapping CpG probes to whether or not they are annotated to a PRC2 binding site. This reads in BED files from five different cell lines and calls a site as bound if the majority of the cell lines agree (3+/5). Takes a while to run, but really only needs to be done once.
In [1]:
import NotebookImport
from Imports import *
In [2]:
print 'Reading in PRC2 probes'
In [3]:
probe_annotations = pd.read_hdf(DATA_STORE, 'probe_annotations')
In [4]:
path = './Data/PRC2_Binding/'
In [ ]:
r = {}
for f in os.listdir(path):
df = pd.read_table(path + f, header=None,
names=['chr','start','end','name','score','sig'])
df['chr'] = df['chr'].str.replace('chr','')
res = {}
for c in df['chr'].unique():
df_c = probe_annotations[probe_annotations.Chromosome == c]
coord = df_c.Genomic_Coordinate.astype(float)
test_p = lambda p: (coord >= p.start) & (coord <= p.end)
res[c] = df[df.chr == c].apply(test_p, 1).any()
res = pd.concat(res).groupby(level=1).sum()
r[f] = res
r = pd.concat(r, 1)
In [ ]:
r.to_csv(path + 'mapped_to_methylation_probes.csv')
I'm using a cutoff of 3/5 datasets to annotate a binding site.
In [ ]:
prc2_probes = r.sum(1)>2