Read PRC2 Probes

One-off script to process PRC2 annotations into a unified track mapping CpG probes to whether or not they are annotated to a PRC2 binding site. This reads in BED files from five different cell lines and calls a site as bound if the majority of the cell lines agree (3+/5). Takes a while to run, but really only needs to be done once.


In [1]:
import NotebookImport
from Imports import *


importing IPython notebook from Imports

In [2]:
print 'Reading in PRC2 probes'


Reading in PRC2 probes

In [3]:
probe_annotations = pd.read_hdf(DATA_STORE, 'probe_annotations')

In [4]:
path = './Data/PRC2_Binding/'

In [ ]:
r = {}
for f in os.listdir(path):
    df = pd.read_table(path + f, header=None, 
                       names=['chr','start','end','name','score','sig'])
    df['chr'] = df['chr'].str.replace('chr','')

    res = {}
    for c in df['chr'].unique():
        df_c = probe_annotations[probe_annotations.Chromosome == c]
        coord = df_c.Genomic_Coordinate.astype(float)
        test_p = lambda p: (coord >= p.start) & (coord <= p.end)
        res[c] = df[df.chr == c].apply(test_p, 1).any()
    res = pd.concat(res).groupby(level=1).sum()
    r[f] = res
r = pd.concat(r, 1)

In [ ]:
r.to_csv(path + 'mapped_to_methylation_probes.csv')

I'm using a cutoff of 3/5 datasets to annotate a binding site.


In [ ]:
prc2_probes = r.sum(1)>2