Meta Analysis of the Datasets for the Epi² pilot project

RNA PTM DATASETS

PYTHON 3 Notebook

Adrien Leger / EMBL EBI

Starting date 23/05/2016

Import general package and definition of specific functions



In [2]:

    
# pycl imports
from pycl import *

#Std lib imports
import datetime
from glob import glob
from pprint import pprint as pp
from os.path import basename
from os import listdir, remove, rename
from os.path import abspath, basename, isdir
from collections import OrderedDict

# Third party import
import numpy as np
import scipy.stats as stats
import pylab as pl
from Bio import Entrez

# Jupyter specific imports
from IPython.core.display import display, HTML, Markdown, Image

# Pyplot tweaking
%matplotlib inline
pl.rcParams['figure.figsize'] = 30, 10  # that's default image size for this interactive session

# Jupyter display tweaking 
toogle_code()
larger_display(75)

# Simplify warning reporting to lighten the notebook style 
import warnings
warnings.formatwarning=(lambda message, *args: "{}\n".format(message))

# Allow to use R directly
%load_ext rpy2.ipython









    





    Toggle on/off the raw code






    











    



The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython



In [8]:

    
# Specific helper functions

def generate_header (PMID, cell, modification, method):
    h = "# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver\n"
    h+= "# Maurits Evers (maurits.evers@anu.edu.au)\n"
    h+= "# Data cleaned and standardized. {}\n".format(str (datetime.datetime.today()))
    h+= "# Adrien Leger (aleg@ebi.ac.uk)\n"
    h+= "# RNA_modification={}|Cell_type={}|Analysis_method={}|Pubmed_ID={}\n".format(modification, cell, method, PMID)
    h+= "# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"
    return h

def file_summary(file, separator=["\t", "|"], max_items=10):
    n_line = fastcount(file)
    print("Filename:\t{}".format(file))
    print("Total lines:\t{}\n".format(n_line))
    linerange(file, range_list=[[0,9],[n_line-5, n_line-1]])
    print(colsum(file, header=False, ignore_hashtag_line=True, separator=separator, max_items=max_items))

def distrib_peak_len (file, range=None, bins=50, normed=True):
    h = []
    for line in open (file):    
        if line[0] != "#":
            ls = line.split("\t")
            delta = abs(int(ls[1])-int(ls[2]))
            h.append(delta)
        h.sort()

    pl.hist(h,normed=normed, range=range, bins=bins)
    pl.show()

def pubmed_fetch(pmid):
    Entrez.email = 'your.email@example.com'
    handle = Entrez.efetch (db='pubmed', id=pmid, retmode='xml', )
    return Entrez.read(handle)[0]

def pmid_to_info(pmid):
    results = pubmed_fetch(pmid)
    try:
        title = results['MedlineCitation']['Article']['ArticleTitle']
    except (KeyError, IndexError) as E:
        title = "NA"
    try:
        first_name = results['MedlineCitation']['Article']['AuthorList'][0]['LastName']
    except (KeyError, IndexError) as E:
        first_name = "NA"
    try:
        Year = results['MedlineCitation']['Article']['ArticleDate'][0]['Year']
    except (KeyError, IndexError) as E:
        Year = "NA"
    try:
        Month = results['MedlineCitation']['Article']['ArticleDate'][0]['Month']
    except (KeyError, IndexError) as E:
        Month = "NA"
    try:
        Day = results['MedlineCitation']['Article']['ArticleDate'][0]['Day']
    except (KeyError, IndexError) as E:
        Day = "NA"
    
    d = {"title":title, "first_name":first_name, "Year":Year, "Month":Month, "Day":Day}
        
    return d

Overview of the datasets

I collected datasets from different sources. The 2 big database containing Inosine edition RADAR and DARNED as well as all the datasets cited in the recent review about lncRNA and epitranscriptomics from Shaffik et al. In addition I also have 2 recent datasets for m1A and m6A/Am6A. All the datasets need to be carefully reviewed, reformated to BED6 format, converted to hg38 reference genome and reannotated with recent genecode annotations. See overview of the datasets in the table below

Modification	Article	Initial number of peaks found (Shafik et al)	Final number of peaks found	Number of peaks in lncRNA found (Shafik et al)	Number of uniq lncRNA (Shafik et al)
Inosine	Z Peng et al. Nat Biotechnol 30, 253-60 (2012)	22686 (22686)	21,111	3382 (4425)	505 (846)
Inosine	M Sakurai et al. Genome Res 24, 522-34 (2014)	20482 (20482)	20,482	2550 (3050)	319 (400)
m5C	S Hussain et al. Cell Rep 4, 255-61 (2013)	1084 (1084)	1,084	107 (110)	39 (41)
m5C	V Khoddami et al. Nat Biotechnol 31, 458-64 (2013)	20553 (20533)	20,553	1523 (1580)	39 (38)
m5C	JE Squires et al. Nucleic Acids Res 40, 5023-33 (2012)	10490 (10490)	10,490	281 (1544)	112 (711)
m6A	D Dominissini et al. Nature 485, 201-6 (2012)	25918 (25776)	2,894	115 (7397)	84 (6165)
m6A	KD Meyer et al. Cell 149, 1635-46 (2012)	4341 (4341)	4,341	48 (57)	16 (20)
m6A and m6Am	B Linder et al. Nat Methods 12, 767-72 (2015)	15167 (NA)	15,167	385 (NA)	168 (NA)
m1A	D Dominissini et al. Nature 530, 441-6 (2016)	32136 (NA)	19552 (HeLa:8873, HepG2:8550, HEK293:2129)	606 (NA)	338 (NA)
pseudouridylation	TM Carlile et al. Nature 515, 143-6 (2014)	8 (8)	8	4 (4)	3 (3)
pseudouridylation	X Li et al. Nat Chem Biol 11, 592-7 (2015)	1489 (1489)	1,489	48 (58)	44 (54)
pseudouridylation	S Schwartz et al. Cell 159, 148-62 (2014)	402 (396)	402	14 (15)	10 (11)
Inosine	DARNED database	333216 (259654)	290,002	24152 (23574)	1300 (1833)
Inosine	RADAR database	2576460 (2576289)	1,342,374	97118 (218793)	3343 (6376)

DARNED

Reformat DARNED Database

The file dowloaded from DARNED is not a bed file. I need to modify it to comply with the standard format



In [62]:

    
file_summary("./PTM_Original_Datasets/DARNED_human_hg19_all_sites.txt")









    



Filename:	./PTM_Original_Datasets/DARNED_human_hg19_all_sites.txt
Total lines:	333217

0	#DARNED  
1	#chrom	coordinate	strand	inchr	inrna	gene	seqReg	exReg	source	PubMed ID
2	4	250721	-	A	I		O		DIENCEPHALON	19478186
3	4	475468	-	A	I	ZNF721	I		THYMUS	15342557
4	4	476348	-	A	I	ZNF721	I		THYMUS	15342557
5	4	476410	-	A	I	ZNF721	I		THYMUS	15342557
6	4	476414	-	A	I	ZNF721	I		THYMUS	15342557
7	4	476424	-	A	I	ZNF721	I		THYMUS	15342557
8	4	476487	-	A	I	ZNF721	I		THYMUS	15342557
9	4	476489	-	A	I	ZNF721	I		THYMUS	15342557

333212	X	3737545	-	A	G	LOC389906	I		LYMPHOBLASTOID CELL	22484847
333213	X	3737714	-	A	G	LOC389906	I		LYMPHOBLASTOID CELL	22484847
333214	X	3737716	-	A	G	LOC389906	I		LYMPHOBLASTOID CELL	22484847
333215	X	3738584	-	G	T	LOC389906	I		LYMPHOBLASTOID CELL	22484847
333216	X	3738585	-	G	T	LOC389906	I		LYMPHOBLASTOID CELL	22484847

|0|1|2|12|3|7|5|6|4|8|19|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|50666|34880|28526|28318|28036|21071|19282|16767|15833|15081|...|

|1|14716333|14715305|52882052|119532688|67852887|9605882|69040972|32828603|42878530|66174571|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|4|3|3|3|3|3|3|3|3|...|

|2|+|-|?|
|:---|:---|:---|:---|
|Count|169430|163784|1|

|3|A|T|G|C|
|:---|:---|:---|:---|:---|
|Count|311260|14098|4452|3405|

|4|G|I|C|T|A|U|
|:---|:---|:---|:---|:---|:---|:---|
|Count|221647|89116|13552|4765|4132|3|

|5||RBM6|RABGAP1L|EIF4G3|RBM47|GATAD2B|ASH1L|SPATS2|UBE2K|RERE|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|53302|912|902|845|844|765|701|699|648|641|...|

|6|I|O|E|
|:---|:---|:---|:---|
|Count|266423|53302|13490|

|7||3|5|C|
|:---|:---|:---|:---|:---|
|Count|319725|11830|940|720|

|8|LYMPHOBLASTOID CELL||LYMPHOBLASTOID CELL LINE|BREAST CANCER|BRAIN|U87MG|CEREBELLUM|THYMUS|UTERUS|SPLEEN|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|242346|24471|20285|8075|4866|2916|2343|2003|1737|1398|...|

|9|22484847|15342557|22327324|22028664|21960545|15258596|15342557,15258596|15545495|21725310|21960545,22327324|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|242346|25158|19791|15266|10753|6675|4944|1622|1237|1028|...|

DARNED is a little messy and hard to convert since the position with the same PMID/OR sample types where fused in the same Site. It makes it difficult to parse. I think it would be better if I duplicate the site with the several PMID and sample type. I just need to verify if the number of fields in PMID and cell type is similar and if they correspond to each other, ie first in cell type = first in PMID



In [55]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

header = "# DARNED database Human all sites hg19 coordinates\n"
header+= "# Data cleaned, filtered for Inosine editing, standardized and converted to BED6 format\n"
header+= "# Adrien Leger (aleg@ebi.ac.uk) - {}\n".format(str (datetime.datetime.today()))
header+= "# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

reformat_table(
    input_file = "./PTM_Original_Datasets/DARNED_human_hg19_all_sites.txt",
    output_file = "./PTM_Original_Datasets/DARNED_human_hg19_inosine.bed",
    init_template = [0,"\t",1,"\t",2,"\t",3,"\t",4,"\t",5,"\t",6,"\t",7,"\t",8,"\t",9],
    final_template = ["chr",0,"\t",1,"\t",1,"\t",3,">",4,"|",8,"|-|",9,"|",5,"\t0\t",2],
    keep_original_header = False,
    header = header,
    replace_internal_space = '_',
    replace_null_val = "-",
    filter_dict = {0:["-"],1:["-"],2:["?"],3:["T","G","C"],4:["C","T","A","U"],8:["-"],9:["-"]},
    subst_dict = {4:{"G":"I"}}
    )

file_summary("./PTM_Original_Datasets/DARNED_human_hg19_inosine.bed")









    



333215 Lines processed	285250 Lines pass	47965 Lines filtered out	0 Lines fail

Filename:	./DARNED/DARNED_human_hg19_inosine.bed
Total lines:	285254

0	# DARNED database Human all sites hg19 coordinates
1	# Data cleaned, filtered for Inosine editing, standardized and converted to BED6 format
2	# Adrien Leger (aleg@ebi.ac.uk) - 2016-06-09 11:23:04.068087
3	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
4	chr4	250721	250721	A>I|DIENCEPHALON|-|19478186|-	0	-
5	chr4	475468	475468	A>I|THYMUS|-|15342557|ZNF721	0	-
6	chr4	476348	476348	A>I|THYMUS|-|15342557|ZNF721	0	-
7	chr4	476410	476410	A>I|THYMUS|-|15342557|ZNF721	0	-
8	chr4	476414	476414	A>I|THYMUS|-|15342557|ZNF721	0	-
9	chr4	476424	476424	A>I|THYMUS|-|15342557|ZNF721	0	-

285249	chrX	3626805	3626805	A>I|LYMPHOBLASTOID_CELL|-|22484847|PRKX	0	-
285250	chrX	3737454	3737454	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-
285251	chrX	3737545	3737545	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-
285252	chrX	3737714	3737714	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-
285253	chrX	3737716	3737716	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-

Found 10 colums
First line found
|0|chr1|chr2|chr3|chr12|chr7|chr5|chr6|chr4|chr8|chr19|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|44398|30388|24884|24604|24321|18260|16768|14625|13808|12417|...|

|1|14715305|52882052|119532688|67852887|9605882|69040972|32828603|64940103|9219835|38582448|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|3|3|3|3|3|3|3|3|3|...|

|2|14715305|52882052|119532688|67852887|9605882|69040972|32828603|64940103|9219835|38582448|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|3|3|3|3|3|3|3|3|3|...|

|3|A>I|
|:---|:---|
|Count|285250|

|4|LYMPHOBLASTOID_CELL|LYMPHOBLASTOID_CELL_LINE|BREAST_CANCER|BRAIN|U87MG|CEREBELLUM|THYMUS|UTERUS|SPLEEN|AMYGDALA|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|219647|20285|8075|4866|2916|2343|2003|1737|1398|1382|...|

|5|-|
|:---|:---|
|Count|285250|

|6|22484847|15342557|22327324|21960545|15342557,15258596|15545495|21960545,22327324|15342557,22028664|21725310,22484847|15342557,22327324|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|219647|23473|19791|10753|4467|1622|1028|610|506|464|...|

|7|-|RBM6|RABGAP1L|RBM47|EIF4G3|GATAD2B|SPATS2|ASH1L|RERE|UBE2K|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|39914|889|818|800|793|735|669|662|617|617|...|

|8|0|
|:---|:---|
|Count|285250|

|9|+|-|
|:---|:---|:---|
|Count|144783|140467|

I reformated and filtered the database with reformat_table to be BED6 compatible. I removed the fields lacking either chromosome, position, tissue, and PID as well as with unknown strand. In addition I also selected only A>I and A>G transitions (same thing). This filtering step eliminated 47965 sites



In [56]:

    
d={}

with open("./PTM_Original_Datasets/DARNED_human_hg19_inosine.bed", "r") as f:
    for line in f:
        if line [0] !="#":
            ls = supersplit(line, ["\t","|"])
            n_tissue = len(ls[4].split(","))
            n_PMID = len(ls[6].split(","))
            key="{}:{}".format(n_tissue,n_PMID)
            if key not in d:
                d[key]=0
            d[key]+=1

print (d)









    



{'5:2': 32, '4:4': 28, '3:4': 49, '1:4': 5, '7:2': 1, '8:4': 1, '6:1': 4, '2:2': 3125, '4:2': 52, '3:3': 215, '1:2': 4618, '3:2': 285, '3:5': 2, '2:4': 18, '5:5': 1, '5:1': 41, '4:3': 50, '6:4': 1, '1:1': 272551, '4:5': 1, '1:3': 181, '4:1': 47, '5:4': 5, '2:5': 1, '6:2': 4, '7:4': 2, '2:3': 386, '3:1': 185, '2:1': 3348, '5:3': 11}

I tried to see if the PMID and the tissue field always had the same lengh, so I could de multiplex the fused positions. The answer is no, the number of PMID and tissues could be different. However 272551 positions have only 1 tissue and 1 PMID. These are maybe not the more relialable positions but they might be more easy to interprete. The sites with only 1 PMID but several tissues can also be used. Same think for several PMID, 1 tissue. I will demultiplex them so as to have only 1 PMID and 1 tissue by site. Concerning the site with more than 1 PMID and 1 tissue, I will extract then in another backup file.



In [57]:

    
infile = "./PTM_Original_Datasets/DARNED_human_hg19_inosine.bed"
outclean = "./PTM_Original_Datasets/DARNED_human_hg19_inosine_cleaned.bed"
outunclean = "./PTM_Original_Datasets/DARNED_human_hg19_inosine_unclean.bed"

with open(infile, "r") as inf, open(outclean, "w") as outf_clean, open(outunclean, "w") as outf_unclean:
    
    init_sites = uniq = several_tissue = several_pmid = several_all = final_sites = 0  
    
    for line in inf:
        if line [0] == "#":
            outf_clean.write(line)
            outf_unclean.write(line)
        else:
            init_sites += 1
            ls = supersplit(line, ["\t","|"])
            tissue_list = ls[4].split(",")
            PMID_list = ls[6].split(",")
            n_tissue = len(tissue_list)
            n_PMID = len(PMID_list)
            
            if n_tissue == 1:
                
                # 1 PMID, 1 tissue = no problem
                if n_PMID == 1:
                    uniq += 1
                    final_sites += 1
                    outf_clean.write(line)
                
                # Several PMID, 1 tissue = demultiplex PMID lines
                else:
                    several_pmid += 1
                    for PMID in PMID_list:
                        final_sites += 1
                        outf_clean.write("{0}\t{1}\t{2}\t{3}|{4}|{5}|{6}|{7}\t{8}\t{9}".format(
                            ls[0],ls[1],ls[2],ls[3],ls[4],ls[5],PMID.strip(),ls[7],ls[8],ls[9]))
            else:
                
                # 1 PMID, several tissues = demultiplex tissues lines
                if n_PMID == 1:
                    several_tissue += 1
                    for tissue in tissue_list:
                        final_sites += 1
                        outf_clean.write("{0}\t{1}\t{2}\t{3}|{4}|{5}|{6}|{7}\t{8}\t{9}".format(
                            ls[0],ls[1],ls[2],ls[3],tissue.strip().strip("_").strip("."),ls[5],ls[6],ls[7],ls[8],ls[9]))
                
                # Several PMID, several tissues = extract the line in uncleanable datasets
                else:
                    several_all += 1
                    outf_unclean.write(line)

print("Initial sites: ", init_sites)
print("Final clean sites: ", final_sites)
print("1 PMID 1 tissu: ", uniq)
print("1 PMID ++ tissue: ", several_tissue)
print("++ PMID 1 tissue: ", several_pmid)
print("++ PMID ++ tissue: ", several_all)

file_summary(outclean)
file_summary(outunclean)









    



Initial sites:  285250
Final clean sites:  290018
1 PMID 1 tissu:  272551
1 PMID ++ tissue:  3625
++ PMID 1 tissue:  4804
++ PMID ++ tissue:  4270
Filename:	./DARNED/DARNED_human_hg19_inosine_cleaned.bed
Total lines:	290022

0	# DARNED database Human all sites hg19 coordinates
1	# Data cleaned, filtered for Inosine editing, standardized and converted to BED6 format
2	# Adrien Leger (aleg@ebi.ac.uk) - 2016-06-09 11:23:04.068087
3	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
4	chr4	250721	250721	A>I|DIENCEPHALON|-|19478186|-	0	-
5	chr4	475468	475468	A>I|THYMUS|-|15342557|ZNF721	0	-
6	chr4	476348	476348	A>I|THYMUS|-|15342557|ZNF721	0	-
7	chr4	476410	476410	A>I|THYMUS|-|15342557|ZNF721	0	-
8	chr4	476414	476414	A>I|THYMUS|-|15342557|ZNF721	0	-
9	chr4	476424	476424	A>I|THYMUS|-|15342557|ZNF721	0	-

290017	chrX	3626805	3626805	A>I|LYMPHOBLASTOID_CELL|-|22484847|PRKX	0	-
290018	chrX	3737454	3737454	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-
290019	chrX	3737545	3737545	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-
290020	chrX	3737714	3737714	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-
290021	chrX	3737716	3737716	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-

Found 10 colums
First line found
|0|chr1|chr2|chr3|chr12|chr7|chr5|chr6|chr4|chr8|chr19|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|44648|30709|25074|24695|24563|18565|16848|14786|13917|12992|...|

|1|100265393|142605978|142605971|100265391|100264626|142605889|142605887|53430927|53430922|53430916|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|7|7|7|6|6|6|6|5|5|5|...|

|2|100265393|142605978|142605971|100265391|100264626|142605889|142605887|53430927|53430922|53430916|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|7|7|7|6|6|6|6|5|5|5|...|

|3|A>I|
|:---|:---|
|Count|290018|

|4|LYMPHOBLASTOID_CELL|LYMPHOBLASTOID_CELL_LINE|BREAST_CANCER|BRAIN|U87MG|CEREBELLUM|THYMUS|UTERUS|SPLEEN|AMYGDALA|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|219647|20827|8292|7364|3238|2732|2372|2109|1856|1731|...|

|5|-|
|:---|:---|
|Count|290018|

|6|22484847|15342557|22327324|21960545|15258596|15545495|22028664|21725310|19478186|18684997|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|219647|31500|20285|11165|3955|1646|885|450|412|41|...|

|7|-|RBM6|RABGAP1L|RBM47|EIF4G3|GATAD2B|SPATS2|ASH1L|UBE2K|RERE|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|42637|890|817|797|793|735|666|659|618|617|...|

|8|0|
|:---|:---|
|Count|290018|

|9|+|-|
|:---|:---|:---|
|Count|147164|142854|

Filename:	./DARNED/DARNED_human_hg19_inosine_unclean.bed
Total lines:	4274

0	# DARNED database Human all sites hg19 coordinates
1	# Data cleaned, filtered for Inosine editing, standardized and converted to BED6 format
2	# Adrien Leger (aleg@ebi.ac.uk) - 2016-06-09 11:23:04.068087
3	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
4	chr4	491512	491512	A>I|WHOLE_EMBRYO,_MAINLY_HEAD|-|15342557,15258596|ZNF721	0	-
5	chr4	491575	491575	A>I|WHOLE_EMBRYO,_MAINLY_HEAD|-|15342557,15258596|ZNF721	0	-
6	chr4	491592	491592	A>I|WHOLE_EMBRYO,_MAINLY_HEAD|-|15342557,15258596|ZNF721	0	-
7	chr4	491643	491643	A>I|WHOLE_EMBRYO,_MAINLY_HEAD|-|15342557,15258596|ZNF721	0	-
8	chr4	491894	491894	A>I|WHOLE_EMBRYO,_MAINLY_HEAD|-|15342557,15258596|ZNF721	0	-
9	chr4	491908	491908	A>I|WHOLE_EMBRYO,_MAINLY_HEAD|-|15342557,15258596|ZNF721	0	-

4269	chrX	77083673	77083673	A>I|VENOUS_BLOOD,LYMPHOBLASTOID_CELL|-|21725310,22484847|MAGT1	0	-
4270	chrX	77083915	77083915	A>I|VENOUS_BLOOD,LYMPHOBLASTOID_CELL|-|21725310,22484847|MAGT1	0	-
4271	chrX	100354910	100354910	A>I|VENOUS_BLOOD,LYMPHOBLASTOID_CELL|-|21725310,22484847|CENPI	0	+
4272	chrX	118672535	118672535	A>I|VENOUS_BLOOD,LYMPHOBLASTOID_CELL|-|21725310,22484847|CXorf56	0	-
4273	chrX	123046865	123046865	A>I|VENOUS_BLOOD,LYMPHOBLASTOID_CELL|-|21725310,22484847|XIAP	0	+

Found 10 colums
First line found
|0|chr19|chr1|chr17|chr2|chr7|chr12|chr8|chr3|chr6|chr16|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|440|438|302|288|280|233|212|191|188|184|...|

|1|30931810|41591342|128299306|37330614|54589730|19440290|55900533|39359707|39359055|47919605|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|2|2|2|2|2|2|2|2|...|

|2|30931810|41591342|128299306|37330614|54589730|19440290|55900533|39359707|39359055|47919605|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|2|2|2|2|2|2|2|2|...|

|3|A>I|
|:---|:---|
|Count|4270|

|4|U87MG,LYMPHOBLASTOID_CELL_LINE|VENOUS_BLOOD,LYMPHOBLASTOID_CELL|BREAST_CANCER,LYMPHOBLASTOID_CELL_LINE|BRAIN,_HYPOTHALAMUS|BRAIN,_HIPPOCAMPUS|TONGUE,_TUMOR_TISSUE|BREAST_CANCER,U87MG,LYMPHOBLASTOID_CELL_LINE|BRAIN,LYMPHOBLASTOID_CELL_LINE|THYMUS,LYMPHOBLASTOID_CELL_LINE|SPLEEN,LYMPHOBLASTOID_CELL_LINE|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|603|506|481|270|245|101|84|70|57|50|...|

|5|-|
|:---|:---|
|Count|4270|

|6|15342557,15258596|21960545,22327324|21725310,22484847|15342557,22327324|15342557,15258596,22327324|15342557,21960545|15342557,15258596,21960545|15342557,21960545,22327324|15258596,21960545,22327324|15342557,15258596,21960545,22327324|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1174|1028|506|434|236|194|122|98|93|69|...|

|7|-|MAVS|GINS4|POLH|COQ4|NUP43|CTSS|MRI1|APOL6|SGK494|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|789|42|35|32|30|30|30|27|24|23|...|

|8|0|
|:---|:---|
|Count|4270|

|9|+|-|
|:---|:---|:---|
|Count|2243|2027|

I only lost 4270 sites with several PMID and several tissue. Some sites where demultiplexed and I now have 290018 sites with 1 PMID and 1 tissue

Convert DARNED coordinates from hg19 to hg39 with Crossmap

Coordinate conversion using CrossMap and a hg19 tp hg38 chain file in BASH



In [58]:

    
# Conversion to hg38 with Crossmap/liftover
lifover_chainfile = "../LiftOver_chain_files/hg19ToHg38.over.chain.gz"
input_bed = "./PTM_Original_Datasets/DARNED_human_hg19_inosine_cleaned.bed"
temp_bed = "./PTM_Original_Datasets/DARNED_human_hg38_inosine_temp.bed"

cmd = "CrossMap.py bed {} {} {}".format(lifover_chainfile, input_bed, temp_bed)  
bash(cmd)

# Rewriting and updating of the header removed by Crossmap
final_bed = "./PTM_Original_Datasets/DARNED_human_hg38_inosine_cleaned.bed"

header = "# DARNED database Human all sites hg38 coordinates\n"
header+= "# Data cleaned, filtered for Inosine editing, standardized, converted to BED6 format and updated to hg38 coordinates\n"
header+= "# Adrien Leger (aleg@ebi.ac.uk) - {}\n".format(str (datetime.datetime.today()))
header+= "# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

with open (temp_bed, "r") as infile, open (final_bed, "w") as outfile:
    outfile.write (header)
    for line in infile:
        outfile.write (line)

file_summary(final_bed)









    



@ 2016-06-09 11:27:48: Read chain_file:  ../LiftOver_chain_files/hg19ToHg38.over.chain.gz

Filename:	./DARNED/DARNED_human_hg38_inosine_cleaned.bed
Total lines:	290002

0	# DARNED database Human all sites hg38 coordinates
1	# Data cleaned, filtered for Inosine editing, standardized, converted to BED6 format and updated to hg38 coordinates
2	# Adrien Leger (aleg@ebi.ac.uk) - 2016-06-09 11:27:51.339572
3	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
4	chr4	256932	256932	A>I|DIENCEPHALON|-|19478186|-	0	-
5	chr4	481679	481679	A>I|THYMUS|-|15342557|ZNF721	0	-
6	chr4	482559	482559	A>I|THYMUS|-|15342557|ZNF721	0	-
7	chr4	482621	482621	A>I|THYMUS|-|15342557|ZNF721	0	-
8	chr4	482625	482625	A>I|THYMUS|-|15342557|ZNF721	0	-
9	chr4	482635	482635	A>I|THYMUS|-|15342557|ZNF721	0	-

289997	chrX	3708764	3708764	A>I|LYMPHOBLASTOID_CELL|-|22484847|PRKX	0	-
289998	chrX	3819413	3819413	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-
289999	chrX	3819504	3819504	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-
290000	chrX	3819673	3819673	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-
290001	chrX	3819675	3819675	A>I|LYMPHOBLASTOID_CELL|-|22484847|LOC389906	0	-

Found 10 colums
First line found
|0|chr1|chr2|chr3|chr12|chr7|chr5|chr6|chr4|chr8|chr19|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|44646|30709|25074|24695|24554|18565|16848|14783|13916|12991|...|

|1|101010404|143226413|143226406|101010402|101009637|143226324|143226322|52927674|52927669|52927663|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|7|7|7|6|6|6|6|5|5|5|...|

|2|101010404|143226413|143226406|101010402|101009637|143226324|143226322|52927674|52927669|52927663|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|7|7|7|6|6|6|6|5|5|5|...|

|3|A>I|
|:---|:---|
|Count|289998|

|4|LYMPHOBLASTOID_CELL|LYMPHOBLASTOID_CELL_LINE|BREAST_CANCER|BRAIN|U87MG|CEREBELLUM|THYMUS|UTERUS|SPLEEN|AMYGDALA|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|219641|20827|8291|7364|3238|2732|2371|2109|1856|1729|...|

|5|-|
|:---|:---|
|Count|289998|

|6|22484847|15342557|22327324|21960545|15258596|15545495|22028664|21725310|19478186|18684997|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|219641|31489|20285|11164|3954|1646|885|449|412|41|...|

|7|-|RBM6|RABGAP1L|RBM47|EIF4G3|GATAD2B|SPATS2|ASH1L|UBE2K|RERE|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|42623|890|817|797|793|735|666|659|618|617|...|

|8|0|
|:---|:---|
|Count|289998|

|9|+|-|
|:---|:---|:---|
|Count|146981|143017|

The conversion resulted in the lost of 16 sites, which is neglectable compared with the 290002 sites in the database

RADAR

Reformat RADAR Database

The file dowloaded from DARNED is not a bed file. I need to modify it to comply with the standard format. The format of the mane file Human_AG_all_hg19_v2.txt is the following:



In [45]:

    
file_summary("./PTM_Original_Datasets/RADAR_human_hg19_v2_primary.txt", separator=["\t"])









    



Filename:	./RADAR/RADAR_human_hg19_v2_primary.txt
Total lines:	2576460

0	#chromosome	position	gene	strand	annot1	annot2	alu?	non_alu_repetitive?	conservation_chimp	conservation_rhesus	conservation_mouse
1	chr1	206256301	C1orf186	-	intronic	intronic	no	no	N	N	N
2	chr6	116991832	intergenic	-	intergenic	intergenic	no	no	N	N	N
3	chr7	30504355	NOD1	-	intronic	intronic	no	no	N	N	N
4	chr1	85127959	SSX2IP	-	Syn	Gln->Gln	no	no	N	N	N
5	chr15	100203261	MEF2A	+	intronic	intronic	no	no	N	N	N
6	chr6	102372915	GRIK2	+	intronic	intronic	no	no	N	N	N
7	chr9	135788925	TSC1	-	intronic	intronic	no	no	N	N	N
8	chr17	74414054	UBE2O	-	intronic	intronic	no	no	N	N	N
9	chr15	73069310	ADPGK	-	intronic	intronic	no	no	N	N	N

2576455	chr1	92591586	intergenic	-	intergenic	intergenic	no	no	N	N	N
2576456	chr1	92591608	intergenic	-	intergenic	intergenic	no	no	N	N	N
2576457	chr1	92591631	intergenic	-	intergenic	intergenic	no	no	N	N	N
2576458	chr1	92591653	intergenic	-	intergenic	intergenic	no	no	N	N	N
2576459	chr1	92591655	intergenic	-	intergenic	intergenic	no	no	N	N	N

Found 11 colums
First line found
|0|chr1|chr19|chr2|chr17|chr12|chr7|chr3|chr16|chr11|chr10|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|252016|199959|175788|174174|146931|145354|142732|127960|114020|113928|...|

|1|42831806|30049042|50790316|44475661|4235199|41373938|67126414|58306614|40511124|33025396|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|3|3|3|3|3|3|3|3|3|3|...|

|2|intergenic|EIF4G3|FNBP1|ASH1L|RBM6|PITPNC1|BMPR2|UBE2D2|RBFOX1|SSH2|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|516714|2407|2329|2324|2281|2267|2190|2122|2078|2070|...|

|3|+|-|
|:---|:---|:---|
|Count|1317488|1258971|

|4|intronic|intergenic|3UTR|ncRNA|5UTR|Nonsyn|Syn|
|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1936801|516714|85169|26595|6775|3036|1369|

|5|intronic|intergenic|3UTR|ncRNA|5UTR|Ser->Gly|Lys->Arg|Thr->Ala|Gln->Arg|Lys->Glu|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1936801|516714|85169|26595|6775|352|334|329|308|251|...|

|6|yes|no|
|:---|:---|:---|
|Count|2461955|114504|

|7|no|yes|
|:---|:---|:---|
|Count|2514573|61886|

|8|N|chr10:103286552|chr2b:164625417|chr16:69147108|chr14:52012530|chr2a:113447991|chr17:49960115|chr19:1909581|chr11:122596489|chr11:74768561|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2559579|1|1|1|1|1|1|1|1|1|...|

|9|N|chr13:37093053|chr9:15399446|chr2:33731045|chr13:111889329|chr16:35202767|chr19:1700969|chr14:121990938|chr11:94627674|chr1:165876512|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2568511|1|1|1|1|1|1|1|1|1|...|

|10|N|chr15:3230063|chr14:8769692|chr14:8769662|chr14:8769652|chr14:8769608|chr18:24123733|chr18:24123481|chr18:24123472|chr18:24123361|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2576372|1|1|1|1|1|1|1|1|1|...|

I am not interested by the conservation fields, the annotation and the repetitive nature, but I will keep chromosome, position, gene and strand. Additional information is also available in a secondary database file that I found hidden on RADAR website. The information includes the original publication and the source of the biological sample. The information is only available for around half of the sites from 4 publications. I will only retain these richly annotated sites. The same sites can have bmultiple entries in the secondary file since I could have been reported by several papers in several tissues. I will keep a line for each independantly discovered site. The coverage and editing level could be interesting to save too.



In [46]:

    
file_summary("./PTM_Original_Datasets/RADAR_human_hg19_v2_secondary.txt")









    



Filename:	./RADAR/RADAR_human_hg19_v2_secondary.txt
Total lines:	1343465

0	#location	reference	tissue	coverage	editing_level(%)
1	chr1:1037916	Peng et al 2012	Lymphoblastoid cell line	9	66.67
2	chr1:1156882	Peng et al 2012	Lymphoblastoid cell line	42	36.59
3	chr1:1157460	Peng et al 2012	Lymphoblastoid cell line	66	22.73
4	chr1:1252441	Peng et al 2012	Lymphoblastoid cell line	11	72.73
5	chr1:1252443	Peng et al 2012	Lymphoblastoid cell line	11	45.45
6	chr1:1253357	Peng et al 2012	Lymphoblastoid cell line	31	32.26
7	chr1:1253944	Peng et al 2012	Lymphoblastoid cell line	28	46.43
8	chr1:1418532	Peng et al 2012	Lymphoblastoid cell line	5	60.00
9	chr1:1419773	Peng et al 2012	Lymphoblastoid cell line	10	60.00

1343460	chrX:119056591	Bahn et al 2012	U87 cell line	5	60
1343461	chrX:123047295	Bahn et al 2012	U87 cell line	17	71
1343462	chrX:135300016	Bahn et al 2012	U87 cell line	9	55
1343463	chrX:153702725	Bahn et al 2012	U87 cell line	15	33
1343464	chrX:153702725	Bahn et al 2012	U87 cell line	15	33

Found 5 colums
First line found
|0|chrX:73417059|chrX:73417042|chrX:73416986|chrX:73416963|chr7:130629624|chr7:128299306|chr7:128101135|chr7:66180775|chr7:66174750|chr7:38763241|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|6|6|6|6|6|6|6|6|6|6|...|

|1|Ramaswami et al 2013|Ramaswami et al 2012|Peng et al 2012|Bahn et al 2012|
|:---|:---|:---|:---|:---|
|Count|813960|504251|21111|4142|

|2|Lymphoblastoid cell line|Brain|Illumina Bodymap|U87 cell line|
|:---|:---|:---|:---|:---|
|Count|689264|417104|232954|4142|

|3|4|3|5|2|6|1|7|8|9|10|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|83140|80917|80548|77752|74430|71549|67965|61546|55401|49506|...|

|4|100|50|33.3|66.7|25|40|20|28.6|16.7|22.2|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|147889|107934|79376|62334|60977|53316|44422|40080|32439|29548|...|

The operation is quite complex since I will have to fuse the 2 files and extract only specific values. I need to code a specific parser. The main RADAR file will be parsed and organised as a simple embed dict. The secondary file will be the more important since I will start parsing from it to find the complementary information in the main database file. Each site will be added to a list of Site objects, that will be subsequently iterated to combine with the main database before writing in a new Bed formated file.



In [108]:

    
# Create a structured dict of dict to parse the main database file
from collections import OrderedDict

def parse_RADAR_main (file):
    
    # Define the top level access dict
    radar_dict = OrderedDict() 
    
    for line in open (file, "r"):
        if line[0] != "#":
            
            sl = line.split("\t")

            assert len(sl) == 11      
            chromosome, position, gene, strand = sl[0].strip(), int(sl[1].strip()), sl[2].strip(), sl[3].strip()

            if chromosome not in radar_dict:
                radar_dict[chromosome] = OrderedDict()

            # There should be only one line per position
            assert position not in radar_dict[chromosome]
            radar_dict[chromosome][position] = {"gene":gene,"strand":strand}
    
    return radar_dict



In [109]:

    
# Create a class to store a line of the additional file.

from collections import OrderedDict

class Site (object):
    
    #~~~~~~~CLASS FIELDS~~~~~~~#
    # Table of correspondance reference => PMID
    TITLE_TO_PMID = {
        "Peng et al 2012":"22327324",
        "Bahn et al 2012":"21960545",
        "Ramaswami et al 2012":"22484847",
        "Ramaswami et al 2013":"23291724",
        }
    
    # Table of correspondance reference => PMID
    TISSUE_TO_SAMPLE = {
        "Brain":"Brain",
        "Illumina Bodymap":"Illumina_Bodymap",
        "Lymphoblastoid cell line":"YH",
        "U87 cell line":"U87MG"
        }

    #~~~~~~~FONDAMENTAL METHODS~~~~~~~#
    # Parse a line of the aditional information file
    def __init__(self, line, ):
        sl = line.strip().split("\t")
        
        self.chromosome = sl[0].split(":")[0].strip()
        self.position = int(sl[0].split(":")[1].strip())
        self.PMID = self.TITLE_TO_PMID[sl[1].strip()]
        self.tissue = self.TISSUE_TO_SAMPLE[sl[2].strip()]
        self.coverage = sl[3].strip()
        self.editing_level = sl[4].strip()
    
    # Fundamental class methods str and repr
    def __repr__(self):
        msg = "SITE CLASS\n"
        # list all values in object dict in alphabetical order
        keylist = [key for key in self.__dict__.keys()]
        keylist.sort()
        for key in keylist:
            msg+="\t{}\t{}\n".format(key, self.__dict__[key])
        return (msg)

    def __str__(self):
        return self.__repr__()



In [110]:

    
a = Site("chr1:1037916	Peng et al 2012	Lymphoblastoid cell line	9	66.67")
print (a)









    



SITE CLASS
	PMID	22327324
	chromosome	chr1
	coverage	9
	editing_level	66.67
	position	1037916
	tissue	YH



In [111]:

    
# Create a structured dict of dict to parse the secondary database file
def parse_RADAR_secondary (file):
    
    # Define a list to store Site object (not a dict because of redundancy)
    radar_list = []
    
    for line in open (file, "r"):
        if line[0] != "#":
            radar_list.append(Site(line))
    
    # return a list sorted by chromosome and positions
    return sorted(radar_list, key=lambda Site: (Site.chromosome, Site.position))

Read the original file, reformat the field and write a new file BED6 compliant.

chrom   chromstart  chromend    name    score   orient
chr4    774138  774138  A>I|LOC100129917|LUNG:LYMPHOBLASTOID_CELL_LINE|15342557:15258596:22327324   0   -



In [112]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"


def reformat_RADAR (main_file, secondary_file, outfile, errfile, header):

    # Read and structure the 2 database files
    print("Parse the main database file")
    main = parse_RADAR_main(main_file)

    print("Parse the secondary database file")
    secondary = parse_RADAR_secondary (secondary_file)

    print("Combine the data together in a new bed formated file")
    with open (outfile, "w+") as csvout, open (errfile, "w+") as errout:
        
        # rewrite header
        csvout.write(header)

        fail = success = 0
        for total, site in enumerate(secondary):

            try:
                line = "{0}\t{1}\t{1}\t{2}|{3}|{4}|{5}|{6}\t{7}\t{8}\n".format(
                    site.chromosome,
                    site.position,
                    "A>I",
                    site.tissue,
                    "-",
                    site.PMID,
                    main[site.chromosome][site.position]["gene"],
                    site.editing_level,
                    main[site.chromosome][site.position]["strand"],
                )
                csvout.write(line)
                success += 1

            except KeyError as E:
                line = "{0}\t{1}\t{2}\t{3}\t{4}\n".format(
                    site.chromosome,
                    site.position,
                    site.tissue,
                    site.PMID,
                    site.editing_level
                )
                errout.write(line)
                fail += 1       

    print ("{} Sites processed\t{} Sites pass\t{} Sites fail".format(total, success, fail))



In [113]:

    
header = "# RADAR database Human v2 all sites hg19 coordinates\n"
header+= "# Data cleaned, standardized and converted to BED6 format\n"
header+= "# Adrien Leger (aleg@ebi.ac.uk) - {}\n".format(str (datetime.datetime.today()))
header+= "# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

reformat_RADAR(
    main_file = "./PTM_Original_Datasets/RADAR_human_hg19_v2_primary.txt",
    secondary_file = "./PTM_Original_Datasets/RADAR_human_hg19_v2_secondary.txt",
    outfile = "./PTM_Original_Datasets/RADAR_Human_hg19_inosine_cleaned.bed",
    errfile = "./PTM_Original_Datasets/RADAR_Human_hg19_inosine_orphan.bed",
    header = header)

file_summary("./PTM_Original_Datasets/RADAR_Human_hg19_inosine_cleaned.bed")









    



Parse the main database file
Parse the secondary database file
Combine the data together in a new bed formated file
1343463 Sites processed	1342423 Sites pass	1041 Sites fail
Filename:	./RADAR/RADAR_Human_hg19_inosine_cleaned.bed
Total lines:	1342427

0	# RADAR database Human v2 all sites hg19 coordinates
1	# Data cleaned, standardized and converted to BED6 format
2	# Adrien Leger (aleg@ebi.ac.uk) - 2016-06-09 12:08:57.833547
3	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
4	chr1	651206	651206	A>I|YH|-|22484847|RP5-857K21.4	40	-
5	chr1	666265	666265	A>I|YH|-|22484847|uc009vjm.2	16.7	-
6	chr1	666266	666266	A>I|YH|-|22484847|uc009vjm.2	16.7	-
7	chr1	676236	676236	A>I|YH|-|22484847|uc002khh.2	19.4	-
8	chr1	676236	676236	A>I|YH|-|23291724|uc002khh.2	13	-
9	chr1	676236	676236	A>I|Brain|-|23291724|uc002khh.2	15	-

1342422	chrY	23457758	23457758	A>I|YH|-|22484847|intergenic	18.8	-
1342423	chrY	23475581	23475581	A>I|YH|-|22484847|intergenic	50	+
1342424	chrY	23475599	23475599	A>I|YH|-|22484847|intergenic	50	+
1342425	chrY	23476318	23476318	A>I|YH|-|22484847|intergenic	66.7	+
1342426	chrY	23476344	23476344	A>I|YH|-|22484847|intergenic	100	+

Found 10 colums
First line found
|0|chr1|chr19|chr17|chr2|chr7|chr3|chr12|chr16|chr11|chr10|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|134437|105431|93850|93622|78261|76421|76398|63190|61568|58856|...|

|1|35822000|684433|50850813|69689663|45644567|86382114|32828603|89857376|31134287|43711294|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|8|8|8|8|7|7|7|7|7|7|...|

|2|35822000|684433|50850813|69689663|45644567|86382114|32828603|89857376|31134287|43711294|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|8|8|8|8|7|7|7|7|7|7|...|

|3|A>I|
|:---|:---|
|Count|1342423|

|4|YH|Brain|Illumina_Bodymap|U87MG|
|:---|:---|:---|:---|:---|
|Count|688679|416924|232846|3974|

|5|-|
|:---|:---|
|Count|1342423|

|6|23291724|22484847|22327324|21960545|
|:---|:---|:---|:---|:---|
|Count|813604|503965|20880|3974|

|7|intergenic|RBM6|FNBP1|EIF4G3|UBE2D2|uc003tvf.2|uc001sns.2|ASH1L|C11orf80|BMPR2|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|65861|1891|1884|1731|1667|1614|1592|1488|1463|1429|...|

|8|100|50|33.3|66.7|25|40|20|28.6|16.7|22.2|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|147788|107871|79336|62303|60939|53270|44407|40063|32424|29536|...|

|9|+|-|
|:---|:---|:---|
|Count|689923|652500|

After combining information, out of the 1343463 sites in the RADAR secondary file and 2576460 in the RADAR primary file, 1342423 consistant sites were found in both files, ie half of the database site were filtered out because they where not in the 2 database files.

Convert RADAR coordinates from hg19 to hg39 with Crossmap

Coordinate conversion using CrossMap in BASH



In [114]:

    
# Conversion to hg38 with Crossmap/liftover
lifover_chainfile = "../LiftOver_chain_files/hg19ToHg38.over.chain.gz"
input_bed = "./PTM_Original_Datasets/RADAR_Human_hg19_inosine_cleaned.bed"
temp_bed = "./PTM_Original_Datasets/RADAR_Human_hg38_inosine_temp.bed"

cmd = "CrossMap.py bed {} {} {}".format(lifover_chainfile, input_bed, temp_bed)  
bash(cmd)

# Rewriting and updating of the header removed by Crossmap
final_bed = "./PTM_Original_Datasets/RADAR_Human_hg38_inosine_cleaned.bed"

header = "# RADAR database Human v2 all sites hg38 coordinates\n"
header+= "# Data cleaned, standardized, converted to BED6 format and updated to hg38 coordinates\n"
header+= "# Adrien Leger (aleg@ebi.ac.uk) - {}\n".format(str (datetime.datetime.today()))
header+= "# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

with open (temp_bed, "r") as infile, open (final_bed, "w") as outfile:
    outfile.write (header)
    for line in infile:
        outfile.write (line)

file_summary(final_bed)









    



@ 2016-06-09 12:09:43: Read chain_file:  ../LiftOver_chain_files/hg19ToHg38.over.chain.gz

Filename:	./RADAR/RADAR_Human_hg38_inosine_cleaned.bed
Total lines:	1342378

0	# RADAR database Human v2 all sites hg38 coordinates
1	# Data cleaned, standardized, converted to BED6 format and updated to hg38 coordinates
2	# Adrien Leger (aleg@ebi.ac.uk) - 2016-06-09 12:09:56.855643
3	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
4	chr1	715826	715826	A>I|YH|-|22484847|RP5-857K21.4	40	-
5	chr1	730885	730885	A>I|YH|-|22484847|uc009vjm.2	16.7	-
6	chr1	730886	730886	A>I|YH|-|22484847|uc009vjm.2	16.7	-
7	chr1	740856	740856	A>I|YH|-|22484847|uc002khh.2	19.4	-
8	chr1	740856	740856	A>I|YH|-|23291724|uc002khh.2	13	-
9	chr1	740856	740856	A>I|Brain|-|23291724|uc002khh.2	15	-

1342373	chrY	21295872	21295872	A>I|YH|-|22484847|intergenic	18.8	-
1342374	chrY	21313695	21313695	A>I|YH|-|22484847|intergenic	50	+
1342375	chrY	21313713	21313713	A>I|YH|-|22484847|intergenic	50	+
1342376	chrY	21314432	21314432	A>I|YH|-|22484847|intergenic	66.7	+
1342377	chrY	21314458	21314458	A>I|YH|-|22484847|intergenic	100	+

Found 10 colums
First line found
|0|chr1|chr19|chr17|chr2|chr7|chr3|chr12|chr16|chr11|chr10|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|134410|105430|93813|93615|78247|76420|76398|63190|61568|58853|...|

|1|69252729|36517927|48346889|124783316|124783311|124783279|26718169|87187352|42473933|39772282|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|9|8|8|8|8|8|8|7|7|7|...|

|2|69252729|36517927|48346889|124783316|124783311|124783279|26718169|87187352|42473933|39772282|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|9|8|8|8|8|8|8|7|7|7|...|

|3|A>I|
|:---|:---|
|Count|1342374|

|4|YH|Brain|Illumina_Bodymap|U87MG|
|:---|:---|:---|:---|:---|
|Count|688654|416907|232839|3974|

|5|-|
|:---|:---|
|Count|1342374|

|6|23291724|22484847|22327324|21960545|
|:---|:---|:---|:---|:---|
|Count|813569|503951|20880|3974|

|7|intergenic|RBM6|FNBP1|EIF4G3|UBE2D2|uc003tvf.2|uc001sns.2|ASH1L|C11orf80|BMPR2|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|65861|1891|1884|1731|1667|1614|1592|1488|1463|1429|...|

|8|100|50|33.3|66.7|25|40|20|28.6|16.7|22.2|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|147777|107868|79334|62301|60938|53268|44406|40059|32423|29536|...|

|9|+|-|
|:---|:---|:---|
|Count|689259|653115|

Around 10000 additional sites were lost during the conversion from hg19 to hg38

Shafik et al datasets

File list and define basic exploratory functions

The datasets a pretty cleaned and already converted to hg38 build coordinates. However, they are not strandardize and the synthax could be different since the bioinformatician in charge of the analysis tried to keep as much information as possible from the original datasets. I will strandardize the file names, the bed name fields and add reformat the file header. In addition I will explore the datasets to see if the are consistant and decide if I use them or not.



In [35]:

    
listdir("./PTM_Original_Datasets/")









    Out[35]:





['m6A_Meyer_hg38.bed',
 'editing_Sakurai_hg38.bed',
 'pseudoU_Li_hg38.bed',
 'DARNED_human_hg19_inosine.bed',
 'MeRIPseq_m1A_Dominissini2016_hg38.bed',
 'm5C_Khoddami_hg38.bed',
 'DARNED_human_hg19_all_sites.txt',
 'pseudoU_Carlile_hg38.bed',
 'RADAR_Human_hg38_inosine_temp.bed.unmap',
 'miCLIP_m6A_Linder2015_hg38.bed',
 'm5C_Hussain_hg38.bed',
 'RADAR_Human_hg19_inosine_cleaned.bed',
 'pseudoU_Schwartz_hg38.bed',
 'DARNED_human_hg38_inosine_cleaned.bed',
 'm6A_Dominissini_hg38.bed',
 'RADAR_Human_hg19_inosine_orphan.bed',
 'RADAR_Human_hg38_inosine_temp.bed',
 'editing_Peng_hg38.bed',
 'RADAR_Human_hg38_inosine_cleaned.bed',
 'DARNED_human_hg19_inosine_cleaned.bed',
 'RADAR_human_hg19_v2_secondary.txt',
 'RADAR_human_hg19_v2_primary.txt',
 'm5C_Squires_hg38.bed',
 'DARNED_human_hg19_inosine_unclean.bed',
 'DARNED_human_hg38_inosine_temp.bed',
 'DARNED_human_hg38_inosine_temp.bed.unmap']

editing_Peng_hg38



In [5]:

    
infile="./PTM_Original_Datasets/editing_Peng_hg38.bed"
PMID = "22327324"
cell = "YH"
modification = "A>I"
method = "A_to_I_editing"
author = "Peng"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/editing_Peng_hg38.bed
Total lines:	22692

0	# Transcriptome-wide map of editing sites [hg38 coordinates]
1	# Reference: Peng et al., Nat. Biotechnol. 30, 253 (2012) [PMID 22327324, DOI 10.1038/nbt.2122]
2	#
3	# Data cleaned and converted to BED6, coordinate conversion to hg38 using liftOver.
4	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
5	#
6	chr1	1102535	1102536	Peng|chr1|1027779|-|T|Y|A->G|66.67%|37|C|6|T|3|9|intron|C1orf159	0	-
7	chr1	1221501	1221502	Peng|chr1|1146745|-|T|Y|A->G|36.59%|99|T|26|C|15|42|intron|SDF4	0	-
8	chr1	1222079	1222080	Peng|chr1|1147323|-|T|Y|A->G|22.73%|94|T|51|C|15|66|intron|SDF4	0	-
9	chr1	1251840	1251841	Peng|chr1|1177084|-|T|Y|A->G|56.25%|99|C|9|T|7|16|intergenic|-	0	-

22687	chrY	21272330	21272331	Peng|chrY|21843605|-|T|Y|A->G|58.33%|62|C|7|T|5|12|intergenic|-	0	-
22688	chrY	21272371	21272372	Peng|chrY|21843646|-|T|Y|A->G|81.48%|49|C|22|T|5|27|intergenic|-	0	-
22689	chrY	21288231	21288232	Peng|chrY|21859506|-|T|Y|A->G|30.00%|45|T|14|C|6|20|intergenic|-	0	-
22690	chrY	21294412	21294413	Peng|chrY|21865687|-|T|Y|A->G|27.78%|25|T|13|C|5|18|intergenic|-	0	-
22691	chrY	21295110	21295111	Peng|chrY|21866385|-|T|Y|A->G|26.32%|45|T|28|C|10|38|intergenic|-	0	-

|0|chr1|chr12|chr19|chr2|chr7|chr3|chr11|chr17|chr16|chr6|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2081|1713|1709|1596|1358|1264|1262|1226|1051|964|...|

|1|64863354|21295110|21294412|21288231|21272371|21272330|21272300|21272295|21263766|21263690|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|1|1|1|...|

|2|64863355|21295111|21294413|21288232|21272372|21272331|21272301|21272296|21263767|21263691|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|1|1|1|...|

|3|Peng|
|:---|:---|
|Count|22686|

|4|chr1|chr12|chr19|chr2|chr7|chr3|chr11|chr17|chr16|chr6|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2081|1713|1709|1596|1358|1264|1262|1226|1051|964|...|

|5|31676676|21866385|21865687|21859506|21843646|21843605|21843575|21843570|21835041|21834965|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|1|1|1|...|

|6|+|-|
|:---|:---|:---|
|Count|11774|10912|

|7|A|T|C|G|
|:---|:---|:---|:---|:---|
|Count|11481|10563|329|313|

|8|R|Y|C|G|T|A|W|M|S|K|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|11406|10460|151|137|107|107|82|82|78|76|

|9|A->G|T->C|G->A|C->T|T->G|C->G|G->C|A->C|T->A|C->A|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|21111|715|236|195|69|61|56|53|51|49|...|

|10|50.00%|66.67%|33.33%|60.00%|40.00%|57.14%|44.44%|71.43%|42.86%|25.00%|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1088|706|634|502|455|431|358|299|285|284|...|

|11|99|24|25|32|20|26|21|23|22|30|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|6650|371|366|362|359|356|356|351|344|340|...|

|12|A|T|G|C|
|:---|:---|:---|:---|:---|
|Count|7903|7058|3903|3822|

|13|5|6|7|8|4|10|9|11|12|13|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1125|1120|1034|1004|988|870|857|759|728|616|...|

|14|G|C|A|T|
|:---|:---|:---|:---|:---|
|Count|7907|7075|3896|3808|

|15|4|5|6|7|8|9|3|2|10|11|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2151|2134|1937|1707|1461|1347|1313|1219|1100|978|...|

|16|9|10|11|13|12|14|15|8|7|17|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|718|694|693|685|666|651|642|628|555|548|...|

|17|intergenic|intron|3-UTR|unknown|CDS|5-UTR|
|:---|:---|:---|:---|:---|:---|:---|
|Count|11219|9362|1905|102|80|18|

|18|-|SPN|FNBP1|FCER2|USP4|CCDC84|CCAR1|SRP54|CD22|C11orf80|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|11219|112|51|45|42|41|41|40|37|36|...|

|19|0|
|:---|:---|
|Count|22686|

|20|+|-|
|:---|:---|:---|
|Count|11768|10918|



In [5]:

    
print(colsum(infile, colrange=[9], header=False, ignore_hashtag_line=True, separator=["\t", "|"], max_items=20, ret_type="report"))









    



First line found
9
	A->G	21111
	T->C	715
	G->A	236
	C->T	195
	T->G	69
	C->G	61
	G->C	56
	A->C	53
	T->A	51
	C->A	49
	G->T	45
	A->T	45

The column 9 contains more than just Inosine transitions(A>G) but also all the other editing sites they found. Here I will focuss on the Inosine only. I need to filter out all the other values



In [61]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"|",5,"|",6,"|",7,"|",8,"|",9,"|",10,"%|",11,"|",12,"|",13,"|",14,"|",15,"|",16,"|",17,"|",18,"\t",19,"\t",20]
final_template=[0,"\t",1,"\t",2,"\t",9,"|",cell,"|",method,"|",PMID,"|",18,"\t",10,"\t",20]

# filter out all but A>G transition which are Inosine transition
filter_dict={9:["T->C","G->A","C->T","T->G","C->G","G->C","A->C","T->A","C->A","G->T","A->T"]}

# Reformat the field value A->G to A>I for standardization 
subst_dict={9:{"A->G":"A>I"}}


reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-",
    subst_dict = subst_dict,
    filter_dict = filter_dict )

file_summary(outfile)









    



22686 Lines processed	21111 Lines pass	1575 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Datasets/Peng_A>I_YH_hg38_cleaned.bed
Total lines:	21117

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-09 16:51:42.099109
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=A>I|Cell_type=YH|Analysis_method=A_to_I_editing|Pubmed_ID=22327324
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	1102535	1102536	A>I|YH|A_to_I_editing|22327324|C1orf159	66.67	-
7	chr1	1221501	1221502	A>I|YH|A_to_I_editing|22327324|SDF4	36.59	-
8	chr1	1222079	1222080	A>I|YH|A_to_I_editing|22327324|SDF4	22.73	-
9	chr1	1251840	1251841	A>I|YH|A_to_I_editing|22327324|-	56.25	-

21112	chrY	21272330	21272331	A>I|YH|A_to_I_editing|22327324|-	58.33	-
21113	chrY	21272371	21272372	A>I|YH|A_to_I_editing|22327324|-	81.48	-
21114	chrY	21288231	21288232	A>I|YH|A_to_I_editing|22327324|-	30.00	-
21115	chrY	21294412	21294413	A>I|YH|A_to_I_editing|22327324|-	27.78	-
21116	chrY	21295110	21295111	A>I|YH|A_to_I_editing|22327324|-	26.32	-

|0|chr1|chr19|chr12|chr2|chr7|chr11|chr3|chr17|chr16|chr6|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1933|1644|1586|1451|1278|1164|1150|1147|1005|906|...|

|1|64863354|21295110|21294412|21288231|21272371|21272330|21272300|21272295|21263766|21263690|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|1|1|1|...|

|2|64863355|21295111|21294413|21288232|21272372|21272331|21272301|21272296|21263767|21263691|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|1|1|1|...|

|3|A>I|
|:---|:---|
|Count|21111|

|4|YH|
|:---|:---|
|Count|21111|

|5|A_to_I_editing|
|:---|:---|
|Count|21111|

|6|22327324|
|:---|:---|
|Count|21111|

|7|-|SPN|FNBP1|FCER2|USP4|CCDC84|CCAR1|SRP54|CD22|LYRM7|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|10768|112|51|45|42|41|41|40|37|35|...|

|8|50.00|66.67|33.33|60.00|40.00|57.14|44.44|71.43|25.00|55.56|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1036|678|609|474|441|410|344|290|274|266|...|

|9|+|-|
|:---|:---|:---|
|Count|11012|10099|



In [17]:

    
distrib_peak_len(outfile)

Contains a lot of fields, some of which I don't even have an idea of what the contains. The dataset was not filtered and contains not only A>G A>I transitions. There is a total of 22686 sites but only 21111 are A>G transitions. I filtered out all the other modifications and retained only the A>G transition.

I am not sure that I should use this dataset, especially since it is already included in the RADAR database.

editing_Sakurai_hg38



In [18]:

    
infile = "./PTM_Original_Datasets/editing_Sakurai_hg38.bed"
PMID = "24407955"
cell = "Brain"
modification = "A>I"
method = "ICE_seq"
author = "Sakurai"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/editing_Sakurai_hg38.bed
Total lines:	20488

0	# Transcriptome-wide map of editing sites [hg38 coordinates]
1	# Reference: Sakurai et al., Genome Res. 24, 522 (2014) [PMID 24407955, DOI 10.1101/gr.162537.113]
2	#
3	# Data cleaned and converted to BED6, coordinate conversion to hg38 using liftOver.
4	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
5	#
6	chr1	136167	136168	Sakurai|uc009vjj.1|EST|NA	0	-
7	chr1	136175	136176	Sakurai|uc009vjj.1|EST|NA	0	-
8	chr1	136177	136178	Sakurai|uc009vjj.1|EST|NA	0	-
9	chr1	136178	136179	Sakurai|uc009vjj.1|EST|NA	0	-

20483	chrY	18928288	18928289	Sakurai|NA|No_annotation|SINE/Alu	0	-
20484	chrY	18928298	18928299	Sakurai|NA|No_annotation|SINE/Alu	0	-
20485	chrY	18928299	18928300	Sakurai|NA|No_annotation|SINE/Alu	0	-
20486	chrY	18928319	18928320	Sakurai|NA|No_annotation|SINE/Alu	0	-
20487	chrY	19608866	19608867	Sakurai|NA|No_annotation|SINE/Alu	0	+

|0|chr1|chr19|chr17|chr2|chr7|chr16|chr12|chr15|chr5|chr22|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2273|2254|1491|1444|1345|1254|1056|945|834|815|...|

|1|23296469|104124385|85091790|85091779|85090896|85090878|85090865|85090844|17148267|301184|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|3|2|2|2|2|2|2|2|2|2|...|

|2|23296470|104124386|85091791|85091780|85090897|85090879|85090866|85090845|17148268|301185|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|3|2|2|2|2|2|2|2|2|2|...|

|3|Sakurai|
|:---|:---|
|Count|20482|

|4|NA|SMIM7|ZNF587|uc002uzx.2|uc010nlq.1|MAVS|SNRPN|FOXRED2|KCNIP4|LYRM7|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|6672|126|92|73|71|71|68|65|61|59|...|

|5|No_annotation|Intron|3'UTR|5'UTR|EST|CDS|
|:---|:---|:---|:---|:---|:---|:---|
|Count|6671|5767|4506|2056|1416|66|

|6|SINE/Alu|NA|LINE/L1|LTR/ERVL-MaLR|LTR/ERV1|SINE/MIR|Simple_repeat|DNA/hAT-Charlie|LINE/L2|DNA/TcMar-Tigger|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|18787|1006|324|133|60|37|25|17|15|14|...|

|7|0|
|:---|:---|
|Count|20482|

|8|+|-|
|:---|:---|:---|
|Count|10405|10077|



In [7]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"|",5,"|",6,"\t",7,"\t",8]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",cell,"|",method,"|",PMID,"|",4,"\t",7,"\t",8]

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-")

file_summary(outfile)









    



20482 Lines processed	20482 Lines pass	0 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Dataset/Sakurai_A>I_Brain_hg38_cleaned.bed
Total lines:	20488

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-08 12:20:13.527986
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=A>I|Cell_type=Brain|Analysis_method=ICE_seq|Pubmed_ID=24407955
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	136167	136168	A>I|Brain|ICE_seq|24407955|uc009vjj.1	0	-
7	chr1	136175	136176	A>I|Brain|ICE_seq|24407955|uc009vjj.1	0	-
8	chr1	136177	136178	A>I|Brain|ICE_seq|24407955|uc009vjj.1	0	-
9	chr1	136178	136179	A>I|Brain|ICE_seq|24407955|uc009vjj.1	0	-

20483	chrY	18928288	18928289	A>I|Brain|ICE_seq|24407955|NA	0	-
20484	chrY	18928298	18928299	A>I|Brain|ICE_seq|24407955|NA	0	-
20485	chrY	18928299	18928300	A>I|Brain|ICE_seq|24407955|NA	0	-
20486	chrY	18928319	18928320	A>I|Brain|ICE_seq|24407955|NA	0	-
20487	chrY	19608866	19608867	A>I|Brain|ICE_seq|24407955|NA	0	+

Found 10 colums
First line found
|0|chr1|chr19|chr17|chr2|chr7|chr16|chr12|chr15|chr5|chr22|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2273|2254|1491|1444|1345|1254|1056|945|834|815|...|

|1|23296469|104124385|85091790|85091779|85090896|85090878|85090865|85090844|17148267|301184|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|3|2|2|2|2|2|2|2|2|2|...|

|2|23296470|104124386|85091791|85091780|85090897|85090879|85090866|85090845|17148268|301185|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|3|2|2|2|2|2|2|2|2|2|...|

|3|A>I|
|:---|:---|
|Count|20482|

|4|Brain|
|:---|:---|
|Count|20482|

|5|ICE_seq|
|:---|:---|
|Count|20482|

|6|24407955|
|:---|:---|
|Count|20482|

|7|NA|SMIM7|ZNF587|uc002uzx.2|uc010nlq.1|MAVS|SNRPN|FOXRED2|KCNIP4|LYRM7|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|6672|126|92|73|71|71|68|65|61|59|...|

|8|0|
|:---|:---|
|Count|20482|

|9|+|-|
|:---|:---|:---|
|Count|10405|10077|



In [19]:

    
distrib_peak_len(outfile)

No problem with this dataset, I kept the gene loci name for future comparison after reannotation with GENCODE

m5C_Hussain_hg38



In [20]:

    
infile = "./PTM_Original_Datasets/m5C_Hussain_hg38.bed"
PMID = "23871666"
cell = "HEK293"
modification = "m5C"
method = "miCLIP"
author = "Hussain"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/m5C_Hussain_hg38.bed
Total lines:	1090

0	# Transcriptome-wide map of m5C [hg38 coordinates]
1	# Reference: Hussain et al., Genome Biology 14, 215 (2013) [PMID 24286375, DOI 10.1186/gb4143]
2	#
3	# Data cleaned and converted to BED6, coordinate conversion to hg38 using liftOver.
4	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
5	#
6	chr1	16535295	16535296	Hussain|id102	0	-
7	chr1	16535303	16535304	Hussain|id103	0	-
8	chr1	16535309	16535310	Hussain|id104	0	-
9	chr1	16545955	16545956	Hussain|id105	0	-

1085	chrX	56600603	56600604	Hussain|id1081	0	-
1086	chrX	91280207	91280208	Hussain|id1082	0	+
1087	chrX	121653629	121653630	Hussain|id1075	0	-
1088	chrY	11204085	11204086	Hussain|id1083	0	-
1089	chrY	13552443	13552444	Hussain|id1084	0	-

|0|chr6|chr1|chr5|chr17|chr14|chr16|chr11|chr2|chr8|chr15|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|368|92|88|87|73|59|44|41|39|28|...|

|1|13552443|11204085|121653629|91280207|56600603|39509363|31486611|18674986|18674933|3838454|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|2|13552444|11204086|121653630|91280208|56600604|39509364|31486612|18674987|18674934|3838455|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|3|Hussain|
|:---|:---|
|Count|1084|

|4|id1084|id1083|id1075|id1082|id1081|id1080|id1078|id1077|id1076|id1079|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|5|0|
|:---|:---|
|Count|1084|

|6|-|+|
|:---|:---|:---|
|Count|593|491|



In [9]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"\t",5,"\t",6]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",cell,"|",method,"|",PMID,"|-\t",5,"\t",6]

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-")

file_summary(outfile)









    



1084 Lines processed	1084 Lines pass	0 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Dataset/Hussain_m5C_HEK293_hg38_cleaned.bed
Total lines:	1090

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-08 12:22:11.200389
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=m5C|Cell_type=HEK293|Analysis_method=miCLIP|Pubmed_ID=23871666
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	16535295	16535296	m5C|HEK293|miCLIP|23871666|-	0	-
7	chr1	16535303	16535304	m5C|HEK293|miCLIP|23871666|-	0	-
8	chr1	16535309	16535310	m5C|HEK293|miCLIP|23871666|-	0	-
9	chr1	16545955	16545956	m5C|HEK293|miCLIP|23871666|-	0	-

1085	chrX	56600603	56600604	m5C|HEK293|miCLIP|23871666|-	0	-
1086	chrX	91280207	91280208	m5C|HEK293|miCLIP|23871666|-	0	+
1087	chrX	121653629	121653630	m5C|HEK293|miCLIP|23871666|-	0	-
1088	chrY	11204085	11204086	m5C|HEK293|miCLIP|23871666|-	0	-
1089	chrY	13552443	13552444	m5C|HEK293|miCLIP|23871666|-	0	-

Found 10 colums
First line found
|0|chr6|chr1|chr5|chr17|chr14|chr16|chr11|chr2|chr8|chr15|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|368|92|88|87|73|59|44|41|39|28|...|

|1|13552443|11204085|121653629|91280207|56600603|39509363|31486611|18674986|18674933|3838454|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|2|13552444|11204086|121653630|91280208|56600604|39509364|31486612|18674987|18674934|3838455|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|3|m5C|
|:---|:---|
|Count|1084|

|4|HEK293|
|:---|:---|
|Count|1084|

|5|miCLIP|
|:---|:---|
|Count|1084|

|6|23871666|
|:---|:---|
|Count|1084|

|7|-|
|:---|:---|
|Count|1084|

|8|0|
|:---|:---|
|Count|1084|

|9|-|+|
|:---|:---|:---|
|Count|593|491|



In [13]:

    
distrib_peak_len(outfile)

No problem with this dataset, Since there is no gene loci, I just filed the field with a dash to indicate that it is empty

m5C_Khoddami_hg38



In [21]:

    
infile="./PTM_Original_Datasets/m5C_Khoddami_hg38.bed"
PMID = "23604283"
cell = "MEF"
modification = "m5C"
method = "AzaIP"
author = "Khoddami"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/m5C_Khoddami_hg38.bed
Total lines:	20559

0	# Transcriptome-wide map of m5C [hg38 coordinates]
1	# Reference: Khoddami and Cairns, Nat. Biotechnol. 31, 458 (2013) [PMID 23604283, DOI 10.1038/nbt.2566]
2	#
3	# Data cleaned and converted to BED6, coordinate conversion to hg38 using liftOver.
4	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
5	#
6	chr1	16545917	16545918	Khoddami|tRNA-Gly-GGG	0	-
7	chr1	16545919	16545920	Khoddami|tRNA-Gly-GGG	0	-
8	chr1	16545920	16545921	Khoddami|tRNA-Gly-GGG	0	-
9	chr1	16545921	16545922	Khoddami|tRNA-Gly-GGG	0	-

20554	chrY	3367777	3367778	Khoddami|tRNA-Glu-GAG	0	+
20555	chrY	3367784	3367785	Khoddami|tRNA-Glu-GAG	0	+
20556	chrY	3367799	3367800	Khoddami|tRNA-Glu-GAG	0	+
20557	chrY	3367804	3367805	Khoddami|tRNA-Glu-GAG	0	+
20558	chrY	3367808	3367809	Khoddami|tRNA-Glu-GAG	0	+

|0|chr15|chr6|chr1|chr17|chr16|chr14|chr5|chr12|chr7|chr3|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|5959|3648|1843|949|893|892|705|686|622|596|...|

|1|143831770|143831768|143831764|143831761|143831757|143831755|143831748|143831741|3367808|3367804|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|2|2|2|2|2|2|1|1|...|

|2|143831771|143831769|143831765|143831762|143831758|143831756|143831749|143831742|3367809|3367805|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|2|2|2|2|2|2|1|1|...|

|3|Khoddami|
|:---|:---|
|Count|20553|

|4|SHF|5S-rRNA|tRNA-Cys-TGY|tRNA-Gly-GGY|tRNA-Glu-GAG|tRNA-Lys-AAG|tRNA-Lys-AAA|tRNA-Ala-GCY|tRNA-Gln-CAG|tRNA-Tyr-TAC|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|5716|2044|511|400|373|356|334|325|317|315|...|

|5|0|SHF|AC084082.3|
|:---|:---|:---|:---|
|Count|20419|87|47|

|6|-|+|0|
|:---|:---|:---|:---|
|Count|12988|7431|134|



In [11]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"\t",5,"\t",6]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",cell,"|",method,"|",PMID,"|",4,"\t",5,"\t",6]

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-")

file_summary(outfile)









    



20553 Lines processed	20553 Lines pass	0 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Dataset/Khoddami_m5C_MEF_hg38_cleaned.bed
Total lines:	20559

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-08 12:22:52.164144
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=m5C|Cell_type=MEF|Analysis_method=AzaIP|Pubmed_ID=23604283
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	16545917	16545918	m5C|MEF|AzaIP|23604283|tRNA-Gly-GGG	0	-
7	chr1	16545919	16545920	m5C|MEF|AzaIP|23604283|tRNA-Gly-GGG	0	-
8	chr1	16545920	16545921	m5C|MEF|AzaIP|23604283|tRNA-Gly-GGG	0	-
9	chr1	16545921	16545922	m5C|MEF|AzaIP|23604283|tRNA-Gly-GGG	0	-

20554	chrY	3367777	3367778	m5C|MEF|AzaIP|23604283|tRNA-Glu-GAG	0	+
20555	chrY	3367784	3367785	m5C|MEF|AzaIP|23604283|tRNA-Glu-GAG	0	+
20556	chrY	3367799	3367800	m5C|MEF|AzaIP|23604283|tRNA-Glu-GAG	0	+
20557	chrY	3367804	3367805	m5C|MEF|AzaIP|23604283|tRNA-Glu-GAG	0	+
20558	chrY	3367808	3367809	m5C|MEF|AzaIP|23604283|tRNA-Glu-GAG	0	+

Found 10 colums
First line found
|0|chr15|chr6|chr1|chr17|chr16|chr14|chr5|chr12|chr7|chr3|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|5959|3648|1843|949|893|892|705|686|622|596|...|

|1|143831770|143831768|143831764|143831761|143831757|143831755|143831748|143831741|3367808|3367804|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|2|2|2|2|2|2|1|1|...|

|2|143831771|143831769|143831765|143831762|143831758|143831756|143831749|143831742|3367809|3367805|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|2|2|2|2|2|2|1|1|...|

|3|m5C|
|:---|:---|
|Count|20553|

|4|MEF|
|:---|:---|
|Count|20553|

|5|AzaIP|
|:---|:---|
|Count|20553|

|6|23604283|
|:---|:---|
|Count|20553|

|7|SHF|5S-rRNA|tRNA-Cys-TGY|tRNA-Gly-GGY|tRNA-Glu-GAG|tRNA-Lys-AAG|tRNA-Lys-AAA|tRNA-Ala-GCY|tRNA-Gln-CAG|tRNA-Tyr-TAC|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|5716|2044|511|400|373|356|334|325|317|315|...|

|8|0|SHF|AC084082.3|
|:---|:---|:---|:---|
|Count|20419|87|47|

|9|-|+|0|
|:---|:---|:---|:---|
|Count|12988|7431|134|



In [22]:

    
distrib_peak_len(outfile)

No problem with this dataset. It seems to be focussed on tRNA gene that are clearly over-represented in the gene list

m5C_Squires_hg38



In [23]:

    
infile="./PTM_Original_Datasets/m5C_Squires_hg38.bed"
PMID = "22344696"
cell = "HeLa"
modification = "m5C"
method = "bisulfite_seq"
author = "Squires"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/m5C_Squires_hg38.bed
Total lines:	10496

0	# Transcriptome-wide map of m5C [hg38 coordinates]
1	# Reference: Squires et al., Nucleic Acids Res. 40, 5023 (2012) [PMID 22344696, DOI 10.1093/nar/gks144]
2	#
3	# Data cleaned and converted to BED6, coordinate conversion to hg38 using liftOver.
4	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
5	#
6	chr1	631539	631540	Squires|id1	0	+
7	chr1	631540	631541	Squires|id2	0	+
8	chr1	632285	632286	Squires|id3	0	+
9	chr1	632286	632287	Squires|id4	0	+

10491	chrX	155291091	155291092	Squires|id10271	0	-
10492	chrX	156023069	156023070	Squires|id10272	0	+
10493	chrY	5105140	5105141	Squires|id10273	0	+
10494	chrY	5338229	5338230	Squires|id10274	0	-
10495	chrY	10192049	10192050	Squires|id10275	0	-

|0|chr1|chr17|chr19|chr11|chr3|chr6|chr2|chr5|chr12|chr7|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1169|734|660|659|621|608|582|576|497|472|...|

|1|1629420|1629419|10192049|5338229|5105140|156023069|155291091|155291090|154552121|154515971|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|1|1|1|1|1|1|1|1|...|

|2|1629421|1629420|10192050|5338230|5105141|156023070|155291092|155291091|154552122|154515972|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|1|1|1|1|1|1|1|1|...|

|3|Squires|
|:---|:---|
|Count|10490|

|4|id10275|id10274|id10273|id10272|id10271|id10270|id10269|id10268|id10267|id10266|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|5|0|
|:---|:---|
|Count|10490|

|6|+|-|
|:---|:---|:---|
|Count|5349|5141|



In [13]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"\t",5,"\t",6]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",cell,"|",method,"|",PMID,"|-\t",5,"\t",6]

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-")

file_summary(outfile)









    



10490 Lines processed	10490 Lines pass	0 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Dataset/Squires_m5C_HeLa_hg38_cleaned.bed
Total lines:	10496

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-08 12:23:48.676446
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=m5C|Cell_type=HeLa|Analysis_method=bisulfite_seq|Pubmed_ID=22344696
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	631539	631540	m5C|HeLa|bisulfite_seq|22344696|-	0	+
7	chr1	631540	631541	m5C|HeLa|bisulfite_seq|22344696|-	0	+
8	chr1	632285	632286	m5C|HeLa|bisulfite_seq|22344696|-	0	+
9	chr1	632286	632287	m5C|HeLa|bisulfite_seq|22344696|-	0	+

10491	chrX	155291091	155291092	m5C|HeLa|bisulfite_seq|22344696|-	0	-
10492	chrX	156023069	156023070	m5C|HeLa|bisulfite_seq|22344696|-	0	+
10493	chrY	5105140	5105141	m5C|HeLa|bisulfite_seq|22344696|-	0	+
10494	chrY	5338229	5338230	m5C|HeLa|bisulfite_seq|22344696|-	0	-
10495	chrY	10192049	10192050	m5C|HeLa|bisulfite_seq|22344696|-	0	-

Found 10 colums
First line found
|0|chr1|chr17|chr19|chr11|chr3|chr6|chr2|chr5|chr12|chr7|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1169|734|660|659|621|608|582|576|497|472|...|

|1|1629420|1629419|10192049|5338229|5105140|156023069|155291091|155291090|154552121|154515971|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|1|1|1|1|1|1|1|1|...|

|2|1629421|1629420|10192050|5338230|5105141|156023070|155291092|155291091|154552122|154515972|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|1|1|1|1|1|1|1|1|...|

|3|m5C|
|:---|:---|
|Count|10490|

|4|HeLa|
|:---|:---|
|Count|10490|

|5|bisulfite_seq|
|:---|:---|
|Count|10490|

|6|22344696|
|:---|:---|
|Count|10490|

|7|-|
|:---|:---|
|Count|10490|

|8|0|
|:---|:---|
|Count|10490|

|9|+|-|
|:---|:---|:---|
|Count|5349|5141|



In [24]:

    
distrib_peak_len(outfile)

1 nt wide peaks = No problem with this dataset. Since there are no gene loci, I just filed the field with a dash to indicate that it is empty

m6A_Dominissini_hg38



In [51]:

    
infile="./PTM_Original_Datasets/m6A_Dominissini_hg38.bed"
PMID = "22575960"
cell = "HepG2"
modification = "m6A"
method = "M6A_seq"
author = "Dominissini"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/m6A_Dominissini_hg38.bed
Total lines:	25782

0	# Transcriptome-wide map of m6A [hg38 coordinates]
1	# Reference: Dominissini et al., Nature 485, 201 (2012) [PMID 22575960, DOI 10.1038/nature11112]
2	#
3	# Data cleaned and converted to BED6, coordinate conversion to hg38 using liftOver.
4	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
5	#
6	chr1	10000	10156	Dominissini|NR_003285|NR_003285	0	+
7	chr1	10000	12006	Dominissini|NR_003286|NR_003286	0	+
8	chr1	10000	15207	Dominissini|NR_003287|NR_003287	0	+
9	chr1	14405	29358	Dominissini|FLJ00038|iduc009viw.1	0	-

25777	chrY	25975511	25988101	Dominissini|PRY|iduc004fxh.1	0	+
25778	chrY	56954307	56968975	Dominissini|SPRY3|iduc004fxi.1	0	+
25779	chrY	57067874	57130276	Dominissini|VAMP7|iduc004fxj.1	0	+
25780	chrY	57184099	57197337	Dominissini|IL9R|iduc004fxn.1	0	+
25781	chrY	57208832	57212183	Dominissini|WASH1|iduc010nxs.1	0	+

|0|chr1|chr19|chr2|chr11|chr17|chr6|chr15|chr3|chr7|chr12|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2458|1665|1571|1566|1477|1370|1350|1312|1299|1229|...|

|1|10000|22439591|22308857|2752294|155061623|154398375|52495831|2691185|2664302|2609390|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|3|2|2|2|2|2|2|2|2|2|...|

|2|2500974|101560403|93576202|107481853|22146831|155071361|153744726|52500812|2741309|2666039|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|3|3|3|2|2|2|2|2|2|...|

|3|Dominissini|
|:---|:---|
|Count|25776|

|4|DQ587539|DQ590589|FKSG56|DQ571461|DQ576951|DQ599872|DQ599768|DQ575686|DQ593188|DQ576952|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|15|14|11|10|9|9|9|8|8|8|...|

|5|iduc010nxs.1|iduc004fxn.1|iduc004fxj.1|iduc004fxi.1|iduc004fxh.1|iduc004fxg.1|iduc004fxf.1|iduc004fxd.1|iduc004fxc.1|iduc004fxb.1|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|6|0|
|:---|:---|
|Count|25776|

|7|+|-|
|:---|:---|:---|
|Count|13088|12688|



In [4]:

    
distrib_peak_len(infile, normed=False, bins=500)

Apparently there is a problem in the data since some of the peaks can be up to 3 500 000 with is much longer than initially described in the paper. I dowloaded the original peak calling file from the sup data of the paper to compare with this datafile.



In [30]:

    
infile="./PTM_Original_Datasets/m6A_Dominissini_hg19_original_table.csv"
file_summary(infile)









    



Filename:	./PTM_Original_Datasets/m6A_Dominissini_hg19_original_table.csv
Total lines:	25919

0	#chr	txStart	txEnd	strand	geneSymbol
1	chr1	24475	25940	-	F379
2	chr1	58953	59871	+	OR4F5
3	chr1	127586	128558	-	DQ580039
4	chr1	129367	129428	-	DQ600587
5	chr1	310946	310977	+	DQ599874
6	chr1	311008	311086	+	DQ599768
7	chr1	314322	314353	+	DQ574670
8	chr1	314353	314385	+	DQ596298
9	chr1	314432	314501	+	DQ599768

25914	chrY	22700500	22702366	+	TTY6
25915	chrY	57764372	57767722	+	WASH1
25916	chr1	1	156	+	NR_003285
25917	chr1	1	1869	+	NR_003286
25918	chr1	1	5070	+	NR_003287

|0|chr1|chr19|chr2|chr11|chr17|chr6|chr15|chr7|chr3|chr12|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2501|1668|1576|1566|1480|1370|1362|1326|1312|1233|...|

|1|1|46717826|3447992|73075997|72861477|95740887|22995127|22864393|2680336|153943093|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|3|2|2|2|2|2|2|2|2|2|...|

|2|2429015|101096493|92494109|107825998|47848011|35774471|73076608|74671945|153987177|22702366|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|3|3|3|2|2|2|2|2|2|...|

|3|+|-|
|:---|:---|:---|
|Count|13140|12778|

|4|DQ587539|DQ590589|FKSG56|DQ571461|DQ576951|DQ599872|DQ599768|TCRB|DQ575686|DQ593188|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|15|14|11|10|9|9|9|8|8|8|...|



In [26]:

    
distrib_peak_len(infile, normed=False, bins=500)



In [27]:

    
distrib_peak_len(infile, normed=False, range=[1,2000], bins=500)



In [28]:

    
distrib_peak_len(infile, normed=False, range=[1,200], bins=200)

The same problem is also found in the original data, though the values only goes up to 2 500 000... I think that long peaks were improperly called... Looking at the data in detail we can see that most of the peaks are found in the 1 to 130 range. There is also a second smaller peak around 1000. Looking at the original article the mapping was done on the human transcriptome and not the genome. Apparently the coordinates were converted to the genome after and that might explain this decrepancy. I have 2 options => Starting from scratch with a recent genome build, but the dataset seems quite tricky and I am not sure I could do it as well as the original authors. The second option is to be retrictive and keep only the small peaks ie > 1000 pb. I think that I will start by this alternative and go back to the data again if needed. To be sure of the quality of the data I will start from the original data and do the liftover conversion myself.



In [52]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"
infile="./PTM_Original_Datasets/m6A_Dominissini_hg19_original_table.csv"
outfile = "./PTM_Clean_Datasets/Dominissini_m6A_HepG2_hg19_cleaned.bed"

init_template=[0,"\t",1,"\t",2,"\t",3,"\t",4]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",cell,"|",method,"|",PMID,"|",4,"\t-\t",3]

# Predicate function to filter out large peaks 
predicate = lambda val_list: abs(int(val_list[1])-int(val_list[2])) <= 1000

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-",
    predicate = predicate)

file_summary(outfile)









    



25918 Lines processed	2994 Lines pass	22924 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Datasets/Dominissini_m6A_HepG2_hg19_cleaned.bed
Total lines:	3000

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-13 14:24:15.657786
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=m6A|Cell_type=HepG2|Analysis_method=M6A_seq|Pubmed_ID=22575960
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	58953	59871	m6A|HepG2|M6A_seq|22575960|OR4F5	-	+
7	chr1	127586	128558	m6A|HepG2|M6A_seq|22575960|DQ580039	-	-
8	chr1	129367	129428	m6A|HepG2|M6A_seq|22575960|DQ600587	-	-
9	chr1	310946	310977	m6A|HepG2|M6A_seq|22575960|DQ599874	-	+

2995	chrX	76056091	76056179	m6A|HepG2|M6A_seq|22575960|DJ442759	-	-
2996	chrX	95478740	95479557	m6A|HepG2|M6A_seq|22575960|LOC643486	-	-
2997	chrX	109354872	109354944	m6A|HepG2|M6A_seq|22575960|SNORD96B	-	-
2998	chrX	138833983	138834005	m6A|HepG2|M6A_seq|22575960|CS548486	-	-
2999	chr1	1	156	m6A|HepG2|M6A_seq|22575960|NR_003285	-	+

|0|chr15|chr11|chr1|chr6|chr17|chr7|chr14|chr19|chr9|chr16|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|589|293|222|190|188|172|160|154|150|98|...|

|1|73075997|720013|13627924|100114700|1|138833983|109354872|95478740|76056091|56779945|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|2|2|1|1|1|1|1|1|...|

|2|73076608|720042|13628535|156|138834005|109354944|95479557|76056179|56780742|51950584|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|2|1|1|1|1|1|1|1|...|

|3|m6A|
|:---|:---|
|Count|2994|

|4|HepG2|
|:---|:---|
|Count|2994|

|5|M6A_seq|
|:---|:---|
|Count|2994|

|6|22575960|
|:---|:---|
|Count|2994|

|7|DQ587539|DQ590589|FKSG56|DQ576951|DQ599872|DQ599768|DQ571461|DQ575686|DQ593188|DQ576952|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|15|14|11|9|9|9|8|8|8|8|...|

|8|-|
|:---|:---|
|Count|2994|

|9|+|-|
|:---|:---|:---|
|Count|1543|1451|

The filtering based on the peak length size remove a lot of peaks = nearly 90 % of the dataset



In [53]:

    
# Conversion to hg38 with Crossmap/liftover
lifover_chainfile = "../LiftOver_chain_files/hg19ToHg38.over.chain.gz"
input_bed = "./PTM_Clean_Datasets/Dominissini_m6A_HepG2_hg19_cleaned.bed"
temp_bed = "./PTM_Clean_Datasets/Dominissini_m6A_HepG2_hg38_temp.bed"

cmd = "CrossMap.py bed {} {} {}".format(lifover_chainfile, input_bed, temp_bed)  
bash(cmd)

# Rewriting and updating of the header removed by Crossmap
final_bed = "./PTM_Clean_Datasets/Dominissini_m6A_HepG2_hg38_cleaned.bed"
header = generate_header(PMID, cell, modification, method)

with open (temp_bed, "r") as infile, open (final_bed, "w") as outfile:
    outfile.write (header)
    for line in infile:
        outfile.write (line)

file_summary(final_bed)









    



@ 2016-06-13 14:24:41: Read chain_file:  ../LiftOver_chain_files/hg19ToHg38.over.chain.gz

Filename:	./PTM_Clean_Datasets/Dominissini_m6A_HepG2_hg38_cleaned.bed
Total lines:	2900

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-13 14:24:42.098489
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=m6A|Cell_type=HepG2|Analysis_method=M6A_seq|Pubmed_ID=22575960
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	58953	59871	m6A|HepG2|M6A_seq|22575960|OR4F5	+	+
7	chr1	127586	128558	m6A|HepG2|M6A_seq|22575960|DQ580039	-	-
8	chr1	129367	129428	m6A|HepG2|M6A_seq|22575960|DQ600587	-	-
9	chr1	501020	501617	m6A|HepG2|M6A_seq|22575960|DQ574721	-	-

2895	chrX	56753512	56754309	m6A|HepG2|M6A_seq|22575960|LOC442454	-	-
2896	chrX	76835666	76835754	m6A|HepG2|M6A_seq|22575960|DJ442759	-	-
2897	chrX	96223741	96224558	m6A|HepG2|M6A_seq|22575960|LOC643486	-	-
2898	chrX	110111644	110111716	m6A|HepG2|M6A_seq|22575960|SNORD96B	-	-
2899	chrX	139751824	139751846	m6A|HepG2|M6A_seq|22575960|CS548486	-	-

|0|chr15|chr11|chr1|chr6|chr17|chr14|chr7|chr19|chr9|chr16|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|569|293|191|190|171|171|163|151|143|98|...|

|1|72783656|13586424|99574495|139751824|110111644|96223741|76835666|56753512|52207330|52080086|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|2|1|1|1|1|1|1|1|...|

|2|72784267|13587035|139751846|110111716|96224558|76835754|56754309|52207457|52080213|50922622|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|2|1|1|1|1|1|1|1|1|...|

|3|m6A|
|:---|:---|
|Count|2894|

|4|HepG2|
|:---|:---|
|Count|2894|

|5|M6A_seq|
|:---|:---|
|Count|2894|

|6|22575960|
|:---|:---|
|Count|2894|

|7|DQ587539|FKSG56|DQ576951|DQ599872|DQ571461|DQ575686|DQ593188|DQ576952|DQ582460|DQ590589|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|13|11|9|9|8|8|8|8|8|8|...|

|8|-|+|
|:---|:---|:---|
|Count|1471|1423|

|9|-|+|
|:---|:---|:---|
|Count|1471|1423|



In [55]:

    
distrib_peak_len(final_bed, normed=False, bins=200)

It is much better but I only have 2894 sites out of the 25000 initially in the dataset. For this first exploratory study it should be OK, but I will probably have to go back to the original data again later

m6A_Meyer_hg38



In [14]:

    
infile="./PTM_Original_Datasets/m6A_Meyer_hg38.bed"
PMID = "22608085"
cell = "HEK293"
modification = "m6A"
method = "MeRIP_Seq"
author = "Meyer"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/m6A_Meyer_hg38.bed
Total lines:	4347

0	# Transcriptome-wide map of m6A [hg38 coordinates]
1	# Reference: Meyer et al., Cell 149, 1635 (2012) [PMID 22608085, DOI 10.1016/j.cell.2012.05.003]
2	#
3	# Data cleaned and converted to BED6, coordinate conversion to hg38 using liftOver.
4	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
5	#
6	chr1	1055019	1055133	Meyer|AGRN|3'UTR	0	+
7	chr1	1055132	1055245	Meyer|AGRN|3'UTR	0	+
8	chr1	1055419	1055620	Meyer|AGRN|3'UTR	0	+
9	chr1	1232619	1232795	Meyer|B3GALT6|CDS	0	+

4342	chrX	154488410	154488595	Meyer|SLC10A3|CDS	0	-
4343	chrX	154488594	154488779	Meyer|SLC10A3|CDS	0	-
4344	chrX	154488778	154488961	Meyer|SLC10A3|Exon	0	-
4345	chrX	155216447	155216548	Meyer|VBP1|Exon	0	+
4346	chrX	155260564	155260740	Meyer|RAB39B|3'UTR	0	-

|0|chr1|chr3|chr2|chr16|chr6|chrX|chr17|chr5|chr8|chr12|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|364|269|262|240|233|232|229|221|208|199|...|

|1|155260564|155216447|154488778|154488594|154488410|154461184|154460334|154421110|154420985|154398171|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|2|155260740|155216548|154488961|154488779|154488595|154461335|154460535|154421236|154421111|154398284|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|3|Meyer|
|:---|:---|
|Count|4341|

|4|NSD1|ANKRD11|FREM2|FAM208B|ASXL1|SACS|DST|FAT4|TSHZ1|EPPK1|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|23|17|17|17|16|15|14|14|14|13|...|

|5|CDS|3'UTR|Exon|Intron|5'UTR|Gene|
|:---|:---|:---|:---|:---|:---|:---|
|Count|2168|1295|459|203|152|64|

|6|0|
|:---|:---|
|Count|4341|

|7|+|-|
|:---|:---|:---|
|Count|2217|2124|



In [16]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"|",5,"\t",6,"\t",7]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",cell,"|",method,"|",PMID,"|",4,"\t",6,"\t",7]

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-")

file_summary(outfile)









    



4341 Lines processed	4341 Lines pass	0 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Datasets/Meyer_m6A_HEK293_hg38_cleaned.bed
Total lines:	4347

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-13 12:23:21.117650
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=m6A|Cell_type=HEK293|Analysis_method=MeRIP_Seq|Pubmed_ID=22608085
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	1055019	1055133	m6A|HEK293|MeRIP_Seq|22608085|AGRN	0	+
7	chr1	1055132	1055245	m6A|HEK293|MeRIP_Seq|22608085|AGRN	0	+
8	chr1	1055419	1055620	m6A|HEK293|MeRIP_Seq|22608085|AGRN	0	+
9	chr1	1232619	1232795	m6A|HEK293|MeRIP_Seq|22608085|B3GALT6	0	+

4342	chrX	154488410	154488595	m6A|HEK293|MeRIP_Seq|22608085|SLC10A3	0	-
4343	chrX	154488594	154488779	m6A|HEK293|MeRIP_Seq|22608085|SLC10A3	0	-
4344	chrX	154488778	154488961	m6A|HEK293|MeRIP_Seq|22608085|SLC10A3	0	-
4345	chrX	155216447	155216548	m6A|HEK293|MeRIP_Seq|22608085|VBP1	0	+
4346	chrX	155260564	155260740	m6A|HEK293|MeRIP_Seq|22608085|RAB39B	0	-

|0|chr1|chr3|chr2|chr16|chr6|chrX|chr17|chr5|chr8|chr12|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|364|269|262|240|233|232|229|221|208|199|...|

|1|155260564|155216447|154488778|154488594|154488410|154461184|154460334|154421110|154420985|154398171|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|2|155260740|155216548|154488961|154488779|154488595|154461335|154460535|154421236|154421111|154398284|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|3|m6A|
|:---|:---|
|Count|4341|

|4|HEK293|
|:---|:---|
|Count|4341|

|5|MeRIP_Seq|
|:---|:---|
|Count|4341|

|6|22608085|
|:---|:---|
|Count|4341|

|7|NSD1|ANKRD11|FREM2|FAM208B|ASXL1|SACS|DST|FAT4|TSHZ1|EPPK1|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|23|17|17|17|16|15|14|14|14|13|...|

|8|0|
|:---|:---|
|Count|4341|

|9|+|-|
|:---|:---|:---|
|Count|2217|2124|



In [17]:

    
distrib_peak_len(outfile, bins = 100)

The original dataset was OK and clean. Contrary to the previous dataset the width of the peaks are between 100 and 220 pb wich is clearly better. It is however interesting to notice that the peak lengths are not randomly distributed

miCLIP_m6A_Linder2015_hg38



In [27]:

    
infile="./PTM_Original_Datasets/miCLIP_m6A_Linder2015_hg38.bed"
PMID = "26121403"
cell = "HEK293"
modification = "m6A:m6Am"
method = "miCLIP"
author = "Linder"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/miCLIP_m6A_Linder2015_hg38.bed
Total lines:	15174

0	# miCLIP-based transcriptome-wide map of m6A and m6Am sites [hg38 coordinates]
1	# Reference: Linder et al., Nat. Methods 12, 767 (2015) [PMID 26121403, DOI 10.1038/nmeth.3453]
2	# Data converted to BED from Supplementary tables S2 and S3, conversion to hg38 using liftOver.
3	#
4	# Data is part of the R package RNAModR
5	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
6	#
7	chr1	631479	631480	B045_1-4_chr1_f_c1[k=4][m=2]_CTACT;MIR6723;hsa-mir-6723;microRNAmir-6723	1	+
8	chr1	634538	634539	B045_1-4_chr1_f_c5[k=52][m=2]_ATACC;MIR6723;hsa-mir-6723;microRNAmir-6723	1	+
9	chr1	931395	931396	B045_1-4_chr1_f_c7[k=4][m=2]_AGACT;XM_005244723;ENSG00000187634;SAMD11	1	+

15169	chrX	154516231	154516232	105216[gene=chrX_r_c40520][PH=3][PH0=0.06][P=1.02e-02]_CCATT;XM_005274713;ENSG00000071889;FAM3A	1	-
15170	chrX	155071078	155071079	105658[gene=chrX_r_c40716][PH=3][PH0=0.12][P=2.81e-02]_CCATT;NM_001018024;ENSG00000182712;CMC4	1	-
15171	chrY	10195941	10195942	B045_1-4_chrY_f_c14[k=17][m=2]_CGACG;XR_244578;TTTY23B;TTTY23	1	+
15172	chrY	10200247	10200248	B045_1-4_chrY_f_c42[k=1122][m=17]_CGACA;XR_244578;TTTY23B;TTTY23	1	+
15173	chrY	10200249	10200250	B045_1-4_chrY_f_c43[k=899][m=18]_ACACT;XR_244578;TTTY23B;TTTY23	1	+

|0|chr1|chr19|chr17|chr16|chr2|chr3|chr6|chr11|chr12|chr5|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1816|1098|1043|865|827|826|798|742|712|695|...|

|1|10200249|10200247|10195941|155071078|154516231|154485199|154485159|154484982|154484792|154484582|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|2|10200250|10200248|10195942|155071079|154516232|154485200|154485160|154484983|154484793|154484583|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|3|B045_1-4_chrY_f_c43[k=899][m=18]_ACACT;XR_244578;TTTY23B;TTTY23|B045_1-4_chrY_f_c42[k=1122][m=17]_CGACA;XR_244578;TTTY23B;TTTY23|B045_1-4_chrY_f_c14[k=17][m=2]_CGACG;XR_244578;TTTY23B;TTTY23|105658[gene=chrX_r_c40716][PH=3][PH0=0.12][P=2.81e-02]_CCATT;NM_001018024;ENSG00000182712;CMC4|105216[gene=chrX_r_c40520][PH=3][PH0=0.06][P=1.02e-02]_CCATT;XM_005274713;ENSG00000071889;FAM3A|B045_1-4_chrX_r_c957[k=6][m=2]_TGACT;NM_014235;ENSG00000102178;UBL4A|104987[gene=chrX_r_c40453][PH=3][PH0=0.09][P=4.88e-02]_TGACT;NM_014235;ENSG00000102178;UBL4A|104977[gene=chrX_r_c40451][PH=4][PH0=0.13][P=1.15e-02]_TGACT;NM_014235;ENSG00000102178;UBL4A|B045_1-4_chrX_r_c955[k=11][m=2]_TGACA;NM_014235;ENSG00000102178;UBL4A|B045_1-4_chrX_r_c954[k=4][m=2]_AAACC;NM_014235;ENSG00000102178;UBL4A|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|4|1|2|U37;smallnucleolarRNA,C/Dbox37|ZNF503L|NCRNA00245|LC27|SPG48|hsa-mir-1226;microRNA1226|F1AA|U62|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|13226|844|41|20|17|11|10|10|10|9|...|

|5|+|-|1|ZPO1|zeta|FEM1-beta|U62A;smallnucleolarRNA,C/Dbox62A|TIP39|HSU15552|OCP2|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|7158|6912|505|15|10|10|9|9|9|8|...|



In [22]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"\t",4,"\t",5]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",cell,"|",method,"|",PMID,"|-\t",4,"\t",5]

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-")

file_summary(outfile)









    



15167 Lines processed	15167 Lines pass	0 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Dataset/Linder_m6A:m6Am_HEK293_hg38_cleaned.bed
Total lines:	15173

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-08 12:27:58.239000
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=m6A:m6Am|Cell_type=HEK293|Analysis_method=miCLIP|Pubmed_ID=26121403
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	631479	631480	m6A:m6Am|HEK293|miCLIP|26121403|-	1	+
7	chr1	634538	634539	m6A:m6Am|HEK293|miCLIP|26121403|-	1	+
8	chr1	931395	931396	m6A:m6Am|HEK293|miCLIP|26121403|-	1	+
9	chr1	942770	942771	m6A:m6Am|HEK293|miCLIP|26121403|-	2	+

15168	chrX	154516231	154516232	m6A:m6Am|HEK293|miCLIP|26121403|-	1	-
15169	chrX	155071078	155071079	m6A:m6Am|HEK293|miCLIP|26121403|-	1	-
15170	chrY	10195941	10195942	m6A:m6Am|HEK293|miCLIP|26121403|-	1	+
15171	chrY	10200247	10200248	m6A:m6Am|HEK293|miCLIP|26121403|-	1	+
15172	chrY	10200249	10200250	m6A:m6Am|HEK293|miCLIP|26121403|-	1	+

Found 10 colums
First line found
|0|chr1|chr19|chr17|chr16|chr2|chr3|chr6|chr11|chr12|chr5|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1816|1098|1043|865|827|826|798|742|712|695|...|

|1|10200249|10200247|10195941|155071078|154516231|154485199|154485159|154484982|154484792|154484582|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|2|10200250|10200248|10195942|155071079|154516232|154485200|154485160|154484983|154484793|154484583|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|3|m6A:m6Am|
|:---|:---|
|Count|15167|

|4|HEK293|
|:---|:---|
|Count|15167|

|5|miCLIP|
|:---|:---|
|Count|15167|

|6|26121403|
|:---|:---|
|Count|15167|

|7|-|
|:---|:---|
|Count|15167|

|8|1|2|
|:---|:---|:---|
|Count|14255|912|

|9|+|-|
|:---|:---|:---|
|Count|7721|7446|



In [28]:

    
distrib_peak_len(outfile)

The name field is unusable since it contains a random number of fieds... I cannot parse it easily. That is not a big problem since I will reannotate the data based on the last gencode annotation realease, I did not save any of the informations contained in the original name field. With miCLIP data the peak are 1 nt wide

MeRIPseq_m1A_Dominissini2016_hg38



In [29]:

    
infile="./PTM_Original_Datasets/MeRIPseq_m1A_Dominissini2016_hg38.bed"
PMID = "26863196"
cell = "HeLa:HEK293:HepG2"
modification = "m1A"
method = "M1A_seq"
author = "Dominissini"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/MeRIPseq_m1A_Dominissini2016_hg38.bed
Total lines:	32136

0	# MeRIP-seq-based transcriptome-wide map of m1A [hg38 coordinates]
1	# Reference: Dominissini et al., Nature 530, 441 (2016) [PMID 26863196, DOI 10.1038/nature16998]
2	# Data converted to BED from Supplementary table 1, conversion to hg38 using liftOver.
3	# Strand information is missing in original reference, and was added using current
4	# RefSeq annotation from NCBI and UCSC.
5	#
6	# Data is part of the R package RNAModR
7	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
8	#
9	chr1	17001	17002	NR_024540|ncRNA|HeLa	0	-

32131	chrX	155216522	155216523	NM_001303545|5UTR|HEPG2_common_mRNA	0	+
32132	chrX	155216526	155216527	NM_003372|CDS|HeLa	0	+
32133	chrX	155612852	155612853	NM_001184797|5UTR|HEPG2_common_mRNA	0	-
32134	chrX	155612855	155612856	NM_001184797|5UTR|HeLa	0	-
32135	chrX	155612864	155612865	NM_001184797|5UTR|HEPG2_Glucose_starv_4h	0	-

|0|chr1|chr19|chr17|chr16|chr2|chr11|chr12|chr3|chr7|chr9|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|3277|2995|2388|1931|1919|1827|1653|1576|1575|1374|...|

|1|44748959|48802691|43865288|35277131|39342516|6070631|67847429|82524426|32077360|77821200|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|6|4|4|4|4|4|4|4|4|4|...|

|2|44748960|48802692|43865289|35277132|39342517|6070632|67847430|82524427|32077361|77821201|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|6|4|4|4|4|4|4|4|4|4|...|

|3|NM_000094|NM_032888|NM_014675|NM_052843|NM_058243|NM_198576|NM_002473|NM_016333|NM_001853|NM_031407|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|37|32|31|24|22|22|21|21|20|19|...|

|4|CDS|5UTR|ncRNA|3UTR|
|:---|:---|:---|:---|:---|
|Count|14627|13328|2610|1562|

|5|HeLa|HEPG2_common_mRNA|HEPG2_heat_shock_4h|HEPG2_Glucose_starv_4h|HEPG2_common_total_RNA|HEK293|
|:---|:---|:---|:---|:---|:---|:---|
|Count|8873|8550|5529|3645|3401|2129|

|6|0|
|:---|:---|
|Count|32127|

|7|+|-|
|:---|:---|:---|
|Count|16301|15826|

I don't know understand some of the categories for HEPG2 cells

Hela = No ambiguity
HEK293 = No ambiguity
HEPG2_common_mRNA = Unclear, but aparently looking at the data this is the general RNA dataset (including ncRNA) after peak calling in untreated HEPG2 cells
HEPG2_heat_shock_4h = No ambiguity
HEPG2_Glucose_starv_4h = No ambiguity
HEPG2_common_total_RNA = Unclear, but it seems to be commons peaks shared by all the cell types.



In [37]:

    
###### chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"|",5,"\t",6,"\t",7]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",5,"|",method,"|",PMID,"|",3,"\t",6,"\t",7]

# filter out all but A>G transition which are Inosine transition
filter_dict={5:["HEPG2_heat_shock_4h","HEPG2_Glucose_starv_4h","HEPG2_common_total_RNA"]}

# Reformat the field value A->G to A>I for standardization 
subst_dict={5:{"HEPG2_common_mRNA":"HepG2"}}

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-",
    subst_dict = subst_dict,
    filter_dict = filter_dict
    )

file_summary(outfile)









    



32127 Lines processed	19552 Lines pass	12575 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Dataset/Dominissini_m1A_HeLa:HEK293:HepG2_hg38_cleaned.bed
Total lines:	19558

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-08 14:27:18.116524
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=m1A|Cell_type=HeLa:HEK293:HepG2|Analysis_method=M1A_seq|Pubmed_ID=26863196
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	17001	17002	m1A|HeLa|M1A_seq|26863196|NR_024540	0	-
7	chr1	24826	24827	m1A|HeLa|M1A_seq|26863196|NR_024540	0	-
8	chr1	778484	778485	m1A|HepG2|M1A_seq|26863196|NR_033908	0	-
9	chr1	826715	826716	m1A|HeLa|M1A_seq|26863196|NR_024321	0	-

19553	chrX	155071584	155071585	m1A|HEK293|M1A_seq|26863196|NM_001018055	0	+
19554	chrX	155216522	155216523	m1A|HepG2|M1A_seq|26863196|NM_001303545	0	+
19555	chrX	155216526	155216527	m1A|HeLa|M1A_seq|26863196|NM_003372	0	+
19556	chrX	155612852	155612853	m1A|HepG2|M1A_seq|26863196|NM_001184797	0	-
19557	chrX	155612855	155612856	m1A|HeLa|M1A_seq|26863196|NM_001184797	0	-

Found 10 colums
First line found
|0|chr1|chr19|chr17|chr16|chr11|chr2|chr12|chr7|chr3|chr9|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1983|1908|1501|1224|1133|1115|1013|960|934|835|...|

|1|44748959|112100|153934912|132023526|118727659|107628638|107000102|53686602|48802691|47171145|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|3|2|2|2|2|2|2|2|2|...|

|2|44748960|112101|153934913|132023527|118727660|107628639|107000103|53686603|48802692|47171146|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|3|2|2|2|2|2|2|2|2|...|

|3|m1A|
|:---|:---|
|Count|19552|

|4|HeLa|HepG2|HEK293|
|:---|:---|:---|:---|
|Count|8873|8550|2129|

|5|M1A_seq|
|:---|:---|
|Count|19552|

|6|26863196|
|:---|:---|
|Count|19552|

|7|NM_000094|NM_052843|NM_058243|NM_016333|NR_027409|NM_001042681|NM_002473|NM_007056|NM_001961|NR_027410|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|25|18|16|16|14|14|13|13|13|13|...|

|8|0|
|:---|:---|
|Count|19552|

|9|+|-|
|:---|:---|:---|
|Count|10053|9499|



In [30]:

    
distrib_peak_len(outfile)

pseudoU_Carlile_hg38



In [31]:

    
infile="./PTM_Original_Datasets/pseudoU_Carlile_hg38.bed"
PMID = "25192136"
cell = "HeLa"
modification = "Y"
method = "Pseudo_seq"
author = "Carlile"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/pseudoU_Carlile_hg38.bed
Total lines:	14

0	# Transcriptome-wide map of pseudouridine [hg38 coordinates]
1	# Reference: Carlile et al., Nature 515, 143 (2014) [PMID 25192136, DOI 10.1038/nature13802]
2	#
3	# Data cleaned and converted to BED6, coordinate conversion to hg38 using liftOver.
4	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
5	#
6	chr11	62851987	62855914	ENST00000538654|SNHG1	1766	-
7	chr11	65497761	65506469	ENST00000534336|MALAT1	5160	+
8	chr11	65497761	65506469	ENST00000534336|MALAT1	5590	+
9	chr12	98599634	98599883	ENST00000391141|SNORA53	110	+

9	chr12	98599634	98599883	ENST00000391141|SNORA53	110	+
10	chr17	16438821	16478678	ENST00000497774|FAM211A-AS1	1537	+
11	chr2	231455799	231455936	ENST00000384158|SNROA75	84	-
12	chr6	52995619	52995950	ENST00000365328|RN7SK	250	+
13	chrX	119787352	119787484	ENST00000383895|SNORA69	81	-

|0|chr11|chrX|chr6|chr2|chr17|chr12|
|:---|:---|:---|:---|:---|:---|:---|
|Count|3|1|1|1|1|1|

|1|65497761|119787352|52995619|231455799|16438821|98599634|62851987|
|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|

|2|65506469|119787484|52995950|231455936|16478678|98599883|62855914|
|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|

|3|ENST00000534336|ENST00000383895|ENST00000365328|ENST00000384158|ENST00000497774|ENST00000391141|ENST00000538654|
|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|

|4|MALAT1|SNORA69|RN7SK|SNROA75|FAM211A-AS1|SNORA53|SNHG1|
|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|

|5|81|250|84|1537|110|5590|5160|1766|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|

|6|+|-|
|:---|:---|:---|
|Count|5|3|



In [32]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"\t",5,"\t",6]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",cell,"|",method,"|",PMID,"|",4,"\t",5,"\t",6]

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-")

file_summary(outfile)









    



8 Lines processed	8 Lines pass	0 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Dataset/Carlile_Y_HeLa_hg38_cleaned.bed
Total lines:	14

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-08 14:23:30.530422
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=Y|Cell_type=HeLa|Analysis_method=Pseudo_seq|Pubmed_ID=25192136
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr11	62851987	62855914	Y|HeLa|Pseudo_seq|25192136|SNHG1	1766	-
7	chr11	65497761	65506469	Y|HeLa|Pseudo_seq|25192136|MALAT1	5160	+
8	chr11	65497761	65506469	Y|HeLa|Pseudo_seq|25192136|MALAT1	5590	+
9	chr12	98599634	98599883	Y|HeLa|Pseudo_seq|25192136|SNORA53	110	+

9	chr12	98599634	98599883	Y|HeLa|Pseudo_seq|25192136|SNORA53	110	+
10	chr17	16438821	16478678	Y|HeLa|Pseudo_seq|25192136|FAM211A-AS1	1537	+
11	chr2	231455799	231455936	Y|HeLa|Pseudo_seq|25192136|SNROA75	84	-
12	chr6	52995619	52995950	Y|HeLa|Pseudo_seq|25192136|RN7SK	250	+
13	chrX	119787352	119787484	Y|HeLa|Pseudo_seq|25192136|SNORA69	81	-

Found 10 colums
First line found
|0|chr11|chrX|chr6|chr2|chr17|chr12|
|:---|:---|:---|:---|:---|:---|:---|
|Count|3|1|1|1|1|1|

|1|65497761|119787352|52995619|231455799|16438821|98599634|62851987|
|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|

|2|65506469|119787484|52995950|231455936|16478678|98599883|62855914|
|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|

|3|Y|
|:---|:---|
|Count|8|

|4|HeLa|
|:---|:---|
|Count|8|

|5|Pseudo_seq|
|:---|:---|
|Count|8|

|6|25192136|
|:---|:---|
|Count|8|

|7|MALAT1|SNORA69|RN7SK|SNROA75|FAM211A-AS1|SNORA53|SNHG1|
|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|2|1|1|1|1|1|1|

|8|81|250|84|1537|110|5590|5160|1766|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|

|9|+|-|
|:---|:---|:---|
|Count|5|3|



In [32]:

    
distrib_peak_len(outfile)

Dataset OK but only contains only 8 peaks in lncRNA. 1 of the peaks is really wide = 40000 pb The dataset contains only the 8 peaks identify in the lncRNA... The coordinates correspond to the gen coordinates rather than the peaks themselves.. Use the dataset?? Is it really worthy to remap everything for such a low number of eventual peaks?

pseudoU_Li_hg38



In [33]:

    
infile="./PTM_Original_Datasets/pseudoU_Li_hg38.bed"
PMID = "26075521"
cell = "HEK293"
modification = "Y"
method = "CeU_Seq"
author = "Li"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/pseudoU_Li_hg38.bed
Total lines:	1495

0	# Transcriptome-wide map of pseudouridine [hg38 coordinates]
1	# Reference: Li et al., Nat. Chem. Biol. 11, 592 (2015) [PMID 26075521, DOI 10.1038/nchembio.1836]
2	#
3	# Data cleaned and converted to BED6, coordinate conversion to hg38 using liftOver.
4	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
5	#
6	chr1	1051520	1051521	Li|AGRN|NM_198576|CDS	0	+
7	chr1	1254956	1254957	Li|UBE2J2|NM_194458|3'_UTR	0	-
8	chr1	1298603	1298604	Li|ACAP3|NM_030649|CDS	0	-
9	chr1	1386884	1386885	Li|CCNL2|NM_030937|3'_UTR	0	-

1490	chrX	154765960	154765961	Li|DKC1|NR_110022|noncoding	0	+
1491	chrX	155033408	155033409	Li|FUNDC2|NM_023934|CDS	0	+
1492	chrX	155054648	155054649	Li|FUNDC2|NM_023934|CDS	0	+
1493	chrX	155122200	155122201	Li|BRCC3|NM_024332|3'_UTR	0	+
1494	chrX	155258757	155258758	Li|RAB39B|NM_171998|3'_UTR	0	-

|0|chr1|chr17|chr19|chr11|chr2|chr12|chr3|chrX|chr16|chr6|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|169|121|108|100|97|90|69|62|60|57|...|

|1|155258757|155122200|155054648|155033408|154765960|154436177|154413507|154398282|154024660|153934970|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|2|155258758|155122201|155054649|155033409|154765961|154436178|154413508|154398283|154024661|153934971|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|3|Li|
|:---|:---|
|Count|1489|

|4|WHSC1|C7orf26|INTS1|BAG6|PRKCSH|IGF2BP1|CNOT2|HYOU1|ATP5L|MALAT1|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|3|3|3|3|3|3|3|3|3|...|

|5|NM_001042424|NM_024067|NM_001080453|NM_006546|NM_006389|NR_033759|NR_002819|NM_003373|NM_002631|NM_023934|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|3|3|3|3|3|3|3|3|2|...|

|6|CDS|3'_UTR|noncoding|5'_UTR|stop_codon|start_codon|
|:---|:---|:---|:---|:---|:---|:---|
|Count|779|466|141|96|6|1|

|7|0|
|:---|:---|
|Count|1489|

|8|+|-|
|:---|:---|:---|
|Count|788|701|



In [35]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"|",5,"|",6,"\t",7,"\t",8]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",cell,"|",method,"|",PMID,"|",4,"\t",7,"\t",8]

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-")

file_summary(outfile)









    



1489 Lines processed	1489 Lines pass	0 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Dataset/Li_Y_HEK293_hg38_cleaned.bed
Total lines:	1495

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-08 14:26:25.480994
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=Y|Cell_type=HEK293|Analysis_method=CeU_Seq|Pubmed_ID=26075521
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	1051520	1051521	Y|HEK293|CeU_Seq|26075521|AGRN	0	+
7	chr1	1254956	1254957	Y|HEK293|CeU_Seq|26075521|UBE2J2	0	-
8	chr1	1298603	1298604	Y|HEK293|CeU_Seq|26075521|ACAP3	0	-
9	chr1	1386884	1386885	Y|HEK293|CeU_Seq|26075521|CCNL2	0	-

1490	chrX	154765960	154765961	Y|HEK293|CeU_Seq|26075521|DKC1	0	+
1491	chrX	155033408	155033409	Y|HEK293|CeU_Seq|26075521|FUNDC2	0	+
1492	chrX	155054648	155054649	Y|HEK293|CeU_Seq|26075521|FUNDC2	0	+
1493	chrX	155122200	155122201	Y|HEK293|CeU_Seq|26075521|BRCC3	0	+
1494	chrX	155258757	155258758	Y|HEK293|CeU_Seq|26075521|RAB39B	0	-

Found 10 colums
First line found
|0|chr1|chr17|chr19|chr11|chr2|chr12|chr3|chrX|chr16|chr6|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|169|121|108|100|97|90|69|62|60|57|...|

|1|155258757|155122200|155054648|155033408|154765960|154436177|154413507|154398282|154024660|153934970|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|2|155258758|155122201|155054649|155033409|154765961|154436178|154413508|154398283|154024661|153934971|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1|1|1|1|1|1|1|1|1|1|...|

|3|Y|
|:---|:---|
|Count|1489|

|4|HEK293|
|:---|:---|
|Count|1489|

|5|CeU_Seq|
|:---|:---|
|Count|1489|

|6|26075521|
|:---|:---|
|Count|1489|

|7|WHSC1|C7orf26|INTS1|BAG6|PRKCSH|IGF2BP1|CNOT2|HYOU1|ATP5L|MALAT1|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|3|3|3|3|3|3|3|3|3|...|

|8|0|
|:---|:---|
|Count|1489|

|9|+|-|
|:---|:---|:---|
|Count|788|701|



In [34]:

    
distrib_peak_len(outfile)

No problem with this dataset

pseudoU_Schwartz_hg38



In [35]:

    
infile="./PTM_Original_Datasets/pseudoU_Schwartz_hg38.bed"
PMID = "25219674"
cell = "HEK293:Fibroblast"
modification = "Y"
method = "Psi-seq"
author = "Schwartz"
outfile = "./PTM_Clean_Datasets/{}_{}_{}_hg38_cleaned.bed".format(author, modification, cell, method)

file_summary(infile)









    



Filename:	./PTM_Original_Datasets/pseudoU_Schwartz_hg38.bed
Total lines:	402

0	# Transcriptome-wide map of pseudouridine [hg38 coordinates]
1	# Reference: Schwartz et al., Cell 159, 148 (2014) [PMID 25219674, DOI 10.1016/j.cell.2014.08.028]
2	#
3	# Data cleaned and converted to BED6, coordinate conversion to hg38 using liftOver.
4	# Maintainer: Maurits Evers (maurits.evers@anu.edu.au)
5	#
6	chr1	1045799	1045800	Schwartz|AGRN	0	+
7	chr1	6222864	6222865	Schwartz|ICMT	0	-
8	chr1	6623606	6623607	Schwartz|PHF13	0	+
9	chr1	16545956	16545957	Schwartz|TRNA_Gly	0	-

397	chrX	136210827	136210828	Schwartz|FHL1	0	+
398	chrX	153694609	153694610	Schwartz|SLC6A8	0	+
399	chrX	153700577	153700578	Schwartz|BCAP31	0	-
400	chrX	153786900	153786901	Schwartz|IDH3G	0	-
401	chrX	154765960	154765961	Schwartz|DKC1	0	+

|0|chr1|chr12|chr2|chr5|chr17|chr11|chr6|chr3|chrX|chr8|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|58|30|28|27|25|24|23|22|18|18|...|

|1|23917850|22829934|154765960|153786900|153700577|153694609|136210827|136210510|129592365|124077964|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|2|1|1|1|1|1|1|1|1|...|

|2|23917851|22829935|154765961|153786901|153700578|153694610|136210828|136210511|129592366|124077965|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|2|1|1|1|1|1|1|1|1|...|

|3|Schwartz|
|:---|:---|
|Count|396|

|4|RPL15|TRNA_Val|PMPCB|MALAT1|TRNA_Glu|MCL1|SNORD21|FHL1|HNRNPH2|PLEC|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|4|3|3|3|3|3|2|2|2|...|

|5|0|
|:---|:---|
|Count|396|

|6|-|+|
|:---|:---|:---|
|Count|198|198|



In [39]:

    
# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand\n"

init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"\t",5,"\t",6]
final_template=[0,"\t",1,"\t",2,"\t",modification,"|",cell,"|",method,"|",PMID,"|",4,"\t",5,"\t",6]

reformat_table(
    input_file=infile,
    output_file=outfile,
    init_template=init_template,
    final_template=final_template,
    keep_original_header = False,
    header = generate_header(PMID, cell, modification, method),
    replace_internal_space='_',
    replace_null_val="-")

file_summary(outfile)









    



396 Lines processed	396 Lines pass	0 Lines filtered out	0 Lines fail

Filename:	./PTM_Clean_Dataset/Schwartz_Y_HEK293:Fibroblast_hg38_cleaned.bed
Total lines:	402

0	# Data cleaned, converted to BED6, coordinate converted to hg38 using liftOver
1	# Maurits Evers (maurits.evers@anu.edu.au)
2	# Data cleaned and standardized. 2016-06-08 14:30:59.408427
3	# Adrien Leger (aleg@ebi.ac.uk)
4	# RNA_modification=Y|Cell_type=HEK293:Fibroblast|Analysis_method=Psi-seq|Pubmed_ID=25219674
5	# chrom	chromstart	chromend	modif|cell_type|method|PMID|loci	score	strand
6	chr1	1045799	1045800	Y|HEK293:Fibroblast|Psi-seq|25219674|AGRN	0	+
7	chr1	6222864	6222865	Y|HEK293:Fibroblast|Psi-seq|25219674|ICMT	0	-
8	chr1	6623606	6623607	Y|HEK293:Fibroblast|Psi-seq|25219674|PHF13	0	+
9	chr1	16545956	16545957	Y|HEK293:Fibroblast|Psi-seq|25219674|TRNA_Gly	0	-

397	chrX	136210827	136210828	Y|HEK293:Fibroblast|Psi-seq|25219674|FHL1	0	+
398	chrX	153694609	153694610	Y|HEK293:Fibroblast|Psi-seq|25219674|SLC6A8	0	+
399	chrX	153700577	153700578	Y|HEK293:Fibroblast|Psi-seq|25219674|BCAP31	0	-
400	chrX	153786900	153786901	Y|HEK293:Fibroblast|Psi-seq|25219674|IDH3G	0	-
401	chrX	154765960	154765961	Y|HEK293:Fibroblast|Psi-seq|25219674|DKC1	0	+

Found 10 colums
First line found
|0|chr1|chr12|chr2|chr5|chr17|chr11|chr6|chr3|chrX|chr8|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|58|30|28|27|25|24|23|22|18|18|...|

|1|23917850|22829934|154765960|153786900|153700577|153694609|136210827|136210510|129592365|124077964|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|2|1|1|1|1|1|1|1|1|...|

|2|23917851|22829935|154765961|153786901|153700578|153694610|136210828|136210511|129592366|124077965|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|2|1|1|1|1|1|1|1|1|...|

|3|Y|
|:---|:---|
|Count|396|

|4|HEK293:Fibroblast|
|:---|:---|
|Count|396|

|5|Psi-seq|
|:---|:---|
|Count|396|

|6|25219674|
|:---|:---|
|Count|396|

|7|RPL15|TRNA_Val|PMPCB|MALAT1|TRNA_Glu|MCL1|SNORD21|FHL1|HNRNPH2|PLEC|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|4|4|3|3|3|3|3|2|2|2|...|

|8|0|
|:---|:---|
|Count|396|

|9|-|+|
|:---|:---|:---|
|Count|198|198|



In [36]:

    
distrib_peak_len(outfile)

No problem with this dataset

Summary of the PTM datasets

Verify datasets homogeneous formating



In [57]:

    
for f in sorted(glob("./PTM_Clean_Datasets/*.bed")):
    print (f)
    linerange(f, [[10,12]])









    



./PTM_Clean_Datasets/Carlile_Y_HeLa_hg38_cleaned.bed
10	chr17	16438821	16478678	Y|HeLa|Pseudo_seq|25192136|FAM211A-AS1	1537	+
11	chr2	231455799	231455936	Y|HeLa|Pseudo_seq|25192136|SNROA75	84	-
12	chr6	52995619	52995950	Y|HeLa|Pseudo_seq|25192136|RN7SK	250	+

./PTM_Clean_Datasets/DARNED_human_hg38_inosine_cleaned.bed
10	chr4	482698	482698	A>I|THYMUS|-|15342557|ZNF721	0	-
11	chr4	482700	482700	A>I|THYMUS|-|15342557|ZNF721	0	-
12	chr4	482737	482737	A>I|THYMUS|-|15342557|ZNF721	0	-

./PTM_Clean_Datasets/Dominissini_m1A_HeLa:HEK293:HepG2_hg38_cleaned.bed
10	chr1	826862	826863	m1A|HepG2|M1A_seq|26863196|NR_024321	0	-
11	chr1	827681	827682	m1A|HepG2|M1A_seq|26863196|NR_047524	0	+
12	chr1	942192	942193	m1A|HeLa|M1A_seq|26863196|NM_152486	0	+

./PTM_Clean_Datasets/Dominissini_m6A_HepG2_hg38_cleaned.bed
10	chr1	460876	461815	m6A|HepG2|M6A_seq|22575960|OR4F16	-	-
11	chr1	622631	623325	m6A|HepG2|M6A_seq|22575960|X64709	+	+
12	chr1	623326	623396	m6A|HepG2|M6A_seq|22575960|M37726	+	+

./PTM_Clean_Datasets/Hussain_m5C_HEK293_hg38_cleaned.bed
10	chr1	16678287	16678288	m5C|HEK293|miCLIP|23871666|-	0	-
11	chr1	16678295	16678296	m5C|HEK293|miCLIP|23871666|-	0	-
12	chr1	16678302	16678303	m5C|HEK293|miCLIP|23871666|-	0	-

./PTM_Clean_Datasets/Khoddami_m5C_MEF_hg38_cleaned.bed
10	chr1	16545922	16545923	m5C|MEF|AzaIP|23604283|tRNA-Gly-GGG	0	-
11	chr1	16545924	16545925	m5C|MEF|AzaIP|23604283|tRNA-Gly-GGG	0	-
12	chr1	16545930	16545931	m5C|MEF|AzaIP|23604283|tRNA-Gly-GGG	0	-

./PTM_Clean_Datasets/Li_Y_HEK293_hg38_cleaned.bed
10	chr1	1482572	1482573	Y|HEK293|CeU_Seq|26075521|ATAD3B	0	+
11	chr1	1520575	1520576	Y|HEK293|CeU_Seq|26075521|ATAD3A	0	+
12	chr1	2374504	2374505	Y|HEK293|CeU_Seq|26075521|MORN1	0	-

./PTM_Clean_Datasets/Linder_m6A:m6Am_HEK293_hg38_cleaned.bed
10	chr1	942896	942897	m6A:m6Am|HEK293|miCLIP|26121403|-	1	+
11	chr1	1014146	1014147	m6A:m6Am|HEK293|miCLIP|26121403|-	1	+
12	chr1	1014335	1014336	m6A:m6Am|HEK293|miCLIP|26121403|-	1	+

./PTM_Clean_Datasets/Meyer_m6A_HEK293_hg38_cleaned.bed
10	chr1	1233019	1233195	m6A|HEK293|MeRIP_Seq|22608085|B3GALT6	0	+
11	chr1	1233894	1234070	m6A|HEK293|MeRIP_Seq|22608085|B3GALT6	0	+
12	chr1	1234494	1234662	m6A|HEK293|MeRIP_Seq|22608085|B3GALT6	0	+

./PTM_Clean_Datasets/Peng_A>I_YH_hg38_cleaned.bed
10	chr1	1252243	1252244	A>I|YH|A_to_I_editing|22327324|-	19.44	-
11	chr1	1317060	1317061	A>I|YH|A_to_I_editing|22327324|CPSF3L	72.73	-
12	chr1	1317062	1317063	A>I|YH|A_to_I_editing|22327324|CPSF3L	45.45	-

./PTM_Clean_Datasets/RADAR_Human_hg38_inosine_cleaned.bed
10	chr1	740870	740870	A>I|YH|-|22484847|uc002khh.2	6.9	-
11	chr1	749347	749347	A>I|YH|-|22484847|RP11-206L10.3	19	-
12	chr1	766680	766680	A>I|Brain|-|23291724|LOC100288069	100	-

./PTM_Clean_Datasets/Sakurai_A>I_Brain_hg38_cleaned.bed
10	chr1	136185	136186	A>I|Brain|ICE_seq|24407955|uc009vjj.1	0	-
11	chr1	136217	136218	A>I|Brain|ICE_seq|24407955|uc009vjj.1	0	-
12	chr1	136280	136281	A>I|Brain|ICE_seq|24407955|uc009vjj.1	0	-

./PTM_Clean_Datasets/Schwartz_Y_HEK293:Fibroblast_hg38_cleaned.bed
10	chr1	24842732	24842733	Y|HEK293:Fibroblast|Psi-seq|25219674|CLIC4	0	+
11	chr1	25356801	25356802	Y|HEK293:Fibroblast|Psi-seq|25219674|TMEM50A	0	+
12	chr1	33014552	33014553	Y|HEK293:Fibroblast|Psi-seq|25219674|AK2	0	-

./PTM_Clean_Datasets/Squires_m5C_HeLa_hg38_cleaned.bed
10	chr1	633058	633059	m5C|HeLa|bisulfite_seq|22344696|-	0	+
11	chr1	633062	633063	m5C|HeLa|bisulfite_seq|22344696|-	0	+
12	chr1	634423	634424	m5C|HeLa|bisulfite_seq|22344696|-	0	+

OK for all the datasets

Summary of the datasets



In [58]:

    
for f in sorted(glob("./PTM_Clean_Datasets/*.bed")):
    print ("\n", "-"*100)
    print ("Dataset Name\t{}".format(basename(f)))
    print ("Number sites\t{}".format(simplecount(f, ignore_hashtag_line=True)))
    a = colsum(
        f,
        colrange = [3,4,5,6],
        header=False,
        ignore_hashtag_line=True,
        separator=["\t", "|"],
        max_items=20,
        ret_type="dict"
    )
    
    # Get more info via pubmed
    print ("PMID")
    for pmid,count in a[6].items():
        pubmed_info = pmid_to_info(pmid)
        print ("\t*{}\t{}\n\t {}. et al, {}\{}\{}\n\t {}".format(
                pmid,count,
                pubmed_info["first_name"],
                pubmed_info["Year"],
                pubmed_info["Month"],
                pubmed_info["Day"],
                pubmed_info["title"]))
    
    # Simple listing for the other fields
    for title, col in [["RNA PTM",3],["Tissue/cell",4],["Method",5]]:
        print (title)
        print(dict_to_report(a[col], ntab=1, max_items=10, tab="\t", sep="\t"))









    



 ----------------------------------------------------------------------------------------------------
Dataset Name	Carlile_Y_HeLa_hg38_cleaned.bed
Number sites	8
PMID
	*25192136	8
	 Carlile. et al, 2014\09\05
	 Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells.
RNA PTM
	Y	8

Tissue/cell
	HeLa	8

Method
	Pseudo_seq	8


 ----------------------------------------------------------------------------------------------------
Dataset Name	DARNED_human_hg38_inosine_cleaned.bed
Number sites	289998
PMID
	*19478186	412
	 Li. et al, NA\NA\NA
	 Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing.
	*15342557	31489
	 Kim. et al, NA\NA\NA
	 Widespread RNA editing of embedded alu elements in the human transcriptome.
	*15258596	3954
	 Levanon. et al, 2004\07\18
	 Systematic identification of abundant A-to-I editing sites in the human transcriptome.
	*21960545	11164
	 Bahn. et al, 2011\09\29
	 Accurate identification of A-to-I RNA editing in human by transcriptome sequencing.
	*22327324	20285
	 Peng. et al, 2012\02\12
	 Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome.
	*22028664	885
	 Carmi. et al, 2011\10\20
	 Identification of widespread ultra-edited human RNAs.
	*15545495	1646
	 Blow. et al, 2004\11\15
	 A survey of RNA editing in human brain.
	*18684997	41
	 Kawahara. et al, 2008\08\06
	 Frequency and fate of microRNA editing in human brain.
	*16100382	15
	 Eisenberg. et al, 2005\08\12
	 Identification of RNA editing sites in the SNP database.
	*16594986	6
	 Blow. et al, 2006\04\04
	 RNA editing of human microRNAs.
	*15797904	3
	 Clutterbuck. et al, 2005\03\29
	 A bioinformatic screen for novel A-I RNA editing sites reveals recoding editing in BC10.
	*21725310	449
	 Ju. et al, 2011\07\03
	 Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals.
	*21984433	4
	 Silberberg. et al, 2011\10\07
	 Deregulation of the A-to-I RNA editing mechanism in psychiatric disorders.
	*19275900	4
	 Sie. et al, 2009\03\09
	 Conserved recoding RNA editing of vertebrate C1q-related factor C1QL1.
	*22484847	219641
	 Ramaswami. et al, 2012\04\04
	 Accurate identification of human Alu and non-Alu RNA editing sites.
RNA PTM
	A>I	289998

Tissue/cell
	LYMPHOBLASTOID_CELL	219641
	LYMPHOBLASTOID_CELL_LINE	20827
	BREAST_CANCER	8291
	BRAIN	7364
	U87MG	3238
	CEREBELLUM	2732
	THYMUS	2371
	UTERUS	2109
	SPLEEN	1856
	AMYGDALA	1729
	...	...

Method
	-	289998


 ----------------------------------------------------------------------------------------------------
Dataset Name	Dominissini_m1A_HeLa:HEK293:HepG2_hg38_cleaned.bed
Number sites	19552
PMID
	*26863196	19552
	 Dominissini. et al, 2016\02\10
	 The dynamic N(1)-methyladenosine methylome in eukaryotic messenger RNA.
RNA PTM
	m1A	19552

Tissue/cell
	HeLa	8873
	HepG2	8550
	HEK293	2129

Method
	M1A_seq	19552


 ----------------------------------------------------------------------------------------------------
Dataset Name	Dominissini_m6A_HepG2_hg38_cleaned.bed
Number sites	2894
PMID
	*22575960	2894
	 Dominissini. et al, 2012\04\29
	 Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq.
RNA PTM
	m6A	2894

Tissue/cell
	HepG2	2894

Method
	M6A_seq	2894


 ----------------------------------------------------------------------------------------------------
Dataset Name	Hussain_m5C_HEK293_hg38_cleaned.bed
Number sites	1084
PMID
	*23871666	1084
	 Hussain. et al, 2013\07\18
	 NSun2-mediated cytosine-5 methylation of vault noncoding RNA determines its processing into regulatory small RNAs.
RNA PTM
	m5C	1084

Tissue/cell
	HEK293	1084

Method
	miCLIP	1084


 ----------------------------------------------------------------------------------------------------
Dataset Name	Khoddami_m5C_MEF_hg38_cleaned.bed
Number sites	20553
PMID
	*23604283	20553
	 Khoddami. et al, 2013\04\21
	 Identification of direct targets and modified bases of RNA cytosine methyltransferases.
RNA PTM
	m5C	20553

Tissue/cell
	MEF	20553

Method
	AzaIP	20553


 ----------------------------------------------------------------------------------------------------
Dataset Name	Li_Y_HEK293_hg38_cleaned.bed
Number sites	1489
PMID
	*26075521	1489
	 Li. et al, 2015\06\15
	 Chemical pulldown reveals dynamic pseudouridylation of the mammalian transcriptome.
RNA PTM
	Y	1489

Tissue/cell
	HEK293	1489

Method
	CeU_Seq	1489


 ----------------------------------------------------------------------------------------------------
Dataset Name	Linder_m6A:m6Am_HEK293_hg38_cleaned.bed
Number sites	15167
PMID
	*26121403	15167
	 Linder. et al, 2015\06\29
	 Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome.
RNA PTM
	m6A:m6Am	15167

Tissue/cell
	HEK293	15167

Method
	miCLIP	15167


 ----------------------------------------------------------------------------------------------------
Dataset Name	Meyer_m6A_HEK293_hg38_cleaned.bed
Number sites	4341
PMID
	*22608085	4341
	 Meyer. et al, 2012\05\17
	 Comprehensive analysis of mRNA methylation reveals enrichment in 3' UTRs and near stop codons.
RNA PTM
	m6A	4341

Tissue/cell
	HEK293	4341

Method
	MeRIP_Seq	4341


 ----------------------------------------------------------------------------------------------------
Dataset Name	Peng_A>I_YH_hg38_cleaned.bed
Number sites	21111
PMID
	*22327324	21111
	 Peng. et al, 2012\02\12
	 Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome.
RNA PTM
	A>I	21111

Tissue/cell
	YH	21111

Method
	A_to_I_editing	21111


 ----------------------------------------------------------------------------------------------------
Dataset Name	RADAR_Human_hg38_inosine_cleaned.bed
Number sites	1342374
PMID
	*22484847	503951
	 Ramaswami. et al, 2012\04\04
	 Accurate identification of human Alu and non-Alu RNA editing sites.
	*23291724	813569
	 Ramaswami. et al, 2013\01\06
	 Identifying RNA editing sites using RNA sequencing data alone.
	*22327324	20880
	 Peng. et al, 2012\02\12
	 Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome.
	*21960545	3974
	 Bahn. et al, 2011\09\29
	 Accurate identification of A-to-I RNA editing in human by transcriptome sequencing.
RNA PTM
	A>I	1342374

Tissue/cell
	YH	688654
	Brain	416907
	Illumina_Bodymap	232839
	U87MG	3974

Method
	-	1342374


 ----------------------------------------------------------------------------------------------------
Dataset Name	Sakurai_A>I_Brain_hg38_cleaned.bed
Number sites	20482
PMID
	*24407955	20482
	 Sakurai. et al, 2014\01\09
	 A biochemical landscape of A-to-I RNA editing in the human brain transcriptome.
RNA PTM
	A>I	20482

Tissue/cell
	Brain	20482

Method
	ICE_seq	20482


 ----------------------------------------------------------------------------------------------------
Dataset Name	Schwartz_Y_HEK293:Fibroblast_hg38_cleaned.bed
Number sites	396
PMID
	*25219674	396
	 Schwartz. et al, 2014\09\11
	 Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA.
RNA PTM
	Y	396

Tissue/cell
	HEK293:Fibroblast	396

Method
	Psi-seq	396


 ----------------------------------------------------------------------------------------------------
Dataset Name	Squires_m5C_HeLa_hg38_cleaned.bed
Number sites	10490
PMID
	*22344696	10490
	 Squires. et al, 2012\02\16
	 Widespread occurrence of 5-methylcytosine in human coding and non-coding RNA.
RNA PTM
	m5C	10490

Tissue/cell
	HeLa	10490

Method
	bisulfite_seq	10490

Gene annotation of the PTM datasets

The original annotations might not be optimal, and probably not made from an uniq reference annotation file. I will reanotate all the datasets will the last version of gencodegencode.v24.long_noncoding_RNAs.gff3. I split the file in 3 to retain only genes, transcript and exons. I also got the general gencode file containing all the annotated genes in the primary assembly

gencodegencode.v24.long_noncoding_RNAs = Contains the comprehensive gene annotation of lncRNA genes on the reference chromosomes
gencode.v24.annotation = Contains the comprehensive gene annotation on the primary assembly (chromosomes and scaffolds) sequence regions



In [120]:

    
# New dir to create annotated files 
mkdir("PTM_Annotated_Datasets")
mkdir("Test")









    Out[120]:





'Test'

I found a python wrapper package for bedtools to manipulate bed files. I will use it to intersect my bed files containing the positions of the PTM (or peaks) and the gff3 annotation files. This will allow me to get gene names for each positions



In [17]:

    
help(reformat_table)









    



Help on function reformat_table in module pycl:

reformat_table(input_file, output_file, init_template, final_template, header='', keep_original_header=True, replace_internal_space='_', replace_null_val='*', subst_dict={}, filter_dict=[], predicate=None)
    Reformat a table given an intial and a final line templates indicated as a list where numbers
    indicate the data column and strings the formating characters
    Example initial line = "chr1    631539    631540    Squires|id1    0    +"
    Initial template = [0,"     ",1,"   ",2,"   ",3,"|",4,"     ",5,"   ",6]
    Example final line = "chr1    631539    631540    m5C|-|HeLa|22344696    -    -"
    Final template = [0,"       ",1,"   ",2,"   m5C|-|HeLa|22344696     -       ",6]
    *
    A nested dictionnary of substitution per position can also be provided to replace
    specific values by others :
    subst_dict = { 0:{"chr1":"1","chr2":"2"}, 3:{"Squires":"5376774764","Li":"27664684"}}
    *
    In addition a dictionnary of list per position can be provided to fiter out lines 
    with specific values :
    filter_dict =  { 0:["chr2", "chr4"], 1:["46767", "87765"], 5:["76559", "77543"]}
    *
    Finally values can also be filtered based on a lambda function predicate
    Example of filtering base on the difference of value between 2 positions of the line.    
    predicate = lambda val_list: abs(int(val_list[1])-int(val_list[2])) <= 2000



In [23]:

    
head("../../Reference_Annotation/gencode_v24.gff3.gz")









    



##gff-version 3
#description: evidence-based annotation of the human genome (GRCh38), version 24 (Ensembl 83)
#provider: GENCODE
#contact: gencode-help@sanger.ac.uk
#format: gff3
#date: 2015-12-03
##sequence-region chr1 1 248956422
chr1	HAVANA	gene	11869	14409	.	+	.	ID=ENSG00000223972.5;gene_id=ENSG00000223972.5;gene_type=transcribed_unprocessed_pseudogene;gene_status=KNOWN;gene_name=DDX11L1;level=2;havana_gene=OTTHUMG00000000961.2
chr1	HAVANA	transcript	11869	14409	.	+	.	ID=ENST00000456328.2;Parent=ENSG00000223972.5;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_status=KNOWN;gene_name=DDX11L1;transcript_type=processed_transcript;transcript_status=KNOWN;transcript_name=DDX11L1-002;level=2;transcript_support_level=1;tag=basic;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1
chr1	HAVANA	exon	11869	12227	.	+	.	ID=exon:ENST00000456328.2:1;Parent=ENST00000456328.2;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_status=KNOWN;gene_name=DDX11L1;transcript_type=processed_transcript;transcript_status=KNOWN;transcript_name=DDX11L1-002;exon_number=1;exon_id=ENSE00002234944.1;level=2;transcript_support_level=1;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1;tag=basic



In [29]:

    
import pybedtools
annotation_file = "../../Reference_Annotation/gencode_v24.gff3"
peak_file = "./PTM_Clean_Datasets/Dominissini_m6A_HepG2_hg38_cleaned.bed"
output_file = "./test.bed"

peak = pybedtools.BedTool(peak_file)
annotation = pybedtools.BedTool(annotation_file)

intersection = peak.intersect(annotation, wo=True, s=True)

# Reformat the file generated by pybedtools to a simple Bed format
init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"|",5,"|",6,"|",7,"\t",8,"\t",9,"\t",10,"\t",11,"\t",12,"\t",13,
               "\t",14,"\t",15,"\t",16,"\t",17,"\tID=",18,";gene_id=",19,";gene_type=",20,";gene_status=",21,
               ";gene_name=",22,";level=",23,";havana_gene=",24]

final_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"|",5,"|",6,"|",18,"|",20,"|",22,"\t",8,"\t",9]

print(intersection.head())

print("Post processing results")
reformat_table(
    input_file=intersection.fn,
    output_file=output_file,
    init_template=init_template,
    final_template=final_template,
    replace_internal_space='_',
    replace_null_val="-",
    keep_original_header = False,
    predicate = lambda v: v[12] == "gene"
)

head(output_file)









    



chr1	127586	128558	m6A|HepG2|M6A_seq|22575960|DQ580039	-	-	chr1	HAVANA	transcript	92230	129217	.	-	.	ID=ENST00000477740.5;Parent=ENSG00000238009.6;gene_id=ENSG00000238009.6;transcript_id=ENST00000477740.5;gene_type=lincRNA;gene_status=KNOWN;gene_name=RP11-34P13.7;transcript_type=lincRNA;transcript_status=KNOWN;transcript_name=RP11-34P13.7-003;level=2;transcript_support_level=5;tag=not_best_in_genome_evidence;havana_gene=OTTHUMG00000001096.2;havana_transcript=OTTHUMT00000003688.1	972
 chr1	127586	128558	m6A|HepG2|M6A_seq|22575960|DQ580039	-	-	chr1	HAVANA	transcript	110953	129173	.	-	.	ID=ENST00000471248.1;Parent=ENSG00000238009.6;gene_id=ENSG00000238009.6;transcript_id=ENST00000471248.1;gene_type=lincRNA;gene_status=KNOWN;gene_name=RP11-34P13.7;transcript_type=lincRNA;transcript_status=KNOWN;transcript_name=RP11-34P13.7-002;level=2;transcript_support_level=5;tag=not_best_in_genome_evidence,dotter_confirmed;havana_gene=OTTHUMG00000001096.2;havana_transcript=OTTHUMT00000003687.1	972
 chr1	127586	128558	m6A|HepG2|M6A_seq|22575960|DQ580039	-	-	chr1	HAVANA	gene	89295	133723	.	-	.	ID=ENSG00000238009.6;gene_id=ENSG00000238009.6;gene_type=lincRNA;gene_status=KNOWN;gene_name=RP11-34P13.7;level=2;havana_gene=OTTHUMG00000001096.2	972
 chr1	127586	128558	m6A|HepG2|M6A_seq|22575960|DQ580039	-	-	chr1	ENSEMBL	transcript	120725	133723	.	-	.	ID=ENST00000610542.1;Parent=ENSG00000238009.6;gene_id=ENSG00000238009.6;transcript_id=ENST00000610542.1;gene_type=lincRNA;gene_status=KNOWN;gene_name=RP11-34P13.7;transcript_type=lincRNA;transcript_status=KNOWN;transcript_name=RP11-34P13.7-201;level=3;transcript_support_level=5;tag=basic;havana_gene=OTTHUMG00000001096.2	972
 chr1	129367	129428	m6A|HepG2|M6A_seq|22575960|DQ600587	-	-	chr1	HAVANA	gene	89295	133723	.	-	.	ID=ENSG00000238009.6;gene_id=ENSG00000238009.6;gene_type=lincRNA;gene_status=KNOWN;gene_name=RP11-34P13.7;level=2;havana_gene=OTTHUMG00000001096.2	61
 chr1	129367	129428	m6A|HepG2|M6A_seq|22575960|DQ600587	-	-	chr1	ENSEMBL	transcript	120725	133723	.	-	.	ID=ENST00000610542.1;Parent=ENSG00000238009.6;gene_id=ENSG00000238009.6;transcript_id=ENST00000610542.1;gene_type=lincRNA;gene_status=KNOWN;gene_name=RP11-34P13.7;transcript_type=lincRNA;transcript_status=KNOWN;transcript_name=RP11-34P13.7-201;level=3;transcript_support_level=5;tag=basic;havana_gene=OTTHUMG00000001096.2	61
 chr1	129367	129428	m6A|HepG2|M6A_seq|22575960|DQ600587	-	-	chr1	HAVANA	transcript	129081	133566	.	-	.	ID=ENST00000453576.2;Parent=ENSG00000238009.6;gene_id=ENSG00000238009.6;transcript_id=ENST00000453576.2;gene_type=lincRNA;gene_status=KNOWN;gene_name=RP11-34P13.7;transcript_type=lincRNA;transcript_status=KNOWN;transcript_name=RP11-34P13.7-004;level=2;transcript_support_level=2;havana_gene=OTTHUMG00000001096.2;havana_transcript=OTTHUMT00000003689.1	61
 chr1	501020	501617	m6A|HepG2|M6A_seq|22575960|DQ574721	-	-	chr1	HAVANA	transcript	494464	501617	.	-	.	ID=ENST00000440038.6;Parent=ENSG00000237094.11;gene_id=ENSG00000237094.11;transcript_id=ENST00000440038.6;gene_type=lincRNA;gene_status=KNOWN;gene_name=RP4-669L17.10;transcript_type=lincRNA;transcript_status=KNOWN;transcript_name=RP4-669L17.10-003;level=2;transcript_support_level=5;tag=basic;havana_gene=OTTHUMG00000156968.6;havana_transcript=OTTHUMT00000346880.2	597
 chr1	501020	501617	m6A|HepG2|M6A_seq|22575960|DQ574721	-	-	chr1	HAVANA	exon	501556	501617	.	-	.	ID=exon:ENST00000440038.6:1;Parent=ENST00000440038.6;gene_id=ENSG00000237094.11;transcript_id=ENST00000440038.6;gene_type=lincRNA;gene_status=KNOWN;gene_name=RP4-669L17.10;transcript_type=lincRNA;transcript_status=KNOWN;transcript_name=RP4-669L17.10-003;exon_number=1;exon_id=ENSE00003186556.1;level=2;transcript_support_level=5;havana_gene=OTTHUMG00000156968.6;havana_transcript=OTTHUMT00000346880.2;tag=basic	62
 chr1	501020	501617	m6A|HepG2|M6A_seq|22575960|DQ574721	-	-	chr1	HAVANA	transcript	498984	501607	.	-	.	ID=ENST00000608420.1;Parent=ENSG00000237094.11;gene_id=ENSG00000237094.11;transcript_id=ENST00000608420.1;gene_type=lincRNA;gene_status=KNOWN;gene_name=RP4-669L17.10;transcript_type=lincRNA;transcript_status=KNOWN;transcript_name=RP4-669L17.10-014;level=2;transcript_support_level=5;havana_gene=OTTHUMG00000156968.6;havana_transcript=OTTHUMT00000472556.1	587
 None
Post processing results
6822 Lines processed	1080 Lines pass	5742 Lines filtered out	0 Lines fail
chr1	127586	128558	m6A|HepG2|M6A_seq|22575960|ENSG00000238009.6|lincRNA|RP11-34P13.7	-	-
chr1	129367	129428	m6A|HepG2|M6A_seq|22575960|ENSG00000238009.6|lincRNA|RP11-34P13.7	-	-
chr1	501020	501617	m6A|HepG2|M6A_seq|22575960|ENSG00000237094.11|lincRNA|RP4-669L17.10	-	-
chr1	460876	461815	m6A|HepG2|M6A_seq|22575960|ENSG00000237094.11|lincRNA|RP4-669L17.10	-	-
chr1	675578	676517	m6A|HepG2|M6A_seq|22575960|ENSG00000230021.8|lincRNA|RP5-857K21.4	-	-
chr1	719475	719536	m6A|HepG2|M6A_seq|22575960|ENSG00000230021.8|lincRNA|RP5-857K21.4	-	-
chr1	719583	719615	m6A|HepG2|M6A_seq|22575960|ENSG00000230021.8|lincRNA|RP5-857K21.4	-	-
chr1	722884	722962	m6A|HepG2|M6A_seq|22575960|ENSG00000230021.8|lincRNA|RP5-857K21.4	-	-
chr1	722992	723024	m6A|HepG2|M6A_seq|22575960|ENSG00000230021.8|lincRNA|RP5-857K21.4	-	-
chr1	723024	723055	m6A|HepG2|M6A_seq|22575960|ENSG00000230021.8|lincRNA|RP5-857K21.4	-	-



In [36]:

    
import pybedtools

def intersect_extract_genecodeID (annotation_file, peak_file, outdir):
    
    output_file = "{}/{}_{}.bed".format(outdir, file_basename(peak_file), file_basename(annotation_file))
    genecount_file = "{}/{}_{}_uniq-gene.csv".format(outdir, file_basename(peak_file), file_basename(annotation_file))
    site_file = "{}/{}_{}_uniq-sites.csv".format(outdir, file_basename(peak_file), file_basename(annotation_file))
    
    peak = pybedtools.BedTool(peak_file)
    annotation = pybedtools.BedTool(annotation_file)
    
    # Intersect the 2 files with pybedtools
    print("Intersecting {} with {}".format(file_basename(peak_file), file_basename(annotation_file)))
    intersection = peak.intersect(annotation, wo=True, s=True)

    # Reformat the file generated by pybedtools to a simple Bed format
    init_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"|",5,"|",6,"|",7,"\t",8,"\t",9,"\t",10,"\t",11,"\t",12,"\t",13,
                   "\t",14,"\t",15,"\t",16,"\t",17,"\tID=",18,";gene_id=",19,";gene_type=",20,";gene_status=",21,
                   ";gene_name=",22,";level=",23,";havana_gene=",24]

    final_template=[0,"\t",1,"\t",2,"\t",3,"|",4,"|",5,"|",6,"|",18,"|",20,"|",22,"\t",8,"\t",9]
    
    h = "# Data cleaned, converted to BED6, standardized and coordinates converted to hg38 using liftOver\n"
    h+= "# Overlaping gene annotation with gencodev24\n"
    h+= "# Adrien Leger (aleg@ebi.ac.uk) {}\n".format(str (datetime.datetime.today()))
    h+= "# chrom\tchromstart\tchromend\tmodif|cell_type|method|PMID|ensembl_id|gene_type|gene_name\tscore\tstrand\n"
    
    print("Post processing results")
    reformat_table(
        input_file=intersection.fn,
        output_file=output_file,
        init_template=init_template,
        final_template=final_template,
        replace_internal_space='_',
        replace_null_val="-",
        header = h,
        keep_original_header = False,
        predicate = lambda v: v[12] == "gene"
    )
    
    # Count the number of lines in the initial and final peak files
    i, j = simplecount(peak_file, ignore_hashtag_line=True), simplecount(output_file, ignore_hashtag_line=True)
    print("Total initial positions: {}\tTotal final positions: {}".format(i, j))
         
    # Count uniq gene id and uniq positions found in the dataset 
    geneid_dict = OrderedDict()
    coord_dict = OrderedDict()
    
    with open (output_file, "r") as fp:
        for line in fp:
            if line[0] != "#":
                sl= supersplit(line, separator=["\t", "|"])
                
                # write gene id, gene_type, 
                gene_id = "{}\t{}\t{}".format(sl[7],sl[9],sl[8])
                if gene_id not in geneid_dict:
                    geneid_dict[gene_id] = 0
                geneid_dict[gene_id] += 1
                
                coord = "{}:{}-{}".format(sl[0],sl[1],sl[2])
                if coord not in coord_dict:
                    coord_dict[coord] = 0
                coord_dict[coord] += 1
                
        print ("Uniq genes found\t{}\nUniq position found\t{}\n".format(
                len(geneid_dict.values()), len(coord_dict.values()) ))
    
    # Write each gene id found with the number of time found 
    with open (genecount_file, "w") as fp:
        fp.write(dict_to_report (geneid_dict, max_items=0, sep="\t"))

    # Write each gene id found with the number of time found 
    with open (site_file, "w") as fp:
        fp.write(dict_to_report (coord_dict, max_items=0, sep="\t"))

Add the gene name and ensembl gene ID to the bed name field

Test with one file



In [44]:

    
annotation_file = "../../Reference_Annotation/gencode_v24.gff3"
peak_file = "./PTM_Clean_Datasets/DARNED_human_hg38_inosine_cleaned.bed"
outdir = "./"

output_file = "./DARNED_human_hg38_inosine_cleaned_gencode_v24.bed"
genecount_file = "./DARNED_human_hg38_inosine_cleaned_gencode_v24_uniq-gene.csv"
site_file = "./DARNED_human_hg38_inosine_cleaned_gencode_v24_uniq-sites.csv"

intersect_extract_genecodeID (annotation_file, peak_file, outdir)
file_summary(output_file, separator=["\t","|"])

head (genecount_file)
head (site_file)

remove(output_file)
remove(genecount_file)
remove(site_file)









    



Intersecting DARNED_human_hg38_inosine_cleaned with gencode_v24
Post processing results
1887438 Lines processed	294441 Lines pass	1592997 Lines filtered out	0 Lines fail
Total initial positions: 289998	Total final positions: 294441
Uniq genes found	10558
Uniq position found	254964

Filename:	./DARNED_human_hg38_inosine_cleaned_gencode_v24.bed
Total lines:	294445

0	# Data cleaned, converted to BED6, standardized and coordinates converted to hg38 using liftOver
1	# Overlaping gene annotation with gencodev24
2	# Adrien Leger (aleg@ebi.ac.uk) 2016-08-11 15:46:24.693099
3	# chrom	chromstart	chromend	modif|cell_type|method|PMID|ensembl_id|gene_type|gene_name	score	strand
4	chr4	481679	481679	A>I|THYMUS|-|15342557|ENSG00000182903.15|protein_coding|ZNF721	0	-
5	chr4	482559	482559	A>I|THYMUS|-|15342557|ENSG00000182903.15|protein_coding|ZNF721	0	-
6	chr4	482621	482621	A>I|THYMUS|-|15342557|ENSG00000182903.15|protein_coding|ZNF721	0	-
7	chr4	482625	482625	A>I|THYMUS|-|15342557|ENSG00000182903.15|protein_coding|ZNF721	0	-
8	chr4	482635	482635	A>I|THYMUS|-|15342557|ENSG00000182903.15|protein_coding|ZNF721	0	-
9	chr4	482698	482698	A>I|THYMUS|-|15342557|ENSG00000182903.15|protein_coding|ZNF721	0	-

294440	chrX	3708764	3708764	A>I|LYMPHOBLASTOID_CELL|-|22484847|ENSG00000183943.5|protein_coding|PRKX	0	-
294441	chrX	3819413	3819413	A>I|LYMPHOBLASTOID_CELL|-|22484847|ENSG00000205664.10|lincRNA|RP11-706O15.1	0	-
294442	chrX	3819504	3819504	A>I|LYMPHOBLASTOID_CELL|-|22484847|ENSG00000205664.10|lincRNA|RP11-706O15.1	0	-
294443	chrX	3819673	3819673	A>I|LYMPHOBLASTOID_CELL|-|22484847|ENSG00000205664.10|lincRNA|RP11-706O15.1	0	-
294444	chrX	3819675	3819675	A>I|LYMPHOBLASTOID_CELL|-|22484847|ENSG00000205664.10|lincRNA|RP11-706O15.1	0	-

|0|chr1|chr2|chr12|chr7|chr3|chr5|chr6|chr4|chr8|chr19|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|45035|31079|25842|25363|24976|19521|16922|14674|14231|14119|...|

|1|140854015|140854002|140853153|140853131|140853061|140853045|140853034|140853018|140853008|140852959|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|18|18|18|18|18|18|18|18|18|18|...|

|2|140854015|140854002|140853153|140853131|140853061|140853045|140853034|140853018|140853008|140852959|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|18|18|18|18|18|18|18|18|18|18|...|

|3|A>I|
|:---|:---|
|Count|294441|

|4|LYMPHOBLASTOID_CELL|LYMPHOBLASTOID_CELL_LINE|BREAST_CANCER|BRAIN|U87MG|CEREBELLUM|THYMUS|UTERUS|SPLEEN|AMYGDALA|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|228582|18482|8608|6862|3304|2546|2343|2201|1933|1582|...|

|5|-|
|:---|:---|
|Count|294441|

|6|22484847|15342557|22327324|21960545|15258596|15545495|22028664|21725310|19478186|18684997|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|228582|29047|17905|11522|4244|1398|768|457|417|63|...|

|7|ENSG00000131508.15|ENSG00000004534.14|ENSG00000249859.7|ENSG00000152061.22|ENSG00000179912.20|ENSG00000163694.14|ENSG00000075151.20|ENSG00000143614.8|ENSG00000123352.17|ENSG00000116539.10|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1105|983|922|817|816|797|793|735|666|659|...|

|8|protein_coding|lincRNA|antisense|processed_transcript|transcribed_unprocessed_pseudogene|sense_intronic|TEC|sense_overlapping|processed_pseudogene|unprocessed_pseudogene|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|266259|8749|7730|4604|2205|1643|710|649|523|441|...|

|9|UBE2D2|RBM6|PVT1|RABGAP1L|R3HDM2|RBM47|EIF4G3|GATAD2B|SPATS2|ASH1L|...|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Count|1105|983|922|817|816|797|793|735|666|659|...|

|10|0|
|:---|:---|
|Count|294441|

|11|+|-|
|:---|:---|:---|
|Count|149574|144867|


ENSG00000131508.15	UBE2D2	protein_coding	1105
ENSG00000004534.14	RBM6	protein_coding	983
ENSG00000249859.7	PVT1	lincRNA	922
ENSG00000152061.22	RABGAP1L	protein_coding	817
ENSG00000179912.20	R3HDM2	protein_coding	816
ENSG00000163694.14	RBM47	protein_coding	797
ENSG00000075151.20	EIF4G3	protein_coding	793
ENSG00000143614.8	GATAD2B	protein_coding	735
ENSG00000123352.17	SPATS2	protein_coding	666
ENSG00000116539.10	ASH1L	protein_coding	659

chr5:140854015-140854015	18
chr5:140854002-140854002	18
chr5:140853153-140853153	18
chr5:140853131-140853131	18
chr5:140853061-140853061	18
chr5:140853045-140853045	18
chr5:140853034-140853034	18
chr5:140853018-140853018	18
chr5:140853008-140853008	18
chr5:140852959-140852959	18

It is working ok > looping over all the cleaned PTM files



In [45]:

    
# Annotation vs gencodev23 lncRNA genes

annotation_file = '/home/aleg/Data/Reference_Annotation/gencode_v24_lncRNAs.gff3'

for peak_file in sorted(glob("./PTM_Clean_Datasets/*.bed")):
    outdir = "./PTM_Annotated_Datasets"
    intersect_extract_genecodeID (annotation_file, peak_file, outdir)









    



Intersecting Carlile_Y_HeLa_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
483 Lines processed	4 Lines pass	479 Lines filtered out	0 Lines fail
Total initial positions: 8	Total final positions: 4
Uniq genes found	3
Uniq position found	3

Intersecting DARNED_human_hg38_inosine_cleaned with gencode_v24_lncRNAs
Post processing results
127043 Lines processed	24152 Lines pass	102891 Lines filtered out	0 Lines fail
Total initial positions: 289998	Total final positions: 24152
Uniq genes found	1300
Uniq position found	19814

Intersecting Dominissini_m1A_HeLa:HEK293:HepG2_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
3445 Lines processed	606 Lines pass	2839 Lines filtered out	0 Lines fail
Total initial positions: 19552	Total final positions: 606
Uniq genes found	338
Uniq position found	578

Intersecting Dominissini_m6A_HepG2_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
394 Lines processed	115 Lines pass	279 Lines filtered out	0 Lines fail
Total initial positions: 2894	Total final positions: 115
Uniq genes found	84
Uniq position found	114

Intersecting Hussain_m5C_HEK293_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
314 Lines processed	107 Lines pass	207 Lines filtered out	0 Lines fail
Total initial positions: 1084	Total final positions: 107
Uniq genes found	39
Uniq position found	106

Intersecting Khoddami_m5C_MEF_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
4677 Lines processed	1523 Lines pass	3154 Lines filtered out	0 Lines fail
Total initial positions: 20553	Total final positions: 1523
Uniq genes found	36
Uniq position found	1523

Intersecting Li_Y_HEK293_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
291 Lines processed	48 Lines pass	243 Lines filtered out	0 Lines fail
Total initial positions: 1489	Total final positions: 48
Uniq genes found	44
Uniq position found	47

Intersecting Linder_m6A:m6Am_HEK293_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
1903 Lines processed	385 Lines pass	1518 Lines filtered out	0 Lines fail
Total initial positions: 15167	Total final positions: 385
Uniq genes found	168
Uniq position found	375

Intersecting Meyer_m6A_HEK293_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
219 Lines processed	48 Lines pass	171 Lines filtered out	0 Lines fail
Total initial positions: 4341	Total final positions: 48
Uniq genes found	16
Uniq position found	48

Intersecting Peng_A>I_YH_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
17712 Lines processed	3382 Lines pass	14330 Lines filtered out	0 Lines fail
Total initial positions: 21111	Total final positions: 3382
Uniq genes found	505
Uniq position found	3259

Intersecting RADAR_Human_hg38_inosine_cleaned with gencode_v24_lncRNAs
Post processing results
451870 Lines processed	97118 Lines pass	354752 Lines filtered out	0 Lines fail
Total initial positions: 1342374	Total final positions: 97118
Uniq genes found	3343
Uniq position found	68396

Intersecting Sakurai_A>I_Brain_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
9953 Lines processed	2550 Lines pass	7403 Lines filtered out	0 Lines fail
Total initial positions: 20482	Total final positions: 2550
Uniq genes found	319
Uniq position found	2330

Intersecting Schwartz_Y_HEK293:Fibroblast_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
50 Lines processed	14 Lines pass	36 Lines filtered out	0 Lines fail
Total initial positions: 396	Total final positions: 14
Uniq genes found	10
Uniq position found	14

Intersecting Squires_m5C_HeLa_hg38_cleaned with gencode_v24_lncRNAs
Post processing results
1392 Lines processed	281 Lines pass	1111 Lines filtered out	0 Lines fail
Total initial positions: 10490	Total final positions: 281
Uniq genes found	112
Uniq position found	272

Between 1 and 16% (excluding the weird carlile dataset) of the peaks are found in lncRNA annotated in gencode v24. Iterate through the datasets to find the number of uniq genes found



In [46]:

    
# Annotation vs gencodev23 All genes
annotation_file = "/home/aleg/Data/Reference_Annotation/gencode_v24.gff3"

for peak_file in sorted(glob("./PTM_Clean_Datasets/*.bed")):
    outdir = "./PTM_Annotated_Datasets"
    intersect_extract_genecodeID (annotation_file, peak_file, outdir)









    



Intersecting Carlile_Y_HeLa_hg38_cleaned with gencode_v24
Post processing results
558 Lines processed	25 Lines pass	533 Lines filtered out	0 Lines fail
Total initial positions: 8	Total final positions: 25
Uniq genes found	22
Uniq position found	7

Intersecting DARNED_human_hg38_inosine_cleaned with gencode_v24
Post processing results
1887438 Lines processed	294441 Lines pass	1592997 Lines filtered out	0 Lines fail
Total initial positions: 289998	Total final positions: 294441
Uniq genes found	10558
Uniq position found	254964

Intersecting Dominissini_m1A_HeLa:HEK293:HepG2_hg38_cleaned with gencode_v24
Post processing results
274144 Lines processed	20786 Lines pass	253358 Lines filtered out	0 Lines fail
Total initial positions: 19552	Total final positions: 20786
Uniq genes found	8516
Uniq position found	19063

Intersecting Dominissini_m6A_HepG2_hg38_cleaned with gencode_v24
Post processing results
6822 Lines processed	1080 Lines pass	5742 Lines filtered out	0 Lines fail
Total initial positions: 2894	Total final positions: 1080
Uniq genes found	466
Uniq position found	1019

Intersecting Hussain_m5C_HEK293_hg38_cleaned with gencode_v24
Post processing results
1893 Lines processed	332 Lines pass	1561 Lines filtered out	0 Lines fail
Total initial positions: 1084	Total final positions: 332
Uniq genes found	158
Uniq position found	296

Intersecting Khoddami_m5C_MEF_hg38_cleaned with gencode_v24
Post processing results
67724 Lines processed	14234 Lines pass	53490 Lines filtered out	0 Lines fail
Total initial positions: 20553	Total final positions: 14234
Uniq genes found	321
Uniq position found	12011

Intersecting Li_Y_HEK293_hg38_cleaned with gencode_v24
Post processing results
20318 Lines processed	1579 Lines pass	18739 Lines filtered out	0 Lines fail
Total initial positions: 1489	Total final positions: 1579
Uniq genes found	1425
Uniq position found	1471

Intersecting Linder_m6A:m6Am_HEK293_hg38_cleaned with gencode_v24
Post processing results
174546 Lines processed	15945 Lines pass	158601 Lines filtered out	0 Lines fail
Total initial positions: 15167	Total final positions: 15945
Uniq genes found	6191
Uniq position found	14851

Intersecting Meyer_m6A_HEK293_hg38_cleaned with gencode_v24
Post processing results
54291 Lines processed	4564 Lines pass	49727 Lines filtered out	0 Lines fail
Total initial positions: 4341	Total final positions: 4564
Uniq genes found	1525
Uniq position found	4312

Intersecting Peng_A>I_YH_hg38_cleaned with gencode_v24
Post processing results
113005 Lines processed	18827 Lines pass	94178 Lines filtered out	0 Lines fail
Total initial positions: 21111	Total final positions: 18827
Uniq genes found	4034
Uniq position found	17081

Intersecting RADAR_Human_hg38_inosine_cleaned with gencode_v24
Post processing results
8532655 Lines processed	1372609 Lines pass	7160046 Lines filtered out	0 Lines fail
Total initial positions: 1342374	Total final positions: 1372609
Uniq genes found	17532
Uniq position found	918049

Intersecting Sakurai_A>I_Brain_hg38_cleaned with gencode_v24
Post processing results
117323 Lines processed	19906 Lines pass	97417 Lines filtered out	0 Lines fail
Total initial positions: 20482	Total final positions: 19906
Uniq genes found	2648
Uniq position found	17230

Intersecting Schwartz_Y_HEK293:Fibroblast_hg38_cleaned with gencode_v24
Post processing results
5067 Lines processed	421 Lines pass	4646 Lines filtered out	0 Lines fail
Total initial positions: 396	Total final positions: 421
Uniq genes found	378
Uniq position found	372

Intersecting Squires_m5C_HeLa_hg38_cleaned with gencode_v24
Post processing results
112753 Lines processed	11038 Lines pass	101715 Lines filtered out	0 Lines fail
Total initial positions: 10490	Total final positions: 11038
Uniq genes found	4659
Uniq position found	10169

Overall summary of the analysis compared with results reported in Shafik et al

Article	Modification	Total sites initial (Shafik)	Total sites final	All sites in RNA	Uniq sites in RNA	Uniq RNA with site	All sites in lncRNA (Shafik)	Uniq sites in lncRNA	Uniq lncRNA with site (Shafik)
Peng et al	Inosine	22686 (22686)	21,111	18,827	17,081	4,034	3382 (4425)	3259	505 (846)
Sakurai et al	Inosine	20482 (20482)	20,482	19,906	17,230	2,648	2550 (3050)	2330	319 (400)
Hussain et al	m5C	1084 (1084)	1,084	332	296	158	107 (110)	106	39 (41)
Khoddami et al	m5C	20553 (20533)	20,553	14,234	12,011	321	1523 (1580)	1523	36 (38)
Squires et al.	m5C	10490 (10490)	10,490	11,038	10,169	4,659	281 (1544)	272	112 (711)
Dominissini et al.	m6A	25918 (25776)	2,894	1,080	1,019	466	115 (7397)	114	84 (6165)
Meyer et al	m6A	4341 (4341)	4,341	4,564	4,312	1,525	48 (57)	48	16 (20)
Linder et al	m6A and m6Am	15167 (NA)	15,167	15,945	14,851	6,191	385 (NA)	375	168 (NA)
Dominissini et al.	m1A	32136 (NA)	19552 (HeLa:8873, HepG2:8550, HEK293:2129)	20,786	19,063	8,516	606 (NA)	578	338 (NA)
Carlile et al	pseudouridylation	8 (8)	8	25	7	22	4 (4)	3	3 (3)
Li et al	pseudouridylation	1489 (1489)	1,489	1,579	1,471	1,425	48 (58)	47	44 (54)
Schwartz et al	pseudouridylation	402 (396)	402	421	372	378	14 (15)	14	10 (11)
DARNED	Inosine	333216 (259654)	290,002	294,441	254,964	10,558	24152 (23574)	19814	1300 (1833)
RADAR	Inosine	2576460 (2576289)	1,342,374	1,372,609	918,049	17,532	97118 (218793)	68396	3343 (6376)

I obtained less peaks and less uniq genes than shafik et al, probably since I used more stringent criteria for filtering and cleaning (especially for dominissini 2012 paper). Overall the results are consistant though for later analysis I will probably have to go back from the data for several papers.

New papers



In [ ]: