In [1]:
from cruzdb import Genome

We will mirror a set of tables from the UCSC mysql to a local sqlite database and load the file we've just downloaded


In [2]:
local_url = "sqlite:///hg18.db"
tables = ("refGene", "cpgIslandExt")


#Genome('hg18').mirror(tables, local_url)
#Genome(local_url).load_file('wgEncodeRegTfbsClustered.bed')

Now we have a copy of those tables which we can access locally


In [3]:
hg18 = Genome(local_url)

We can use this local copy to annotate a pair of local files. The data is from Hansen et. al: http://www.pnas.org/content/107/1/139.full

It contains intervals that are consistently replicated either early or late.

The annotate command takes a file name (here the early or late intervals) and a list of tables to annotate with.

The annotation will add distance and feature-type columns. For gene-like tables, the feature-type

columns will list the features that overlap the interval: exons, introns, utrs, etc.


In [4]:
for f in ("data_c_constant_early.bed", "data_e_constant_late.bed"):
    with open("supplement/Additional-File-" + ("4-early.anno.bed" if "early" in f else "5-late.anno.bed"), "w") as fh:
        hg18.annotate(f, list(tables[:2]) , feature_strand=True, out=fh, parallel=True)

Here is what the start of one of the files looks like:


In [5]:
! head supplement/Additional-File-4-early.anno.bed; grep "TSS" supplement/Additional-File-4-early.anno.bed | head


chrom	start	end	refGene_name	refGene_distance	refGene_feature	cpgIslandExt_name	cpgIslandExt_distance	cpgIslandExt_feature
chr13	19132500	19133500	MPHOSPH8	0	intron	CpG: 71	26248	
chr13	19140500	19142500	MPHOSPH8	0	intron+cds	CpG: 71	34248	
chr13	19143500	19156500	PSPC1;MPHOSPH8	0;0	nc_intron+nc_exon;intron+utr3+cds	CpG: 71	37248	
chr13	19442500	19459500	ZMYM2	0	intron+utr5	CpG: 159	10870	
chr13	19489500	19490500	ZMYM2	0	intron	CpG: 159	57870	
chr13	19579500	19582500	ZMYM2	15532	intergenic	CpG: 24	-8024	
chr13	19585500	19586500	ZMYM2	21532	intergenic	CpG: 24	-4024	
chr13	19587500	19589500	GJA3	-20894	intergenic	CpG: 24	-1024	shore
chr13	19590500	19606500	GJA3	-3894	intergenic	CpG: 78;CpG: 24	0;0	island;island
chr13	20166500	20179500	IL17D	0	TSS+intron+utr5+cds	CpG: 113	0	island
chr13	20496500	20548500	LATS2	0	TSS+intron+utr5+cds	CpG: 167;CpG: 28	0;0	island;island
chr13	27440500	27528500	PRHOXNB;FLT3;CDX2	0;0;0	gene;intron+utr3+cds;TSS+intron+utr5+cds	CpG: 54;CpG: 227;CpG: 143;CpG: 42	0;0;0;0	island;island;island;island
chr13	27899500	28060500	FLT1	0	TSS+intron+utr5+cds	CpG: 78;CpG: 23;CpG: 200	0;0;0	island;island;island
chr13	40782500	40819500	NAA16	0	TSS+intron+utr5+cds	CpG: 101	0	island
chr13	41451500	41529500	DGKH;VWA8-AS1	0;0	TSS+intron+utr5+cds;nc_intron+nc_exon	CpG: 105;CpG: 120;CpG: 26	0;0;0	island;island;island
chr13	41554500	41650500	DGKH;DGKH	0;0	TSS+intron+utr5+cds;intron+cds	CpG: 26	32940	
chr13	44451500	44472500	NUFIP1;KIAA1704	0;0	TSS+intron+utr5+cds;TSS+intron+utr5+cds	CpG: 64	0	island
chr13	44584500	44636500	GTF2F2	0	TSS+intron+utr5+cds	CpG: 41	0	island
chr13	44874500	44943500	SLC25A30;COG3	0;0	TSS+intron+utr5+cds;TSS+intron+utr5+cds	CpG: 57;CpG: 56	0;0	island;island
grep: write error

We define a helper function to count the number of feature-types (TSS, intron, etc) for one of these files:


In [4]:
from collections import Counter
from toolshed import reader

def count_features(fname):
    feature_counts = Counter()
    feature_counts_unique = Counter()
    gene_counts = 0.0 # number of times a region overlaps a gene

    for d in reader(fname):
        feats = d['refGene_feature'].split(";")
        flat = []
        if feats[0] != "intergenic":
            gene_counts += 1
            
        for f in feats:
            flat.extend(f.split("+"))
        feature_counts.update(flat)
        if len(flat) == 1:
            feature_counts_unique.update(flat)
    return feature_counts, feature_counts_unique, gene_counts

def show_enrichment(early_f, late_f):
    early, uearly, early_gene = count_features(early_f)
    late, ulate, late_gene = count_features(late_f)
    
    early_intron_freq, late_intron_freq = uearly['intron'] / early_gene, ulate['intron'] / late_gene
    return "%.2f%%" % (100. * late_intron_freq / early_intron_freq)

The function records any feature-type as it occurs, and whether it was the only feature. For example, this would differentiate a replicating region that overlapped an intron and an exon from one that overlapped only an exon.

Below, we first note that there is a greater number of overlaps with introns in the early-replicating regions.


In [5]:
early, uearly, early_gene = count_features('supplement/Additional-File-4-early.anno.bed')
late, ulate, late_gene = count_features('supplement/Additional-File-5-late.anno.bed')

early['intron'], late['intron']


Out[5]:
(9369, 4956)

However, if we look at only the cases where the early or late overlap only an intron, and not with other features


In [6]:
uearly['intron'], ulate['intron']


Out[6]:
(2733, 2492)

we see that the difference is much smaller. So, now we wish to ask: given that a region occurs anywhere in a gene, how often is it in the intron only.


In [7]:
uearly['intron'] / early_gene, ulate['intron'] / late_gene


Out[7]:
(0.3074932493249325, 0.4666666666666667)

So, 46.7% of late regions that fall within a gene, only touch an intron.

and 30.7% of early regions that fall within a gene only touch an intron.

We can guage this enrichment of late-replicating regions falling entirely within an intron


In [8]:
early_intron_freq, late_intron_freq = uearly['intron'] / early_gene, ulate['intron'] / late_gene

"%.2f%%" % (100. * late_intron_freq / early_intron_freq)


Out[8]:
'151.76%'

So we see an 152% enrichment for late-replicating regions appearing entirely within introns relative to early-replicating regions when we consider only those regions that occur somehwere in a gene.

Below we repeat the analysis, using only genes with at least 2 exons, since we won't see the effect about without an intron. We encapulsated the above sequence to calculate enrichment into a single function.


In [9]:
refGene = hg18.refGene
from sqlalchemy import and_
q = refGene.filter(refGene.exonStarts > 1).filter(refGene.cdsStart != refGene.cdsEnd)
for f in ("data_c_constant_early.bed", "data_e_constant_late.bed"):
    with open("supplement/Additional-File-" + ("6_early.introns.anno.bed" if "early" in f else "7_late.introns.anno.bed"), "w") as fh:
        hg18.annotate(f, [q] , feature_strand=True, out=fh)

In [12]:
show_enrichment("supplement/Additional-File-6_early.introns.anno.bed", "supplement/Additional-File-7_late.introns.anno.bed")


Out[12]:
'159.39%'

In [ ]:
import cruzdb
cruzdb

In [ ]: