In [1]:
from cruzdb import Genome
We will mirror a set of tables from the UCSC mysql to a local sqlite database and load the file we've just downloaded
In [2]:
local_url = "sqlite:///hg18.db"
tables = ("refGene", "cpgIslandExt")
#Genome('hg18').mirror(tables, local_url)
#Genome(local_url).load_file('wgEncodeRegTfbsClustered.bed')
Now we have a copy of those tables which we can access locally
In [3]:
hg18 = Genome(local_url)
We can use this local copy to annotate a pair of local files. The data is from Hansen et. al: http://www.pnas.org/content/107/1/139.full
It contains intervals that are consistently replicated either early or late.
The annotate command takes a file name (here the early or late intervals) and a list of tables to annotate with.
The annotation will add distance and feature-type columns. For gene-like tables, the feature-type
columns will list the features that overlap the interval: exons, introns, utrs, etc.
In [4]:
for f in ("data_c_constant_early.bed", "data_e_constant_late.bed"):
with open("supplement/Additional-File-" + ("4-early.anno.bed" if "early" in f else "5-late.anno.bed"), "w") as fh:
hg18.annotate(f, list(tables[:2]) , feature_strand=True, out=fh, parallel=True)
Here is what the start of one of the files looks like:
In [5]:
! head supplement/Additional-File-4-early.anno.bed; grep "TSS" supplement/Additional-File-4-early.anno.bed | head
We define a helper function to count the number of feature-types (TSS, intron, etc) for one of these files:
In [4]:
from collections import Counter
from toolshed import reader
def count_features(fname):
feature_counts = Counter()
feature_counts_unique = Counter()
gene_counts = 0.0 # number of times a region overlaps a gene
for d in reader(fname):
feats = d['refGene_feature'].split(";")
flat = []
if feats[0] != "intergenic":
gene_counts += 1
for f in feats:
flat.extend(f.split("+"))
feature_counts.update(flat)
if len(flat) == 1:
feature_counts_unique.update(flat)
return feature_counts, feature_counts_unique, gene_counts
def show_enrichment(early_f, late_f):
early, uearly, early_gene = count_features(early_f)
late, ulate, late_gene = count_features(late_f)
early_intron_freq, late_intron_freq = uearly['intron'] / early_gene, ulate['intron'] / late_gene
return "%.2f%%" % (100. * late_intron_freq / early_intron_freq)
The function records any feature-type as it occurs, and whether it was the only feature. For example, this would differentiate a replicating region that overlapped an intron and an exon from one that overlapped only an exon.
Below, we first note that there is a greater number of overlaps with introns in the early-replicating regions.
In [5]:
early, uearly, early_gene = count_features('supplement/Additional-File-4-early.anno.bed')
late, ulate, late_gene = count_features('supplement/Additional-File-5-late.anno.bed')
early['intron'], late['intron']
Out[5]:
However, if we look at only the cases where the early or late overlap only an intron, and not with other features
In [6]:
uearly['intron'], ulate['intron']
Out[6]:
we see that the difference is much smaller. So, now we wish to ask: given that a region occurs anywhere in a gene, how often is it in the intron only.
In [7]:
uearly['intron'] / early_gene, ulate['intron'] / late_gene
Out[7]:
So, 46.7% of late regions that fall within a gene, only touch an intron.
and 30.7% of early regions that fall within a gene only touch an intron.
We can guage this enrichment of late-replicating regions falling entirely within an intron
In [8]:
early_intron_freq, late_intron_freq = uearly['intron'] / early_gene, ulate['intron'] / late_gene
"%.2f%%" % (100. * late_intron_freq / early_intron_freq)
Out[8]:
So we see an 152% enrichment for late-replicating regions appearing entirely within introns relative to early-replicating regions when we consider only those regions that occur somehwere in a gene.
Below we repeat the analysis, using only genes with at least 2 exons, since we won't see the effect about without an intron. We encapulsated the above sequence to calculate enrichment into a single function.
In [9]:
refGene = hg18.refGene
from sqlalchemy import and_
q = refGene.filter(refGene.exonStarts > 1).filter(refGene.cdsStart != refGene.cdsEnd)
for f in ("data_c_constant_early.bed", "data_e_constant_late.bed"):
with open("supplement/Additional-File-" + ("6_early.introns.anno.bed" if "early" in f else "7_late.introns.anno.bed"), "w") as fh:
hg18.annotate(f, [q] , feature_strand=True, out=fh)
In [12]:
show_enrichment("supplement/Additional-File-6_early.introns.anno.bed", "supplement/Additional-File-7_late.introns.anno.bed")
Out[12]:
In [ ]:
import cruzdb
cruzdb
In [ ]: