This notebook is to get myself to be familair with the pybedtools

import the pybedtools module


In [4]:
import pybedtools
import sys
import os

get the working directory and you can change to the directory you want by os.chdir(path) list all the files in the directory os.listdir(path)


In [6]:
os.getcwd()
# use a pre-shipped bed file as an example
a = pybedtools.example_bedtool('a.bed')

a is a bedtool object, one can acess the object by index


In [15]:
print a[0]
print a[1]
feature = a[1]


chr1	1	100	feature1	0	+

chr1	100	200	feature2	0	+

see what type of the interval is by Interval.file_type All features, no matter what the file type, have chrom, start, stop, name, score, and strand attributes. Note that start and stop are long integers, while everything else (including score) is a string.


In [22]:
print feature.file_type
print feature
print feature.chrom
print feature.start
print feature.stop
print feature.name
print feature.score
print feature.strand
print feature.fields


bed
chr1	100	200	feature2	0	+

chr1
100
200
feature2
0
+
['chr1', '100', '200', 'feature2', '0', '+']

interval can also be accessed by index or like a dictionary


In [25]:
print feature[0]
print feature["chrom"]
print feature[1]
print feature["start"]


chr1
chr1
100
100

In [13]:
print a[1:4]


<itertools.islice object at 0x10d42b418>

slice get an itertools object that can be interated


In [14]:
for interval in a[1:4]:
    print interval


chr1	100	200	feature2	0	+

chr1	150	500	feature3	0	-

chr1	900	950	feature4	0	+

for each interval, one can access the chr,star, end by


In [20]:
for interval in a[1:4]:
    print interval.chrom


chr1
chr1
chr1

Let's do some intersection for 2 bed files


In [34]:
a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')

In [46]:
print a.head() # print out only the first 10 lines if you have big bed file
print b.head()


chr1	1	100	feature1	0	+
 chr1	100	200	feature2	0	+
 chr1	150	500	feature3	0	-
 chr1	900	950	feature4	0	+
 None
chr1	155	200	feature5	0	-
 chr1	800	901	feature6	0	+
 None

use the Bedtools.intersect() method


In [51]:
a_and_b = a.intersect(b)

In [43]:
a_and_b.head()


chr1	155	200	feature2	0	+
 chr1	155	200	feature3	0	-
 chr1	900	901	feature4	0	+

one can add flags to the intersect call just as the command line intersectbed

-wa Write the original entry in A for each overlap. may have duplicated entries from A

-u Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B.
The following toy example returns the same result for -u and -wa flag.

a.intersect(b, wa=True).head()


In [49]:
a.intersect(b, u=True).head()


chr1	100	200	feature2	0	+
 chr1	150	500	feature3	0	-
 chr1	900	950	feature4	0	+

saving files

save the Bedtool object to a file, you can add a trackline.


In [53]:
c = a_and_b.saveas('intersection-of-a-and-b.bed', trackline='track name="a and b"')

In [58]:
os.listdir(".")
print c


track name="a and b"
chr1	155	200	feature2	0	+
chr1	155	200	feature3	0	-
chr1	900	901	feature4	0	+

one can chain the methods of pybedtools just like the pipe in the command line.
The following intersect a with b first, and save the intersection in a file. because the intersect() method returns a Bedtool object, it can be chained using .merge() method and finally saved the mearged bed file


In [59]:
x4 = a\
    .intersect(b, u=True)\
    .saveas('a-with-b.bed')\
    .merge()\
    .saveas('a-with-b-merged.bed')

demonstrate the filter method

grep out only the intervals with length bigger than 100


In [64]:
print a
for interval in a:
    print len(interval)


chr1	1	100	feature1	0	+
chr1	100	200	feature2	0	+
chr1	150	500	feature3	0	-
chr1	900	950	feature4	0	+

99
100
350
50

Let's use filter to extract the intervals with length >100


In [63]:
print a.filter(lambda x: len(x) >100)


chr1	150	500	feature3	0	-

Or, use a more generic function


In [65]:
def len_filter(feature, L):
    return len(feature) > L

Then call this function inside the filter method:


In [67]:
print a.filter(len_filter, 100)


chr1	150	500	feature3	0	-

we got the same results as using the lambda. Using len_filter function is more flexiabl, as you can supply any length that you want to filter.

demonstrate the each() method

each() method can apply a function for each interval in the BedTool object.
It is similar to the apply functions in R Let's add counts of how many hits in b intersect a


In [70]:
with_count = a.intersect(b, c=True)
print with_count


chr1	1	100	feature1	0	+	0
chr1	100	200	feature2	0	+	1
chr1	150	500	feature3	0	-	1
chr1	900	950	feature4	0	+	1

Normalize the counts by dividing the length of the interval.use a scalar of 0.001 to normalize it to counts per 1kb


In [74]:
def normalize_count(feature, scalar=0.001):
    count = float(feature[-1])
    normalized_count = count/len(feature) * scalar
    ## write the score back, need to turn it to a string first
    feature.score = str(normalized_count)
    return feature

In [75]:
print with_count.each(normalize_count)


chr1	1	100	feature1	0.0	+	0
chr1	100	200	feature2	1e-05	+	1
chr1	150	500	feature3	2.85714285714e-06	-	1
chr1	900	950	feature4	2e-05	+	1


In [ ]: