In [4]:
import pybedtools
import sys
import os
get the working directory and you can change to the directory you want by os.chdir(path) list all the files in the directory os.listdir(path)
In [6]:
os.getcwd()
# use a pre-shipped bed file as an example
a = pybedtools.example_bedtool('a.bed')
a is a bedtool object, one can acess the object by index
In [15]:
print a[0]
print a[1]
feature = a[1]
see what type of the interval is by Interval.file_type All features, no matter what the file type, have chrom, start, stop, name, score, and strand attributes. Note that start and stop are long integers, while everything else (including score) is a string.
In [22]:
print feature.file_type
print feature
print feature.chrom
print feature.start
print feature.stop
print feature.name
print feature.score
print feature.strand
print feature.fields
interval can also be accessed by index or like a dictionary
In [25]:
print feature[0]
print feature["chrom"]
print feature[1]
print feature["start"]
In [13]:
print a[1:4]
slice get an itertools object that can be interated
In [14]:
for interval in a[1:4]:
print interval
for each interval, one can access the chr,star, end by
In [20]:
for interval in a[1:4]:
print interval.chrom
Let's do some intersection for 2 bed files
In [34]:
a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
In [46]:
print a.head() # print out only the first 10 lines if you have big bed file
print b.head()
use the Bedtools.intersect() method
In [51]:
a_and_b = a.intersect(b)
In [43]:
a_and_b.head()
one can add flags to the intersect call just as the command line intersectbed
-wa Write the original entry in A for each overlap. may have duplicated entries from A
-u Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B.
The following toy example returns the same result for -u and -wa flag.
a.intersect(b, wa=True).head()
In [49]:
a.intersect(b, u=True).head()
In [53]:
c = a_and_b.saveas('intersection-of-a-and-b.bed', trackline='track name="a and b"')
In [58]:
os.listdir(".")
print c
one can chain the methods of pybedtools just like the pipe in the command line.
The following intersect a with b first, and save the intersection in a file.
because the intersect() method returns a Bedtool object, it can be chained using .merge()
method and finally saved the mearged bed file
In [59]:
x4 = a\
.intersect(b, u=True)\
.saveas('a-with-b.bed')\
.merge()\
.saveas('a-with-b-merged.bed')
In [64]:
print a
for interval in a:
print len(interval)
Let's use filter to extract the intervals with length >100
In [63]:
print a.filter(lambda x: len(x) >100)
Or, use a more generic function
In [65]:
def len_filter(feature, L):
return len(feature) > L
Then call this function inside the filter method:
In [67]:
print a.filter(len_filter, 100)
we got the same results as using the lambda. Using len_filter function is more flexiabl, as you can supply any length that you want to filter.
In [70]:
with_count = a.intersect(b, c=True)
print with_count
Normalize the counts by dividing the length of the interval.use a scalar of 0.001 to normalize it to counts per 1kb
In [74]:
def normalize_count(feature, scalar=0.001):
count = float(feature[-1])
normalized_count = count/len(feature) * scalar
## write the score back, need to turn it to a string first
feature.score = str(normalized_count)
return feature
In [75]:
print with_count.each(normalize_count)
In [ ]: