Purpose = Develop a Python3 program to parse a BAM/SAM file(s) and extract reads overlapping given genomic coordinates + generating a report
Using pysam 0.8.1 as the only dependencie
In [2]:
import pysam, csv
In [4]:
with open("../test/ITR_coordinates.txt", newline='') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
print (row)
Draft a small object class to store basic interval information
In [11]:
class interval(object):
def __init__ (self, ref_name, start, end):
self.ref_name = ref_name
self.start = start
self.end = end
def __repr__(self):
return "{} [{}:{}]".format(self.ref_name, self.start, self.end)
In [13]:
interval_list = []
with open("../test/ITR_coordinates.txt", newline='') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
interval_list.append(interval(row[0], row[1], row[2]))
for i in interval_list:
print (i)
Hard code a more robust class in scr. A class Instance list track its own instance and class methods allow a global manipulation of Instances
In [3]:
from Interval import Interval
In [31]:
Interval.resetInstances()
with open("../test/ITR_coordinates.txt", newline='') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
Interval(row[0], row[1], row[2])
Interval.printInstances()
Interval.countInstances()
Out[31]:
Everything works as expected so far.
Now write a new function saying if an interval is overlapping the coordinate store in the instance = is overlapping
In [12]:
for i in Interval.Instances:
print(i.is_overlapping("SSV9K2-CMV-GFP", 500, 1000))
In [12]:
for i in Interval.Instances:
print(i.is_overlapping("SSV9K2-CMV-GFP-HygroTK-bGHpA", 500, 1000))
In [9]:
for i in Interval.Instances:
print(i.is_overlapping("SSV9K2-CMV-GFP-HygroTK-bGHpA", 200, 4200))
In [10]:
for i in Interval.Instances:
print(i.is_overlapping("SSV9K2-CMV-GFP-HygroTK-bGHpA", 1000, 200))
In [13]:
for i in Interval.Instances:
print(i.is_overlapping("SSV9K2-CMV-GFP-HygroTK-bGHpA", 6000, 500))
The function is working independant of the order of the values. Let's try to use a function to add a read if it is overlapping the interval
In [6]:
for i in Interval.Instances:
if i.is_overlapping("SSV9K2-CMV-GFP-HygroTK-bGHpA", 4200, 200):
i.add_read("NOT A REAL READ")
print ("Adding read to {}".format(repr(i)))
Interval.printInstances()
Now We'll start to handle sam files using pysam.
In [44]:
i
Out[44]:
In [8]:
sam = pysam.AlignmentFile("../test/S1_10.sam", "rb")
for a in sam:
if not a.is_unmapped:
print(a.query_name, sam.getrname(a.reference_id), a.reference_start, a.reference_end, a.is_reverse, a.mate_is_reverse)
else:
print("Unmapped")
In [35]:
sam = pysam.AlignmentFile("../test/S1_10.sam", "rb")
for a in sam:
print (a.query_name)
for i in Interval.Instances:
if not a.is_unmapped:
read_name = sam.getrname(a.reference_id)
# Try to see if the read overlap the coordinates
if i.is_overlapping(read_name, a.reference_start, a.reference_end):
i.add_read(a)
print ("READ OVERLAP")
# Else try to see if the pair overlap the coordinates
elif not a.mate_is_unmapped:
mate_name = sam.getrname(a.next_reference_id)
# Pairs concordantly mapped ############# TO WRAP
if 0 < abs(a.template_length) <= 1000:
if read_name == mate_name:
if a.is_reverse != a.mate_is_reverse:
#
if i.is_overlapping(read_name, a.reference_start, a.next_reference_start):
i.add_read(a)
print ("PAIR OVERLAP")
else:
print ("No pair overlap")
else:
print("invalid orientation")
else:
print("Mate mapped on a different reference")
else:
print("template exceding size limits")
else:
print("Mate is unmapped")
else:
print("Unmapped")
In [33]:
Interval.printInstances()
In [48]:
for i in Interval.Instances[1].read_list:
print (i.query_name)
Write a more advanced csv file parser for interval in Find_overlap_read main.
The main entry should be directly in the shell since a parsing and verification of argument is done but for the sake of development under interactive IDE is is also possible to import the class and to instanciate with argument interval_file and align_files
In [1]:
from Find_overlap_read import Main
Example with a file containing valid interval rows
In [2]:
a = Main(interval_file="../test/Good.txt", align_files=["../test/S1_10.sam"])
Example with a file containing no valid interval row
In [3]:
a = Main(interval_file="../test/Bad.txt", align_files=["../test/S1_10.sam"])
Continue development in source files