Find_overlap_read Development Notebook

Creation : 2014 / 12 / 05

Adrien Leger adrien.leger@gmail.com

Purpose = Develop a Python3 program to parse a BAM/SAM file(s) and extract reads overlapping given genomic coordinates + generating a report

Using pysam 0.8.1 as the only dependencie


In [2]:
import pysam, csv

In [4]:
with open("../test/ITR_coordinates.txt", newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t')
    for row in reader:
        print (row)


['SSV9K2-CMV-GFP-HygroTK-bGHpA', '500', '600']
['SSV9K2-CMV-GFP-HygroTK-bGHpA', '4000', '4100']
['SSV9K2-CMV-GFP-HygroTK-bGHpA', '4800', '4900']

Draft a small object class to store basic interval information


In [11]:
class interval(object):
    def __init__ (self, ref_name, start, end):
        self.ref_name = ref_name
        self.start = start
        self.end = end
    def __repr__(self):
        return "{} [{}:{}]".format(self.ref_name, self.start, self.end)

In [13]:
interval_list = []
with open("../test/ITR_coordinates.txt", newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t')
    for row in reader:
        interval_list.append(interval(row[0], row[1], row[2]))
for i in interval_list:
    print (i)


SSV9K2-CMV-GFP-HygroTK-bGHpA [500:600]
SSV9K2-CMV-GFP-HygroTK-bGHpA [4000:4100]
SSV9K2-CMV-GFP-HygroTK-bGHpA [4800:4900]

Hard code a more robust class in scr. A class Instance list track its own instance and class methods allow a global manipulation of Instances


In [3]:
from Interval import Interval

In [31]:
Interval.resetInstances()

with open("../test/ITR_coordinates.txt", newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t')
    for row in reader:
        Interval(row[0], row[1], row[2])

Interval.printInstances()
Interval.countInstances()


Clearing Reference instances list
SSV9K2-CMV-GFP-HygroTK-bGHpA [200:600] = 0 overlapping read pairs
SSV9K2-CMV-GFP-HygroTK-bGHpA [4050:4100] = 0 overlapping read pairs
SSV9K2-CMV-GFP-HygroTK-bGHpA [4800:4850] = 0 overlapping read pairs
Out[31]:
3

Everything works as expected so far.

Now write a new function saying if an interval is overlapping the coordinate store in the instance = is overlapping


In [12]:
for i in Interval.Instances:
    print(i.is_overlapping("SSV9K2-CMV-GFP", 500, 1000))


False
False
False

In [12]:
for i in Interval.Instances:
    print(i.is_overlapping("SSV9K2-CMV-GFP-HygroTK-bGHpA", 500, 1000))


True
False
False

In [9]:
for i in Interval.Instances:
    print(i.is_overlapping("SSV9K2-CMV-GFP-HygroTK-bGHpA", 200, 4200))


True
True
False

In [10]:
for i in Interval.Instances:
    print(i.is_overlapping("SSV9K2-CMV-GFP-HygroTK-bGHpA", 1000, 200))


True
False
False

In [13]:
for i in Interval.Instances:
    print(i.is_overlapping("SSV9K2-CMV-GFP-HygroTK-bGHpA", 6000, 500))


True
True
True

The function is working independant of the order of the values. Let's try to use a function to add a read if it is overlapping the interval


In [6]:
for i in Interval.Instances:
    if i.is_overlapping("SSV9K2-CMV-GFP-HygroTK-bGHpA", 4200, 200):
        i.add_read("NOT A REAL READ")
        print ("Adding read to {}".format(repr(i)))
Interval.printInstances()


Adding read to SSV9K2-CMV-GFP-HygroTK-bGHpA [200:600] = 1 overlapping read pairs
Adding read to SSV9K2-CMV-GFP-HygroTK-bGHpA [4050:4100] = 1 overlapping read pairs
SSV9K2-CMV-GFP-HygroTK-bGHpA [200:600] = 1 overlapping read pairs
SSV9K2-CMV-GFP-HygroTK-bGHpA [4050:4100] = 1 overlapping read pairs
SSV9K2-CMV-GFP-HygroTK-bGHpA [4800:4850] = 0 overlapping read pairs

Now We'll start to handle sam files using pysam.


In [44]:
i


Out[44]:
SSV9K2-CMV-GFP-HygroTK-bGHpA [4800:4900] = 0 overlapping read pairs

In [8]:
sam = pysam.AlignmentFile("../test/S1_10.sam", "rb")
for a in sam:
    if not a.is_unmapped:
        print(a.query_name, sam.getrname(a.reference_id), a.reference_start, a.reference_end, a.is_reverse, a.mate_is_reverse)
    else:
        print("Unmapped")


HWI-1KL149:87:HA58EADXX:1:1101:1531:2163 SSV9K2-CMV-GFP-HygroTK-bGHpA 4949 5050 True False
HWI-1KL149:87:HA58EADXX:1:1101:1531:2163 SSV9K2-CMV-GFP-HygroTK-bGHpA 4861 4962 False True
Unmapped
Unmapped
HWI-1KL149:87:HA58EADXX:1:1101:1606:2189 SSV9K2-CMV-GFP-HygroTK-bGHpA 4284 4385 True False
HWI-1KL149:87:HA58EADXX:1:1101:1606:2189 SSV9K2-CMV-GFP-HygroTK-bGHpA 4103 4204 False True
HWI-1KL149:87:HA58EADXX:1:1101:1752:2238 SSV9K2-CMV-GFP-HygroTK-bGHpA 3031 3132 True False
HWI-1KL149:87:HA58EADXX:1:1101:1752:2238 SSV9K2-CMV-GFP-HygroTK-bGHpA 2808 2909 False True
HWI-1KL149:87:HA58EADXX:1:1101:1813:2228 SSV9K2-CMV-GFP-HygroTK-bGHpA 4789 4855 True False
HWI-1KL149:87:HA58EADXX:1:1101:1813:2228 SSV9K2-CMV-GFP-HygroTK-bGHpA 5465 5504 False False

In [35]:
sam = pysam.AlignmentFile("../test/S1_10.sam", "rb")
for a in sam:
    print (a.query_name)
    for i in Interval.Instances:
        if not a.is_unmapped:   
            read_name = sam.getrname(a.reference_id)
            
            # Try to see if the read overlap the coordinates
            if i.is_overlapping(read_name, a.reference_start, a.reference_end):
                i.add_read(a)
                print ("READ OVERLAP")
            
            # Else try to see if the pair overlap the coordinates
            elif not a.mate_is_unmapped:
                mate_name = sam.getrname(a.next_reference_id)
                
                # Pairs concordantly mapped ############# TO WRAP
                if 0 < abs(a.template_length) <= 1000:
                    if read_name == mate_name:
                        if a.is_reverse != a.mate_is_reverse:
                            #
                            if i.is_overlapping(read_name, a.reference_start, a.next_reference_start):
                                i.add_read(a)
                                print ("PAIR OVERLAP")
                            else:
                                print ("No pair overlap")
                        else:
                            print("invalid orientation")
                    else:
                        print("Mate mapped on a different reference")
                else:
                    print("template exceding size limits")
            else:
                print("Mate is unmapped")
        
        else:
            print("Unmapped")


HWI-1KL149:87:HA58EADXX:1:1101:1531:2163
No pair overlap
No pair overlap
No pair overlap
HWI-1KL149:87:HA58EADXX:1:1101:1531:2163
No pair overlap
No pair overlap
No pair overlap
HWI-1KL149:87:HA58EADXX:1:1101:1744:2169
Unmapped
Unmapped
Unmapped
HWI-1KL149:87:HA58EADXX:1:1101:1744:2169
Unmapped
Unmapped
Unmapped
HWI-1KL149:87:HA58EADXX:1:1101:1606:2189
No pair overlap
PAIR OVERLAP
No pair overlap
HWI-1KL149:87:HA58EADXX:1:1101:1606:2189
No pair overlap
PAIR OVERLAP
No pair overlap
HWI-1KL149:87:HA58EADXX:1:1101:1752:2238
template exceding size limits
template exceding size limits
template exceding size limits
HWI-1KL149:87:HA58EADXX:1:1101:1752:2238
template exceding size limits
template exceding size limits
template exceding size limits
HWI-1KL149:87:HA58EADXX:1:1101:1813:2228
Mate mapped on a different reference
Mate mapped on a different reference
READ OVERLAP
HWI-1KL149:87:HA58EADXX:1:1101:1813:2228
invalid orientation
invalid orientation
invalid orientation

In [33]:
Interval.printInstances()


SSV9K2-CMV-GFP-HygroTK-bGHpA [200:600] = 0 overlapping read pairs
SSV9K2-CMV-GFP-HygroTK-bGHpA [4050:4100] = 2 overlapping read pairs
SSV9K2-CMV-GFP-HygroTK-bGHpA [4800:4850] = 1 overlapping read pairs

In [48]:
for i in Interval.Instances[1].read_list:
    print (i.query_name)


HWI-1KL149:87:HA58EADXX:1:1101:1606:2189
HWI-1KL149:87:HA58EADXX:1:1101:1606:2189
HWI-1KL149:87:HA58EADXX:1:1101:1606:2189
HWI-1KL149:87:HA58EADXX:1:1101:1606:2189

Write a more advanced csv file parser for interval in Find_overlap_read main.

The main entry should be directly in the shell since a parsing and verification of argument is done but for the sake of development under interactive IDE is is also possible to import the class and to instanciate with argument interval_file and align_files


In [1]:
from Find_overlap_read import Main

Example with a file containing valid interval rows


In [2]:
a = Main(interval_file="../test/Good.txt", align_files=["../test/S1_10.sam"])


interval_file = ../test/Good.txt
bam_output = True
output_prefix = out
align_files = ['../test/S1_10.sam']
report_output = True
Parsing the CSV csv file containing interval coordinates
3 valid interval(s) found in ../test/Good.txt
Clearing Reference instances list

Example with a file containing no valid interval row


In [3]:
a = Main(interval_file="../test/Bad.txt", align_files=["../test/S1_10.sam"])


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-3d9839474f10> in <module>()
----> 1 a = Main(interval_file="../test/Bad.txt", align_files=["../test/S1_10.sam"])

/media/analyse/Pharmaco_AAV/documents_pharmaco_AAV/Programming/Python/Find_overlap_reads/src/Find_overlap_read.py in __init__(self, interval_file, align_files, output_prefix, bam_output, report_output)
     87 
     88         if Interval.countInstances() == 0:
---> 89             raise ValueError ("No valid row found in the interval file. Please see readme file")
     90 
     91         #Interval.printInstances()

ValueError: No valid row found in the interval file. Please see readme file
interval_file = ../test/Bad.txt
bam_output = True
output_prefix = out
align_files = ['../test/S1_10.sam']
report_output = True
Parsing the CSV csv file containing interval coordinates
Not enough values in the row 	Skiping row
Not enough values in the row 	Skiping row
Not enough values in the row 	Skiping row
invalid literal for int() with base 10: 'a' 	Skiping row
Not enough values in the row 	Skiping row

Continue development in source files