Python

Dynamically typed language created by Guido van Rossum (BDFL) in mid-90.

code blocks are defined by INDENTATION
everything is an object
everything is a reference
first class functions
garbage collected
batteries included

OBJECT:

A data structure that holds (encapsulates) both data and actions that can be performed on these data

REFERENCE:

A pointer to specific memory location that holds the actual value (object)

FIRST-CLASS FUNCTIONS:

Functions (reusable blocks of code that compute something) can be manipulated (assigned and passed around) as any other object

BATTERIES:

Python's standard library (things that come with language) and docs are tremendous resources

To save later aggravation, set your editors to use 4 spaces indents and NEVER EVER mix tabs and spaces.



In [1]:

    
import this









    



The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

The takehome message

Python was designed to be concise, high-level multi-paradigm language. You can write Java or C++ like code in python but it'll look ugly. There is a pythonic way of doing things. Often it is the best way, but not always.

The tradeoffs:

speed
lack of easy(true) concurrency

Code is written once but read many times

You're collaborating with yourself in the future

Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.

John F Woods

Let's dive right in

Python is interpreted language. When you launch it you find yourself in a python shell which operates the same way as bash.

Types

Numerical types

`int` - integer



In [2]:

    
a = 1
a









    Out[2]:





1

Integer division gotcha (python 2 specific)



In [1]:

    
from __future__ import division, print_function



In [10]:

    
b = 2

a / b









    Out[10]:





0.5



In [4]:

    
a / float(b)









    Out[4]:





0.5



In [5]:

    
float(a) / b









    Out[5]:





0.5

`float` - floating point (real) number

In Python this means 64-bit float:

1-bit fraction + 11-bit exponent + 53-bit fraction



In [6]:

    
# Float can be written in two ways
f = 1.23456
g = 1e-8
f + g









    Out[6]:





1.23456001

`complex` - complex number



In [7]:

    
c = 1j # same as (0 + 1j)
i = 1.092 - 2.98003j
c - i









    Out[7]:





(-1.092+3.98003j)

Sequence (iterable) types

`str` - string



In [2]:

    
s = 'I am a Python string. I can be "manipulated" in many ways.'
s









    Out[2]:





'I am a Python string. I can be "manipulated" in many ways.'



In [3]:

    
seq = '>RNA\tAGUUCGAGGCUUAAACGGGCCUUAU'
seq









    Out[3]:





'>RNA\tAGUUCGAGGCUUAAACGGGCCUUAU'



In [4]:

    
print(seq)









    



>RNA	AGUUCGAGGCUUAAACGGGCCUUAU



In [5]:

    
len(seq)









    Out[5]:





30



In [6]:

    
print(s[0])
print(s[5])

I
a



In [7]:

    
# Strings are immutable!!!
s[3] = 'n'









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-2151f380161c> in <module>()
      1 # Strings are immutable!!!
----> 2 s[3] = 'n'

TypeError: 'str' object does not support item assignment



In [8]:

    
# Let's have a look inside:
print(dir(s))









    



['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']



In [9]:

    
s.split()









    Out[9]:





['I',
 'am',
 'a',
 'Python',
 'string.',
 'I',
 'can',
 'be',
 '"manipulated"',
 'in',
 'many',
 'ways.']



In [10]:

    
s.split('p')









    Out[10]:





['I am a Python string. I can be "mani', 'ulated" in many ways.']



In [11]:

    
s.upper()









    Out[11]:





'I AM A PYTHON STRING. I CAN BE "MANIPULATED" IN MANY WAYS.'



In [12]:

    
seq.split()









    Out[12]:





['>RNA', 'AGUUCGAGGCUUAAACGGGCCUUAU']



In [13]:

    
seq









    Out[13]:





'>RNA\tAGUUCGAGGCUUAAACGGGCCUUAU'



In [14]:

    
# Unpacking
name, seq = seq.split()
print('Name: {}\tSequence: {}'.format(name,seq))









    



Name: >RNA	Sequence: AGUUCGAGGCUUAAACGGGCCUUAU



In [15]:

    
s = '               i am  a really    stupid    long string   '.strip()
s.strip('i')









    Out[15]:





' am  a really    stupid    long string'



In [16]:

    
sl = s.split()
print(' '.join(sl))









    



i am a really stupid long string

`list` - (ordered) collection of elements



In [17]:

    
sl, len(sl)









    Out[17]:





(['i', 'am', 'a', 'really', 'stupid', 'long', 'string'], 7)



In [21]:

    
sl[2:]









    Out[21]:





['a', 'really', 'stupid', 'long', 'string']



In [22]:

    
print(dir(sl))









    



['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__delslice__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']



In [24]:

    
sl.insert(2, 'another')
sl









    Out[24]:





['i', 'am', 'another', 'a', 'really', 'stupid', 'long', 'string', 'another']



In [25]:

    
sl.extend(['another', 'item'])
sl









    Out[25]:





['i',
 'am',
 'another',
 'a',
 'really',
 'stupid',
 'long',
 'string',
 'another',
 'another',
 'item']



In [32]:

    
# Lists are mutable!!!
lst = ['item', 'another item']
lst1 = [item.upper() for item in lst]
print(lst1)

lst.append('add to lst')
print(lst1)









    



['ITEM', 'ANOTHER ITEM']
['ITEM', 'ANOTHER ITEM']



In [30]:

    
lst.append('YAY')
print(lst)
print(lst1)









    



['item', 1, 2, 'another item', 'add to lst', 'YAY']
['item', 1, 2, 'another item']



In [28]:

    
# Lists are ordered and iterable:
lst1 = []
for item in lst:
    lst1.append(item)
    
print(lst1)









    



['item', 1, 2, 'another item', 'add to lst']



In [34]:

    
for c in seq:
    print(c)









    



A
G
U
U
C
G
A
G
G
C
U
U
A
A
A
C
G
G
G
C
C
U
U
A
U



In [33]:

    
# Don't do this:
for i in range(len(lst)):
    print(lst[i])
    
# If you need index of each element:
for ind,item in enumerate(lst):
    print(ind, item)









    



item
another item
add to lst
0 item
1 another item
2 add to lst



In [35]:

    
# To check that an element is in list:

print('item' in lst)
print('Item' in lst)

if 'item' in lst:
    print('Item in list!')
else:
    print('Not in list')









    



True
False
Item in list!

`tuple` is a immutable list



In [36]:

    
t = ('name', 'position')
name, position = t
print(name, position)









    



name position



In [38]:

    
# From the earlier example:
r1 = range(11)
r2 = range(0, 100, 10)
for i1,i2 in zip(r1,r2):
    print(i1,i2)

DIFFERENCE FROM ARRAYS:

Elements of list do not need to be of the same type

`sets` - unordered collections of unique elements



In [39]:

    
sl









    Out[39]:





['i',
 'am',
 'another',
 'a',
 'really',
 'stupid',
 'long',
 'string',
 'another',
 'another',
 'item']



In [40]:

    
set(sl)









    Out[40]:





{'a', 'am', 'another', 'i', 'item', 'long', 'really', 'string', 'stupid'}



In [41]:

    
# Nice optimization trick:
huge_list = range(1000000)
huge_set = set(huge_list)

def check_list(elem, lst=None):
    if lst:
        # DO smth
    return elem in huge_list

def check_set(elem):
    return elem in huge_set

# And let's refactor this into one function and talk about variable scopes



In [44]:

    
%time check_list(10000)









    



CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 276 µs






    Out[44]:





True



In [45]:

    
%time check_set(10000)









    



CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 14.1 µs






    Out[45]:





True

`dictionary` - key/value pairs

AKA hashtable, associative array. In general, mapping type.



In [48]:

    
d = {'key1': 1, 'key2': 20, 'key3': 345}
d









    Out[48]:





{'key1': 1, 'key2': 20, 'key3': 345}

key can be any hashable type. What it means in practical terms is that it must be immutable.



In [47]:

    
print(dir(d))









    



['__class__', '__cmp__', '__contains__', '__delattr__', '__delitem__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'has_key', 'items', 'iteritems', 'iterkeys', 'itervalues', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values', 'viewitems', 'viewkeys', 'viewvalues']

`dict` is mutable!



In [49]:

    
d.pop('key1')
print(d)
# Note also that dict is unordered









    



{'key3': 345, 'key2': 20}



In [50]:

    
# Assignment
d['key1'] = 101
d['key2'] = 202
d









    Out[50]:





{'key1': 101, 'key2': 202, 'key3': 345}



In [51]:

    
# Membership test operates on keys
print('key3' in d)
print('key4' in d)









    



True
False



In [52]:

    
# Looping over key:value pairs

for key,val in d.items():
    print(key,val)









    



key3 345
key2 202
key1 101

Working with files.



In [ ]:

    
# This works but is not good

fh = open('myfile.txt', 'r')
for line in fh:
    # Do something
fh.close()



In [ ]:

    
# Better way (context manager)
import os.path
directory = 'dir1'
filename = 'file.txt'

with open(os.path.join(directory, filename), 'r') as fh:
    for line in fh:
        # Do something
# Do something else. At this point fh.close() will be called automatically



In [ ]:

    
# We can also work with gzipped files
import gzip

with gzip.open('myfile.gz', 'rb') as fh: # Note 'rb' as file opening mode
    for line in fh:
        # Do something

Standard library

Python has expansive standard library and comprehensive, accessible documentation complete with examples. Some of the must-know modules:

collections
itertools
os and particularly os.path
re

Comparisons

In comparison operations each operand is first cast to boolean.

The boolean type is a subclass of int (False ~ 0, True ~ 1) mainly for historical reasons.



In [55]:

    
a = 1
b = 0
if 0 < a < 2:
    print('a is True!')
if not b:
    print('b is False!')









    



a is True!
b is False!



In [56]:

    
s1 = 'Fox'
s2 = 'Dog'

# Equality check
if s1 != s2:
    print('s1 and s2 are different')
    
# Identity check
if s1 is s2:
    pass
else:
    print('s1 and s2 are not the same thing!')









    



s1 and s2 are different
s1 and s2 are not the same thing!



In [57]:

    
s2 = s1

# Identity check
if s1 is s2:
    print('s1 and s2 POINT to the same thing')
else:
    print('s1 and s2 are not the same thing!')









    



s1 and s2 POINT to the same thing



In [58]:

    
# This however doesn't work for integers
b = 1
if a == b:
    print('a and b are equal!')
    
if a is b:
    print('a and b are not the same thing!')
    
# Implementation detail: integers from -5 to ~256 are cached (singletons)









    



a and b are equal!
a and b are not the same thing!



In [59]:

    
# Empty sequence types evaluate to False, non-empty ones to True
lst1 = []
lst2 = ['a']

if lst1 and lst2:
    print('Both lists are non-empty.')
else:
    print('One of the lists is empty.')









    



One of the lists is empty.



In [61]:

    
if lst1:
    pass
else:
    print('Empty')









    



Empty



In [ ]:

    
# Another singleton
a = None

Example



In [62]:

    
!ls -lah ../data









    



total 191M
drwxrwxr-x 2 ilya ilya 4.0K Sep 30 18:05 .
drwxrwxr-x 5 ilya ilya 4.0K Sep 16 13:23 ..
-rw-rw-r-- 1 ilya ilya  24M Sep 30 08:54 7E2.pileup.gz
-rw-rw-r-- 1 ilya ilya  21M Sep 30 15:49 AAGCTA_R2.pileup.gz
-rw-rw-r-- 1 ilya ilya 146M Sep 29 12:04 BJ-HSR1_R1.fastq.gz
-rw-rw-r-- 1 ilya ilya 500K Sep 16 09:03 contigs.fasta
-rw-r----- 1 ilya ilya  644 Sep 16 09:00 dHSR1.fa
-rw-rw-r-- 1 ilya ilya 5.4K Sep 23 15:58 gradtimes.txt
-rw-rw-r-- 1 ilya ilya  445 Sep 16 09:00 hHSR-435.fa
-rw-rw-r-- 1 ilya ilya  611 Sep 16 09:00 hHSR.fa
-rw-rw-r-- 1 ilya ilya  126 Sep 16 09:04 rose.fa



In [70]:

    
import os
import sys
import csv
import gzip
import pandas as pd
csv.field_size_limit(sys.maxsize)

def parse_pileup(barcode, dirname='../results', track='minus', sample_id='minus'):
    filename = os.path.join(dirname, '{0}_{1}.pileup.gz'.format(barcode, sample_id))
    print(filename)
    with gzip.open(filename, 'rb') as pileup:
        reader = csv.DictReader(pileup,
                        delimiter='\t',
                        fieldnames=['seqname', 'pos', 'base', 'coverage', 'details', 'qual'])
        data = []
        for rec in reader:
            pos = int(rec['pos'])
            last = rec
            if pos == 1:
                data.append({'pos': 0, 'base': '*',
                             track: rec['details'].count('^')})
            else:
                data.append({'pos': pos-1, 'base': last['base'],
                             track: rec['details'].count('^')})
    return pd.DataFrame.from_records(data)



In [71]:

    
df = parse_pileup('AAGCTA', dirname='../data', sample_id='R2')









    



../data/AAGCTA_R2.pileup.gz



In [72]:

    
df









    Out[72]:






  
    
      
      base
      minus
      pos
    
  
  
    
      0
      *
      25
      0
    
    
      1
      C
      1
      1
    
    
      2
      T
      5
      2
    
    
      3
      C
      2
      3
    
    
      4
      G
      1
      4
    
    
      5
      C
      1
      5
    
    
      6
      T
      0
      6
    
    
      7
      A
      21
      7
    
    
      8
      T
      2
      8
    
    
      9
      G
      5
      9
    
    
      10
      C
      1
      10
    
    
      11
      G
      0
      11
    
    
      12
      T
      24
      12
    
    
      13
      C
      2
      13
    
    
      14
      A
      32
      14
    
    
      15
      C
      1
      15
    
    
      16
      G
      2
      16
    
    
      17
      T
      25
      17
    
    
      18
      T
      15
      18
    
    
      19
      A
      18
      19
    
    
      20
      A
      14
      20
    
    
      21
      A
      8
      21
    
    
      22
      A
      1
      22
    
    
      23
      A
      39
      23
    
    
      24
      T
      29
      24
    
    
      25
      A
      27
      25
    
    
      26
      A
      34
      26
    
    
      27
      T
      10
      27
    
    
      28
      G
      2
      28
    
    
      29
      A
      12
      29
    
    
      ...
      ...
      ...
      ...
    
    
      655
      C
      0
      655
    
    
      656
      C
      0
      656
    
    
      657
      T
      0
      657
    
    
      658
      A
      0
      658
    
    
      659
      G
      0
      659
    
    
      660
      C
      0
      660
    
    
      661
      T
      0
      661
    
    
      662
      T
      0
      662
    
    
      663
      A
      0
      663
    
    
      664
      A
      0
      664
    
    
      665
      A
      0
      665
    
    
      666
      T
      0
      666
    
    
      667
      C
      0
      667
    
    
      668
      G
      0
      668
    
    
      669
      G
      0
      669
    
    
      670
      G
      0
      670
    
    
      671
      C
      0
      671
    
    
      672
      T
      0
      672
    
    
      673
      T
      0
      673
    
    
      674
      C
      0
      674
    
    
      675
      G
      0
      675
    
    
      676
      G
      0
      676
    
    
      677
      T
      0
      677
    
    
      678
      C
      0
      678
    
    
      679
      C
      0
      679
    
    
      680
      G
      0
      680
    
    
      681
      G
      0
      681
    
    
      682
      T
      0
      682
    
    
      683
      T
      0
      683
    
    
      684
      C
      0
      684
    
  

685 rows × 3 columns

Assignment

Do the summary of E.coli .gff file in python

EXTRA CREDIT:

Find all occurences of the sequence GGGGCGGGGG in E.coli genome

EXTRA EXTRA CREDIT:

Find all almost exact occurences of the same sequence in E.coli genome (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz)



In [1]:

    
import gzip
from collections import Counter

foo = 'bar'

def column_stat(filename, col=2, sep='\t'):
    counter = Counter()
    foo = 'BAZ'
    print(foo)
    with gzip.open(filename) as fi:
        for line in fi:
            line_norm = line.strip()
            if line and (not line_norm.startswith('#')):
                fields = line_norm.split(sep)
                counter[fields[col]] += 1
    return counter
                         
def column_stat1(filename, **kwargs):
    counter = Counter()
    col = kwargs['col']
    sep = kwargs['sep']
    with gzip.open(filename) as fi:
        for line in fi:
            line_norm = line.strip()
            if line and (not line_norm.startswith('#')):
                fields = line_norm.split(sep)
                counter[fields[col]] += 1
    return counter



In [2]:

    
#column_stat('../data/GCF_000005845.2_ASM584v2_genomic.gff.gz', col=3, sep=',')
kwargs = {'col': 2, 'sep': '\t'}
column_stat('../data/GCF_000005845.2_ASM584v2_genomic.gff.gz', **kwargs)









    



BAZ






    Out[2]:





Counter({'gene': 4518, 'CDS': 4386, 'repeat_region': 355, 'exon': 178, 'tRNA': 89, 'ncRNA': 65, 'region': 62, 'rRNA': 22, 'tmRNA': 2})



In [3]:

    
!ls -lah ../data









    



total 192M
drwxrwxr-x 2 ilya ilya 4.0K Oct 14 14:41 .
drwxrwxr-x 5 ilya ilya 4.0K Oct 14 11:21 ..
-rw-rw-r-- 1 ilya ilya  24M Sep 30 08:54 7E2.pileup.gz
-rw-rw-r-- 1 ilya ilya  21M Sep 30 15:49 AAGCTA_R2.pileup.gz
-rw-rw-r-- 1 ilya ilya 146M Sep 29 12:04 BJ-HSR1_R1.fastq.gz
-rw-rw-r-- 1 ilya ilya 500K Sep 16 09:03 contigs.fasta
-rw-r----- 1 ilya ilya  644 Sep 16 09:00 dHSR1.fa
-rw-r----- 1 ilya ilya 1.4M Oct 14 14:41 GCF_000005845.2_ASM584v2_genomic.fna.gz
-rw-rw-r-- 1 ilya ilya 440K Sep 10 13:08 GCF_000005845.2_ASM584v2_genomic.gff.gz
-rw-rw-r-- 1 ilya ilya 5.4K Sep 23 15:58 gradtimes.txt
-rw-rw-r-- 1 ilya ilya  445 Sep 16 09:00 hHSR-435.fa
-rw-rw-r-- 1 ilya ilya  611 Sep 16 09:00 hHSR.fa
-rw-rw-r-- 1 ilya ilya  11K Oct 14 09:18 ROSE1_dp.ps
-rw-rw-r-- 1 ilya ilya  126 Sep 16 09:04 rose.fa



In [4]:

    
pattern = 'GGGGCGGGGG'
nuc = {'A':'T', 'T' : 'A', 'C' : 'G',  'G' : 'C'}

def rc(seq):
    return ''.join([nuc[i] for i in seq])[::-1]

with gzip.open('../data/GCF_000005845.2_ASM584v2_genomic.fna.gz') as fi:
    glines = []
    for line in fi:
        line_norm = line.strip().upper()
        if not line_norm.startswith('>'):
            glines.append(line_norm)

seq = ''.join(glines)

def find_kmer(seq, pattern):
    seq_map = {}
    k = len(pattern)
    for kmer, pos in [(seq[i:i+k], i) for i in xrange(0, len(seq)-k)]:
        if kmer in seq_map:
            seq_map[kmer].append(pos)
        else:
            seq_map[kmer] = [pos,]
    return seq_map
    
forward = find_kmer(seq, pattern)
print(' '.join([str(x) for x in forward[pattern]]))
reverse = find_kmer(seq, rc(pattern))
print(' '.join([str(x) for x in reverse[rc(pattern)]]))



In [5]:

    
nuc = {'A':'T', 'T' : 'A', 'C' : 'G',  'G' : 'C'}
#take the patter            , search it based on dic
res = []
for i in pattern :
     res.append(nuc[i])
print(''.join(res)[::-1])









    



CCCCCGCCCC



In [6]:

    
def rc(seq):
    return ''.join([nuc[i] for i in seq])[::-1]



In [7]:

    
D = 1 # number of mismatches

def hamming(s1, s2):
    '''
    Computes Hamming distance between s1 and s2.
    '''
    if len(s1) != len(s2):
        raise ValueError('s1 and s2 must be the same length to compute Hamming distance!')
    return sum(ch1 != ch2 for ch1,ch2 in zip(s1, s2))

seq_map = {}
k = len(pattern)
for kmer, pos in [(seq[i:i+k], i) for i in xrange(0, len(seq)-k+1)]:
    if hamming(kmer, pattern) <= D:
        if kmer in seq_map:
            seq_map[kmer].append(pos)
        else:
            seq_map[kmer] = [pos,]
    
res = ' '.join([' '.join([str(x) for x in v]) for key,v in seq_map.items()])

print(' '.join([str(x) for x in sorted([int(x) for x in res.split()])]))









    



2860 45677 50605 86354 97470 101235 248083 276891 286522 300115 367904 380012 405528 473473 501186 512656 524363 534546 535907 545473 555202 628914 632077 658626 691553 709660 808205 808724 813079 836575 836581 874687 884335 890106 903868 907568 915200 1004148 1074191 1080153 1096948 1113652 1138944 1247359 1250157 1265701 1276830 1308740 1341172 1351229 1419600 1453475 1462855 1462856 1483410 1499251 1507397 1513904 1527347 1618065 1674763 1727483 1782055 1806481 1824177 2007998 2074125 2136200 2181890 2326438 2557226 2604532 2673735 2715650 2719810 2723457 2842019 2877653 2885610 2906533 2979722 3003468 3046077 3091234 3239688 3240972 3390519 3497375 3504882 3559071 3641398 3662701 3732140 3740898 3852203 4026938 4052372 4082697 4101779 4156859 4205221 4239235 4258086 4341719 4396994 4492704 4549873 4596496



In [8]:

    
s1 = 'AAATTTTAAA'
s2 = 'AAATTATAAT'
for c1,c2 in zip(s1,s2):
    print(c1,c2)









    



('A', 'A')
('A', 'A')
('A', 'A')
('T', 'T')
('T', 'T')
('T', 'A')
('T', 'T')
('A', 'A')
('A', 'A')
('A', 'T')



In [9]:

    
def hamming(s1, s2):
    '''
    Computes Hamming distance between s1 and s2.
    '''
    if len(s1) != len(s2):
        raise ValueError('s1 and s2 must be the same length to compute Hamming distance!')
    return sum(ch1 != ch2 for ch1,ch2 in zip(s1, s2))

def find_with_mismatch(seq, pattern, D=1):
    seq_map = {}
    k = len(pattern)
    for kmer, pos in [(seq[i:i+k], i) for i in xrange(0, len(seq)-k+1)]:
        if hamming(kmer, pattern) <= D:
            if kmer in seq_map:
                seq_map[kmer].append(pos)
            else:
                seq_map[kmer] = [pos,]
    return seq_map
        
print(find_with_mismatch(seq, pattern, D=0))









    



{'GGGGCGGGGG': [300115]}



In [10]:

    
print(find_with_mismatch(seq, pattern, D=1))









    



{'GGGGCGGCGG': [2860, 836581, 915200, 1250157, 1341172, 1806481, 2719810, 3497375, 3852203, 4052372], 'GGGGCGAGGG': [632077, 1351229, 3046077, 3239688, 3390519, 3504882, 4341719, 4549873], 'GGGGCGGGGT': [534546, 1419600, 3641398], 'GGGGCCGGGG': [1462855, 3091234], 'GGGGAGGGGG': [405528, 907568], 'GGGGCGGGGC': [628914, 4596496], 'CGGGCGGGGG': [276891, 808205, 1074191, 2979722], 'GGGGCGGGAG': [367904, 808724, 2885610], 'GGGGCGGTGG': [286522, 473473, 709660, 903868, 1080153, 1824177, 2136200], 'GGGGCGGGGG': [300115], 'GTGGCGGGGG': [884335, 2326438, 3240972], 'GGCGCGGGGG': [545473, 2007998, 2557226, 4082697, 4205221], 'GGGGCGGGCG': [2181890, 4026938, 4156859], 'GGGGCGGGTG': [524363, 1265701, 1507397, 1618065, 3662701, 4396994], 'GGGGTGGGGG': [1308740], 'TGGGCGGGGG': [248083, 890106, 1096948, 1727483, 1782055, 2715650], 'GCGGCGGGGG': [97470, 101235, 535907, 658626, 813079, 2906533, 3003468, 4492704], 'GGGGCTGGGG': [45677, 86354, 1513904, 2074125, 2604532, 2723457, 2842019, 4258086], 'GGGGCGCGGG': [555202, 1276830, 1483410, 3559071], 'GGGGCAGGGG': [50605, 1113652, 1138944, 1499251, 1527347], 'AGGGCGGGGG': [1453475], 'GGTGCGGGGG': [1247359, 4239235], 'GGGCCGGGGG': [1004148, 1462856, 2673735], 'GGGGCGGGGA': [512656, 4101779], 'GGGGGGGGGG': [380012], 'GAGGCGGGGG': [836575, 2877653, 3740898], 'GGAGCGGGGG': [501186, 691553, 874687, 3732140], 'GGGGCGTGGG': [1674763]}



In [ ]:

	base	minus	pos
0	*	25	0
1	C	1	1
2	T	5	2
3	C	2	3
4	G	1	4
5	C	1	5
6	T	0	6
7	A	21	7
8	T	2	8
9	G	5	9
10	C	1	10
11	G	0	11
12	T	24	12
13	C	2	13
14	A	32	14
15	C	1	15
16	G	2	16
17	T	25	17
18	T	15	18
19	A	18	19
20	A	14	20
21	A	8	21
22	A	1	22
23	A	39	23
24	T	29	24
25	A	27	25
26	A	34	26
27	T	10	27
28	G	2	28
29	A	12	29
...	...	...	...
655	C	0	655
656	C	0	656
657	T	0	657
658	A	0	658
659	G	0	659
660	C	0	660
661	T	0	661
662	T	0	662
663	A	0	663
664	A	0	664
665	A	0	665
666	T	0	666
667	C	0	667
668	G	0	668
669	G	0	669
670	G	0	670
671	C	0	671
672	T	0	672
673	T	0	673
674	C	0	674
675	G	0	675
676	G	0	676
677	T	0	677
678	C	0	678
679	C	0	679
680	G	0	680
681	G	0	681
682	T	0	682
683	T	0	683
684	C	0	684

	base	minus	pos
0	*	25	0
1	C	1	1
2	T	5	2
3	C	2	3
4	G	1	4
5	C	1	5
6	T	0	6
7	A	21	7
8	T	2	8
9	G	5	9
10	C	1	10
11	G	0	11
12	T	24	12
13	C	2	13
14	A	32	14
15	C	1	15
16	G	2	16
17	T	25	17
18	T	15	18
19	A	18	19
20	A	14	20
21	A	8	21
22	A	1	22
23	A	39	23
24	T	29	24
25	A	27	25
26	A	34	26
27	T	10	27
28	G	2	28
29	A	12	29
...	...	...	...
655	C	0	655
656	C	0	656
657	T	0	657
658	A	0	658
659	G	0	659
660	C	0	660
661	T	0	661
662	T	0	662
663	A	0	663
664	A	0	664
665	A	0	665
666	T	0	666
667	C	0	667
668	G	0	668
669	G	0	669
670	G	0	670
671	C	0	671
672	T	0	672
673	T	0	673
674	C	0	674
675	G	0	675
676	G	0	676
677	T	0	677
678	C	0	678
679	C	0	679
680	G	0	680
681	G	0	681
682	T	0	682
683	T	0	683
684	C	0	684