Alignment-based sequence methods

This notebook introduces the alignment-based sequence methods (operationalized by the Optimal Matching (OM) algorithm), which was originally developed for matching protein and DNA sequences in biology and used extensively for analyzing strings in computer science and recently widely applied to explore the neighborhood change.

It generally works by finding the minimum cost for aligning one sequence to match another using a combination of operations including substitution, insertion, deletion and transposition. The cost of each operation can be parameterized diferently and may be theory-driven or data-driven. The minimum cost is considered as the distance between the two sequences.

The sequence module in giddy provides a suite of alignment-based sequence methods.

Author: Wei Kang weikang9009@gmail.com



In [1]:

    
import numpy as np
import pandas as pd



In [2]:

    
import libpysal
import mapclassify as mc
f = libpysal.io.open(libpysal.examples.get_path("usjoin.csv"))
pci = np.array([f.by_col[str(y)] for y in range(1929,2010)])
q5 = np.array([mc.Quantiles(y,k=5).yb for y in pci]).transpose()
q5









    



/Users/weikang/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval






    Out[2]:





array([[0, 0, 0, ..., 0, 0, 0],
       [2, 2, 2, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [3, 3, 2, ..., 2, 2, 2],
       [3, 3, 3, ..., 4, 4, 4]])



In [3]:

    
q5.shape









    Out[3]:





(48, 81)

Import Sequence class from giddy.sequence:



In [4]:

    
from giddy.sequence import Sequence

"hamming"

substitution cost = 1
insertion/deletion cost = $\infty$



In [5]:

    
seq_hamming = Sequence(q5, dist_type="hamming")
seq_hamming









    Out[5]:





<giddy.sequence.Sequence at 0x7ff498df1160>



In [6]:

    
seq_hamming.seq_dis_mat #pairwise sequence distance matrix









    Out[6]:





array([[ 0., 75.,  7., ..., 21., 81., 78.],
       [75.,  0., 80., ..., 79., 57., 73.],
       [ 7., 80.,  0., ..., 14., 81., 81.],
       ...,
       [21., 79., 14., ...,  0., 81., 81.],
       [81., 57., 81., ..., 81.,  0., 51.],
       [78., 73., 81., ..., 81., 51.,  0.]])

"interval"

Assuming there are $k$ states in the sequences and they are ordinal/continuous.

substitution cost = differences between states
insertion/deletion cost = $k-1$



In [7]:

    
seq_interval = Sequence(q5, dist_type="interval")
seq_interval









    Out[7]:





<giddy.sequence.Sequence at 0x7ff451d8d160>



In [8]:

    
seq_interval.seq_dis_mat









    Out[8]:





array([[  0., 123.,   7., ...,  21., 190., 225.],
       [123.,   0., 130., ..., 116.,  69., 108.],
       [  7., 130.,   0., ...,  14., 197., 232.],
       ...,
       [ 21., 116.,  14., ...,   0., 183., 218.],
       [190.,  69., 197., ..., 183.,   0.,  61.],
       [225., 108., 232., ..., 218.,  61.,   0.]])

"arbitrary"

substitution cost = 0.5
insertion/deletion cost = 1



In [9]:

    
seq_arbitrary = Sequence(q5, dist_type="arbitrary")
seq_arbitrary









    Out[9]:





<giddy.sequence.Sequence at 0x7ff451d8dc18>



In [10]:

    
seq_arbitrary.seq_dis_mat









    Out[10]:





array([[ 0. , 37.5,  3.5, ..., 10.5, 40.5, 39. ],
       [37.5,  0. , 40. , ..., 39.5, 28.5, 36.5],
       [ 3.5, 40. ,  0. , ...,  7. , 40.5, 40.5],
       ...,
       [10.5, 39.5,  7. , ...,  0. , 40.5, 40.5],
       [40.5, 28.5, 40.5, ..., 40.5,  0. , 25.5],
       [39. , 36.5, 40.5, ..., 40.5, 25.5,  0. ]])

"markov"

substitution cost = $1-\frac{p_{ij}+p_{ji}}{2}$ where $p_{ij}$ is the empirical rate of transitioning from state $i$ to $j$
insertion/deletion cost = 1



In [11]:

    
seq_markov = Sequence(q5, dist_type="markov")
seq_markov









    Out[11]:





<giddy.sequence.Sequence at 0x7ff451d8df28>



In [12]:

    
seq_markov.seq_dis_mat









    Out[12]:





array([[ 0.        , 72.31052406,  6.34073233, ..., 19.02219698,
        80.2334688 , 77.48002783],
       [72.31052406,  0.        , 77.05042347, ..., 74.77437281,
        50.75696949, 65.9128181 ],
       [ 6.34073233, 77.05042347,  0.        , ..., 12.68146465,
        80.97128589, 80.51785856],
       ...,
       [19.02219698, 74.77437281, 12.68146465, ...,  0.        ,
        80.10306616, 80.46369148],
       [80.2334688 , 50.75696949, 80.97128589, ..., 80.10306616,
         0.        , 41.57088046],
       [77.48002783, 65.9128181 , 80.51785856, ..., 80.46369148,
        41.57088046,  0.        ]])

"tran"

Biemann, T. (2011). A Transition-Oriented Approach to Optimal Matching. Sociological Methodology, 41(1), 195–221. https://doi.org/10.1111/j.1467-9531.2011.01235.x



In [13]:

    
seq_tran = Sequence(q5, dist_type="tran")
seq_tran









    Out[13]:





<giddy.sequence.Sequence at 0x7ff451d8d588>



In [14]:

    
seq_tran.seq_dis_mat









    Out[14]:





array([[ 0., 23.,  8., ..., 12., 24., 21.],
       [23.,  0., 17., ..., 16., 28., 22.],
       [ 8., 17.,  0., ...,  4., 18., 16.],
       ...,
       [12., 16.,  4., ...,  0., 21., 15.],
       [24., 28., 18., ..., 21.,  0., 23.],
       [21., 22., 16., ..., 15., 23.,  0.]])



In [21]:

    
seq_tran.seq_dis_mat









    Out[21]:





array([[  0., 220.,  25., ...,  55., 220., 220.],
       [220.,   0., 241., ..., 199.,  93., 123.],
       [ 25., 241.,   0., ...,  44., 241., 241.],
       ...,
       [ 55., 199.,  44., ...,   0., 207., 220.],
       [220.,  93., 241., ..., 207.,   0.,  84.],
       [220., 123., 241., ..., 220.,  84.,   0.]])



In [ ]: