This notebook introduces the alignment-based sequence methods (operationalized by the Optimal Matching (OM) algorithm), which was originally developed for matching protein and DNA sequences in biology and used extensively for analyzing strings in computer science and recently widely applied to explore the neighborhood change.
It generally works by finding the minimum cost for aligning one sequence to match another using a combination of operations including substitution, insertion, deletion and transposition. The cost of each operation can be parameterized diferently and may be theory-driven or data-driven. The minimum cost is considered as the distance between the two sequences.
The sequence module in giddy provides a suite of alignment-based sequence methods.
Author: Wei Kang weikang9009@gmail.com
In [1]:
    
import numpy as np
import pandas as pd
    
In [2]:
    
import libpysal
import mapclassify as mc
f = libpysal.io.open(libpysal.examples.get_path("usjoin.csv"))
pci = np.array([f.by_col[str(y)] for y in range(1929,2010)])
q5 = np.array([mc.Quantiles(y,k=5).yb for y in pci]).transpose()
q5
    
    
    Out[2]:
In [3]:
    
q5.shape
    
    Out[3]:
Import Sequence class from giddy.sequence:
In [4]:
    
from giddy.sequence import Sequence
    
In [5]:
    
seq_hamming = Sequence(q5, dist_type="hamming")
seq_hamming
    
    Out[5]:
In [6]:
    
seq_hamming.seq_dis_mat #pairwise sequence distance matrix
    
    Out[6]:
In [7]:
    
seq_interval = Sequence(q5, dist_type="interval")
seq_interval
    
    Out[7]:
In [8]:
    
seq_interval.seq_dis_mat
    
    Out[8]:
In [9]:
    
seq_arbitrary = Sequence(q5, dist_type="arbitrary")
seq_arbitrary
    
    Out[9]:
In [10]:
    
seq_arbitrary.seq_dis_mat
    
    Out[10]:
In [11]:
    
seq_markov = Sequence(q5, dist_type="markov")
seq_markov
    
    Out[11]:
In [12]:
    
seq_markov.seq_dis_mat
    
    Out[12]:
Biemann, T. (2011). A Transition-Oriented Approach to Optimal Matching. Sociological Methodology, 41(1), 195–221. https://doi.org/10.1111/j.1467-9531.2011.01235.x
In [13]:
    
seq_tran = Sequence(q5, dist_type="tran")
seq_tran
    
    Out[13]:
In [14]:
    
seq_tran.seq_dis_mat
    
    Out[14]:
In [21]:
    
seq_tran.seq_dis_mat
    
    Out[21]:
In [ ]: