Day 21 - Burrows–Wheeler transform


Definition(s)

The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2.

When a character string is transformed by the BWT, the transformation permutes the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row.

function BWT (string s)
   create a table, rows are all possible rotations of s
   sort rows alphabetically
   return (last column of the table)
function inverseBWT (string s)
   create empty table
   repeat length(s) times
       // first insert creates first column
       insert s as a column of table before first column of the table
       sort rows of the table alphabetically
   return (row that ends with the 'EOF' character)

Algorithm(s)


In [48]:
END_CHAR = '$'

def rotate(s, n):
    return s[n:] + s[:n]

In [43]:
def bwt(text):
    return ''.join(s[-1] for s in sorted([rotate(text, i) for i in range(len(text))]))

In [71]:
def ibwt(text):
    table = [''] * len(text)
    for _ in range(len(text)):
        table = sorted([c + s for c, s in zip(text, table)])
        
    return next(filter(lambda s: s.endswith(END_CHAR), table)).strip(END_CHAR)

Run(s)


In [69]:
bwt('alex' + END_CHAR)


Out[69]:
'x$lae'

In [72]:
ibwt('x$lae')


Out[72]:
'alex'

In [75]:
bwt('abababababaababababaaabababbababbabcabababbcaccacabcabcbabab' + END_CHAR)


Out[75]:
'bbbabbbbbbbbab$acbbbbbbbccccaaaaaacaaaaaabaaaabaaaaabababcbba'

In [76]:
ibwt('bbbabbbbbbbbab$acbbbbbbbccccaaaaaacaaaaaabaaaabaaaaabababcbba')


Out[76]:
'abababababaababababaaabababbababbabcabababbcaccacabcabcbabab'