Unicode and ASCII code map characters to bits, but all characters (letters, numbers, punctuation marks) map to the same number of bits (7 for ASCII).
The Huffman code is a variable length code. The code length (in bits) for each datum (data point) is proportional to its frequency in the data. This is effectively a loss-less compression. The frequencies are calculated from a corpus - a big representative data set.
To preserve the reversability of the code (one-to-one) we use prefix free codes - no code is a prefix of another code, i.e. if 0110 is a code, then there is no other code theat begins with 0110.
We start with a simple code for creating a histogram of the data:
In [1]:
def char_count(text):
histogram = {}
for ch in text:
histogram[ch] = histogram.get(ch, 0) + 1
return histogram
In [2]:
letter_count = char_count("live and let live")
letter_count
Out[2]:
Next we need to build the Huffman tree.
In [3]:
def by_value(tup):
return tup[1]
def build_huffman_tree(letter_count):
""" recieves dictionary with char:count entries
generates a list of letters representing the binary Huffman encoding tree"""
queue = list(letter_count.items())
while len(queue) >= 2:
queue.sort(key=by_value) # sort by the count
# combine two least frequent elements
elem1, freq1 = queue.pop(0)
elem2, freq2 = queue.pop(0)
elems = (elem1, elem2)
freqs = freq1 + freq2
queue.append((elems,freqs))
return queue[0][0]
In [4]:
tree = build_huffman_tree(letter_count)
tree
Out[4]:
In [5]:
def build_codebook(huff_tree, prefix=''):
""" receives a Huffman tree with embedded encoding and a prefix of encodings.
returns a dictionary where characters are keys and associated binary strings are values."""
if isinstance(huff_tree, str): # a leaf
return {huff_tree: prefix}
else:
left, right = huff_tree
codebook = {}
codebook.update( build_codebook(left, prefix + '0'))
codebook.update( build_codebook(right, prefix + '1'))
return codebook
In [6]:
codebook = build_codebook(tree)
for ch,freq in sorted(letter_count.items(), key=by_value, reverse=True):
print(ch, freq, codebook[ch])
In [7]:
def compress(text, codebook):
''' compress text using codebook dictionary '''
return ''.join(codebook[ch] for ch in text if ord(ch) <= 128)
In [13]:
ch,ord(ch),'{:08b}'.format(ord(ch))
Out[13]:
In [21]:
bits_ascii = ''.join(['{:07b}'.format(ord(ch)) for ch in "live and let live" if ord(ch) <= 128])
huff_bits = compress("live and let live", codebook)
In [22]:
print(len(bits_ascii),bits_ascii)
print(len(huff_bits),huff_bits)
In [12]:
def build_decoding_dict(codebook):
"""build the "reverse" of encoding dictionary"""
return {v:k for k,v in codebook.items()}
In [13]:
decodebook = build_decoding_dict(codebook)
decodebook
Out[13]:
In [14]:
def decompress(bits, decodebook):
prefix = ''
result = []
for bit in bits:
prefix += bit
if prefix not in decodebook:
continue
result.append(decodebook[prefix])
prefix = ''
assert prefix == '' # must finish last codeword
return ''.join(result) # converts list of chars to a string
In [15]:
decompress(huff_bits, decodebook)
Out[15]:
In [16]:
def huffman_code(corpus):
counts = char_count(corpus)
tree = build_huffman_tree(counts)
return build_codebook(tree)
There is only one way to create the codebook for this corpus:
In [17]:
huffman_code('aaabbc')
Out[17]:
There are different ways to create a codebook for this corpus - b and c are interchangeable:
In [18]:
huffman_code('aaabbcc')
Out[18]:
Here the frequency of all characters is equal, so the code is actually a fixed length code - most characters
In [19]:
codebook = huffman_code('qwertuioplkjhgfdsazxcvbnm')
codebook
Out[19]:
Average length of 4.72, most codes are of length 5:
In [20]:
mean([len(v) for v in codebook.values()])
Out[20]:
In the next corpus every letter appears a different number of times:
In [21]:
corpus = ''.join([ch*(i+1) for i,ch in enumerate('qwertuioplkjhgfdsazxcvbnm') ])
corpus
Out[21]:
In [22]:
huffman_code(corpus)
Out[22]:
What do we do if some characters in the text did not appear in the corpus?
In [23]:
print(''.join([chr(c) for c in range(128)]))