The original article (A Method for the Construction of Minimum-Redundancy Codes) is easily reachable by typing "Huffman 1952" in your favorite search engine.
Huffman coding is a lossless data compression algorithm.
It can be applied to chains of symbols, such as text.
It is optimal if the probability mass function of the symbols are known and they are independent and identically distributed.
In [1]:
#first, we some string of symbols to encode
text = "ABCABAACABBC"
In [31]:
#let's find out how many of each symbol we saw with this function
#we will use the defaultdict, for higher performance
from collections import defaultdict
def get_occurence_dict(text):
"""get the mapping {symbol : number of occurences in the text}"""
sym_to_occurrence = defaultdict(int) # char c -> int occurrences
for sym in text:
sym_to_occurrence[sym] = sym_to_occurrence.get(sym, 0) + 1
return sym_to_occurrence
occurrences = get_occurence_dict(text)
print(sorted(occurrences.items()))
In [32]:
#from the number of times we saw each symbol, let's calculate their respective frequency
def get_frequency_table(occurence_dict, text_length):
"""convert occurrences to frequency"""
frequency_table = [] # (char c, frequency (sums to 1))
for k, v in occurence_dict.items():
frequency_table.append((v / text_length, [k]))
return frequency_table
frequencies = get_frequency_table(occurrences, len(text))
print(sorted(frequencies))
In [33]:
#to make the procedure efficient, we will use a heap (min-heap)
from heapq import heapify, heappop, heappush
def generate_binary_code(frequency_table):
"""generates binary code for each sym
use the exact same encoding as in the article"""
heapify(frequency_table)
code = defaultdict(list)
while len(frequency_table) > 1:
one = heappop(frequency_table)
zero = heappop(frequency_table)
for sym in one[1]:
code[sym].append(1)
for sym in zero[1]:
code[sym].append(0)
heappush(frequency_table, (one[0] + zero[0], one[1] + zero[1]))
return code
code = generate_binary_code(frequencies)
In [34]:
for (sym, binary) in sorted(code.items()):
print("{0} {1}".format(sym, binary))
In [35]:
encoded_text = []
for sym in text:
encoded_text += code[sym]
print(encoded_text)
So the encoded message is actually bigger than the original one...
Python doesn't allow low-level control of memory, so we will use C code to represent a bit array data structure.
In [ ]: