Messing around with Huffman coding in python3 and C

The original article (A Method for the Construction of Minimum-Redundancy Codes) is easily reachable by typing "Huffman 1952" in your favorite search engine.

What is huffman coding?

Huffman coding is a lossless data compression algorithm.

It can be applied to chains of symbols, such as text.

It is optimal if the probability mass function of the symbols are known and they are independent and identically distributed.

Give me an example


In [1]:
#first, we some string of symbols to encode
text = "ABCABAACABBC"

Step 1: figuring out the probability of each symbol


In [31]:
#let's find out how many of each symbol we saw with this function
#we will use the defaultdict, for higher performance
from collections import defaultdict


def get_occurence_dict(text):
    """get the mapping {symbol : number of occurences in the text}"""
    sym_to_occurrence = defaultdict(int)  # char c -> int occurrences
    for sym in text:
        sym_to_occurrence[sym] = sym_to_occurrence.get(sym, 0) + 1
    return sym_to_occurrence


occurrences = get_occurence_dict(text)

print(sorted(occurrences.items()))


[('A', 5), ('B', 4), ('C', 3)]

In [32]:
#from the number of times we saw each symbol, let's calculate their respective frequency
def get_frequency_table(occurence_dict, text_length):
    """convert occurrences to frequency"""
    frequency_table = []  # (char c, frequency (sums to 1))
    for k, v in occurence_dict.items():
        frequency_table.append((v / text_length, [k]))
    return frequency_table


frequencies = get_frequency_table(occurrences, len(text))
    

print(sorted(frequencies))


[(0.25, ['C']), (0.3333333333333333, ['B']), (0.4166666666666667, ['A'])]

Step 2: from the probabilities, generate the binary code for each symbol


In [33]:
#to make the procedure efficient, we will use a heap (min-heap)
from heapq import heapify, heappop, heappush

def generate_binary_code(frequency_table):
    """generates binary code for each sym
    use the exact same encoding as in the article"""
    heapify(frequency_table)
    code = defaultdict(list)

    while len(frequency_table) > 1:
        one  = heappop(frequency_table)
        zero = heappop(frequency_table)
        for sym in one[1]:
            code[sym].append(1)
        for sym in zero[1]:
            code[sym].append(0)
        heappush(frequency_table, (one[0] + zero[0], one[1] + zero[1]))
    return code

code = generate_binary_code(frequencies)

In [34]:
for (sym, binary) in sorted(code.items()):
    print("{0}    {1}".format(sym, binary))


A    [1]
B    [0, 0]
C    [1, 0]

Step 3. Encode the text


In [35]:
encoded_text = []


for sym in text:
    encoded_text += code[sym]
    
    
print(encoded_text)


[1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0]

Step 4. Transform the encoding into bits

So the encoded message is actually bigger than the original one...

Python doesn't allow low-level control of memory, so we will use C code to represent a bit array data structure.


In [ ]: