The principle behind this technique is to design a binary character code for each character, using a variable number of bits to represent each character, so as to reduce the total length of the document.
To allow easy, unambiguous decoding, we require the code to be a prefix code: no code is a prefix of another
For example, if the codewords are {0, 01, 11, 001}, the decoding of string like 001 is ambiguous. You can interpret it as 0-01, or 001.
Solution ---- insisting on the prefix-free property: no codeword can be a prefix of another codeword, where codeword is a string of digits to represent a character in the encoding.
Details
Given a character set C and the frequency f (c) of each character c belongs to C in the document A, the problem is to find a prefix code that minimizes the total length of the document.
The tree representing such an optimal code is called a Huffman tree T . We aim to minimize
B(T,A) = sum of f(c)d(c)
——d (c) is the depth of character c in the tree T (numbers of bits used to encode character c), or the length of the codeword of character c.
——B(T,A) is the total number of bits for document A based on the Huffman tree T.
The problem is to design a tree T to minimize B(T,A).
- There are no nodes with only a single child.
- Rearranging characters within the same level of the tree T does not change B(T,A).
- Swapping a character at a higher level with another less-frequent character reduces B(T,A), where the level of the tree root is the lowest level
We have a greedy strategy for the Huffman coding problem:
For efficient implementation of a Huffman tree, we can use a priority queue data structure - a minimum heap
The running time is O(n) +O(n) + O(nlogn) = O(nlogn)