Click to see the rest of the Data Representation section :
Huffman encoding is a form of lossless compression allowing files to be made smaller by reducing the number of bits needed for each character. Before learning about Huffman Encoding, it’s recommended that you are familiar with character sets and binary.
Huffman encoding uses a technique called Binary Trees to create a unique bit pattern for each character in the file based on the frequency with which the character appears.
To understand how this technique saves space, let’s use an example:
T | 0 | E | A | C | H | A | L | L | A | B | O | U | T | I | T |
If we were to represent this using standard ASCII, we would require 7 bits for each character and multiply this by the number of characters in the string. So in this case:
7 * 15 = 105 bits 105 / 8 = 14 bytes (rounded up)
But the text doesn’t actually use all of the characters in the ASCII character set, so we could create a reduced character set by just using the characters from the string and assigning each one a different binary number in its smallest format. In this case this would be:
T | 0 |
E | 1 |
A | 10 |
C | 11 |
H | 100 |
L | 101 |
B | 110 |
O | 111 |
U | 1000 |
I | 1001 |
Using this reduced character set, the string “TEACHALLABOUTIT” can be reduced as follows:
T = 0 : 1 bit
E = 1 : 1 bit
A = 10 : 2 bits
C = 11 : 2 bits
H = 100 : 3 bits
A = 10 : 2 bits
L = 101 : 3 bits
L = 101 : 3 bits
A = 10 : 2 bits
B = 110 : 3 bits
O = 111 : 3 bits
U = 1000 : 4 bits
T = 0 : 1 bit
I = 1001 : 4 bits
T = 0 : 1 bit
Total= 35 bits
35 / 8 = 5 Bytes (rounded up)
This is a saving of 9 bytes just in this small piece of text!
However, the problem with doing this is ambiguity. If a program receives the bit pattern 1001, it’s impossible to tell whether this means: I, or HE, or ATE!
Using huffman encoding takes away the ambiguity by making a set of unique bit patterns that can’t be mistaken for another character, and assigning the smallest bit patterns to the most frequently used characters making the most efficient use of the character set.
The process of huffman encoding starts off by identifying the frequency with which each letter appears in the string, putting the least frequent at the top. So for our string “TEACHALLABOUTIT”:
E - 1
C - 1
H - 1
B - 1
O - 1
U - 1
I - 1
L - 2
A - 3
T - 3
When ordering characters that have the same frequency, we can either put them in order of how they appear in the string (shown here) or in alphabetical order.
After this, we can start to build the binary tree from the bottom by placing the top two least frequent numbers into a “node”, then combining them:
After combining the nodes, the frequency table is updated placing the combined frequency under any single characters with the same frequency:
H - 1
B - 1
O - 1
U - 1
I - 1
L - 2
EC - 2
A - 3
T - 3
Once updated, a second root node is created with the next two least frequent characters:
This forms the bottom of the tree, and allows us to build up to the next level by adding the least frequent letter to the roots:
O - 1
U - 1
I - 1
L - 2
EC - 2
HB - 2
A - 3
T - 3
Once the next level is added, further characters can be added until all single characters have been added to the tree. Characters with lower frequency are placed to the left of the tree:
I - 1
L - 2
A - 3
T - 3
OEC - 3
UHB - 3
A - 3
T - 3
IOEC - 4
LUHB - 5
Finally, the two sides of the tree are connected together (to check that the tree has been formed correctly, the top root node should have a frequency that is the same as the length of the string:
Now that the tree has been formed, we can use it to create unique bit patterns for each character. We do this by assigning a zero each time we move left down the tree, and one when we move right. For example, to create the letter “O” we move LEFT, RIGHT, RIGHT, LEFT, creating 0110:
Using this method, because no other character starts with 011 it can’t be mistaken for another character.
Finally, having worked out the bit patterns for each character, we can calculate the size of the file:
A = 00 : 2 bits (*3)
T = 10 : 2 bits (*3)
I = 010 : 3 bits (*1)
L = 110 : 3 bits (*2)
O = 0110 : 4 bits (*1)
U = 1110 : 4 bits (*1)
E = 01110 : 5 bits (*1)
C = 01111 : 5 bits (*1)
H = 11110 : 5 bits (*1)
B = 11111 : 5 bits (*1)
Total = 49 bits
49 / 8 = 7 Bytes (rounded up)
Find this page helpful? Share the love on your social media mentioning @TeachAllAboutIT and we’ll enter you in our monthly draw to win a gift voucher for any product on the site!
More For Members
Not a member yet? Sign Up Here