Huffman Encoding

Huffman encoding is a form of lossless compression allowing files to be made smaller by reducing the number of bits needed for each character. Before learning about Huffman Encoding, it’s recommended that you are familiar with character sets and binary.

Huffman encoding uses a technique called Binary Trees to create a unique bit pattern for each character in the file based on the frequency with which the character appears.

To understand how this technique saves space, let’s use an example:

T0EACHALLABOUTIT

If we were to represent this using standard ASCII, we would require 7 bits for each character and multiply this by the number of characters in the string. So in this case:

7 * 15 = 105 bits
105 / 8 = 14 bytes (rounded up)

But the text doesn’t actually use all of the characters in the ASCII character set, so we could create a reduced character set by just using the characters from the string and assigning each one a different binary number in its smallest format. In this case this would be:

T0
E1
A10
C11
H100
L101
B110
O111
U1000
I1001

Using this reduced character set, the string “TEACHALLABOUTIT” can be reduced as follows:

T = 0 : 1 bit 
E = 1 : 1 bit
A = 10 : 2 bits
C = 11 : 2 bits
H = 100 : 3 bits
A = 10 : 2 bits
L = 101 : 3 bits
L = 101 : 3 bits
A = 10 : 2 bits
B = 110 : 3 bits
O = 111 : 3 bits
U = 1000 : 4 bits
T = 0 : 1 bit
I = 1001 : 4 bits
T = 0 : 1 bit

Total= 35 bits
35 / 8 = 5 Bytes (rounded up)

This is a saving of 9 bytes just in this small piece of text!

However, the problem with doing this is ambiguity. If a program receives the bit pattern 1001, it’s impossible to tell whether this means: I, or HE, or ATE!

Using huffman encoding takes away the ambiguity by making a set of unique bit patterns that can’t be mistaken for another character, and assigning the smallest bit patterns to the most frequently used characters making the most efficient use of the character set.

The process of huffman encoding starts off by identifying the frequency with which each letter appears in the string, putting the least frequent at the top. So for our string “TEACHALLABOUTIT”:

E - 1
C - 1
H - 1
B - 1
O - 1
U - 1
I - 1
L - 2
A - 3
T - 3

When ordering characters that have the same frequency, we can either put them in order of how they appear in the string (shown here) or in alphabetical order.

After this, we can start to build the binary tree from the bottom by placing the top two least frequent numbers into a “node”, then combining them:

Huffman Encoding Step 1 - GCSE Computer science

After combining the nodes, the frequency table is updated placing the combined frequency under any single characters with the same frequency:

H - 1
B - 1
O - 1
U - 1
I - 1
L - 2
EC - 2
A - 3
T - 3

Once updated, a second root node is created with the next two least frequent characters:

Huffman Encoding Step 2 - GCSE Computer science

This forms the bottom of the tree, and allows us to build up to the next level by adding the least frequent letter to the roots:

O - 1
U - 1
I - 1
L - 2
EC - 2
HB - 2
A - 3
T - 3

Huffman Encoding Part 3 - GCSE Computer science

Once the next level is added, further characters can be added until all single characters have been added to the tree. Characters with lower frequency are placed to the left of the tree:

I - 1
L - 2
A - 3
T - 3
OEC - 3
UHB - 3

Huffman Encoding - GCSE Computer science

A - 3
T - 3
IOEC - 4
LUHB - 5

Huffman Encoding Step 5 - GCSE Computer science

Finally, the two sides of the tree are connected together (to check that the tree has been formed correctly, the top root node should have a frequency that is the same as the length of the string:

Huffman Encoding Step 6 - GCSE Computer science

Now that the tree has been formed, we can use it to create unique bit patterns for each character. We do this by assigning a zero each time we move left down the tree, and one when we move right. For example, to create the letter “O” we move LEFT, RIGHT, RIGHT, LEFT, creating 0110:

Huffman Encoding Part 7 - GCSE Computer science

Using this method, because no other character starts with 011 it can’t be mistaken for another character.

Finally, having worked out the bit patterns for each character, we can calculate the size of the file:

A = 00 : 2 bits (*3) 
T = 10 : 2 bits  (*3)
I = 010 : 3 bits  (*1)
L = 110 : 3 bits (*2)
O = 0110 : 4 bits (*1)
U = 1110 : 4 bits (*1)
E = 01110 : 5 bits (*1)
C = 01111 : 5 bits (*1)
H = 11110 : 5 bits (*1)
B = 11111 : 5 bits (*1)

Total = 49 bits
49 / 8 = 7 Bytes (rounded up)

Find this page helpful? Share the love on your social media mentioning @TeachAllAboutIT and we’ll enter you in our monthly draw to win a gift voucher for any product on the site!

More For Members

Lesson Plan

Coming Soon!

Presentation

Coming Soon!

Homework

Click to download

Students

Click to Revise!

Not a member yet? Sign Up Here

Sign Up For Membership Today

Level Price  
Individual Teacher £3.99 per Month. After your initial payment, your first payment is Free. Select
Individual Student £2.50 per Month. After your initial payment, your first payment is Free. Select
Whole School £12.50 per Month. Select
iGCSE Computer Science Distance Learning (Feb Start) £21.00 now and then £21.00 per Month for 15 more Months. Select
iGCSE Computer Science Distance Learning (Sept Start) £21.00 now and then £21.00 per Month for 15 more Months. Select