Understanding Huffman Coding: A Practical Approach

Huffman coding stands as a cornerstone in data compression techniques, offering a powerful method to reduce the size of digital information. This algorithm, which relies on the frequency of characters in a dataset, has a significant impact on the efficiency of data storage and transmission. By assigning shorter codes to more common characters and longer codes to less frequent ones, Huffman coding optimizes the representation of data, leading to substantial space savings.

The process of Huffman coding involves the construction of a binary tree, which forms the basis for encoding and decoding operations. To implement this technique developers first analyze the frequency of each character in the input data.

They then build a tree structure where each leaf node represents a character, and the path from the root to a leaf determines its unique code. This approach not only enables efficient compression but also ensures lossless data recovery, making it a go-to solution for various applications in computer science and information theory.

Fundamental Principles 🔑

The fundamental idea behind Huffman coding is to use a frequency-sorted binary tree to encode symbols.This approach ensures that the code for any particular symbol is never a prefix of the bit string representing any other symbol, making it uniquely decodable.

Frequency Analysis 📈

The first step in Huffman coding involves keep tracking of the frequency of each symbol occurring in the data using a Map, this data can be in the form of string or URL...

Tree Construction 🛠️

The Huffman tree is constructed from the bottom up, starting with the least frequent symbols. The process involves the following steps:

Sort the list of symbols by frequency and it can be done with a minHeap.
Take the two elements from the minHeap with the lowest frequencies and create a parent node with a frequency ( named cost in our solution below) equal to their sum.
Remove these two elements from the minHeap and insert the new parent node back into the list, maintaining the frequency order.
Repeat steps 2 and 3 until only one element remains, which becomes the root of the Huffman tree.

This process continues until the complete tree is formed and we will go more in depth of how it actually works.

The final tree structure determines the encoding for each symbol, with left branches represented by '0' and right branches by '1'.

Building the Huffman Tree 🌳

Now here comes the fun part , to generate a Huffman code, one traverses the tree from the root to the desired symbol, left branches will take '0' as edge cost and for right branches will take '1' .

You might be wondering why 0 is used for left and 1 for right. Please don’t get hung up on this detail – it’s just how the algorithm was designed.

Node Creation :

public class Node  {
    Character data;
    int cost; // frequency
    Node left;
    Node right;

    public Node(Character data, int cost) {
        this.data = data;
        this.cost = cost;
        this.left = null;
        this.right = null;
    }
}

Priority Queue Implementation :

A priority queue is used to efficiently manage the nodes during tree construction. The queue is initialized with all unique characters, sorted in ascending order based on their frequencies . This allows for quick access to the least frequent characters, which form the leaves of the Huffman tree.

Tree Assembly Process:

The tree assembly process follows these steps:

Push all characters with their frequencies into the priority queue .
While the queue has more than one node:
a. Remove the two nodes with the lowest frequencies from the queue ^.
b. Create a new internal node with these two as its children ^.
c. Set the new node's frequency to the sum of its children's frequencies.
d. Push the new node back into the priority queue .
The last remaining node in the queue becomes the root of the Huffman tree.

This process ensures that characters with lower frequencies are deeper in the tree, resulting in longer codes, while more frequent characters are closer to the root, having shorter codes.

Decoding Huffman-Encoded Data:

Decoding Huffman-encoded data is a straightforward process that utilizes the Huffman tree. The steps are as follows:

Start at the root of the Huffman tree.
Read bits from the encoded data sequentially.
For each '0' bit, move to the left child; for each '1' bit, move to the right child.
When a leaf node is reached, output the symbol associated with the node and saved in the decoder HashMap, this HashMap<Character,String> structure will save the corresponding Huffman code of the single character .

This process continues until all bits in the encoded data have been processed.The prefix-free nature of Huffman codes ensures unambiguous decoding.

Encoding Data:

Once the Huffman codes are generated from the tree traversal .the encoding process begins.This involves replacing each symbol in the original data with its corresponding Huffman code that we retrieve from encoder HashMap.

This encoding significantly reduces the data size compared to fixed-length encoding methods. The time complexity for encoding each unique character based on its frequency is O(nlog n), where n is the number of unique characters .

Time and Space Complexity Analysis ⏳

The efficiency of Huffman coding can be analyzed in terms of its time and space complexity for different operations:

Tree Construction: The time complexity for building the Huffman tree is

O(n log n), where n is the number of unique characters in the input .

This complexity arises from the need to sort and repeatedly merge nodes.
Encoding: The encoding process has a time complexity of O(N)

where N is the length of the input message. This linear time complexity makes encoding efficient for large datasets.
Decoding: The decoding process also has a time complexity of O(N)

where N is the number of bits in the encoded message. This ensures that decoding is as efficient as encoding.

In terms of space complexity:

Tree Construction: The auxiliary space required is O(n), where n is the number of unique characters. This space is used to store the tree nodes and the priority queue for merging.
Encoding and Decoding: Both processes require O(1) auxiliary space, as they only involve storing the encoded or decoded message .

These complexity characteristics make Huffman coding an efficient choice for lossless data compression, offering a good balance between compression ratios and computational requirements.

Inspired by 535.Encode and Decode tiny URL

public class Codec {
    public class Node implements Comparable<Node> {
    Character data;
    int cost; // frequency
    Node left;
    Node right;

    public Node(Character data, int cost) {
        this.data = data;
        this.cost = cost;
        this.left = null;
        this.right = null;
    }

    @Override
    public int compareTo(Node other) {
        return this.cost - other.cost;
    }
}
    HashMap<Character, String> encoder;
    HashMap<String, Character> decoder;

    public void initEncoderDecoder(Node node, String osf) {
        if (node == null) {
            return;
        }
        if (node.left == null && node.right == null) {
            this.encoder.put(node.data, osf);
            this.decoder.put(osf, node.data);
        }
        initEncoderDecoder(node.left, osf + "0");
        initEncoderDecoder(node.right, osf + "1");
    }

    public String encode(String longUrl) {
        HashMap<Character, Integer> fmap = new HashMap<>();

        for (int i = 0; i < longUrl.length(); i++) {
            char cc = longUrl.charAt(i);
            if (fmap.containsKey(cc)) {
                int ov = fmap.get(cc);
                ov += 1;
                fmap.put(cc, ov);
            } else {
                fmap.put(cc, 1);
            }
        }

        PriorityQueue<Node> minHeap = new PriorityQueue<>((a,b)->a.cost-b.cost);
        Set<Map.Entry<Character, Integer>> entrySet = fmap.entrySet();

        for (Map.Entry<Character, Integer> entry : entrySet) {
            Node node = new Node(entry.getKey(), entry.getValue());
            minHeap.offer(node);
        }

        while (minHeap.size() != 1) {
            Node first = minHeap.poll();
            Node second = minHeap.poll();

            Node newNode = new Node('\0', first.cost + second.cost);
            newNode.left = first;
            newNode.right = second;

            minHeap.offer(newNode);
        }

        Node ft = minHeap.poll();

        this.encoder = new HashMap<>();
        this.decoder = new HashMap<>();

        this.initEncoderDecoder(ft, "");
        // the encode part
        String ans = "";

        // Bitset can be used: like an array but with a bit at each index

        for (int i = 0; i < longUrl.length(); i++) {
            ans = ans + encoder.get(longUrl.charAt(i));
        }

        return ans;

    }

    // Decodes a shortened URL to its original URL.
    public String decode(String shortUrl) {
        String key = "";
        String ans = "";
        for (int i = 0; i < shortUrl.length(); i++) {
            key = key + shortUrl.charAt(i);
            if (decoder.containsKey(key)) {
                ans = ans + decoder.get(key);
                key = "";
            }
        }
        return ans;
    }
}

// Your Codec object will be instantiated and called as such:
// Codec codec = new Codec();
// codec.decode(codec.encode(url));/

Congrats you reached so far 🏆👏

If you found this article on Huffman coding informative and helpful, please share it with others who might benefit from it 👀

Don't forget to follow me on social media for real-time insights and tips 😃

Happy Coding ^_^