File Compression: A Practical Approach

MakxMakx
6 min read

Compression Tool 🗜️

A compression tool is a piece of software that is used to store file data in a compressed format. It is useful for space optimization i.e. it helps to store the same amount of information using less memory space. In this application, the Huffman coding algorithm compresses the file content.

Huffman Encoding Algorithm 🚀

The main steps of the algorithm are explained below -

  1. Frequency Count: This algorithm's first and most important step is to count the frequency of each character present in the source file.

  2. Build Huffman Tree: The next step is to build a Huffman tree based on the frequency table.

  3. Encode Tree: We assign the left edge a value of 0 and the right 1 for each node from the root node till we reach a child node. After this process, we can obtain the encoded bit string for each character. Since characters are only present in the child node, the route from the root to a particular child node is the encoded bit string for that particular character.

Let's understand the Algorithm with the help of an example 🧠

  1. Suppose the file content is halamadrid

  2. The frequency table would look like this.

CharacterFrequency
h1
a3
l1
m1
d2
r1
i1
  1. The Huffman Tree (🌲) for the corresponding table would be

  • The following steps are used to build the Huffman Tree from the frequency table 🪜

    1. Initially insert all nodes into a priority queue.

    2. A node has to have at least two fields

      • characters - representing the characters the node holds.

      • frequency count - frequency count of the characters.

    3. Select two characters with the least frequencies, remove those two nodes from the list, add their frequency, and insert the newly created node into the list.

    4. Step (iii) is repeated until the list contains only one character.

  • Once the Huffman Tree is built, it’s time to prepare the tree for encoding. For each non-terminal node, assign the left child edge a value of 0 and the right child edge a value of 1.

  1. After the tree is built and edges are assigned values, the bit encoding of a character is equal to the path from the root node to the child node. The encodings(🪢) are shown below -

  2. So, the encoded bit sequence for halamadrid will be 000111001110111101101000101

CharacterEncoding
h000
a11
l100
m1011
d01
r1010
i001

Store the encodings in a file

Since we can't open a lock without its key 🔑, here also if we only store the encoded bit sequence to a file, we won’t be able to decode the file content later. To be able to decode the content we should either have access to the frequency table or the character encodings table directly. To store this indirect information, we could insert a header section in our target file, which will hold this important information. The file size will increase depending on the size of the table compared to only storing the encoded bit sequence, but these are very important to retrieve the original information.
Here are some variables that I used to identify the Header section in my application.

var (
    HeaderBodySeparator   string = "\n~<=>-\n"
    KeyValuePairSeparator string = "%->"
    KeyValueSeparator     string = "-<>~"
)

The compressed file file content for the text halamadrid is

000-<>~h%->001-<>~i%->01-<>~d%->100-<>~l%->1010-<>~r%->1011-<>~m%->11-<>~a
~<=>-
ïh

Where ïh is the actual encoded content.

Decoding the compressed file 🔐

To decode the compressed file content, we would first need to parse the header part to retrieve the character encodings table. In the above example, the header part contains

000-<>~h%->001-<>~i%->01-<>~d%->100-<>~l%->1010-<>~r%->1011-<>~m%->11-<>~a

In which each key-value pair is separated by %->. Now if we store the k-v pairs in an array, the array would look like

[
    000-<>~h,
    001-<>~i,
    01-<>~d, 
    100-<>~l,
    1010-<>~r, 
    1011-<>~m,
    11-<>~a
]

Now using the key-value separator i.e. -<>~, we can build the character encodings table as follows.

CharacterEncoding
h000
a11
l100
m1011
d01
r1010
i001

Since we have now access to both the encoded bit sequence and the character encodings table, we can now easily retrieve the original file content.

Problems Faced

Since, in most programming languages, a byte i.e. 8 bits is the smallest possible data type, how could we store let’s say 2-bit character encodings effectively?

If we use one byte for a character (for example a → 11), then it would be of the same size since in the original file also a character takes up one byte.

To solve this problem we need to store multiple character encodings in a byte.

/// convert the string representation of encoded bits to actual bits
/// actual compression happends here
func StringToBits(bitString string) []byte {
    var output []byte
    for i := 0; i < len(bitString); i += 8 {
        var curr byte = 0
        for j := i; j < min(len(bitString), i+8); j++ {
            if bitString[j] == '1' {
                curr = (curr << 1) | 1
            } else {
                curr = (curr << 1)
            }
        }
        output = append(output, curr)
    }
    return output
}

In the above function initially, the current byte is 00000000. The function loops through the encoded bits stream, and at a time it processes the next 8 bits from the current position.

If the current bit is 1, it performs a left-wise shift on the current byte and performs AND with 1. Let’s say at any point the current byte is 00110010.

00110010 « 1 → 01100100 | 1 → 01100101

This is how the algorithm appends a bit 1 to the current byte. To append 0, it is very simple, it just doesn’t perform bitwise AND operation with 1 i.e. it skips the second step.

This is the encoding part, similar to this before decoding it converts the actual bit streams to their equivalent bit string. Here is the algo

/// convert []byte(bit sequence) to string representation of bits
/// argument: input - slice of bytes or bit sequence
func BitsToString(input []byte) string {
    var bitString string
    for idx, inputByte := range input {
        var curr string
        if idx == len(input)-1 {
            curr = fmt.Sprintf("%b", inputByte)
        } else {
            curr = fmt.Sprintf("%08b", inputByte)
        }
        bitString = bitString + curr
    }
    return bitString
}

For the encoded bit stream 000111001110111101101000101, which has 27 bits, we need ⌈count/8⌉ = 4 bytes to store the information. So from our observation initially, it took 27 bytes to store the information but now it takes only 4 bytes to store the same information.

Conclusion

Just like all good things, it comes with its caveat. This application also has its caveat. The application only works great with large-size files. The header overhead sometimes becomes much more than the original file content for small-sized files.

0
Subscribe to my newsletter

Read articles from Makx directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Makx
Makx

I’m a Computer Science undergraduate from India, passionate about programming 💻 and exploring the digital world. Besides coding, I enjoy digital media creation, Gym, and football ⚽.