File Compression: A Practical Approach
Compression Tool 🗜️
A compression tool is a piece of software that is used to store file data in a compressed format. It is useful for space optimization i.e. it helps to store the same amount of information using less memory space. In this application, the Huffman coding algorithm compresses the file content.
Huffman Encoding Algorithm 🚀
The main steps of the algorithm are explained below -
Frequency Count: This algorithm's first and most important step is to count the frequency of each character present in the source file.
Build Huffman Tree: The next step is to build a Huffman tree based on the frequency table.
Encode Tree: We assign the left edge a value of
0
and the right1
for each node from the root node till we reach a child node. After this process, we can obtain the encoded bit string for each character. Since characters are only present in the child node, the route from the root to a particular child node is the encoded bit string for that particular character.
Let's understand the Algorithm with the help of an example 🧠
Suppose the file content is
halamadrid
The frequency table would look like this.
Character | Frequency |
h | 1 |
a | 3 |
l | 1 |
m | 1 |
d | 2 |
r | 1 |
i | 1 |
- The Huffman Tree (🌲) for the corresponding table would be
The following steps are used to build the Huffman Tree from the frequency table 🪜
Initially insert all nodes into a
priority queue
.A node has to have at least two fields
characters
- representing the characters the node holds.frequency count
- frequency count of the characters.
Select two characters with the least frequencies, remove those two nodes from the list, add their frequency, and insert the newly created node into the list.
Step (iii) is repeated until the list contains only one character.
Once the
Huffman Tree
is built, it’s time to prepare the tree for encoding. For each non-terminal node, assign the left child edge a value of0
and the right child edge a value of1
.
After the tree is built and edges are assigned values, the bit encoding of a character is equal to the path from the root node to the child node. The encodings(🪢) are shown below -
So, the encoded bit sequence for
halamadrid
will be000111001110111101101000101
Character | Encoding |
h | 000 |
a | 11 |
l | 100 |
m | 1011 |
d | 01 |
r | 1010 |
i | 001 |
Store the encodings in a file
Since we can't open a lock without its key 🔑, here also if we only store the encoded bit sequence to a file, we won’t be able to decode the file content later. To be able to decode the content we should either have access to the frequency table or the character encodings table directly. To store this indirect information, we could insert a header section in our target file, which will hold this important information. The file size will increase depending on the size of the table compared to only storing the encoded bit sequence, but these are very important to retrieve the original information.
Here are some variables that I used to identify the Header section in my application.
var (
HeaderBodySeparator string = "\n~<=>-\n"
KeyValuePairSeparator string = "%->"
KeyValueSeparator string = "-<>~"
)
The compressed file file content for the text halamadrid
is
000-<>~h%->001-<>~i%->01-<>~d%->100-<>~l%->1010-<>~r%->1011-<>~m%->11-<>~a
~<=>-
ïh
Where ïh
is the actual encoded content.
Decoding the compressed file 🔐
To decode the compressed file content, we would first need to parse the header part to retrieve the character encodings table. In the above example, the header part contains
000-<>~h%->001-<>~i%->01-<>~d%->100-<>~l%->1010-<>~r%->1011-<>~m%->11-<>~a
In which each key-value pair is separated by %->
. Now if we store the k-v pairs in an array, the array would look like
[
000-<>~h,
001-<>~i,
01-<>~d,
100-<>~l,
1010-<>~r,
1011-<>~m,
11-<>~a
]
Now using the key-value separator i.e. -<>~
, we can build the character encodings table as follows.
Character | Encoding |
h | 000 |
a | 11 |
l | 100 |
m | 1011 |
d | 01 |
r | 1010 |
i | 001 |
Since we have now access to both the encoded bit sequence and the character encodings table, we can now easily retrieve the original file content.
Problems Faced
Since, in most programming languages, a byte i.e. 8 bits is the smallest possible data type, how could we store let’s say 2-bit character encodings effectively?
If we use one byte for a character (for example a → 11
), then it would be of the same size since in the original file also a character takes up one byte.
To solve this problem we need to store multiple character encodings in a byte.
/// convert the string representation of encoded bits to actual bits
/// actual compression happends here
func StringToBits(bitString string) []byte {
var output []byte
for i := 0; i < len(bitString); i += 8 {
var curr byte = 0
for j := i; j < min(len(bitString), i+8); j++ {
if bitString[j] == '1' {
curr = (curr << 1) | 1
} else {
curr = (curr << 1)
}
}
output = append(output, curr)
}
return output
}
In the above function initially, the current byte is 00000000
. The function loops through the encoded bits stream, and at a time it processes the next 8 bits from the current position.
If the current bit is 1, it performs a left-wise shift on the current byte and performs AND with 1. Let’s say at any point the current byte is 00110010.
00110010 « 1 → 01100100 | 1 → 01100101
This is how the algorithm appends a bit 1
to the current byte. To append 0
, it is very simple, it just doesn’t perform bitwise AND operation with 1
i.e. it skips the second step.
This is the encoding part, similar to this before decoding it converts the actual bit streams to their equivalent bit string. Here is the algo
/// convert []byte(bit sequence) to string representation of bits
/// argument: input - slice of bytes or bit sequence
func BitsToString(input []byte) string {
var bitString string
for idx, inputByte := range input {
var curr string
if idx == len(input)-1 {
curr = fmt.Sprintf("%b", inputByte)
} else {
curr = fmt.Sprintf("%08b", inputByte)
}
bitString = bitString + curr
}
return bitString
}
For the encoded bit stream 000111001110111101101000101, which has 27 bits, we need ⌈count/8⌉ = 4 bytes to store the information. So from our observation initially, it took 27 bytes to store the information but now it takes only 4 bytes to store the same information.
Conclusion
Just like all good things, it comes with its caveat. This application also has its caveat. The application only works great with large-size files. The header overhead sometimes becomes much more than the original file content for small-sized files.
Subscribe to my newsletter
Read articles from Makx directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Makx
Makx
I’m a Computer Science undergraduate from India, passionate about programming 💻 and exploring the digital world. Besides coding, I enjoy digital media creation, Gym, and football ⚽.