Understanding Data Compression in Big Data

Introduction

This article will discuss compression in the Big Data context, covering the types and methods of compression. I will also highlight why and when each type and method should be used.

Diving in

What is compression?

According to the general English definition of compression which refers to reducing something to occupy a smaller space. In Computer Science, compression is the process of reducing data to a smaller size. Data, in this case, could be represented in text, audio, video files etc. Think of it as anything you store on the hard drive of your computer as data represented in different formats. To provide a more technical definition, compression is the process of encoding data to use fewer bits.

There are multiple reasons to compress data. The most common and intuitive reason is to save storage space. Other reasons are as a result of data being smaller. The benefits of working with smaller data include:

Quicker Data Transmission Time: Compressed data are smaller in size and take less time to be transmitted from source to destination.
Reduced bandwidth consumption: This reason is strongly linked to the advantage of quicker data transmission. Compressed data uses less of the network bandwidth, therefore increasing the throughput and reducing the latency.
Improved performance for digital systems that rely heavily on data: This is evident in systems that rely on processing data. Those systems leverage compression to improve the performance of the systems by reducing the volume of data that needs to be processed. Please note that this might be system-specific and will rely on using the appropriate compression technique. Compression techniques will be discussed later in this article.
Cost Efficiency: Cloud services charge for the storage of data. By using less storage, cost savings are introduced especially in Big Data systems.

Other reasons for compression depend on different compression techniques and formats. Some encryption algorithms can be used as a method of compression. In doing so, it includes a layer of security to the earlier discussed reasons to compress data. Additionally, using common compression formats brings compatibility and room for extensibility to external systems for integration purposes.

It is worth noting that the reasons for compression also sound like benefits. However, compression is not without trade-offs. One common trade-off to compression is the need for decompression which might be concerning for resource-constrained systems. Other trade-offs depend on the compression technique and type of data being used.

Systems in the article refer to digital systems that make use of data and can take advantage of compression techniques. The word systems is used quite loosely and should be interpreted in context with what is being discussed in that section.

Compression Types

To discuss the different techniques used to compress data, I will first categorise compression into 2 main categories. This article will then discuss the techniques relevant to each category. Compression can be broadly grouped into Lossy and Lossless compression.

As the names give away what they mean already, Lossy compression techniques are techniques that do not preserve the full fidelity of the data. Simply put, some data is discarded but not enough to make what the data represents unrecognisable. Hence, lossy compression can offer a very high level of compression compared to lossless compression which will be introduced shortly.

A characteristic of lossy compression is that it is irreversible, i.e. when presented with the compressed file, one cannot restore the raw data with its original fidelity. Certain files and file formats are suitable for lossy compression. It is typically used for images, audio and videos. For instance, JPEG formatted images lend well to compression and compressing a JPEG image, the creator or editor can choose how much loss to introduce.

On the other hand, lossless compression is reversible, meaning that when compressed, all data is preserved and restored fully during decompression. This implies that lossless compression is suitable for text-like files, and in the data warehouse and lakehouse world, it would be the only relevant type to use. Some audio (FLAC and ALAC) and image file (GIF, PNG, etc) formats work well with this compression type.

Choosing a method

There is no general best compression method. Different factors go into choosing what method would be suitable on a case-by-case basis. To buttress this with examples, a data engineer in the finance industry working on tabular data stored would tend to use lossless compression due to the impact of missing data in creating accurate reporting. Alternatively, lossy compression could be the way to go in optimizing the web page with a lot of images by compressing the images and reducing load items by making the website lighter. Therefore, it is crucial to conduct an assessment to determine the most appropriate compression method that aligns with business requirements.

Compression Techniques

This section will only cover the common compression techniques for both lossy and lossless compression. Please note that this is not in any way exhaustive. Furthermore, the techniques discussed may have slight variations to enhance their performance, as backed by different research.

Lossless compression techniques

Three common lossless techniques are the Run-Length Encoding (RLE), Huffman Coding and the Lempel-Ziv-Welch techniques.

Run-Length Encoding: RLE is based on encoding data, such that, it replaces sequences of repeating data with a single piece of data and the count of that piece of data. It is effective for long runs of repeated data. Also, datasets which have dimensions (fields) that are sorted from a low level to a high level of cardinality benefit from RLE.

For example, take a simple string like AAAAABBCDDD. RLE compresses the data to become A(5)B(2)C(1)D(3). To be more practical, take a table in the image below.

Figure 1 - before RLE. It is important to observe the level of cardinality is increasing on the fields from left to right

Before RLE - Raw Data

Figure 2 - After RLE

After RLE

Because RLE depends on runs of repeated fields and in the second example, the cardinality and the sort order of the data, the Mouse record on the item column can not be compressed to just Mouse (3) because the preceding column splits all values into IT, Mouse and HR, Mouse. Certain file formats are compatible with RLE such as bitmap file formats like TIFF, BMP etc. Parquet files also support RLE making it very useful in modern data lakehouses using object storage like S3 or GCS.

Huffman Coding: It is based on statistical modelling that assigns variable length codes to values in the raw data based on the frequency at which they occur in the raw data. The representation of this modelling can be referred to as a Huffman tree, which is, similar to a binary tree. This tree is then used to create a Huffman code for each value in the raw data. The algorithm prioritizes encoding the most frequent values to the fewest possible bits.

Let's take the same data used in the RLE example AAAAABBCDDD. The corresponding Huffman tree looks like this.

Huffman Tree

Huffman Tree

From the tree, we can see that the letter A is represented by 0 likewise D is presented by 10. Compared to letters B: 111 and C:110, we observe that A and D are represented by fewer bits. This is because they have a higher frequency; Hence the Huffman algorithm represents them with fewer bits by design. The resulting compressed data becomes 00000111111110101010.

Huffman Coding uses the prefix rule which states that the code representing a character should not be present in the prefix of any other code. For example, a valid Huffman code can not have letters c and d represented using C: 00 and D: 000 because the representation of C is a prefix of D.

To see this in action, the Computer Science Field Guide has a Huffman Tree Generator you could play with.

Lempel–Ziv–Welch Coding: It was created by Abraham Lempel, Jacob Ziv and Terry Welch in 1984 and is named after the creators, obviously 😅. Similar to RLE and Huffman Coding, LZW works well with data that contain lots of repeated data. The LZW algorithm is dictionary-based and creates a dictionary containing key-value pairs of commonly seen patterns in the raw data. Such dictionary can also be referred to as the code table. Using an illustration to explain how this technique works, lets take our raw data to be represented by ABBABABABA. When passed through the algorithm using a configuration of A-Z as possible values, the resulting code table looks like

LZW Code Table

LZW Code Table

From the above code table, there is a key-value pair for all letters A-Z and key-value pairs for patterns such as AB, BB, BA, and ABA. By having a shorter representation of these patterns LZW algorithm can compress the raw by encoding it into fewer bits. Hence, using the code table generated from that input, the compressed version is 0 1 1 26 29 28. It is key to notice the spaces in the compressed data. One could think of them as the end of a character so the decoder will not interpret a 1,0 as a 10 as they mean different things.

LZW is usually general-purpose and widely used today. It is integrated into many Unix/Linux-based operation systems behind the compress shell command. Also, Common file formats compatible with LZW are GIF, TIFF and PDF. Other applications of LZW Compression can be seen in the field of Natural Language Processing as discussed in this paper on tokenization in NLP.

RLE, Huffman Coding, and LZW Coding are only common examples. Lossless compression techniques go beyond these three (3) described above. Other techniques include DEFLATE which uses a combination of Huffman and LZW - specifically LZ77 - Coding.

Lossy compression techniques

In this section, we will look into two types of lossy compression. Recall that lossy compression introduces a loss to the original data, meaning that not all data is kept.

Discrete Cosine Transform (DCT): This compression method is used mainly in audio, image and video files and is also commonly referred to as block compression. It uses a mathematical function - cosine function as the name implies - to convert blocks of the original data into frequencies. The blocks of data are usually a matrix of 8x8, 4x4 and so on, in that order of magnitude.

The compression comes in when dealing with the high frequencies occurring in the data once the raw data is translated into the frequency domain using the mathematical function. The overall process of using DCT for compression is:

Break down raw data into chunks. For instance, in image compression, this could be 8x8 pixels.
Apply the mathematical function to convert the chunks of data to frequencies. This will result in some high frequencies and low frequencies.
The high frequencies are then reduced or removed depending on the acceptable degree of loss one is willing to introduce. This is where it really becomes lossy compression.
To convert back to representable data, all the remaining frequencies are passed through an Inverse Discrete Cosine Transform - IDCT - to restore the data from the frequencies.

DCT is widely used in different fields today, not only in compression but also in signal processing. Common file formats compatible with DCT are JPEG (images), MP3 (audio), and MPEG (video). Additionally, DCT can achieve high compression ratios making it suitable for digital systems with lots of images like web pages on the Internet.

Fractal Compression: A fractal is a self-repeating infinite pattern that repeats at different scales. When viewed from any point on the scale, the pattern looks similar. Because the patterns are similar at any scale, fractal compression reduces the scale of 'big' fractals to reduce the size of the data.

Examples of Fractals

Fractal Compression was introduced by Michael Barnsley in the 1980s. The general idea using an image is that if an image contains several parts that look alike, why store them twice? To do this, fractal compression does the following:

Partitions the image into non-overlapping blocks known as range blocks. This could be range blocks of 8x8, 16x16 pixels, etc.
It scans the image for self-repeating patterns (fractal patterns). Using the range blocks, the algorithm finds larger sections of the image that are similar to these range blocks. These larger sections are referred to as the domain blocks.
A transform function is then applied to the domain block to approximate the range blocks. These transform functions are mathematical functions such as scaling, translation, rotation etc. They can also be referred to as transformations. These transformations are called fractal codes with respect to Fractal Compression.
The data is then encoded to those transform functions. Instead of storing the pixel-pixel data, the transformations are stored. These transformations are the rules that describe how to reconstruct the image from domain blocks.

With the fractal codes, the image is reconstructed using an iterative process. This process can be computationally expensive but fractal compression could achieve a high ratio of compression compared to other compression techniques. Due to its reliance on self-repeating patterns, it would perform better on data that conforms to having such self-repeating patterns. Examples would be landscape photographs (images of nature) and DNA images.

There are other lossy compression techniques such as Discrete Wavelet Transform, Quantization. These techniques are usually used in images, audio and video files and are suitable for certain types or file formats - JPEG, MP3 - for each file type.

Lossy compression generally has higher compression ratios than lossless compression and sometimes expects that the user knows the amount of loss to introduce beforehand. it is pertinent to emphasise that the choice of compression method and technique depends on several factors. At the core of these factors are the data format and the desired outcome.

TL;DR

Overall, this post discusses compression in the world of data. It relies strongly on the existing body of knowledge in computer science and information theory. To compress means to reduce the volume an entity occupies and in the field of data, volume refers to storage space. Compression in digital systems has many advantages when done right. The obvious is that it reduces the space and gives room to store more data. Other advantages include quicker transmission, lesser bandwidth usage and general improvement in the efficiency of said system. Remember, this is when it is done right.

To leverage the advantages of compression, it is key to know what type to use. Compression is either lossy or lossless. Lossy compression introduces a loss in the original data that is usually irreversible while lossless compression compresses the data and retains all the information contained in the original data. Furthermore, there is discourse on hybrid compression types but I think a combination of lossy and lossless is just lossy. Let me know what you think in the comments.

Lastly, different techniques were introduced for both lossy and lossless compression. The list of techniques and explanations of these techniques are neither exhaustive nor comprehensive. I consider them only a good start in giving you an idea of how each technique works. To wrap up, I have added additional resources to help you investigate further and read deeper about compression in big data.