As backend engineers, we often face challenges of efficiently managing and processing large volumes of data. One of the critical components in this ecosystem is the distributed file system. In this article, I'll explore two popular distributed file systems: Hadoop Distributed File System (HDFS) and GlusterFS, comparing their performance in handling large files.

Introduction

Distributed file systems are the backbone of big data processing and storage. They allow us to store and process data across multiple machines, providing scalability and fault tolerance. Two prominent players in this field are HDFS, which is part of the Apache Hadoop ecosystem, and GlusterFS, an open-source distributed file system.

In this experiment, I set out to answer a crucial question: Which distributed file system works best for large files? To answer this, I conducted a series of tests comparing HDFS and GlusterFS, focusing on their performance in writing and deleting large files.

Before diving into the results, let's look at the setup for each file system:

Experimental Setup

GlusterFS Setup

Disk Preparation: I added 3 disks to each VM and formatted them with the XFS filesystem. I then created directories for the bricks and mounted the disks to these directories.
GlusterFS Installation: I installed the GlusterFS server package on each node and made sure that the Gluster daemon was running.
Cluster Configuration: On the first node, I probed the other nodes to add them to the cluster. I verified the cluster configuration with the gluster pool list command.
Volume Creation: I created a replicated Gluster volume across the three nodes using the directories I prepared earlier. I started the volume and checked its information with the gluster volume info command.
Mounting the Volume: Finally, I mounted the Gluster volume to a directory on my host and verified the operation with the df -h command.

HDFS Setup:

Update Ubuntu Source List: I began by updating the source list using sudo apt-get update.
Install SSH: I installed SSH with sudo apt-get install ssh and set up passwordless login between the Name Node and all Data Nodes.
Configure Nodes: I added all nodes to /etc/hosts and installed JDK1.8 on all nodes.
Download & Install Hadoop: I downloaded Hadoop using wget, extracted it, and renamed the folder to hadoop.
Set Environment Variables: I updated .bashrc with Hadoop environment variables and reloaded them.Configure
Hadoop Cluster: I edited hadoop-env.sh to set JAVA_HOME, updated core-site.xml for the file system, and hdfs-site.xml for replication settings.
Data Folder Creation: I created a data folder and changed its permissions to the login user, which in my case was ‘ubuntu’.
Master and Workers Files Creation: I created a master file that is used by startup scripts to identify the name node and added my name node IP to it. I also created a workers file to identify data nodes and added all my data node IPs to it.
HDFS Formatting: I formatted HDFS, which is a necessary step for any classical file system. This was done on the Name Node server.
HDFS Cluster Start: I started the HDFS cluster by running the start-dfs.sh script from the Name Node Server. I verified the start by running the jps command on the name node and data nodes.
Upload File to HDFS: Finally, I manually created my home directory and uploaded a file to HDFS using the hdfs dfs command.

Test Methodology

To compare the performance of HDFS and GlusterFS, I focused on two primary operations: writing and deleting files. We used the following command to write files of various sizes (2GB, 4GB, 6GB, 8GB, and 9GB) to both file systems:

sudo dd if=/dev/zero of=/mnt/sixgbfile.txt bs=1M count=6144

Each operation (write and delete) was performed three times for each file size, and the average time was calculated to ensure accuracy. I also calculated the margin of error using a 95% confidence interval to validate the reliability of our results.

Note

Also, there are a couple of factors to consider. First, the limitation of disk size due to my Google Cloud credits might affect the scalability of my experiment. Additionally, the difference in replication factor between GlusterFS (replication factor of 3) and HDFS (replication factor of 2) could potentially bias the results in favor of one system over the other. , I tried to replicate the same with HDFS where I tried to make my master node behave as a worker node but then I had issues launching the ResourceManager and the NameNode demon , so then I just reverted to one master node and 2 worker nodes with the replication factor of 2

Results and Analysis

Let's dive into the results of our experiments:

Write Performance:

HDFS consistently outperformed GlusterFS in write operations across all file sizes. Here's a comparison of average write times (in seconds):

As we can see, HDFS maintained relatively consistent write times even as file sizes increased, while GlusterFS showed a more pronounced increase in write times for larger files.
Delete Performance:

Interestingly, GlusterFS showed better performance in delete operations, especially for smaller file sizes:

GlusterFS maintained extremely fast delete times up to 6GB files, with a slight increase for larger files. HDFS, on the other hand, showed consistent delete times across all file sizes.

Technical Analysis

Write Performance:

HDFS's superior write performance can be attributed to its architecture and design principles:

Block-based storage: HDFS splits files into blocks (typically 128MB or 256MB), allowing for parallel write operations across multiple DataNodes.
Write-once, read-many: HDFS is optimized for write-once scenarios, which aligns well with our test case of writing large files.
Data locality: HDFS tries to place data blocks close to the computation, which can significantly improve write performance.

GlusterFS, while designed for scalability, seems to struggle more with large file writes. This could be due to:

Metadata overhead: GlusterFS manages metadata differently, which might introduce additional overhead for large files.
Replication factor: In our setup, GlusterFS had a replication factor of 3, compared to HDFS's 2, which could contribute to longer write times.

Delete Performance:

GlusterFS's superior delete performance, especially for smaller files, can be explained by:

Distributed metadata: GlusterFS distributes metadata across the cluster, potentially allowing for faster delete operations.
Immediate deletion: GlusterFS might be performing immediate deletions, while HDFS could be using a more conservative approach.

HDFS's consistent delete times across file sizes suggest:

Block-based deletion: HDFS might be deleting files on a block-by-block basis, leading to more consistent times regardless of file size.
Namenode involvement: The HDFS Namenode needs to update metadata for each delete operation, which could introduce a constant overhead.

Implications for Backend Engineering

As backend engineers, these findings have several implications:

Workload Consideration: For write-heavy workloads involving large files, HDFS appears to be the better choice. Its consistent performance across file sizes makes it suitable for big data processing tasks.
Delete Operations: If your application involves frequent deletions of smaller files, GlusterFS might be more efficient. However, for larger files or less frequent delete operations, the difference becomes less significant.
Scalability: Both systems offer scalability, but HDFS seems to handle increasing file sizes more gracefully in terms of write performance.
System Design: When designing systems that deal with large files, consider the trade-offs between write and delete performance. HDFS's write performance might be more crucial for data ingestion and processing pipelines.
Resource Utilization: While not directly measured in this experiment, it's worth noting that HDFS's better write performance might lead to more efficient resource utilization in data-intensive applications.

Conclusion

In our comparison of HDFS and GlusterFS for handling large files, HDFS emerged as the better performer for write operations, maintaining consistent and faster write times across various file sizes. GlusterFS, while slower in writes, showed superior performance in delete operations, especially for smaller files.

As backend engineers, the choice between HDFS and GlusterFS should be guided by the specific requirements of your application. If your workload primarily involves writing and processing large files, HDFS appears to be the more suitable choice. However, if your application requires frequent delete operations on smaller files, GlusterFS might offer advantages.

It's important to note that this experiment has its limitations. Factors such as network configuration, hardware specifications, and specific use cases can influence performance. Always consider conducting benchmarks in your specific environment before making a final decision.

Future Work

To further enhance our understanding of these distributed file systems, future experiments could explore:

Read performance comparisons
Performance under concurrent operations
Scalability tests with larger clusters
Impact of different replication factors
Performance with different file types (e.g., structured vs. unstructured data)

By continually evaluating and understanding the performance characteristics of distributed file systems, we as backend engineers can make informed decisions that lead to more efficient and scalable systems.

Distributed File Systems Explained: Is HDFS or GlusterFS Right for You?

Table of contents