Benchmarking IBM Storage Ceph Object Gateway (RGW) - Part 2: Unlocking Predictable Throughput with Horizontal Scaling

Daniel ParkesDaniel Parkes
9 min read

Introduction

This post is the second installment in our blog series on benchmarking IBM Storage Ceph Object Gateway (RGW) with All-Flash IBM Storage Ceph Ready Nodes. If you haven’t read Part 1, we recommend starting there. It provides comprehensive details of our testing environment, including hardware and software configurations, network architecture, and benchmarking methodology.

In this installment, we delve into key performance results, focusing on peak throughput, maximum IOPS, and horizontal scalability results.

Note on resource collocation: All services — RGW, Monitor, Manager, OSD, and Ingress — were deployed in a collocated configuration across each node. While this reflects a typical real-world deployment pattern, it also means resource sharing could introduce internal contention. These results were gathered with this trade-off in mind, and additional separation of services may yield even higher performance under specific workloads.

Peak Throughput Performance Highlights

In our effort to uncover the upper limits of IBM Storage Ceph Object Gateway (RGW), we tested a twelve-node all-NVMe (288 OSD) cluster under ideal conditions, optimized EC profiles, large object sizes, and high client concurrency to push the platform to its maximum capabilities.

Maximum GET Throughput

  • 111 GiB/s aggregate GET throughput was achieved using 32MB objects (EC 2+2, RGW SSL enabled, 12 nodes, 1024 client threads).

  • This corresponds to ~9.25 GiB/s per node, pushing the limits of the Ceph node hardware’s physical network capacity.

  • Limiting factor: Network — Each node is equipped with two 100 Gbps NICs bonded via LACP; however, the Intel cards used are limited to a total bandwidth of 100 Gbps (12.5 GiB/s) per node. Given this, the ~111 GiB/s cluster-wide result confirms that we are very close to line-rate saturation given framing overhead.

Maximum PUT Throughput

  • 65.8 GiB/s aggregate PUT throughput was reached under similar test conditions.

  • PUT operations inherently require additional I/O for replication (especially with EC), which multiplies data movement within the cluster. This explains why PUT throughput is lower than GET.

  • Despite this, each node still delivered ~5.5 GiB/s on average, which, when accounting for replication traffic, again places us near the effective NIC limit of our nodes.

CPU and Memory Utilization

  • CPU usage for RGW processes remained within bounds (8–14% total cluster-wide RGW CPU across 768 vCPUs), confirming that neither RGW nor CPU resources were the bottleneck under large object workloads.

  • Latencies remained consistent across different EC profiles (2+2, 4+2, 8+3), further indicating that the network, not the compute resources, was the limiting factor during the testing.

Peak IOPS Performance Highlights

While large object tests revealed the cluster’s raw throughput potential, small object workloads (64 KB) helped expose the limits of metadata and IOPS scalability. Using a twelve-node all-NVMe cluster and high client concurrency, we tested both GET and PUT operations with multiple EC configurations to isolate the peak operations per second the platform can sustain with small object sizes.

Maximum GET IOPS

  • ~391K GET IOPS were achieved using 64 KB objects (EC 2+2, no SSL, 12 nodes, 1024 client threads). This corresponds to a total GET throughput of ~24.4 GB/s and an average latency of ~2 ms.

  • Limiting factor: CPU usage on RGWs began to rise (~9.8% across 768 vCPUs), but the system still had headroom. Low latency and consistent scaling across client threads suggest that the cluster could push higher IOPS with additional client-side resources/concurrency.

Maximum PUT IOPS

  • ~86.6K PUT IOPS were recorded under the same configuration (64KB, EC 2+2, RGW with SSL, 12 nodes, 1024 client threads). PUT throughput peaked at ~5.4 GiB/s with an average latency of ~8 ms at full concurrency (1024 client threads).

  • Limiting factor: PUT operations involve more backend coordination due to erasure coding writes. While RGW CPU usage remained modest (~2.9%), the OSD layer and I/O complexity likely capped further gains. The gap between GET and PUT IOPS reflects this inherent asymmetry in workload cost.

CPU and Memory Utilization

For small object workloads (64 KB), RGW CPU usage showed a clear distinction between GET and PUT operations. GET workloads consumed significantly more CPU per RGW daemon, scaling from ~3 cores at low concurrency to nearly 10 cores per RGW with eight clients (1,024 threads). In contrast, PUT workloads remained consistently lighter, peaking at just under three cores per RGW even at maximum concurrency.

This behavior is attributed to the high volume of GET requests(as shown in the following graph) over 390K IOPS at peak, which, despite being individually lightweight, accumulate more CPU cycles due to their sheer frequency. Additionally, PUT requests, while involving a heavier I/O path, occur at a lower rate (~87K IOPS) and benefit from write path optimizations.

Note: Object storage workloads in production environments tend to be read-heavy (GET-dominant). This pattern aligns with common use cases like analytics pipelines, data lakes, and media delivery systems.

Memory consumption across RGW daemons remained stable throughout the test, with no signs of pressure or leaks. Each RGW daemon consumed between 170 and 260 MiB of memory during PUTs and between 205 and 260 MiB for GETs, showing a gradual increase with added concurrency.

These results demonstrate that CPU availability becomes the dominant performance factor for small object workloads, particularly under high GET request rates. As IOPS scale into the hundreds of thousands, provisioning sufficient CPU resources is critical to maintaining low latency and high throughput.

Horizontal Scalability with EC 2+2 (4MB Objects)

To evaluate the impact of scaling out the IBM Storage Ceph Object Gateway (RGW), we conducted a controlled test using an erasure coding 2+2 profile, 4MB object sizes, and SSL enabled at the RGW layer. This specific EC profile was chosen because it is valid across 4, 8, and 12-node deployments, allowing for a fair apples-to-apples comparison.

By incrementally increasing the number of nodes and OSDs, we observed how the system responds in terms of throughput, latency, and resource consumption under identical client concurrency and request sizes.

Analysis

  • Predictable, Linear Scaling:
    As the cluster expanded from four to twelve nodes, GET throughput nearly tripled, growing from ~39 GiB/s to ~113 GiB/s. PUT throughput similarly increased by over 3×, rising from ~15.5 GiB/s to just over 50 GiB/s. This linear gain reinforces the effectiveness of IBM Storage Ceph Object Gateway’s horizontal scalability for both read and write operations.

  • Latency, Stability, and Improvement:
    GET latency remained stable and improved as the cluster scaled. The twelve-node deployment exhibited lower latency (36 ms) compared to the eight-node deployment (52 ms), despite achieving significantly higher throughput. This indicates reduced contention and greater parallelism at scale.

  • CPU and Memory Resources:

One of the key advantages of Ceph’s distributed architecture is its ability to amortize per-service resources as cluster size increases. With 64MB objects and 1024 concurrent threads, our testing revealed that RGW CPU and memory usage per daemon decreased as we added more nodes, even as throughput more than tripled. PUT operations, which are typically more resource-intensive due to erasure coding computation, saw RGW CPU usage decrease from ~9.2 cores per daemon at four nodes to just 5.7 cores at twelve nodes. Similarly, RGW memory usage decreased from ~1035 MiB to ~698 MiB over the same scale. GET operations showed comparable trends: memory declined by ~35% per RGW, and CPU remained consistently low.

This test demonstrates that adding nodes to an IBM Storage Ceph Cluster not only increases raw storage capacity but also proportionally enhances throughput and efficiency. For large-object workloads, including backups, media pipelines, or AI training set staging, this horizontal growth pattern is essential. With consistent configuration and moderate tuning, the IBM Storage Ceph Object Gateway (RGW) is capable of scaling linearly and predictably to meet growing performance demands.

Replica 3 vs EC 2+2: Object Size Impact on Performance

To better understand how replication and erasure coding compare across various object sizes, we tested IBM Storage Ceph RGW with both Replica 3 and EC 2+2 profiles on a four-node cluster, with SSL enabled at the gateway level. While ideally we would extend this comparison across twelve nodes, because of time constraints, replica 3 was only tested on four nodes. We plan to revisit these benchmarks with larger deployments once IBM Storage Ceph 9.0 becomes available, which introduces Fast EC improvements.

Each scenario was tested using 512 client threads to avoid saturating the smaller, four-node cluster. Object sizes ranged from 64 KB to 256 MB to observe how replication and erasure coding scale across different I/O profiles.

Table 1. Relative Difference in Throughput and Latency: Replica 3 vs EC 2+2

Positive = Replica 3 performs better (higher throughput, lower latency)

Object SizePUT Throughput IncreaseGET Throughput IncreasePUT Latency ImprovementGET Latency Improvement
64 KiB+18%+37%+18%+33%
4 MiB+6%+22%+6%+16%
64 MiB+1%+21%+2%+16%
256 MiB+1%+21%+2%+17%

Analysis: Throughput and Latency Comparison

  • With smaller objects like 64 KiB, Replica 3 significantly outperformed EC 2+2, delivering 18% more PUT and 37% more GET throughput. This behavior aligns with expectations: encoding/decoding overhead in EC adds latency and computational cost, making Replica 3 better suited for high-ops, small-object workloads.

  • As object sizes grow, the throughput advantage of Replica 3 narrows. At 1 MiB and 4 MiB, performance differences remain notable but smaller, with Replica 3 providing 8–9% higher GET throughput and 6–7% higher PUT throughput.

  • However, at very large sizes like 64 MiB, both schemes approach the network saturation point of a four-node cluster, causing throughput to plateau. As a result, the difference in performance becomes marginal, and neither scheme can fully exploit its architectural advantages. This makes 64 MiB results less reliable for comparing replication efficiency.

  • Latency trends mirror throughput. Replica 3 maintains lower PUT and GET latency across all sizes, but the relative improvement decreases with object size, from 33% faster GETs at 64 KiB to only ~16% at 64 MiB.

Looking Ahead: EC Performance Enhancements

The performance gap at small object sizes is expected to narrow in future releases. IBM Storage Ceph 9.0 (based on upstream Ceph Tentacle) introduces Fast EC, which addresses erasure coding padding inefficiencies and boosts small-object performance. These advances make erasure coding a more attractive option even for high-frequency, small-object workloads.

You can check all the details about Fast EC improvements on the Ceph Day London Cephalocon talk slides or video, I’m adding a graph from the FastEC initial benchmark results to whet your appetite:

Coming Up Next

In the next post, we will explore the trade-offs introduced by different security configurations — including TLS/SSL, SSE-S3, and msgr v2 — and how they affect Ceph Object Gateway (RGW) performance across small and large object workloads. We’ll also begin analyzing the optimal Ceph Object Gateway-to-CPU ratio as we delve deeper into real-world benchmarking results.

Stay tuned for Part 3.

0
Subscribe to my newsletter

Read articles from Daniel Parkes directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Daniel Parkes
Daniel Parkes