🚀 Scaling a DAG-based Workflow Engine - A Journey in Backend Optimization and Scalability!


In recent weeks, my team and I discovered a performance issue when running a complex workflow in our API Builder agent in APIwiz, a DAG-based workflow engine. The workflows were taking 9-11 seconds end-to-end, which was far too slow for real-time needs. Over a six-week sprint, we tackled these pain points, achieving a remarkable ~90% reduction in workflow latency and a 9x increase in Transactions Per Second (TPS), all while maintaining error-free performance. Here’s how we achieved these improvements.
🔧 What is API Builder?
API Builder is a workflow engine built on Directed Acyclic Graphs (DAG) that lets users create powerful workflows with pluggable nodes - API callouts (REST, GraphQL, SOAP), scripting, DB access, assertions, conditions, and more. Workflows can be triggered via REST, GraphQL, Kafka, or even scheduled jobs. It’s backed by Java Spring Boot + MongoDB, and includes a rich events section for real-time tracking.
A Directed Acyclic Graph (DAG) is a finite graph with directed edges and no cycles, meaning it flows in one direction without looping back. In the context of workflow engines, DAGs are used to model processes where each node represents a task, and the edges define the order of execution.
About APIwiz
APIwiz is a low‑code, vendor‑agnostic APIOps and API management platform that centralizes every stage of the API lifecycle—from collaborative design and automated testing to real‑time security linting, governance, and built‑in monetization—across any gateway or cloud environment. Learn more at apiwiz.com.
📉 The Problem
Initially built as a Proof of Concept (POC), things were smooth… until we tried scaling. On high-load stress tests with complex workflows (involving REST calls, scripting, and data transformations), latency shot up to 9-11 seconds, and TPS (Transactions Per Second) tanked. Our target? Bring it down to ~1s latency and a healthy TPS ramp.
🔍 Our Optimization Journey:
Streamlining Data Access
We realized that each workflow execution made multiple DB reads from different collections. And every DB call is expensive.
So, we redesigned the data model to consolidate required data into a single document. Now, only one DB call is needed to fetch the entire execution context.
Impact: This drastically cut down I/O overhead and MongoDB latency.
Embracing Java 21-Virtual Threads: for Enhanced Parallel Processing
Workflow nodes often run in parallel. Previously, we relied on platform threads, which exhausted quickly and suffered from context switching overhead.
Switching to virtual threads was a game-changer: JVM manages them efficiently, freeing up carrier threads during IO (like DB or HTTP calls), letting others continue processing.
Impact: Reduced thread contention and CPU usage, and also reduced context switching overhead.
Yes, it increased memory usage due to per-thread stacks, but well worth the tradeoff for better scalability.
💡 Pro Tip: Virtual threads excel in IO-heavy workloads, scaling to millions of concurrent tasks with minimal OS‐thread overhead.
//Sample Virtual thread execution
public class VirtualThreadDemo {
public static void main(String[] args) {
System.out.println("Starting virtual threads...");
for (int i = 1; i <= 3; i++) {
int taskId = i;
Thread.startVirtualThread(() -> {
System.out.println("Task " + taskId + " running in " + Thread.currentThread());
try {
Thread.sleep(500); // Virtual thread process
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
System.out.println("Task " + taskId + " done");
});
}
}
}
//Output
Starting virtual threads...
Task 1 running in VirtualThread[#21]/runnable@...
Task 2 running in VirtualThread[#22]/runnable@...
Task 3 running in VirtualThread[#23]/runnable@...
Task 1 done
Task 2 done
Task 3 done
Boosting HTTP Performance: Connection Pooling for HTTP Calls
Previously, every REST call opened a fresh connection - costly!
We added an HTTP connection pool - reuse over reinitializing. No repeated SSL handshakes = faster executions.
Connection pooling in the context of HTTP calls refers to the practice of managing and reusing HTTP connections to improve performance and resource utilization. Instead of establishing a new connection for each request, a pool of pre-established connections is maintained and reused.
Impact: The time taken for each HTTP request is reduced, and CPU usage is decreased.
Smarter Logging: Asynchronous and Message Queue Integration
Earlier, we offloaded logs via
@Async
, but it still ran on the same app node and hit MongoDB - expensive and memory-heavy, which became a bottleneck.We offloaded logging to a dedicated microservice via a message queue. Now, the main app just pushes logs, and a lightweight consumer handles DB writes. Cleaner, faster, and scalable.
Impact: The main pipeline remains efficient and lean, ensuring fast processing.
Managing Virtual Threads: Preventing Memory Hogging/Explosion
With all optimizations, we hit a new issue: virtual threads weren’t getting cleaned up after IO stalls.
Fix: Introduced an async web task layer with a 60s timeout, ensuring stale threads don’t hog memory.
Impact: Memory usage optimized, preventing stale threads.
📈 The Result?
~1 second latency under the same stress conditions.
Increased Transactions Per Second (TPS) from 85 TPS to 761 TPS (a 9x increase).
Stable performance across long test durations under sustained load.
Improved stability & resource utilization.
Before optimisation: (Jmeter)
Latency: Average 11,562 ms | Median 6,376 ms | 90th pc 24,919 ms | 99th pc 25,749 ms | 0% errors
Throughput: 85.6/sec
Note: These numbers are from a warmed‑up setup running 85 TPS sustained over 100,000 requests.
After optimisation: (Jmeter)
Latency: Average 1295 ms | Median 683 ms | 90th pc 3098 ms | 99th pc 4741ms | 0% errors
Throughput: 761.5/sec
Note: These results are from a fully warmed‑up setup running at a constant 760 TPS over 100,000 requests. Under such extreme parallel load, it’s normal to see a small number of spikes in responses- our 90th, 95th, and 99th percentiles stretch into the 3–5 s range. Crucially, however, the average response time stays at 1.3 s and the median at 0.7 s, meaning most requests complete well under one second.
💡 Key Takeaways
This project reinforced some crucial engineering principles:
Regular performance reviews catch regressions early.
Data access patterns are often the first place to look for optimizations.
Consolidated data models reduce I/O overhead.
New language features (like virtual threads) can provide significant improvements.
Consider the entire system - Ancillary systems (logging and monitoring) can create unexpected bottlenecks.
Conclusion
In conclusion, our efforts to optimize and scale the DAG-based Workflow Engine, API Builder, resulted in a remarkable 90% reduction in end-to-end execution time and a 9x increase in throughput, all without encountering any errors under sustained load.
By consolidating data access, leveraging Java 21’s virtual threads, implementing connection pooling, offloading logs, and making strategic design and architecture changes, we not only achieved our target metrics but also enhanced the platform's resilience and maintainability.
This journey was a significant win in terms of performance, scalability, and backend design maturity, showcasing the importance of architectural thinking, JVM tuning, and systems design.
Curious about how APIwiz can streamline your workflows and learn more about APIs? Head over to apiwiz.com.
TL;DR
What & Why: API Builder is a DAG‑based workflow engine (Java Spring Boot + MongoDB) for creating pluggable-task pipelines; the initial POC design suffered high latency (9–11 s) and low TPS under load.
Key Optimizations:
Data Access: Consolidated execution context into a single MongoDB document to eliminate multiple reads.
Concurrency: Migrated to Java 21 virtual threads to cut context‑switch overhead in IO‑heavy parallel tasks.
HTTP Calls: Introduced connection pooling to avoid repeated SSL handshakes.
Logging: Offloaded log writes to a separate microservice via message queue instead of in‑app MongoDB writes.
Thread Management: Added a 60s timeout layer to clean up stale virtual threads.
Results: Achieved ~1s latency, 9× TPS increase, and stable long‑run performance.
Takeaways: Regular performance reviews, optimized data patterns, new language features (virtual threads), and mindful design of ancillary systems (logging/monitoring) are crucial for scalable backends.
💬 Would love to hear from anyone who's tackled similar backend scale issues. Let's talk architecture, Java, and optimization!
#backend #performanceengineering #java21 #virtualthreads #systemdesign #springboot #mongodb #architecture #developers #scalability #workflowengine #APIBuilder #Apiwiz
Subscribe to my newsletter
Read articles from Rahul R directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Rahul R
Rahul R
🚀 Software Development Engineer | 🛠️ Microservices Enthusiast | 🤖 AI & ML Explorer As a founding engineer at a fast-paced startup, I’ve been building the future of scalable APIs and microservices - From turning complex workflows into simple solutions to optimizing performance. Let’s dive into the world of APIs and tech innovation, one post at a time! 🌟 👋 Let’s connect on LinkedIn https://www.linkedin.com/in/rahul-r-raghunathan/