Ensuring stable uptime and handling traffic spikes gracefully requires a firm understanding of the behaviour profile of your traffic. How your users behave directly correlates with how your traffic profile is shaped and subsequently how your application needs to scale to react to changes in that traffic profile.

Are you an e-commerce site ahead of Black Friday? It might be time to scale up. Are you a local meal delivery service and lunch time is approaching? It might be time to scale up. Are you a ride-hailing service and the major sports game is about to finish at the local arena? It might be time to scale up. Are you a tool used exclusively during business hours and the end of the day in your primary market has arrived? It might be time to scale down.

It’s necessary to understand the magnitude and velocity of traffic changes your application will encounter. To an extent, one can predict this by just knowing the market well, however, it’s best to always have monitoring in place to verify such predictions and provide evidence-backed reasoning.

There are three major considerations when building for reliability: stability, efficiency, and adaptability. In this blog series, we’ll take a look at each of these considerations. In today’s blog, we’ll start by honing in on stability.

On the stability front, you need to be sure your application behaves in a predictable way, keeping deviations like errors as infrequent as possible. Stability is what makes your business success measurable and repeatable. If costs are inconsistent, then measuring it won’t provide you with much useful insight.

Everyone wants the peace of mind from knowing their application will run consistently. This sort of stability requires a focus on determinism and consistency of application behaviour. A deep understanding of what your application does and why goes a long way here. You want to not just know that a feature exists but also to understand what is involved in delivering that feature: what is the cost of CPU time, the network traffic, and the memory and data storage footprint.

Don’t fail to account for failures

Errors, exceptions, and any other sort of failure that can surface in your application should be an expected possibility. As such, this should be planned for, with an understanding of the subsequent impact on performance factored into the cost of running your application.

Failure rates should be measured and known. They should also be aggressively reduced by tracking and fixing failures wherever possible. An exception is by definition an exceptional case which prevents determinism in your application behaviour. Eliminating these inconsistencies is a foundational part of achieving a consistent performance profile.

Make sure your errors are logged with an appropriate level using a logging library such as pino, preferably in an easily consumable format such as JSON.

💡With appropriately categorized and easily consumable logs, it becomes much easier to keep track of your errors with tools such our Command Center via the App Inspect tool.

Understand the access patterns of your data

A deep understanding of the patterns of how your application accesses your data has tremendous benefits when deciding how to optimize for performance. Is your application write-heavy, read-heavy, or a balance of the two?

A write-heavy application such as an event log tracker may choose to write data directly to the database with little to no enrichment or restructuring ahead of time, trading slower reads for faster writes.
A read-heavy application may choose to do more work upfront, such as denormalizing data into multiple records to avoid needing joins when reading the data.

If you understand the access patterns of your application well, you won’t be caught off-guard by application downtime caused by an interaction that turned out to be a lot more expensive than you thought.

Many application tracing tools such as Grafana Tempo have a Slow Queries tool to see at a glance which queries could benefit most from optimization. They typically also include histogram data per query to be able see which have the greatest variance that could be tightened up to make performance more consistent.

If a query has a lot of variance it’s likely either the inputs or the outputs vary a lot in size. Try using LIMIT and OFFSET SQL keywords to paginate query output, and avoid queries which can take unbounded user input to apply many WHERE clauses.

Know your query complexity

The way an application queries data can have a substantial impact on performance. If the application is allowed to function with unbounded query complexity, it could explode in cost in the wrong conditions. This is a well-known challenge with scaling GraphQL applications, which is why there are tools for limiting query complexity, graph depths, merging queries per level, and providing cursor-based pagination.

Mercurius solves a lot of the common problems associated with GraphQL queries. It’s a great option for building GraphQL APIs with fastify. It also has a very helpful caching plugin which can help improve GraphQL query performance further.

Even in applications without GraphQL, it is still important to be aware of not just expected query complexity but also how requests could be crafted in a way which could magnify query cost. How your applications query data is a common attack surface, typically for data exposure but also often for denial of service attacks. Hardening your queries to reduce potential for unbounded variance benefits both security and stability.

SQL queries should avoid dynamic use of JOIN statements as they can create costly transformations and can potentially result in full table scans if configured with user input which indexes did not expect.

Index effectively

A lack of indexes can also contribute to query cost explosions.

Depending on the structure of the query, it could result in a full table scan which could have significant consequences on the responsiveness of your database. Scaling your application is a challenge of its own, but when you’re saturating the CPU time of your database, you could have even greater scaling challenges ahead. Make sure to not leave any performance on the table when crafting your queries and indexes.

Take a close look at your queries to identify which fields of a table are commonly queried together and consider if indexes should be created for those field sets. Indexes are a trade-off between performance and storage, so you don’t want to index excessively either. If a field combination is not used often or if the table is not large, it may be fine to leave it without an index.

Pool your connections

On the application side, you should take advantage of connection pools to maintain a consistent load level. If every request opens its own connection to the database, you can quickly denial-of-service attack your own database, taking down your whole application with it.

By keeping a pool of connections to the database and having your requests take turns using them you can effectively balance query load over your requests to keep latencies consistent and keep everything moving efficiently without query processing starving the rest of the application for CPU time.

Pooling is one method for dealing with a common software architecture problem called Head-of-line blocking, which means that later tasks are delayed by starved resources to process earlier tasks. With a single connection, the tasks overhead can be better controlled, but they can also be batched or pipelined. Pipelining with Redis/Valkey is a great strategy for getting good caching performance.

Wrapping up

By prioritizing stability, you can lay a strong foundation for your Node.js application. This involves understanding your traffic patterns, proactively handling errors, and optimizing database queries.

In the next part of this series, we'll delve into the importance of efficiency in Node.js applications. We'll explore techniques like caching, and sharing session data. Stay tuned!

💡Want to discover how Platformatic can help you auto-scale your Node.js applications, monitor key Node.js metrics, and minimize application downtime risk? Check out our Command Center.

Building a Reliable Node.js Application | Part I