Operational Resilience: Strategies for High Availability in Big-Scale Infrastructure

Egor KaritskiyEgor Karitskiy
11 min read

In this article, we'll explore the concept of high availability in large-scale infrastructure. The concept itself is straightforward and means ensuring a complex system remains accessible despite unforeseen events, whether internal breakdowns or external factors. This is what we'll call high availability further on. It's not just about availability itself; it's about maintaining system functionality regardless of potential issues.

If we imagine the data centre organisation metaphorically as a "pie" system, high availability must be ensured at every layer, both in infrastructure and applications.

Data Center Level

High availability at the data centre level is achieved through geographic distribution, where servers are deployed across multiple geographically independent locations. The farther apart these locations are, the better the resilience of the system.

Geographic distribution involves partially shared resources within data centres. Placing two servers in a single rack is suboptimal while placing them in different racks nearby is better. Ideally, servers should be distributed across different rows or halls within the data centre.

Server Level

At the server level, high availability is addressed through disk failure management and redundancy mechanisms such as disk reservation and RAID configurations. RAID is short for Redundant Array of Independent Disks. It is a storage technology that distributes data across multiple drives within a single system. It comes in various configurations denoted by numbers like RAID 0, RAID 1, or RAID 5. Each RAID level offers unique advantages, such as enhanced performance, improved fault tolerance, or a blend of both, depending on how it organises and allocates data across the drives. Various RAID levels, including mirrored and striped configurations, ensure data integrity and availability in case of disk failures.

Network Level

High availability at the network level includes stable and uninterrupted communication between servers and data centres. Redundancy measures may comprise additional rack-mounted switches for server-to-server communication and redundant spine and super-spine switches for inter-data centre communication.

While internal structured cabling networks are typically not redundant due to low risk, external optical communication lines are essential to ensure high availability. These lines, which extend from data centres, require non-intersecting independent routes, with two to three lines deployed to mitigate the risk of damage.

Middle-Ware Level

High availability considerations extend to the application layer, particularly in middleware systems such as databases, cloud platforms, and containerisation platforms. Stateless applications, or the ones that do not store the data between sessions, can achieve high availability through automatic replication.

However, for stateful applications like databases, a different approach is necessary. Ensuring availability involves geographic distribution across halls or data centres, along with redundant backups to protect against incidents. Automatic failover mechanisms and backup restoration processes further enhance availability in middleware systems.

This approach protects us from various calamities. For instance, if one data centre fails, leaving only two copies of data, we face a dilemma if these copies contain conflicting information. Known as a "split-brain" scenario, this situation lacks a definitive judge to determine the correct data. To address this, we need to have the 3d copy of the same data to restore data integrity. Similarly, if erroneous data accumulates in the database, even multiple copies offer no solution. In such cases, restoring from previous backup copies becomes the only way to rectify errors and restore normal operations.

Large-Scale vs. Modest Scale

When we speak about infrastructure management, we should always bear in mind that the scale of operations impacts the approach to high availability. At a modest scale, HA strategies are often implemented manually, following established best practices and operational standards. However, in large-scale environments, automation becomes indispensable.

Consider an example with a single accounting system where deploying three database servers and one backup server suffices. However, in environments with thousands, or even hundreds of thousands of servers, such approaches prove inadequate. Achieving high availability at scale requires fundamentally different methodologies.

While at smaller scales, HA implementation typically relies on manual interventions guided by established practices, large-scale infrastructures demand automation at every level, from fundamental processes to overarching strategies. Concepts like infrastructure as code (IaC) embody this paradigm shift and enable seamless management of vast server fleets. Notably, platform solutions like Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) further facilitate HA implementation in expansive environments.

Industry giants like Microsoft, Google, Amazon, Digital Ocean, Facebook, and Meta can serve as examples of the monumental scale at which HA solutions are deployed. Beyond tech giants, large banks and production companies also operate extensive server networks providing high availability.

Business Continuity as the Only Imperative

Business continuity or uninterrupted operations is indispensable for all types of businesses. Irrespective of the industry, companies rely on continuous service delivery for their day-to-day activities. In this respect, applications and processes are categorised based on their criticality to business operations, ranging from mission-critical systems to productivity tools. This categorisation dictates the level of attention and resources allocated to ensure uninterrupted service delivery.

Procedures and Rules

Companies establish procedures and rules tailored to different application grades to maintain business continuity. For instance, mission-critical applications undergo stringent scrutiny and receive top-tier HA measures, while less critical applications may warrant more relaxed strategies. By categorising applications based on their criticality, organisations streamline their approach to HA management and provide effective response protocols.

Let us take as an example an office food ordering app. While a temporary malfunction may inconvenience employees, the core operation of the canteen remains unaffected. On the other hand, if we take a marketplace's website or mobile app, its outage can bring operations to a halt, emphasising its mission-critical nature. Consequently, less critical systems like the food ordering app receive less attention and investment, this means they remain functional with minimal fuss and resource allocation.

Strategies

Further on we'll discuss strategies for providing HA at large-scale systems.

Redundancy and CAP-theorem

When aiming to achieve high availability in a system, redundancy and the CAP theorem can be used as complementary strategies.

Redundancy involves duplicating critical components or data within the system. This can include redundant servers, network links, or data backups. By implementing redundancy, the system can tolerate failures without causing downtime or data loss.

The CAP theorem can be used as an additional means of HA provision. It is also called Brewer's theorem and states that in distributed systems, you can only achieve two out of three guarantees: Consistency, Availability, and Partition Tolerance.

Consistency ensures that readers receive the latest write or an error, while Availability guarantees that every request gets a response, even if it's not the latest data. Partition Tolerance means the system can still operate despite communication failures between its parts. During a network partition, the system must choose between maintaining consistency (by cancelling operations) or ensuring availability (by proceeding with operations, risking inconsistency).

When we understand the trade-offs presented by the CAP theorem, we can strategically use redundancy to enhance availability and mitigate the impact of potential failures or network partitions. So redundancy will help us maintain system functionality and data integrity and the CAP theorem will guide decision-making to balance consistency and availability based on specific priorities.

Disaster-Recovery Sites

Another strategy is disaster-recovery site construction. It is especially popular in the context of financial institutions and banks and involves the establishment of remote data centres prepared to take over critical operations in case of a catastrophic event. These backup facilities, though inactive under normal circumstances, serve as a safeguard against data loss, which could have severe consequences for the institution and potentially even threaten the stability of a nation's financial system. As an example, we can bring the American CitiBank.

How Many Nines Are Needed

Additionally, achieving a certain level of reliability in a data centre is often expressed in terms of "nines". For example, aiming for "5 nines" reliability translates to a fault probability of 0.1%, indicating that 99.999% of the system will remain operational. However, it's important to understand that these figures are often used as marketing slogans. In reality complex calculations based on the probability of failure may vary depending on the specific context and system components.

Furthermore, while reliability is a key metric, it's not the only factor to consider. Availability, or the likelihood that a system will be accessible consistently, is equally or even more important. As confusion often arises regarding the relationship between reliability, availability, and fault probability, clear distinctions must be made in these notions.

Complex Solutions or Reliability-at-All-Levels Strategy

In practical terms, achieving high levels of reliability and availability requires a comprehensive evaluation of all system elements, including infrastructure, network, and processes impacting critical business functions. This assessment involves calculating probabilities, averaging them, and making informed decisions to enhance system resilience and mitigate risks. Ultimately, the goal is to strike a balance between reliability, availability, and cost-effectiveness to ensure the continuity and stability of critical business operations.

Instead of focusing on individual system components like DCs, we need to broaden the scope to include all elements influencing critical business operations. This comprehensive assessment presumes the evaluation of network infrastructure, human resources, and processes that impact system functionality. Through a systematic analysis of failures and employing tailored methodologies, we can derive probabilities and averages to ascertain the availability, reliability, and likelihood of failure for each element.

Subsequently, these findings will help us to make informed decision-making and strategic planning. This may involve investments to bolster reliability levels or adjustments to mitigate overreliance on redundant systems.

Moreover, this approach allows us to set aside superficial metrics, such as marketing slogans touting "5 nines", and instead use complex solutions required for robust system resilience.

"Plywood Houses" Strategy

The "Plywood Houses" strategy gets its name from the plywood homes found in certain flood-prone areas like South America or parts of Africa. In these places, there are annual floods followed by fertile soil deposits. Instead of constantly moving away from their flooded homes, people build temporary shelters that can last for a year before getting washed away. This strategy helps them adapt to the environment without too much effort and resources.

A similar approach can be used to provide resilience to IT systems by calculating the system failure probability. Let's say you have two systems, and each has a chance of failing on its own. Suppose we have two independent systems, System A and System B. The probability of System A failing on its own is 20%, which can be represented as P(A)=0.20, and the probability of System B failing on its own is 10%, represented as P(B)=0.10.

To find the probability of both systems failing simultaneously, we can use the multiplication rule for independent events. This rule states that the probability of both events occurring is the product of their individual probabilities.

So, the probability of both System A and System B failing simultaneously can be calculated as:

P(A∩B)=P(A)×P(B)

Substituting the given probabilities:

P(A∩B)=0.20×0.10

P(A∩B)=0.02

This means there is a 2% chance that both System A and System B will fail at the same time.

Thus, companies adopt a probabilistic approach and strategically increase the redundancy of system components. Doing this they mitigate the risk of simultaneous failures. This pragmatic strategy is particularly used in industries like cryptocurrency mining.

How to define proper strategy for your company

When crafting a reliability strategy for a company, first of all, we need to recognise that reliability cannot be achieved through a single parameter or layer within our infrastructure. Instead, we need to consider all components of the system, as the overall reliability is only as strong as its weakest link. Therefore, it's important to thoroughly analyse business processes and applications, cataloguing them to understand their specific reliability requirements. For each business process, consider the underlying infrastructure, middleware, and data centres, providing all of the above have the necessary reliability measures.

I want to stress that investing in expensive infrastructure may not always be necessary; instead, one can explore alternative options such as cloud providers or virtual machines to achieve reliability at a lower cost. This is a valid option for a startup, for instance. Remember, there's no one-size-fits-all solution when it comes to reliability strategies, so you will anyway need to tailor your approach based on your company's unique needs and goals.

Cost estimation and risks

Business processes are the backbone of profitability for any company, and it's essential to ensure that the cost of implementing reliability measures does not outweigh the profits generated by these processes.

The expense of reliability is directly proportional to its level; as reliability increases, so does the cost, but not in a linear dependence. In fact, the cost of achieving higher reliability can increase exponentially. Therefore, you need to assess how much money your company is willing and able to invest in ensuring the reliability of its business processes. This analysis should consider factors such as available funds, cash flow, and anticipated returns on investment.

Based on this evaluation, decisions can be made regarding resource allocation to enhance the reliability and availability of business processes, infrastructure, and applications. In this respect, reliability should be viewed as one piece of the puzzle, with various potential solutions that must be carefully evaluated based on cost-effectiveness and alignment with business objectives.

Do not overcomplicate your approach

Resilience in data centres should not be overcomplicated. No solution is 100% reliable, and striving for absolute perfection can lead to unnecessary complexity. Achieving high reliability can be simple and straightforward, often through redundant measures like increased duplication. For instance, having five reserve copies instead of two stored in less reliable storage can enhance reliability without overcomplicating the approach. There are multiple paths to achieving the desired result, and simplicity can often be more effective than complexity.

Nothing is 100% reliable and your infrastructure is not either

To conclude, I would like to emphasise that achieving 100% reliability in your infrastructure is unrealistic. Despite your best efforts, no approach guarantees absolute reliability. You can implement various measures like redundant servers and regional data processing centres, but even these won't ensure perfect reliability. Sometimes businesses fail to grasp this reality, insisting on absolute reliability for critical applications. They insist that their applications must function flawlessly under any circumstance, even in the event of a nuclear war. However, it's crucial to educate them that such expectations are unrealistic. Instead, we must stress the importance of evaluating every risk component and acknowledging that ensuring a high level of reliability comes with a significant cost. Businesses need to be prepared for the expenses associated with maximising reliability.

138
Subscribe to my newsletter

Read articles from Egor Karitskiy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Egor Karitskiy
Egor Karitskiy