Mastering Disaster Recovery - Part 1 : Seven Levels

Srigovind NayakSrigovind Nayak
4 min read

When discussing business continuity plans, it's important to understand the concepts of high-availability (HA) and disaster recovery. High-availability is a system's ability to remain resilient against single points of failure, ensuring consistent performance and uptime. However, HA alone is not sufficient. Organisations must also have a robust disaster recovery strategy to quickly restore infrastructure and data with minimal data loss in the event of a disruption.

In this blog, I will provide an overview of disaster recovery and introduce the seven levels of disaster recovery, setting the stage for a deeper exploration in future blogs.

Disaster Recovery

Disaster recovery is a crucial aspect of maintaining or re-establishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or cyberattack. It's essential for keeping all critical aspects of a business functioning despite significant disruptive events. Effective disaster recovery requires well-thought-out policies, procedures, and tools to ensure business continuity.

Measuring Data Loss and Recovery Time

In the event of a disaster, an organisation's primary goal is to restore all systems rapidly while minimising data loss. These objectives are quantified as Recovery Time Objective (RTO) and Recovery Point Objective (RPO):

  • Recovery Time Objective (RTO): This is the duration required to restore infrastructure and data to resume business operations.

  • Recovery Point Objective (RPO): This represents the acceptable amount of data loss, measured in time, from the point of the disaster.

Example showing longer 'actual' times that do NOT meet either RPO or RTOs ('objectives'). Diagram provides schematic representation of the terms RPO and RTO.

The Need for a Secondary Site

A secondary location equipped with comparable infrastructure—like computing resources, storage, and networking—is necessary, particularly when the primary site is not immediately recoverable. The data restored at this secondary site is crucial for continuing business operations.

States of Infrastructure and Data Layers

The secondary site can be either active or passive. For instance, while the computing, network, and storage might be active, if the site lacks the necessary data (or state) to function as the primary site, data restoration is needed. In this scenario, the data layer is in a passive state, which impacts the RTO during disaster recovery.

Considerations for Your Disaster Recovery Plan (DRP)

To effectively establish a DRP, businesses must discuss their domain-specific needs to determine appropriate RPO and RTO requirements. For example, banks typically require very low RPO and RTO, aiming for minimal downtime, whereas a university or research organisation might tolerate some data loss and a longer recovery period.

From Backups to Continuous Data Replication: The 7 Tiers of Disaster Recovery

Achieving desired RPO and RTO goals involves understanding the different levels of disaster recovery, ranging from level 0 to level 6. Each level offers varying degrees of data protection and recovery speed, with increasing cost and complexity.

  1. Level 0 - No Off-Site Data: This basic level involves storing data exclusively on-site, without off-site backups. It's the most cost-effective but carries the highest risk of total data loss in case of on-site disasters. Ideal for small, non-critical setups.

  2. Level 1 - Backup Tapes Off-Site: Involves backing up data to magnetic tapes stored off-site. It's a more secure option than Level 0 but can be slow in data recovery. Suited for institutions where data recovery speed is not a critical factor.

  3. Level 2 - Disk Backup Off-Site: Faster recovery is possible as data is backed up onto disk-based systems off-site. It’s more expensive than tape backups but allows for more frequent backups. Suitable for medium-sized businesses prioritising recovery speed.

  4. Level 3 - Electronic Vaulting: Data is sent in batches to an off-site location at regular intervals. It strikes a balance between backup frequency and costs, ideal for organisations with moderate data-change rates.

  5. Level 4 - Point-in-Time Copies: Offers frequent snapshots of data, providing multiple recovery points. This level is storage-intensive and ideal for businesses with high transaction rates or those maintaining critical systems.

  6. Level 5 - Transaction Integrity: Ensures all transactions are captured up to the point of failure, offering high data integrity. It's technically complex and ideal for setups where transactional consistency is crucial, like financial institutions.

  7. Level 6 - Zero or Near-Zero RPO: Provides continuous data protection with almost instantaneous recovery and minimal data loss. It's the most sophisticated and costly solution, suitable for large enterprises or critical government systems.

Conclusion

In disaster recovery planning, accurately defining Recovery Point Objective (RPO) and Recovery Time Objective (RTO) is crucial for business resilience. These objectives dictate how quickly and effectively a company can bounce back from disruptions. However, implementing these objectives through appropriate disaster recovery tiers involves a careful balance of costs and capabilities. A successful DR plan aligns with the organisation's risk tolerance and budget, ensuring that the level of investment is proportional to the potential risks and impacts. In essence, a well-crafted DR plan not only protects critical business functions but also aligns with the organisation's financial strategy, ensuring long-term stability and growth.

References

High Availability vs Disaster Recovery: What's the Difference and Why it Matters for Your Business

High availability

7 tiers of disaster recovery

Basic Steps for your Business Continuity & Disaster Recovery plan

0
Subscribe to my newsletter

Read articles from Srigovind Nayak directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Srigovind Nayak
Srigovind Nayak

As a software engineer with 3 years of experience, I work on the core backup & recovery features of Zmanda, an enterprise backup and recovery product. I have strong skills in software design, cloud-native development, and delivery. I also foster effective communication and collaboration among the development team, architects, product owners, and business owners. I contribute to some open-source projects and share my technical insights on my blog.