Design Resilient Architecture Part-01

Varish AnsariVarish Ansari
6 min read

🧠 Quick Story: What is happening here?

  • You have Hadoop (big, distributed ETL workload).

  • You have LOTS of EC2 instances (50 per AZ!).

  • You want high availability (no single hardware failure should kill too many instances).


πŸ”₯ Which Placement Group Should You Pick?

βœ… Partition Placement Group

Why?

  • It divides EC2 instances into partitions.

  • Each partition uses different hardware.

  • So, failures are isolated β†’ If one partition fails, others are safe!

  • Designed for distributed systems like Hadoop, Kafka, Cassandra.


❌ Wrong Options (and why)

OptionWhy Wrong?
Cluster Placement GroupFor HPC (High Performance Computing), needs low latency, not for big distributed jobs.
Spread Placement GroupOnly for a small number of instances (like 7 per AZ), not 50+ instances.
Both Spread and PartitionNo, Spread is not good here. Only Partition is correct.

🎯 Final Shortcut to Remember

Hadoop = Partition Placement Group.

Simple one-line memory rule:

"Distributed Big Data ➑️ Partition Placement Group"


Would you also like a one-line trick to remember when to use Cluster vs Spread vs Partition?
I can give you a funny shortcut for that too! πŸŽ―πŸ˜„
Want it? πŸš€

Awesome! πŸš€ Here's a funny and super easy shortcut for remembering Cluster vs Spread vs Partition:


🎯 1. Cluster Placement Group

"Cluster = Close"

  • All instances packed closely together.

  • For: Super fast, low-latency communication (like HPC apps, tightly-coupled systems).

🧠 Memory trick:

"Cluster means close together like a tight friend group."


🎯 2. Spread Placement Group

"Spread = Separate"

  • Instances are far apart β€” on different hardware.

  • For: Protecting a small number of critical instances (max 7 per AZ).

🧠 Memory trick:

"Spread means spread out like introverts at a party."


🎯 3. Partition Placement Group

"Partition = Pieces"

  • Big groups divided into pieces (partitions).

  • Each partition is isolated β†’ good for Hadoop, Kafka, Big Data systems.

🧠 Memory trick:

"Partition means split into pieces for massive distributed workloads."


πŸ”₯ Final One-Liner Summary:

Cluster = Close together πŸ”₯
Spread = Separate far away πŸƒβ€β™‚οΈπŸƒβ€β™€οΈ
Partition = Pieces for Big Data πŸ—οΈ


🎯 What the company needs:

  • Failover to AWS quickly if their on-premises data center fails.

  • Minimal downtime (least waiting time).

  • Same data on-prem and AWS (uniform data).


❌ Why your selected option was wrong:

You selected a solution using AWS CloudFormation triggered by Lambda.
πŸ‘‰ Problem:
CloudFormation takes time to create EC2 instances, ALB, etc. (could be minutes).
Not instant failover ❌ β€” which increases downtime.


βœ… Correct (Best) Solution:

  • Route 53 failover record
    (Automatically detects failure and switches traffic to AWS.)

  • Already running EC2 servers behind an Application Load Balancer in an Auto Scaling group
    (Not creating them after failure β€” they are already ready!)

  • AWS Storage Gateway
    (Keeps on-prem and AWS data in sync.)

πŸ‘‰ So no provisioning delay, failover is instant βœ….


🧠 Easy memory trick:

Failover = Pre-Running + Load Balancer + Storage Sync πŸš€
(Never launch servers during disaster. They must be already active.)


πŸ“’ Simple Final Thought:

If the question asks for LEAST downtime, avoid any solution where new resources are created on-the-fly like CloudFormation or Lambda triggers.
Pre-built infrastructure wins every time! πŸ†

🎯 What’s happening in the question:

  • Videos are saved on local EBS volumes attached to each EC2 instance.

  • When users log in, the Load Balancer sends them to different instances.

  • Problem: Each instance has different videos on its own EBS.
    (That's why users see a random subset of their videos each time.)


βœ… Best solution:

You need shared storage that all EC2 instances can access together.
Two best options:

1. Amazon S3 (Best for Object Storage like videos)

  • Upload all videos to S3.

  • Modify app to read/write videos directly from S3.

  • βœ… S3 is scalable, reliable, and all instances can access it anytime.

2. Amazon EFS (Shared File System)

  • Mount EFS to all EC2 instances.

  • Migrate existing videos from EBS to EFS.

  • App can access videos like normal files but from shared storage.


❌ Why other options are wrong:

OptionWhy Wrong
S3 Glacier Deep ArchiveIt's for cold storage (very slow retrieval, meant for backup only), not for active videos. ❌
Amazon RDSRelational database (for structured data like users, orders) β€” not good for big videos. ❌
DynamoDBNoSQL database (for key-value or document data) β€” not meant for storing big video files. ❌

🧠 Easy memory trick:

For storing videos, images, files βž” Use S3 or EFS, NOT databases.


πŸ“’ Simple Final Thought:

Block storage (EBS) is tied to a single server.
Shared storage (S3 or EFS) is visible to all servers.

When users need to access the same content regardless of server, always think S3 or EFS first! πŸš€

🏬 What’s happening in the question:

A retail company wants to test a blue-green deployment for its global app within 48 hours, just before a major sales event (Thanksgiving).

πŸ’‘ Most users access the app via mobile devices, which often cache DNS records β€” making DNS-based changes slow to propagate.


βœ… Best answer:

Use AWS Global Accelerator to distribute a portion of traffic to a particular deployment


βœ… Why it's correct:

  • AWS Global Accelerator lets you shift traffic instantly between deployments without waiting for DNS changes to propagate.

  • Perfect for mobile clients, which cache DNS and may not update quickly.

  • Provides global coverage and near-instant routing updates.

  • Ideal for testing a new "green" version while keeping the "blue" live.


❌ Why the other options are wrong:

OptionWhy It's Not Ideal
Amazon Route 53 (DNS weighted routing)❌ DNS caching delays updates β€” users might still hit the old version even after routing changes.
Elastic Load Balancer (ELB)⚠️ ALBs can do blue-green via weighted target groups, but only within a single Region β€” not great for global users.
AWS CodeDeploy❌ Used for application deployment, not for routing user traffic between blue/green environments.

πŸ” Blue-Green Deployment Options: Visual Comparison

Feature / Option🌐 AWS Global AcceleratorπŸ” Elastic Load Balancer (ELB)🌍 Amazon Route 53 (DNS)🧩 AWS CodeDeploy
DNS Caching Impact❌ Not affected❌ Not affectedβœ… Affected❌ Not relevant
Traffic Controlβœ… Endpoint weights + dialsβœ… Weighted target groupsβœ… Weighted routing❌ Deployment only
Multi-Region Supportβœ… Yes (Global)❌ Region-boundβœ… Yes❌ Not a routing tool
Switch Speed⚑ Instant⚑ FastπŸ•’ Slower due to caching🚫 Not applicable
Best Use Caseβœ… Global blue-green rollout🟑 Good for single-region❌ Not reliable for quick switch❌ Deploys app code, not routes
0
Subscribe to my newsletter

Read articles from Varish Ansari directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Varish Ansari
Varish Ansari

As a DevOps and Cloud Engineer, I design and manage scalable AWS cloud infrastructures, implement Infrastructure as Code using Terraform, and automate CI/CD pipelines with GitHub Actions. I also work with Docker and Docker Swarm and ECS for containerized deployments, optimize cloud security using AWS WAF, Security Hub, and GuardDuty, and enhance monitoring and observability with CloudWatch, Prometheus, and Grafana. My focus is on automation, scalability, and security to ensure efficient and resilient cloud environments