Database sharding

Database Sharding

Database sharding is a horizontal scaling(partitioning) technique that partitions a logical dataset into multiple smaller, independent databases, like customers with ZIP codes less than 50000 are stored in CustomersEast database instance, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest database instance and so on and so fourth.. . This approach spreads both storage and query load across multiple machines, enabling systems to handle larger datasets and higher throughput than a single node.

What happens if a shard fails, then we can introduce Master-Slave architecture, in these thing, whenever there is a write request, its always on the Master, then the slaves are reading whenever there is a write happens in the master, whenever there is a read request that can be distributed across the slaves, if master fails, then slaves choose a Master among them selves, so with this method the performance will be drastically improved, and there will be no single point of failure.

Different Sharding Strategies

Range-Based Sharding
Range‑based sharding divides data by contiguous key intervals (Ex: data ranges or alphabetical ranges). alphabetics with A-G goes to some database instance (shard1), H-M goes to another database instance (shard2) and so on, This makes range queries efficient, as relevant shards can be quickly identified, and we can get the data fast and efficiently, without giving load onto any other shards.
Hash‑Based Sharding
Hash‑based sharding applies a hash function to the shard key (e.g., hash(user_id) mod N) to determine shard placement, ensuring an even distribution of data and load. However, it complicates range queries since hashed values do not preserve sort order.
List‑Based Sharding
List‑based sharding assigns explicit lists of key values to shards (e.g., countries or regions) allowing fine‑grained control over data locations
Directory-Based Sharding
A lookup table (directory) maps each key or range to a specific shard, providing maximum flexibility at the cost of maintaining and querying the directory on every access.

Advantages
1. Linear Scalability: Add more shards to increase capacity and throughput nearly linear.
2. Fault Isolation: Failure of one shard affects only its subset of data, not the entire system.
3. Data Locality: Geographic sharding can place data close to users, reducing latency

Challenges
1. Operational Complexity: Managing shard maps, rebalancing, backups, and monitoring across many nodes increases overhead
2. Schema Changes: Evolving schemas must be applied across all shards in a coordinated manner

Real world Implementations
1. MongoDB: Uses a shard cluster with config servers and mongos routers; supports range and hash sharding with automatic balancer
2. Apache Cassandra: Employs hash‑based partitioning across nodes via consistent hashing; each node owns token ranges.

System Design (Day 3)

Database Sharding

Different Sharding Strategies

Subscribe to my newsletter

Manoj Kumar

Manoj Kumar