Cap Theorom & Db Sharding
CAP THEOROM :
C – Consistency
A – Availability
P – Partitioning
The CAP Theorem is a fundamental concept in distributed system design, helping engineers understand the trade-offs between Consistency, Availability, and Partition Tolerance in any distributed data store.
Here’s a brief explanation of each:
Consistency (C): Every read receives the most recent write or an error. All nodes return the same data at the same time.
Availability (A): Every request (read/write) gets a response, even if it's not the most up-to-date data.
Partition Tolerance (P): The system continues to operate despite network partitions (i.e., communication between nodes may fail).
Key Insight of CAP Theorem:
- You can only guarantee two of the three properties (Consistency, Availability, Partition Tolerance) in a distributed system. This is often summarized as “Choose two out of three.”
For instance:
CA Systems: Prioritize consistency and availability but struggle with partition tolerance. Example: Traditional relational databases.
CP Systems: Focus on consistency and partition tolerance, sacrificing some availability during partition events. Example: HBase.
AP Systems: Ensure availability and partition tolerance but may not always be consistent. Example: Cassandra, DynamoDB.
Real-World Application:
In practice, no system can completely sacrifice partition tolerance (since network failures are inevitable), so the choice typically boils down to balancing consistency and availability.
Understanding the CAP Theorem helps engineers make informed decisions about the right database or system architecture based on application requirements and trade-offs. It's particularly important when scaling systems across multiple regions or designing high-availability architectures.
Distributed System : A System consisting of a group of machines working in coordination so as to appear as a single coherent system to the end-user
Consistency : Any Read that is happening after a latest write , all the nodes should return the latest value of that Write.
Availability : Every Available node in the system should respond in a non error format to any Read request without the guarantee of returning the latest Write
Partition Tolerance : System will be responding to all Read and Write if the communication channel (or middleware) between nodes is broken ( or partitioned)
In CAP THEOROM generally will support any two of them in CAP , like C&A , A&C , A&P , P&C any one of the thing we need to be left , not all the time the Distributed will support C,A,P will accept anyone of the thing we have to left.
DB SHARDING :
Horizontal Partitioning Is same as Sharding used to store the large data with highly available and scaling
Local Sharding – Physical Machine (Physical DB)
Physical Sharding have Multiple Number of Local Sharding and the Data should be Stored in Multiple Physical Machines ( DB)
Advantage of Sharding :
Query Optimization
Better Performance
Reduce Latency
There are two types of Sharding
1.Algorithmic Sharding – App/Client knows which Shard the query has to go that’s Algorithmic Sharding
2.Dynamic Sharding – Other Module / Part which Client has to query in order to findout which Shard the query has to go to its called Dynamic Sharding
Disadvantage of Sharding :
Partition could be proper
Recover the old data , with Sharding to Non Sharding
Key Based Sharding :
Shared Key its should be an Static it won’t be change
We can select Multiple Column as Shared Key
First step here to pickup an Shard Key , Shard Key not equal to Primary Key
A Primary Key of the Table can be an Shard Key but the Vice Versa its Not Possible
Choosing the Shard Key means Pickup an Column on bases of which you are going to divide the data
Ex : UserId your picking up as Shard Key
After selecting the Shard(UserId) key It will use the Hash function so based on the Shard Key it will be give the Shard Value
Its an Algorithmic Sharding
Advantage :
Evenly Distributed Data
Disadvantage :
Adding new Shard because the Hashing function might Change
Range Based Sharding :
You have to store the Data in Shard in basis of Range
Its an Algorithmic Sharding
Ex : You are storing the value in Shard based on Month Range , like Each 2 Months Data should be Stored in One Shard
Advantage :
Same Database Schema for all Logical & Physical Shard
Disadvantage :
Increased Storage on Specific Shard This is called as an HOTSPOT
Data will not be uniformly distributed , like its an Range Based Shard so In Specific Range only it will be Shard to Specific Shard
Directory Based Sharding (Dynamic Sharding) :
Here Dividing your Shard Based on the Specific Column
Ex : Here you are Dividing Shard Based on Zone
Zone : 1 2 3 4 Shard : A B C D
Here Zone 1 —> Shard A , 2 —> B , 3 —> C , 4 —> D
These will be stores in Lookup Table based on that Zone the Shard could be Directed.
Advantage :
We can add Shard without Touching previous Shard
Remove Shard when ever you want
Data Evenly Distributed
Disadvantage :
For Every Read & Write the Application It will be first Connect with Lookup Table then Lookup Table Routs Where Read & Write has to go.
Lookup Table Crashes the Whole System will Crashes
Subscribe to my newsletter
Read articles from OBULIPURUSOTHAMAN K directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
OBULIPURUSOTHAMAN K
OBULIPURUSOTHAMAN K
As a Computer Science and Engineering graduate, I have cultivated a deep understanding of software development principles and technologies. With a strong foundation in Java programming, coupled with expertise in frontend and backend development, I thrive in crafting robust and scalable solutions. Currently, I am leveraging my skills as a Java Full Stack Engineer at Cognizant, where I am involved in designing and implementing end-to-end solutions that meet the complex requirements of our clients. I am passionate about leveraging emerging technologies to drive innovation and deliver tangible business value. My goal is to continually enhance my expertise and contribute to the advancement of software engineering practices in the industry.