Problem Statement

Design a basic system to count and display the number of times a hashtag is used on a social media platform.

Functional Requirements

Users should able to tweet along with hashtags
- Assumption: For now let’s focus on basic tweets where we only have text (tags are also text only)
- Assumption: Tweet maximum length is 1k including the tags
Users should be able to view the number of times a hashtag has been used.

Non-Functional Requirements

Highly Available system
- Users should always be able to tweet and see the count of hashtag usage(There shouldn’t be any Single Point of Failure)
Low Latency System
- Latency while reading the hashtag count and tweeting should be less(Low latency during both read and write flow)
High Consistency
- Users should able to see the total number of hashtags used as real-time as possible(High consistency between adding a hashtag in a tweet and viewing the updating value of the hashtag)
Fault Tolerant
- In case of writing the entry in the system, we should have retries, etc. in case of some error.
Monitoring and Alerting
- System should be capable of capturing different metrics and raising alerts in case of issues.

NOTE: Achieving both low latency during the write flow and real-time updates for hashtag counts simultaneously is not feasible. Therefore, we will create two separate systems: one prioritizing high consistency and the other focusing on low latency.

Capacity Estimations

As previously stated, let's split it into two sections: one for High Scale and the other for Moderate Scale.

For Moderate Scale System

Let's suppose our platform has a total of 1 million users.
Assuming that 10% of the total users are Daily Active Users (DAU), which amounts to 100,000 people.
Given that it's a read-heavy system, with a Read/Write ratio of 80:20,
There are approximately 20,000 tweets per day and around 0.2 tweets per second.
With the 80:20 read-to-write ratio, we can estimate that users request hashtag counts approximately 2 times per second.

For High Scale System

Assuming we have 10 million total users on our platform,

and 10% of them are our Daily Active Users (DAU), which totals to 1 million users.
Given the Read-to-Write ratio remains consistent at 80:20,

with 200,000 tweets per day and approximately 2 tweets per second,
We can infer that, with the same read-to-write ratio, users request hashtag counts around 8 times per second.

High-Level Design

Approach1-(Suited Good For Moderate Scale)

Summary

Tweet Creation Flow

- The user posts a tweet, handled through an API gateway via Tweet Ingestion Service (API).
  
  * Tweet Ingestion Service performs two tasks:
  
  * Stores the tweet in TweetDB.
  
  * Extracts hashtags, create a request, and sends it to Hashtag Service (API) to sync hashtag counts.
  
  * Upon receiving the request from Tweet Ingestion Service, the Hashtag Service updates the count of the respective hashtag in the database
Upon receiving the request from the Tweet Ingestion Service, the Hashtag Service updates the count of the respective hashtag in the database.

Hashtag Count Read Flow

The user will request the Count of a particular Tag(API)
HashTag service will get the data from the DB and return it to User.

Component of HLD

Load Balancer

Description: Load Balancer helps us to equally balance the incoming load/traffic among different instances of service

Reason To Use: As one of the Non-Functional Requirements is nothing but a Highly Scalable/Available system we have to use the Load Balancer to handle the load and send the request to only healthy instances.

TweetDB(RDMS)

Description: RDMS is a relational database that helps us to store the relational data.

Reason To Use:

As we can see scale of the system is not much (only ~20k tweets per day) relational database is a good choice to go with
The structure of data is also not complex(we are just storing texts here) so RDMS is a good choice to go with

Hashtag DB(Key-Value store like Redis)

Description: Key-value stores like Redis help us to handle high write and read operations easily

Reason To Use:

As we mentioned earlier its Read-heavy system Key-Value store like Redis is a good choice to go with to ensure low read latency.

Benefits

As we are updating the value of HashTag Count synchronously the Count value will be the very real time that we are showing to the User i.e. Almost no lag between the read and write flow
As we are using a Key-value store here to get the value of HashTag Count the latency for Read flow will be less i.e. Low Latency Read flow

Limitations

As we are updating the value of HashTag Count synchronously the latency for the Posting Tweet will be High i.e. High Latency Write Flow

Approach2-(Suited Good For High Scale)

Summary

Tweet Creation Flow

The user will post the tweet and via API gateway request will be handled via Tweet Ingestion Service(API)
Tweet Ingestion Service will do 2 works here
- Store the tweet in TweetDB
- Extract the hashtags from the tweet, create a request, and send it to the queue(Event).
Hashtag Service will consume the event and update the count of the respective hashtags in the DB.

Hashtag Count Read Flow

The user will request the Count of a particular Tag(API)
HashTag service will get the data from the DB and return it to User

Component of HLD

Load Balancer

Description: Load Balancer helps us to equally balance the incoming load/traffic among different instances of service

TweetDB(Cassendra)

Description: Cassandra is write heavy database that helps us to store large values of data.

Reason To Use:

As we can see scale of the system is high( ~2 lakh tweets per day) Cassandra database is a good choice to go with

Hashtag DB(Key-Value store like Redis)

Description: Key-value store like Redis help us to handle high write and read operations easily

Reason To Use:

As we mentioned earlier its Read-heavy system Key-Value store like Redis is a good choice to go with to ensure low read latency

Message-Broker

Description: Message brokers help in different ways like decoupling the services, handling burst load, etc.

Reason To Use:

As the system scale is high and we want low latency during write flow we are using Message-Broker here.
Here we are using Message-Broker to fulfill Fault-Tolerant NonFunctional Requirement as we can republish the message to queue again for n-times and if the issue is still present after retrying for n-times we can send the event to DeadLetterQueue(DLQ) to handle the event after some time

Benefits

As we are updating the value of HashTag Count asynchronously the latency for creating a tweet will be less i.e. Low Latency Write Flow
As we are using a Key-value store here to get the value of HashTag Count the latency for Read flow will be less i.e. Low Latency Read flow

Limitations

As we are updating the value of HashTag Count asynchronously there would be some lag between posting a tweet and updating the count value of respective Tags i.e. some lag between the read and write of tagCount real value

Low-Level Design

Sequence Diagram for Tweet Creation Flow

Summary

Tweet System
- Tweet Service will call the HashTag Service to extract the tags from the tweet
- Then Tweet Service will call the tweet repository to create a tweet entry
- After that, Tweet Service will call the HashTag System either via API or Event
Hashtag System
- Hashtag Service will call the HashTag Repository to update the HashTag Count

Sequence Diagram For HashTag Count Read Flow

Summary

Hashtag System
- Hashtag Service will call the HashTag Repository to get the HashTag Count and return it to User.

DB Design

Tweet Table

{

“text”: text,

“id”: UUID(base64(Primary Key),

“user_id”: UUID(base64),

}

HashTag Table(Key-value store like redis)

For Moderate Scale

{

“tag_name”: text(unique),

“count”: bigInt,

“version”: varChar(255) //(to support optimistic locking)

}

For High Scale

{

“tag_name”: text(unique),

“count”: bigInt,

“lockId”: UUID(base64) //(to support pessimistic locking)

}

Event Contract

{

"topic": "tweet_hashtags",

"key": "12345",

"value": {"hashtags": {"hashtag1":count1, "hashtag2":count2, "hashtag3":count3}},

"timestamp": 164938201231

}

Monitoring And Alerting

To have Operational Excellence we need to create monitors and alerting in our system

Basic Monitoring and Alerting we could start with

Monitor on latency of posting a tweet
Monitor on latency of reading a hashTag count
Monitor on Retries of hashTag count sync API/Events.
Alerts on Retries of hashTag count sync API/Events in case of retries in our system in the last X mins, sec, hrs, etc. breached the threshold.
So On….

Important Points

1-How to handle potential race conditions when multiple posts containing the same hashtags are created simultaneously?

Approach-1(Optimistic Locking)(For Moderate Scale)

Let’s understand it by taking an example

Example-> Let’s say we have 2 users (user1, user2) posting a Tweet having same hashTag let’s say Tag1

As mentioned in Table Design, we have field know as Version in Table.

Step 1: Both User1 and User2 will first fetch the current information of HashTag from HashTagTABLE

Step 2(read phase): Both User1 and User2 will read the same value for column version for HashTag(for Tag1) let’s say v1

Step 3(modify phase): Both user1 and user2 will modify the value of count

Step 4(validation phase): Now before committing, again we will read the version from DB for Tag1 if the version is the same as in Step2 user will proceed to Step5

Step 5 (commit phase): Now let’s say user1 reaches to commit phase it will now update the count value for Tag1 and also increase the version to v2.

As user1 already increment the version to v2 in step5(commit phase) validation for versioning will fail for user2 at step4(validation phase)(in that case we can retry the above steps for user2 again)

Benefits

Optimistic Locking ensures a large number of users can read and write concurrently i.e. High Concurrency requests. As we are not applying any locking on Db rows.
It works well for Systems with a Moderate scale as the chance of conflicts is less.

Limitations

Optimistic locking won’t work in a system having a High scale as the chances of conflicts are high.

Approach-2(Pessimistic Locking)(For High Scale)

Let’s understand it by taking an example

Example-> Let’s say we have 2 user (user1, user2) posting a Tweet having same hashTag let’s say Tag1

As mentioned in Table Design,

Step1: Both User 1 and User2 will try to read and acquire the lock

Step 2: Only one user will able to acquire the lock let’s say User1, other user will wait till user1 doesn’t unlock the tag.

Step3: User1 will update the count in the table(commit into table) and unlock the row

Step4: Now User2 will able to update the count for the same HashTag

As we can see User2 has to wait till User1 won’t do with committing or unlocking the corresponding row

Benefits

Pessimistic Locking works well with High scale as it can handle high conflicts by locking the rows.

Limitations

If the lock won’t unlock properly it could lead to Deadlocks
In Pessimistic locking, we are sacrificing the concurrent request at the cost of accurate data

2-How will we make the write operation of updating HashTag count as Fault Tolerant?

Approach-1(For Moderate Scale System)

As in our Moderate scale system, we are synchronously updating the value of hashtag count what we can do ,

Add retry for x times on sync-count API so that the system can update the count that failed earlier due to some temporary issue
Even if after x times the API failed we can do 2 things here
1. Failed the complete request of tweeting a post i.e. from adding entry in Tweet Table to updating the hashTag count.
2. Complete the request partially and return 2xx to the user and asynchronous update the hashTag count for failed cases only
  1. To update the hashTag count for failed cases we can either use an Event or Cron-based approach

Note: We can either go via approach 2.1 or 2.2 to handle failure based on our requirements

Approach-2(For High Scale System)

As in our High-scale system, we are asynchronously updating the value of hashtag count via message-broker.

As we have a message message-broker here what we can do we can re-publish the same event x times in the queue again for x times in case of failure.
If after re-publishing the event to queue x times it failed we can add the event in DeadLetterQueue to handle the request after some time

System Design For Twitter Hashtag Tracking

Table of contents

Problem Statement

Functional Requirements

Non-Functional Requirements

Capacity Estimations

For Moderate Scale System

For High Scale System

High-Level Design

Approach1-(Suited Good For Moderate Scale)

Summary

Component of HLD

Benefits

Limitations

Approach2-(Suited Good For High Scale)

Summary

Component of HLD

Benefits

Limitations

Low-Level Design

DB Design

Event Contract

Monitoring And Alerting

Important Points

Subscribe to my newsletter

Kunal Arora

Kunal Arora