System Design: How to Scale from Zero to a Million Users

Sahil BhosaleSahil Bhosale
7 min read

Scalability plays a major role in building today’s software systems. Why, you may ask? The simple answer is to manage a large number of users simultaneously with low latency and high response rates.

Building and managing such systems at scale used to require a high budget and advanced technical skills. However, alternatives like AWS, Google Cloud, and Azure are now available, allowing you to avoid building your infrastructure from scratch.

If you are a software developer, applying for a software engineering or architecture position at MAANG companies, or simply looking to upskill, knowledge of designing scalable and distributed systems is crucial. In this article, we will explore one such concept. We’ll start from a monolithic system and see how to scale it into a distributed system capable of handling millions of users.

Single Server Architecture

A single server architecture consists of a client and a server. A client can be anything — a mobile app, web application, etc. — that is trying to make requests to the server for data. This type of architecture is also known as monolithic architecture because all the code (frontend, backend, database) is kept on a single server.

Monolithic systems have their own advantages. However, in today’s world, performance and efficiency play a major role, and that’s where monolithic systems fall short after a certain point.

The drawback of such systems is that they are not scalable. Since all the code is on a single server, each server has its own hardware limitations — such as RAM, CPU, and storage — beyond which it becomes very difficult to scale. The process of increasing system resources to their limits is called vertical scaling. Additionally, such systems can handle only a specific number of users (traffic) at any given time. For example, if a large number of users (say 10K requests) try to request resources and the system can handle only (say 1K requests), the system will crash.

This is the type of system we currently have which is shown in above image. There are various steps that we will follow throughout this article to scale our system. These are general steps to build scalable systems, though real-life systems are quite complex, and their architecture and components will vary from this example. Each system will have different architecture based on its requirements.

Let’s now see how we can start scaling this system.

  1. Application and DB separation

The first step we can take is to separate the database from the application code (both frontend and backend). This separation keeps the business logic and data source distinct. This approach is a form of horizontal scaling, where different components are placed on separate servers.

By doing this, we reduce the load on the application server, improving efficiency while also enhancing data security.

  1. Adding multiple application servers

The next step is to separate the application code based on its functionality. For example, if you’re building an e-commerce site, you could separate the frontend code, payment processing code, user management code, etc., onto individual servers, as shown below.

As illustrated in the diagram, this approach isolates the code into multiple application servers (AP), making it more modular, loosely coupled, and secure. The main advantage of dividing the application across multiple servers is that if one server goes down, it won’t break the entire application — other parts will continue to function. However, if there’s an API call from one server to another that has gone down, that specific API call will fail.

It’s important to note that dividing the application into multiple servers does not necessarily mean adopting a microservices architecture.

  1. Load balancers

The architecture we’ve built so far has become quite efficient. However, since we’ve added multiple application servers, a new challenge arises: how do we determine which server should handle each request from the client? There needs to be a component that manages client requests (load/traffic) and directs them to the appropriate application server based on specific logic. This is where load balancers come into play.

For example: If we have four application servers, we could use the round-robin load balancing algorithm to distribute the load. This algorithm sends one request to each application server (AS) in sequence, ensuring the load is evenly distributed across all servers.

At this stage, various load balancing algorithms can be implemented, including:

  1. Round Robin

  2. Weighted Round Robin

  3. IP Hash

  4. Least Connection

  5. Weighted Least Connection

  6. Least Response Time

Having a good understanding of these algorithms and knowing when to use each one can be highly beneficial in practice.

Subscribed

  1. Database replication (master / slave concept)

In our current architecture, we have only one database, which handles both read and write operations. This increases the overall latency of the system. Additionally, if the database goes down, our application won’t be able to retrieve any data. To address this issue, we implement database replication.

Database replication involves adding multiple slave databases and copying all data from the current database (which acts as the master) to these slave databases. The slave databases are used solely for reading data, while the master database is responsible for writing data. Whenever we write data to the master, it is replicated across all slave databases to ensure consistency.

  1. Cache

To further enhance the performance of our system, we can implement a cache server. A cache server helps retrieve data quickly without needing to query the database or recompute the data. Typically, we cache data that doesn’t change frequently and is often requested by the frontend or client.

The cache server is located alongside the application server. When a request reaches the application server, it first checks if the data is available in the cache. If the data is present, it’s sent directly from the cache to the client. If not, the application server (AS) will query the database to retrieve the data.

We should always aim to minimize database calls by storing and retrieving data from the cache whenever possible, as database queries are costly and significantly contribute to increased latency.

When cached data becomes outdated due to updates in the database, we refer to this as stale data, meaning the cache needs to be updated. There are various cache invalidation techniques to refresh the cache, such as setting expiry times or clearing the cache completely.

  1. CDN

CDN stands for Content Delivery Network. At this stage, we will separate all static files from the application server and store them on a CDN server. The CDN is designed to store static files like images, fonts, assets, and other resources that do not change frequently.

Clients can make direct requests for these files to the CDN server instead of routing them through the application server.

  1. Data Centers

The system architecture we have designed so far can be hosted in a single country. If we replicate this architecture across multiple regions or countries, we create a data center that handles geo-specific requests.

  1. Database scaling

As discussed, we can leverage database replication to help our system handle millions of users. Another technique to further boost database performance is database scaling, which can be implemented at the database level. There are two types of database scaling:

  1. Vertical Scaling: In this approach, we increase system resources, such as RAM or CPU, for our database.

  2. Horizontal Scaling: Here, we add multiple database nodes, typically using a method called sharding. Sharding can also be categorized into two types:

  • Vertical Sharding: This involves dividing a table into multiple tables by rows.

  • Horizontal Sharding: This approach divides a table into multiple tables by columns and is the more common method. To join these tables, we perform denormalization.

The consistent hashing technique is used to shard tables that are already sharded.

In addition to all these strategies, you can also incorporate a messaging queue into your architecture, such as RabbitMQ or Kafka, to handle asynchronous messaging.

By progressing step by step, we can achieve scalability, enabling our systems to handle millions of users while maintaining performance efficiency.

I hope this article has helped you learn some new concepts related to building scalable distributed systems and that you found it worthwhile. Thank you for reading! Please make sure to like and share this article on your social media platforms.

0
Subscribe to my newsletter

Read articles from Sahil Bhosale directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sahil Bhosale
Sahil Bhosale