Scalability: From 10K to 700K Effortlessly

Hey folks, so last week me & my team scaled an application from 10K emails to 700K emails. Sounds cool, right? Today I’m going to share the story & learning from this experience.

The Story

So one of my company's clients had built an email marketing platform recently. Initially, he wants to send an email to max 10K customers. So we just built & optimized that application for 10k customers. Everything is working fine until last week, suddenly last week client asked us that “Is it possible to send 700k mails?” Our minds have just blown out. Because this is a huge number, and we can’t just do this in a single click. So we started researching & testing it. We found that currently the whole operation is operating from the main backend server & there is no way we can continue this for the next huge number of emails, so we choose to use Amazon services. At the end, we prepared a high-level implementation diagram and scheduled this feature for next week.

There is one issue remaining that currently if we do this huge operation in the backend, then the backend server will crash as expected. Although this is not an issue but we take this challenge to scale our backend to at least handle this huge load because eventually we need to do this same thing once we move stuff into AWS. And surprisingly, we did this in 1 working day. We scaled our backend to handle 700k email sending operations.

How did we scale?

So now starts the main thing. How did we do this? Here are the things we do to achieve this.

1. Batching is important

First, we started from the basics. We all know that while handling a huge load, batching is important. Doing work by batching stuff helps us reduce the memory overload. But there is a flaw in batching stuff. batch operation takes more time than normal operation, and if the size of the data is that huge, then it’s really hard to rely on batching only. So to reduce the initial operation time, we did batching twice.

1.1 Grouping Operations

Since we need to reduce memory load so we group our operations by indexes. So, from 0-1000 index it will be a group, from 1001-2000 will be a separate group, and that’s how it goes on. Now, each group will perform operations independently.

1.2 Batch Operations

Besides memory load, we need to reduce the process time also. So we put our 2nd batch operations in the background queue jobs. In this step, we do our actual operations. We validate data, prepare payload for emails & lots of stuff. But since it’s done from the background, it doesn’t affect our initial process time.

2. Queue matters

I already talked about the queue first, but actually, queue jobs matter. When an operation is huge, time-consuming & you really don’t need that result instantly, then a queue makes much more sense than doing that job on the go.

As an example, before using the queue in the 2nd batch operation, it was good when it is handling a small dataset, but when the dataset grew to 700k, we never saw a success response from the API. We only saw a timeout error.

Also, we have used a queue for sending emails (we didn’t create it newly it’s already there). Since the 2nd batch operations are only responsible for preparing data for emails so we implemented a separate email queue which is only responsible for sending emails. Also, this is very helpful to rate limit operations per second because when we use third-party services, we need to take care of API calling limits as well.

3. Caching helps

While sending the same mail to all users, the most common thing is an email template. There is also a lot of other common stuff, but since we sliced our full email sending operations into slices so there is a problem of accessing the common data. The easiest way is to pass data from one slice to another, but this may create huge memory usage. So what we did is cache the common data into Redis and access it whenever we needed. It may sound very simple, but it really helped us a lot because after implementing Batching & Queue, we solved half of the problem, but the issue comes of huge memory usage. And it was solved quickly after we implemented the caching. Also, since this is cached in Redis so we don’t lose anything on performance.

4. Keep only stuff that is needed

After all of this is done, we are about to be stable, but found that our Redis memory is going through a very bad time. Whenever the mailing service starts, the Redis memory usage goes up to 7GB, which is really bad. So we started removing unnecessary data. After removing all the unnecessary data, we managed to run the same mailing service under 200MB of Redis usage. which is a great achievement.

What is the impact of this scaling?

Our main goal was to stop the server from crashing, that’s it. And we achieved that. But it’s not worth it right now because if you’re using any third-party services, then your scaling also depends on the third-party services. In our case, it’s Amazon SES (Simple Email Service). Because the limitation of SES is 14 mail per second. So even if we scale 100x but we still need to cap it to 14 mail per second. But we can increase the limit, of course, and next week we are going to do this.

Conclusion

During this process, I got stuck a lot & then I researched a lot. But eventually I learned a lot also. Also, I realized you don’t need to scale your applications into millions on the first day. You start small, then when you need, you scale. This makes the business more profitable & reduces pressure from developers.

Regarding the app with the current setup, we can go on with just increasing the limit from SES, but it will be good to move this into a separate thing.

This is it for today. I hope you find this article helpful. Do not forget to share your thoughts in the comments.

How We Scaled Our Email System From 10K to 700K Emails Without Crashing the Backend

Table of contents