Identity Validation at Scale
Summary
The journey began in September 2023 when our tech team discovered that we should provide a solution for user identity validation for the most important product of the company. The program I am talking about is Big Brother Brazil, a famous worldwide TV reality show that runs in several countries, in Brazil the program is exhibited by Globo.
The program was scheduled to start in January 2024, so we had roughly 3 months to review our system architecture to handle one of the largest audience programs we have here at Globo.
This paper will not only focus on the technology we used to achieve this goal, but also on project good practices that led us to deliver a successful result to the company.
Quick review on Big Brother
It is important to know how important Big Brother Brazil is in my country. The show has been on air since 2002 and accumulated many records on national TV audience over the years. Big Brother Brazil has a voting mechanism that allows the audience to vote to eliminate the participants during the show and it evolved since then so nowadays all voting mechanisms run entirely on our servers, reaching the largest API throughput peaks.
The challenge
It is fair to say that we already had back on September 23 an API to validate user identity, but it was not trustful nor resilient and certainly would not be able to meet the expectations for the estimated request throughput. The final solution needed to be done with a major refactor of this API. An alarming factor is that only a few companies here in Brazil holds the capability to validate the user identity: every citizen in my country has its own CPF (Cadastro de Pessoa Física) that is similar to a Social Security Number in America, a consolidated identification number held by the government. So, Globo needs to consume one of the available external APIs to do this validation and those API are not capable of dealing with such a large throughput coming from the Big Brother audience, resulting in the necessity to have an additional architecture solution to handle this issue.
Time is against us, let’s talk about the solution
In the tech industry, challenging project deadlines are not unusual and this was exactly what we were dealing with. Beyond the technical difficulties themselves we had to prioritize what is important to be done and what we could compromise in case something goes off. We ended up splitting the project into three chapters, where each chapter runs separately from the others and theoretically could be sent to production and deal with Big Brother's voting.
The three chapters are going to be: the user identity validation core, a virtual waiting room solution and an additional cellphone validation step.
As big as this project was for the company, we could not fail to deliver a steady and resilient software solution, so the risk management aspect of the project turned out to be as important as the technical solution itself. Splitting the project delivery into three pieces of code, even if the total deadline was 3 months, proved to be the best way my team could guarantee that we knew what we had to do but also have backup plans in case something did not go as planned.
We basically had to deliver each chapter in one month, and now I am going to walk you through those chapters.
Chapter 1 – The User Identity Validation Core
As I mentioned before, my team already had an API to deal with user identity validation, but this API was not robust enough to deal with Big Brother's requirements. The first chapter objective is to deliver a core logic API that we trust, and we managed to refactor the API with the following enhancements.
The core API was already integrated with one external partner to deal with the CPF number validation, but we needed to be sure that we would not suffer from any external server downtime, so we added another partner that was able to accomplish the CPF validation to act as a fallback API and also to load balance the requests between those two external APIs.
Talking about the load balancing mechanism, we realized that the two APIs we had integrations with handled different amount of request throughput, therefore it was the best solution to load balance according to each partner capacity to take advantage of each API.
The fallback solution was important to us. Whenever our API encountered some type of error from one partner it could handle the error and, in most cases, retry the CPF validation on the other partner. This strategy greatly raised the success rate of the user identity validation.
Another good practice we implemented in this API is a circuit breaker. Circuit breakers help the API to reduce traffic whenever that is a sign of system degradation. We have a separate circuit breaker for each partner, thus helping redirect traffic to the other one when needed.
Storing each success and each failure in CPF validation, this way we avoided calling the partners for repetitive requests, increasing the core API efficiency e reducing costs from calling the partners.
This final solution for the core API reached the desired level of trust we lacked before, and so this first chapter ended.
Chapter 2 – The Virtual Waiting Room Solution
A virtual waiting room solution is not breaking news in the tech industry, it is a well-known solution that has already been proved to be valuable. Nonetheless, that is exactly what we needed and with an estimated deadline of one month to do so.
There are however some tricky points when you implement a waiting room solution, one of them is caring about the user experience through the entire flow, another is to guarantee that your API is correctly tuned to deliver the fastest output for the user, reducing the waiting time. And that you have as many metrics of the entire ecosystem as you can to continuously make improvements to the flow and user experience.
We chose the messaging tool that is going to be the core waiting mechanism to be Google Cloud Platform's PubSub. PubSub is a great messaging tool, with a lot of native metrics built-in and provides a very efficient and resilient system. Specifically for the metrics, we took advantage of the few weeks we had because PubSub already had almost everything we needed.
To implement the polling mechanism of the waiting room we chose to use Redis as a caching tool to handle the great volume of requests coming from the users. We implemented the logic as shown in the following diagram. Redis relieves the request pressure that comes from the audience and the logic queues users into PubSub.
The waiting room API has a complementary part that pulls users from PubSub and validates them on the user identity validation core API, respecting the throughput both partners can handle.
The described process can be seen on the following web pages:
Chapter 3 – Cellphone validation
As the last chapter of our journey, we had to deal with an additional concern, that is fraud. Despite our initiative to validate user identity with trusted partners, our team decided to implement an auxiliar step to mitigate fraud during the Big Brother Brazil voting mechanism. We also demand that the user inserts a valid cellphone number, and we validate it by sending a code over SMS.
Although that last step was not mandatory, we concluded that this extra step would help us from malicious users trying to fraud the voting process. With that, our final page flow looked like that:
The last mile
It may not surprise some of you that the last mile is the hardest part of any project. We managed to finish all three chapters within 2 months, involving almost 30 people, including developers, devops and UX designers. The last month served us very well for reviewing all the work we made, stressing all test scenarios, implementing as much monitoring and metrics as possible and running load testing on the main APIs.
That final month was as crucial as the first two for the result. And it reinforced the importance of saving some extra time in every deadline for the final review of the project. As it may take half of the time to do 80% of the work done and the other half to accomplish 100%, following the pareto principle.
Load Test can be your guarantee
Whenever your project has a requirement for great API throughput, your best ally becomes the load testing phase. As it ensures that all APIs and your entire ecosystem are running as it should and delivering the expected results. Tunning a virtual waiting room is not a trivial task to do, so this process was crucial to make our team trust the implementation. And we had many insights from this step, greatly enhancing our API capacity from the right tunings.
Conclusion
I hope the story I wrote about our journey has been somewhat useful for the reader. It reviews not only the practical application of some techniques such as messaging, polling, circuit breaker, load balancing and fallback, but also brings up issues about project management in general. I highlight the following points from this paper:
Risk management can be as important as the technology you choose to deliver some project. It can contribute a lot to the result, especially when there is a challenging deadline and a vast scope to satisfy.
Even in the worst deadline scenario, monitoring and metrics are key to success. They tell you if the project succeeds or fails. Invest some time in this topic.
Know the challenge you are facing, know the risks and use it wisely in your favor. Prioritize what makes a difference.
Do not forget about the last mile, as you will need extra time to trail this final path in your project. Otherwise, you risk all the work you have done.
Cover image taken by Edmond Dantès from Pexels
Subscribe to my newsletter
Read articles from Vitor Silva Costa directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Vitor Silva Costa
Vitor Silva Costa
Used to work with CIAM applications that need to be resilient and scalable. My focus is on backend development and system architecture, always working with my teams to deliver great value.