THe outage on June 12, 2025, which affected Google Cloud and subsequently caused widespread internet service disruptions, was a result of multiple "small errors" accumulating throughout the development, testing, deployment, and operations.

Here are the key points that caused this outage:

First Small Error: Lack of Error Handling in New Code (Development Phase)

On May 29, 2025, the Service Control team, which manages Google's internal API management system, upgraded its quota policy function.
The new code introduced failed to include basic error handling.
This meant that if a new policy containing a null value was written into the Service Control system, it would trigger a null pointer exception, causing the entire system to crash.

Second Small Error: Skipping Staging Environment Testing (Testing Phase)

Testing the new Service Control function in a shared staging environment could have impacted the daily work of 76 other Google Cloud product development teams that relied on the system.
To avoid this inconvenience, the Service Control team chose to bypass staging testing, relying instead on local unit tests.
Consequently, the null pointer exception bug remained undiscovered and made its way into the production environment.

Third Small Error: Real-Time Global Deployment of Policies (Deployment Phase)

Google's product managers prioritized real-time synchronization of services over system safety.
The Service Control system used a distributed database for policy distribution.
When a new policy containing a null value was written to the database in one Google Cloud partition on June 12, 2025, all 42 global partitions simultaneously synchronized this flawed policy.
This led to a synchronized null pointer exception across all partitions at once, causing widespread system crashes.

Fourth Small Error: Flawed Recovery Mechanism (Operations Phase)

Although Google's operations team quickly identified the problem and deployed a patch within 40 minutes, the Service Control system's retry mechanism was inadequate.
It lacked both random delays (randomized) and increasing delays (exponential backoff).
As a result, when the system recovered, large partitions, such as US Central One, experienced an overwhelming surge of accumulated tasks all attempting to retry simultaneously. This "stampede" of retries crashed the system again, leading to further outages in entire partitions.
This required the operations team to manually throttle tasks and redirect traffic, taking nearly three hours to clear the backlog.
Overall, Google Cloud's 76 products across 42 global partitions were affected for a total of 6 hours and 41 minutes.

Cascading Failure: Cloudflare's Critical Dependency on Google Cloud

The outage was exacerbated by Cloudflare's reliance on Google Cloud.
Cloudflare, a key internet infrastructure provider, used their core KVS product, WERS KV, as a primary database for critical functions like API caches, system configurations, and user authentication.
Crucially, WERS KV was deployed on Google Cloud's servers, not Cloudflare's own.
Furthermore, Cloudflare only relied on Google Cloud, without redundant deployments on other platforms.
This meant that Google Cloud's failure not only brought down internet applications directly hosted on it but also caused Cloudflare, an underlying infrastructure service, to fail3. Cloudflare's subsequent outage then led to the collapse of even more platform services and internet applications, resulting in significant portions of the internet going down3.

The source highlights that each of these contributing factors represented "entry-level software engineering errors" and points to systemic organizational or industry-wide deficiencies, rather than isolated major faults.

Refs:

Google Cloud outage on June 12

Here are the key points that caused this outage:

Subscribe to my newsletter

Yang

Yang