Service Control is not a product you've heard of. It doesn't have a marketing page or a conference talk. It exists in the infrastructure layer beneath everything else — the system that authorizes every API request across Google Cloud and Google Workspace before that request is allowed to proceed. If you call the Cloud Storage API, Service Control checks your quota. If you authenticate with Google IAM, Service Control validates your policy. If your app on Google Cloud makes any call to any Google service, Service Control is in the critical path. It is, in the most literal sense, the gatekeeper of the entire platform.
When Service Control crashed on June 12, 2025, it didn't just take down one service. It took down the authorization layer for every service. API calls returned 503 errors not because the underlying services had failed, but because the gatekeeper wasn't there to let them through. Compute Engine instances were running. Cloud Storage buckets were intact. BigQuery jobs were ready to execute. None of it mattered — because without Service Control, nothing could be authorized, and nothing unauthorized can proceed in a correctly secured cloud platform.
WHAT SERVICE CONTROL ACTUALLY DOES
Google Cloud's Service Control system performs three functions on every API request:
authentication (is this requester who they claim to be?),
authorization (are they allowed to perform this operation?), and
quota enforcement (have they exceeded their usage limits?). It processes these checks at massive scale across every region — billions of API calls per day — using policy metadata stored in and synchronized across Spanner, Google's globally distributed database. The May 29 code change was adding more sophisticated quota checking logic to this pipeline. The change worked correctly in every scenario that was tested. The scenario that wasn't tested was the one that appeared on June 12.
Problem
May 29: Code Deployed — Bug Present, But Invisible
Google engineers deployed new quota policy checking code to Service Control. The deployment went through the standard region-by-region rollout and passed all checks. But the new code path had two critical gaps: no error handling for null values, and no feature flag to disable it if something went wrong. The bug was invisible during rollout because the problematic code path could only be triggered by a specific type of policy input — blank fields in the policy metadata. That input hadn't appeared during rollout. The binary was now running in every region with a loaded trap, waiting for the right trigger.
Cause
June 12, 10:45 AM PDT: The Policy Update That Pulled the Trigger
An automated system inserted a routine policy change into the regional Spanner tables that Service Control uses for policy metadata. The policy update contained unintended blank fields — values that should have been populated but weren't. Because quota management is global, the Spanner replication engine distributed this metadata worldwide within seconds. Every Service Control binary in every region hit the new code path, encountered the null values, and threw a null pointer exception. Without error handling, the exception crashed the binary. Service Control was dead globally.
Solution
The SRE Response: Diagnosis in 10 Minutes, Red Button in 40
Google's Site Reliability Engineering team began triaging within two minutes of the first alert. They identified the root cause — the null pointer exception in the new quota checking code path — within 10 minutes. Engineers deployed a 'red button' kill switch within 40 minutes to disable the problematic serving path. Most regions began recovering within two hours. But us-central1, Google's largest region, hit a second problem: the recovery itself.
Result
The Herd Effect: When Recovery Made Things Worse
As Service Control instances restarted in us-central1 after the red button was deployed, they all simultaneously reached for the regional Spanner database to load their policy metadata. Hundreds of instances, all restarting at the same moment, all hitting Spanner at the same time, with no randomization in their startup sequence. The Spanner database — which had been handling steady-state read traffic fine — was overwhelmed by the simultaneous burst. Service Control couldn't load its policies, which meant it couldn't restart properly, which meant it kept trying, which kept hitting Spanner. The recovery created a herd effect that prolonged the outage in us-central1 by more than two hours beyond when other regions had stabilized. Full resolution across all services wasn't complete until 18:18 PDT — more than seven hours after the incident began.
The blast radius of the June 12 outage had three concentric rings. The innermost ring was Google's own services: Google Cloud Platform APIs, Google Workspace (Gmail, Calendar, Drive, Docs, Meet), IAM, Cloud Storage, Compute Engine, BigQuery, Cloud SQL, Cloud Spanner, Vertex AI, Cloud Monitoring. The second ring was companies building directly on GCP — Spotify, Snapchat, Fitbit, Replit, GitLab, Shopify, Character.AI, Cursor — whose applications were unable to authorize any backend calls. The third ring was the one that made this incident uniquely instructive: companies that depend on Cloudflare, which depends on Google Cloud. Cloudflare — itself one of the internet's core infrastructure providers — uses Google Cloud for some of its backend operations. When Google Cloud's Service Control failed, Cloudflare experienced partial degradation, which in turn degraded Discord, Twitch, and other services that had built on top of Cloudflare. This is third-order cascading failure: Google fails → Cloudflare degrades → Discord goes down. Discord's users had no idea their outage had anything to do with a null pointer exception in a Google quota management system.
The three-ring cascade of the June 12, 2025 Google Cloud outage — and the dependency chain that connected them
10 min
Time for Google's SRE team to identify the root cause — null pointer exception in the new quota checking code path — from the first alert at 10:49 AM PDT on June 12
40 min
Time to deploy the 'red button' kill switch that disabled the problematic Service Control serving path and allowed most regions to begin recovery
7+ hrs
Total outage duration — most regions recovered within 2 hours, but the herd effect in us-central1 extended full resolution to 18:18 PDT
50+
Google Cloud services affected, including all core infrastructure APIs, all Google Workspace products, and all AI/ML services including Vertex AI