Google Cloud June 2025 Outage: Service Control Null Pointer,

The Story

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.

— Google Cloud — Official Incident Report, June 14, 2025

Service Control is not a product you've heard of. It doesn't have a marketing page or a conference talk. It exists in the infrastructure layer beneath everything else — the system that authorizes every API request across Google Cloud and Google Workspace before that request is allowed to proceed. If you call the Cloud Storage API, Service Control checks your quota. If you authenticate with Google IAM, Service Control validates your policy. If your app on Google Cloud makes any call to any Google service, Service Control is in the critical path. It is, in the most literal sense, the gatekeeper of the entire platform.

When Service Control crashed on June 12, 2025, it didn't just take down one service. It took down the authorization layer for every service. API calls returned 503 errors not because the underlying services had failed, but because the gatekeeper wasn't there to let them through. Compute Engine instances were running. Cloud Storage buckets were intact. BigQuery jobs were ready to execute. None of it mattered — because without Service Control, nothing could be authorized, and nothing unauthorized can proceed in a correctly secured cloud platform.

WHAT SERVICE CONTROL ACTUALLY DOES

Google Cloud's Service Control system performs three functions on every API request: authentication (is this requester who they claim to be?), authorization (are they allowed to perform this operation?), and quota enforcement (have they exceeded their usage limits?). It processes these checks at massive scale across every region — billions of API calls per day — using policy metadata stored in and synchronized across Spanner, Google's globally distributed database. The May 29 code change was adding more sophisticated quota checking logic to this pipeline. The change worked correctly in every scenario that was tested. The scenario that wasn't tested was the one that appeared on June 12.

Problem

May 29: Code Deployed — Bug Present, But Invisible

Google engineers deployed new quota policy checking code to Service Control. The deployment went through the standard region-by-region rollout and passed all checks. But the new code path had two critical gaps: no error handling for null values, and no feature flag to disable it if something went wrong. The bug was invisible during rollout because the problematic code path could only be triggered by a specific type of policy input — blank fields in the policy metadata. That input hadn't appeared during rollout. The binary was now running in every region with a loaded trap, waiting for the right trigger.

Cause

June 12, 10:45 AM PDT: The Policy Update That Pulled the Trigger

An automated system inserted a routine policy change into the regional Spanner tables that Service Control uses for policy metadata. The policy update contained unintended blank fields — values that should have been populated but weren't. Because quota management is global, the Spanner replication engine distributed this metadata worldwide within seconds. Every Service Control binary in every region hit the new code path, encountered the null values, and threw a null pointer exception. Without error handling, the exception crashed the binary. Service Control was dead globally.

Solution

The SRE Response: Diagnosis in 10 Minutes, Red Button in 40

Google's Site Reliability Engineering team began triaging within two minutes of the first alert. They identified the root cause — the null pointer exception in the new quota checking code path — within 10 minutes. Engineers deployed a 'red button' kill switch within 40 minutes to disable the problematic serving path. Most regions began recovering within two hours. But us-central1, Google's largest region, hit a second problem: the recovery itself.

Result

The Herd Effect: When Recovery Made Things Worse

As Service Control instances restarted in us-central1 after the red button was deployed, they all simultaneously reached for the regional Spanner database to load their policy metadata. Hundreds of instances, all restarting at the same moment, all hitting Spanner at the same time, with no randomization in their startup sequence. The Spanner database — which had been handling steady-state read traffic fine — was overwhelmed by the simultaneous burst. Service Control couldn't load its policies, which meant it couldn't restart properly, which meant it kept trying, which kept hitting Spanner. The recovery created a herd effect that prolonged the outage in us-central1 by more than two hours beyond when other regions had stabilized. Full resolution across all services wasn't complete until 18:18 PDT — more than seven hours after the incident began.

Google's own Cloud Service Health dashboard went down during the incident — the monitoring system that customers rely on to track outage status was itself affected by the outage it was supposed to report. Engineers and customers trying to understand what was happening couldn't access the standard communication channel. A status page that goes down during the incident it's supposed to report is a monitoring anti-pattern at its most consequential.

What Went Dark: The Third-Order Cascade

The blast radius of the June 12 outage had three concentric rings. The innermost ring was Google's own services: Google Cloud Platform APIs, Google Workspace (Gmail, Calendar, Drive, Docs, Meet), IAM, Cloud Storage, Compute Engine, BigQuery, Cloud SQL, Cloud Spanner, Vertex AI, Cloud Monitoring. The second ring was companies building directly on GCP — Spotify, Snapchat, Fitbit, Replit, GitLab, Shopify, Character.AI, Cursor — whose applications were unable to authorize any backend calls. The third ring was the one that made this incident uniquely instructive: companies that depend on Cloudflare, which depends on Google Cloud. Cloudflare — itself one of the internet's core infrastructure providers — uses Google Cloud for some of its backend operations. When Google Cloud's Service Control failed, Cloudflare experienced partial degradation, which in turn degraded Discord, Twitch, and other services that had built on top of Cloudflare. This is third-order cascading failure: Google fails → Cloudflare degrades → Discord goes down. Discord's users had no idea their outage had anything to do with a null pointer exception in a Google quota management system.

The three-ring cascade of the June 12, 2025 Google Cloud outage — and the dependency chain that connected them

The three-ring cascade of the June 12, 2025 Google Cloud outage — and the dependency chain that connected them
Failure Ring	What Failed	Why
First: Google's own infrastructure	Cloud IAM, Compute Engine, Cloud Storage, BigQuery, Cloud SQL, Cloud Spanner, Vertex AI, Cloud Monitoring, Google Workspace (Gmail, Calendar, Drive, Docs, Meet)	Service Control — the authorization gateway — crashed globally, blocking all API requests across GCP and Workspace
Second: Direct GCP customers	Spotify (~46K outage reports), Snapchat, Fitbit, Replit, GitLab, Shopify, Character.AI, Cursor, Perplexity AI	Applications built on GCP couldn't authorize any backend calls — services appeared down to users even though underlying compute was running
Third: Cloudflare and its customers	Cloudflare (partial degradation), Discord, Twitch	Cloudflare uses Google Cloud for certain backend operations; when those degraded, Cloudflare's services partially degraded, cascading to Cloudflare's own customers

The Dormant Trap: Why Staged Rollouts Didn't Catch This

Google's staged, region-by-region rollout is exactly the right practice for catching bugs introduced by new deployments. It worked correctly for 14 days — no failures appeared during the May 29 rollout because the failure condition required specific policy data (blank fields) that hadn't yet been inserted. The bug was a dormant trap: present in production, but invisible until the exact trigger arrived. This is a class of bug that staged rollouts are structurally unable to catch — because the rollout environment and the trigger environment are separated by two weeks and an automated policy change that nobody controlled. The only defenses against dormant traps are error handling (so the crash doesn't happen when the trigger arrives) and feature flags (so the code path can be disabled immediately when the trigger produces unexpected behavior). The May 29 change had neither.

10 min

Time for Google's SRE team to identify the root cause — null pointer exception in the new quota checking code path — from the first alert at 10:49 AM PDT on June 12

40 min

Time to deploy the 'red button' kill switch that disabled the problematic Service Control serving path and allowed most regions to begin recovery

7+ hrs

Total outage duration — most regions recovered within 2 hours, but the herd effect in us-central1 extended full resolution to 18:18 PDT

50+

Google Cloud services affected, including all core infrastructure APIs, all Google Workspace products, and all AI/ML services including Vertex AI

The Fix

Google's Response: Five Commitments After the Outage

Google's incident report, published June 14, 2025, outlined specific remediation steps across five categories. Each addresses a distinct failure mode that either caused the outage or made it worse than it needed to be.

Google's five-category post-incident remediation plan, derived from the official June 14, 2025 incident report

Google's five-category post-incident remediation plan, derived from the official June 14, 2025 incident report
Failure Mode	What Happened	Google's Fix
Missing error handling	The new quota checking code had no null-safety — when blank fields appeared, a null pointer exception crashed the binary	Mandatory null-safe code patterns for all Service Control code paths, with additional static analysis to catch null pointer vulnerabilities before deployment
No feature flag	Without a feature flag, the new code path could not be disabled without a full binary redeployment — adding 30+ minutes to initial response time	Feature flag protection required for all new code paths in Service Control — a flag would have allowed the problematic path to be disabled within seconds, before most regions crashed
Herd effect during recovery	Hundreds of Service Control instances restarting simultaneously all hit Spanner at the same time, overwhelming it and prolonging the us-central1 outage by 2+ hours	Randomized exponential backoff on Service Control startup — instances restart with jittered delays so Spanner load is distributed over time rather than concentrated in a burst
Status page availability	Google's Cloud Service Health dashboard went down during the outage, removing the primary customer communication channel	Decoupled status infrastructure — the status page must be architecturally independent of the services it monitors, with its own Service Control dependency removed
Service Control architecture	Service Control is a monolithic regional binary — a crash in Service Control takes down all API authorization for the entire region simultaneously	Modularize Service Control's architecture — isolate the quota checking component so a crash in quota logic does not crash the authentication and authorization components

The Feature Flag That Would Have Saved Seven Hours

The most consequential missing safeguard in the June 12 outage was the absence of a feature flag on the new quota checking code path. A feature flag — a configuration switch that enables or disables a code path without a redeployment — would have changed the incident timeline dramatically. When the null pointer exceptions began firing at 10:49 AM PDT, engineers with a feature flag could have disabled the new code path across all regions within seconds or minutes, before the crash had spread globally. Without a feature flag, the only option was a red-button kill switch that required a new binary deployment — a process that took 40 minutes and still left the herd effect problem during restart. 40 minutes of global Service Control outage versus seconds of a feature flag toggle. Google's own incident report acknowledges this directly: 'If this had been flag protected, the issue would have been caught in staging.' The cost of not having a feature flag was measured in hundreds of millions of users unable to access their services for seven hours.

The Herd Effect: A Recovery Anti-Pattern With a Known Fix

The herd effect that prolonged the us-central1 outage is not a new problem. It has been documented since the earliest days of distributed systems: when many clients restart simultaneously after a shared dependency recovers, they all connect simultaneously and overwhelm the dependency, preventing it from returning to steady state. The canonical solution — randomized exponential backoff — is equally well-documented and simple: when restarting, add a random delay so clients stagger their reconnection attempts over a time window rather than clustering them at a single instant. Every Service Control instance waiting exactly zero milliseconds before hitting Spanner on restart is the problem. Service Control instances waiting a random delay between 0 and 30 seconds before hitting Spanner on restart is the solution. Google committed to implementing this. The fact that it required an outage to prompt the implementation is a reminder that known fixes for known problems often go unimplemented until the cost of not implementing them is paid in production.

MTTD VS DURATION: WHAT THE NUMBERS ACTUALLY TELL YOU

Google's SRE team began triaging within two minutes of the first alert. This is elite incident response. The MTTD (Mean Time to Detect) was near-instantaneous, and the root cause diagnosis took 10 minutes. These are remarkable numbers for a global infrastructure failure of this complexity. The lesson is not that Google's response was slow — it was fast. The lesson is that even elite incident response cannot compensate for missing preventative safeguards. Feature flags, error handling, and randomized backoff would have prevented or dramatically shortened the outage before any human had time to respond. The SRE team's quality is demonstrated by the 10-minute diagnosis. The systems quality gap is demonstrated by the 7-hour duration.

The Broader Lesson: Cleanup Operations Are the Hardest Code to Get Right

The June 12 outage shares an important structural pattern with the October 2025 AWS DynamoDB incident: the trigger was not a complex new feature or an ambitious architectural change. It was a routine operation — in AWS's case, a cleanup job that deleted stale DNS plans; in Google's case, an automated policy update that inserted routine quota metadata. Routine operations are the hardest to protect against because they're not treated with the same scrutiny as new features. A new feature gets code review, testing, staged rollout, and monitoring. A routine automated policy update is assumed to be safe because it's been running correctly for months. But when the underlying system has changed — when new code is now in the critical path that couldn't handle what the routine operation produces — the routine operation becomes the trigger for a catastrophic failure. The implication is uncomfortable: every automated operation that modifies system-critical metadata must be treated as a potential trigger for any latent bugs in the code that consumes that metadata.

Architecture

Service Control sits at the intersection of every API request Google Cloud processes. Understanding how it failed — and why the failure spread so quickly and recovered so slowly — requires understanding three things: the role of Spanner as the global policy data store, the absence of safe failure handling in the new code path, and the herd effect as a predictable consequence of synchronized restart under load.

Normal Flow vs. June 12 Failure: What Service Control Does on Every Request

flowchart TD subgraph normal["Normal API Request Flow"] req["API Request"] --> sc_auth["Service Control\nAuthenticate"] sc_auth --> sc_quota["Service Control\nQuota Check"] sc_quota --> sc_policy["Read Policy from Spanner"] sc_policy --> allow["Request Authorized ✓"] allow --> service["GCP Service Executes"] end subgraph failure["June 12: Failure Flow"] req2["API Request"] --> sc_crash["Service Control\nLoads quota policy"] sc_crash --> null_ptr["Blank field in policy →\nNull Pointer Exception 💥"] null_ptr --> crash["Binary Crashes"] crash --> err503["503 Error returned\nto caller"] err503 --> down["All GCP APIs\ninaccessible"] end style null_ptr fill:#ef4444,color:#ffffff style crash fill:#ef4444,color:#ffffff style down fill:#ef4444,color:#ffffff

The Herd Effect: Why Us-Central1 Recovery Took 2+ Extra Hours

sequenceDiagram participant SC1 as Service Control Instance 1 participant SC2 as Service Control Instance 2 participant SCN as Service Control Instance N participant Spanner as Regional Spanner DB Note over SC1,SCN: Red button deployed — all instances restart SC1->>Spanner: Load policy metadata (T+0s) SC2->>Spanner: Load policy metadata (T+0s) SCN->>Spanner: Load policy metadata (T+0s) Note over Spanner: Overwhelmed — hundreds of simultaneous reads Spanner-->>SC1: Timeout ✗ Spanner-->>SC2: Timeout ✗ Spanner-->>SCN: Timeout ✗ Note over SC1,SCN: Instances retry — still all simultaneously SC1->>Spanner: Retry (T+5s) SC2->>Spanner: Retry (T+5s) SCN->>Spanner: Retry (T+5s) Note over Spanner: Still overwhelmed — herd effect continues Note over SC1,SCN: Fix: randomized exponential backoff\n(T+random[0..30s]) staggers load

THE GLOBAL SPANNER REPLICATION TRAP

The reason the June 12 failure was global rather than regional was Spanner's design strength working against Google in this case. Spanner is Google's globally distributed database, engineered to replicate data to all regions in real time — typically within seconds. This replication is what makes Spanner so powerful for global consistency. On June 12, it was what made the failure instantaneous and universal. When the automated system inserted the policy update with blank fields into the regional Spanner tables, Spanner replicated that policy data to every region within seconds. Every Service Control instance in every region hit the null pointer at essentially the same moment. There was no regional staging, no propagation delay, no opportunity for an alert to fire in one region before the failure had spread to all others. The same architecture that gives Spanner its global consistency guarantee gave this bug its global blast radius.

The Status Page That Went Dark

Google's Cloud Service Health dashboard — the system that customers rely on to understand what Google services are experiencing — went offline during the June 12 outage. This happened because the status infrastructure shared a dependency on the same Google Cloud services that were failing. A status page that fails during a widespread outage is not just unhelpful — it is actively harmful. Customers experiencing failures couldn't access the standard channel to confirm they weren't the source of the problem, couldn't track recovery progress, and couldn't communicate accurate information to their own stakeholders. The status page being down created a second outage: an outage of information. Google's commitment to decoupling status infrastructure from the services it monitors is not a nice-to-have. It is the baseline requirement for maintaining communication with customers during incidents.

Lessons

The June 12, 2025 Google Cloud outage carries lessons that apply to every engineering team — from startups deploying to a single cloud region to hyperscalers managing global infrastructure. The failure modes are not exotic. They are the canonical patterns of distributed systems engineering: missing error handling, absent feature flags, synchronized recovery storms. The scale is Google's. The lessons belong to everyone.

Error handling is not optional for code that runs in the critical path of a globally distributed system. The null pointer exception that crashed Service Control was caused by a missing two-line null check. Any code path that processes external data — data that arrives from an automated system and could contain unexpected values — must explicitly handle the unexpected cases. The failure condition was not unpredictable. Blank fields in policy metadata is a predictable input variation. The code should have anticipated it.

on infrastructure code are not optional — they are the minimum viable safety mechanism for any code that processes global-scale policy data. The difference between 'feature flag enabled, issue caught in staging' and 'no feature flag, 7-hour global outage' is one line of configuration. Every new code path in a globally deployed binary should require a feature flag as a deployment prerequisite, not a nice-to-have.

The herd effect — the in its classic form — is a known failure mode with a known fix. Randomized exponential backoff on service restart is the standard solution, and it has been documented for decades. The fact that Service Control lacked it is a reminder that well-known fixes go unimplemented until the cost of not implementing them becomes acute. Build randomized backoff into any service that has a shared dependency it needs to reconnect to after a failure.

Your monitoring infrastructure must be architecturally independent of the services it monitors. A status page that goes down during an outage is a second outage layered on top of the first. This means no shared dependencies between the monitoring stack and the application stack, separate cloud regions or providers for status infrastructure, and tested independence — verifying that a full outage of the primary platform does not affect the observability layer. This is not easy to build, but it is essential. The moment customers need status information most is exactly the moment a shared-dependency status page is most likely to be unavailable.

Third-order cascade failures are invisible until they happen. Discord's users had no idea their outage originated in a null pointer exception in Google's quota management code. The dependency chain was opaque: Discord → Cloudflare → Google Cloud → Service Control → policy metadata blank fields. Every engineering team should map their dependency chain at least two levels deep — not just 'we use Cloudflare' but 'Cloudflare uses Google Cloud, and a Google Cloud outage of sufficient scope will reach us through Cloudflare.' This mapping informs both architectural decisions and incident response communication.

The Modularization Commitment

Google's most architecturally significant post-incident commitment was to modularize Service Control — isolating the quota checking component so that a failure in quota logic cannot crash the authentication and authorization components. Currently, Service Control is a single binary: a null pointer in any component crashes everything. In a modularized architecture, the quota checking subsystem can crash or be restarted without affecting authentication. This is the same architectural move GitHub is making with service isolation — and it is the right call for the same reason: blast radius containment. The cost of modularizing a critical monolithic binary is high. The cost of not doing it is measured in seven-hour global outages.

THE IRONY OF THE RECOVERY MAKING THINGS WORSE

The most memorable aspect of the June 12 outage is the herd effect: the act of fixing the crash created a new problem that extended the outage by hours. This pattern — where the recovery mechanism amplifies the damage — appears across some of the most instructive engineering post-mortems in the industry. Netflix built Chaos Monkey partly because they discovered that their graceful degradation paths, when triggered simultaneously by a real failure, could overload the systems they were supposed to protect. AWS's October 2025 DynamoDB outage extended 12 extra hours because the automatic failover mechanisms were making the DNS inconsistency worse. And Google's June 2025 Service Control outage extended beyond the crash itself because hundreds of instances restarting simultaneously overwhelmed the Spanner database they all depended on. The lesson that runs through all three: design recovery paths with the same care you design primary paths — because recovery paths are the code that runs when everything is already wrong.

Google deployed a null pointer exception on May 29, and it sat patiently in production for two weeks waiting for exactly the wrong policy update to arrive — like a trapdoor that looks like a floor until someone with the right keycard walks over it. Then it took down Spotify, Discord, Snapchat, Cloudflare, and Google's own status page simultaneously, and when engineers hit the kill switch to fix it, the restart surge broke the database they were restarting into. There is a version of this story where the lesson is 'write a null check.' There is a more useful version where the lesson is: the most dangerous code in your system is the code that runs perfectly for two weeks before it fails catastrophically, because you will have completely forgotten it's there by the time it does.TechLogStack — built at scale, broken in public, rebuilt by engineers