The Story
On February 2, 2026, between 18:35 UTC and 23:10 UTC, GitHub Codespaces stopped launching. Existing sessions held on, but no new environment could spin up. Copilot Coding Agent, CodeQL, Dependabot, GitHub Enterprise Importer, and GitHub Pages all went down with it. The error offered no useful signal: 'failed to provision VM.' What actually happened was that GitHub's security automation decided every VM on the platform was untrusted.
GitHub's compute infrastructure uses a telemetry pipeline to pass health and identity signals about active VMs to the security layer. When that pipeline dropped — due to an unrelated infrastructure change — the security system didn't see an uncertain situation. It saw silence, and it treated silence as a confirmed threat. Security policies designed to protect backend storage accounts activated automatically, blocking access to VM metadata. Codespaces depends on that metadata service to initialize every new environment. With it blocked, all provisioning failed in every region.
Problem
A telemetry gap caused security policies to activate against backend storage accounts. This blocked VM metadata access, preventing all Codespaces provisioning and halting Actions hosted runners across all regions and runner types.
Problem
Telemetry pipeline drops
At 18:35 UTC, compute telemetry stops flowing. Security policies interpret the silence as a breach signal and activate against backend storage accounts.
Cause
VM metadata access blocked
New Codespaces VMs cannot access their own metadata service — a prerequisite for initialization. All provisioning attempts fail across all regions.
Solution
Policy rollback and telemetry restore
Engineers identify the telemetry gap and roll back the security policy activation. Telemetry pipeline is restored.
Result
Full recovery by February 3
Standard runners recover at 23:10 UTC. Larger runners recover at February 3, 00:30 UTC. Codespaces fully restored at 00:15 UTC. Self-hosted runners on other providers were unaffected throughout.
The Fix
GitHub separated the telemetry-dependent security policies from the provisioning critical path. New VMs now receive a time-limited provisioning trust token that allows metadata access even when telemetry has not yet fully initialized. The security policies that triggered were moved to a higher activation threshold — multiple corroborating signals are now required before policies can block infrastructure access. A single missing telemetry stream is no longer sufficient to trigger a lockout.
Solution
Provisioning trust token added with time-limited metadata access. Multi-signal threshold required before security policies activate. Telemetry loss alone can no longer trigger a storage lockout.
Lessons
What to remember
- Systems that act on the absence of a signal are dangerous. Silence is not evidence of a threat. It is evidence of a monitoring gap.
- Security automation that can block production infrastructure needs the same change-control rigor as a production deploy — not just a security review.
- If your security posture instantly hardens when telemetry drops, you've built a system that becomes less available the less observable it is.
- Verify trust before provisioning starts, not during it. The provisioning path is not the right place to enforce complex security policies mid-operation.