To understand why a script deleting 883 sites took two weeks to reverse, you need to understand what an Atlassian site actually is. A site — for example yourcompany.atlassian.net — is not a row in a database. It is a logical container distributed across dozens of services, each maintaining their own slice of state. Identity data (users, groups, permissions) lives in one service. Product databases for Jira, Confluence, and Opsgenie live in others. Media attachments, feature flags, licensing metadata, third-party app configurations — each of these occupies a separate data store. All of it is hosted on AWS and orchestrated through , Atlassian's internal PaaS. The site deletion did not touch a single database — it sent deletion events through the standard provisioning workflow, and every downstream service dutifully removed its copy.
How a site deletion propagated through Atlassian's distributed architecture
flowchart TD
script["Cleanup Script\n(site IDs passed instead of app IDs)"] -->|"permanent delete event"| provisioner["Cloud Provisioner\n(AWS SWF orchestration)"]
provisioner -->|"tenant destruction event"| catalogue["Catalogue Service\n(site metadata + cloudId)"]
provisioner -->|"product deactivation events"| identity["Identity Service\n(users, groups, permissions)"]
provisioner -->|"product deactivation events"| jira_db[("Jira Databases\n(per-shard, multi-tenant)")]
provisioner -->|"product deactivation events"| confluence_db[("Confluence Databases\n(per-shard, multi-tenant)")]
provisioner -->|"product deactivation events"| media[("Media Service\n(attachments, images)")]
provisioner -->|"product deactivation events"| opsgenie_db[("Opsgenie\n(incident management)")]
provisioner -->|"product deactivation events"| statuspage_db[("Statuspage\n(status communication)")]
provisioner -->|"product deactivation events"| commerce["Commerce / Identity\n(billing + contacts DELETED)"]
commerce -.->|"contact info gone\u2014no way to reach customers"| support_gap["Support Gap\n(customers cannot file tickets)"]
style script fill:#ef4444,color:#fff
style support_gap fill:#f59e0b,color:#fff
style catalogue fill:#ef4444,color:#fff
WHY A GLOBAL ROLLBACK WASN'T POSSIBLE
The instinctive solution — roll back the entire database to before the script ran — was blocked by the
. Each database shard contained data from hundreds of customers, most of whom were completely unaffected. A global rollback would have wiped hours of real work from tens of thousands of innocent customers. The only option was surgical: extract and replay each deleted customer's records individually, from 30-day immutable backups, without touching any surrounding data.
Restoration 2 — re-creating records in-place using original identifiers to avoid re-mapping
flowchart LR
backup[("Point-in-Time Backup\n(30-day immutable)")] --> restore_point["Restore Point\n5 min before deletion"]
restore_point --> catalogue_restore["1. Re-create Catalogue Record\n(preserve original cloudId)"]
catalogue_restore --> parallel_work["Run in Parallel"]
parallel_work --> identity_restore["2a. Identity Restore\n(users, groups, permissions)"]
parallel_work --> db_restore["2b. Product DB Restore\n(Jira, Confluence, Opsgenie)"]
parallel_work --> media_restore["2c. Media Restore\n(attachments, images)"]
identity_restore --> services_restore["3. Cross-Service Restore\n(feature flags, apps)"]
db_restore --> services_restore
media_restore --> services_restore
services_restore --> auto_validate["4. Automated Validation\n(health checks per product)"]
auto_validate --> customer_validate["5. Customer Sign-Off\n(manual final check)"]
customer_validate --> restored(["Site Restored"])
style backup fill:#22c55e,color:#fff
style restored fill:#22c55e,color:#fff
style catalogue_restore fill:#0052CC,color:#fff
ℹ️The Missing Layer: Multi-Site Automated DR
Atlassian's disaster recovery had been designed for infrastructure failures — a lost database, a failed availability zone, a corrupted single service. It had never been designed for the scenario of selectively restoring hundreds of customers from shared backups into a live production environment. The capability existed for single-site recovery; it simply hadn't been automated or tested at this scale. The incident forced Atlassian to build, from scratch, the tooling that would eventually become a core part of their DR program.