Company

Slack

Every Slack engineering case study on TechLogStack — real production incidents, post-mortems, and fixes.

Slack Reliability
17 min

Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

On June 30, 2021, a network link connecting one AWS availability zone failed — and Slack users felt it, despite Slack running in multiple availability zones. The postmortem question was brutal: why did a single AZ failure affect users at all? The answer drove 18 months of architecture work.

1.5 years migration time 99.99% SLA maintained
Slack Reliability
17 min

Slack's Worst Day: When a Better Cache Manager Made Everything Worse

On February 22, 2022, Slack went down for many users — including the engineer designated as Incident Commander, who was authoring the postmortem from a position of personal experience. The culprit was a new component that worked exactly as designed.

Slack Distributed Systems
16 min

Slack Rewrote Its Core Architecture for Enterprise — Because the Old One Was a Lie

Slack was built for teams in single workspaces. Enterprise customers were using it across dozens of workspaces simultaneously — and the architecture had never been designed for that. Every major enterprise feature was a workaround on top of a foundation that assumed one workspace per person. Slack spent two years rebuilding the foundation.

2 years development time
Slack Reliability
16 min

Slack Cut Deploy-Related Customer Impact by 90% in Eighteen Months

73% of Slack's customer-facing incidents were being triggered by Slack itself — by its own code deploys. The team stopped treating each outage as a one-off and started treating deploy safety as a program, with metrics, milestones, and automated rollbacks. Eighteen months later, customer impact hours were down 90%.

73% incidents from own deploys 90% reduction in impact hours 18 months of sustained investment