⚡ Engineering War Stories from the Trenches

When Big Tech Breaks
& How They Fix It

The postmortems they published. The outages they survived. The fixes that saved millions of users. Read the real story — not the press release.

Case Studies

Companies

∞

Things Broke

100%

No Fiction

Netflix Chaos Engineering

Netflix Unleashed a Monkey With a Weapon in Its Own Data Center — On Purpose

It was 2011 and Netflix had just migrated hundreds of microservices to AWS. Their architecture was distributed, horizontally scaled, and theoretically fault-tolerant. But theory and production are different things. The only way to know if a system could survive failures was to cause failures — constantly, deliberately, during business hours, and in production. So they built a monkey.

July 19 2011 blog published Business-hours instance killing 10 Simian Army members +3

Read full story →

Hotstar Live Streaming

When MS Dhoni Got Out: How Hotstar Survived 25 Million Concurrent Users

July 9th, 2019. India vs New Zealand, Cricket World Cup semi-final. MS Dhoni walks to the crease and 1.1 million new viewers join Hotstar every single minute. Then he gets run out — and 24 million people hit the back button almost simultaneously.

25.3M peak concurrent 1.1M users/min growth 5.7 Tbps bandwidth +3

Read full story →

GitHub Databases

How GitHub Upgraded 1200 MySQL Hosts Without Dropping a Single Query

MySQL 5.7 was hitting end-of-life, and GitHub's production database fleet spanned 1,200 hosts, 300 terabytes of data, and 5.5 million queries every second. Getting from here to MySQL 8.0 without disrupting 100 million developers was going to take more than a weekend.

1,200+ MySQL hosts upgraded 300+ TB data migrated 5.5M queries/sec maintained +3

Read full story →

Slack Reliability

Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

On June 30, 2021, a network link connecting one AWS availability zone failed — and Slack users felt it, despite Slack running in multiple availability zones. The postmortem question was brutal: why did a single AZ failure affect users at all? The answer drove 18 months of architecture work.

2021-06-30 AZ outage trigger 1.5 years migration time AZ drain in <5 minutes +3

Read full story →

Cloudflare Reliability

Cloudflare Fixed a React Security Vulnerability and Broke the Entire Network

In late 2025, Cloudflare was rolling out a fix for a React security vulnerability. To do so, they needed to disable an internal testing tool with a global killswitch. The killswitch, unexpectedly, triggered a bug that sent HTTP 500 errors across Cloudflare's entire global network. This was the third major configuration-related global outage in two years.

Dec 2025 global outage React CVE fix triggered outage Global killswitch bug +3

Read full story →

LinkedIn Messaging

LinkedIn Needed a Message Queue. They Built the One the Entire Internet Runs On.

In 2010, LinkedIn was drowning in data it couldn't move. Every ML model, every recommendation engine, every real-time feature was starving because there was no reliable way to get activity data from the website into the systems that needed it. Jay Kreps, Jun Rao, and Neha Narkhede spent a year building a fix. They named it after Franz Kafka. The rest of the internet adopted it.

1B events/day at launch (2011) 1T messages/day by 2015 7T messages/day by 2019 +3

Read full story →

Discord Databases

How Discord Migrated Trillions of Messages and Fired Their Garbage Collector

It is 2022 and Discord's on-call engineers are babysitting a 177-node database cluster, manually rebooting nodes after Java GC pauses spiral out of control. The system holding every message ever sent is becoming the thing everyone fears touching most.

177 → 72 nodes p99 latency 15ms (was 40–125ms) 9-day migration (was 3-month est.) +3

Read full story →

Discord Reliability

Discord Killed the MacBook Dev Environment and Never Looked Back

Discord's engineering team had tripled in size and was drowning in a swamp of 'works on my machine' bugs — some engineers running macOS, some Ubuntu, all of them slowly. The solution was radical: no one gets a local dev environment anymore.

3x engineering org growth Mac→V1→V2 (two migrations) Began 2020, V2 done 2023 +3

Read full story →

Netflix Distributed Systems

Netflix Hit the AWS Instance Ceiling and Built a Workflow Engine That Scales Forever

Netflix's Meson orchestrator was handling hundreds of thousands of daily data and ML jobs — and running out of machine. Vertically scaling on AWS had a hard ceiling, and the workflows were doubling in size every year. The only way out was a complete architectural rethink.

2M+ jobs/day at peak Hundreds of thousands of workflows Meson → Maestro 2020 +3

Read full story →

Slack Reliability

Slack's Worst Day: When a Better Cache Manager Made Everything Worse

On February 22, 2022, Slack went down for many users — including the engineer designated as Incident Commander, who was authoring the postmortem from a position of personal experience. The culprit was a new component that worked exactly as designed.

Feb 22 2022 outage Consul rollout to 75% of fleet Cache hit rate collapsed +3

Read full story →

Stripe Databases

How Stripe Moves Petabytes Between Database Shards Without Stopping the Money

Stripe processed over $1 trillion in payment volume in 2023 while maintaining 99.999% uptime — five nines, fewer than 6 minutes of downtime all year. The infrastructure secret is a database platform called DocDB and a migration engine that moves petabytes of financial data between shards without any application knowing it happened.

$1T+ payment volume 2023 99.999% uptime achieved 5M database queries/sec +3

Read full story →

Slack Distributed Systems

Slack Rewrote Its Core Architecture for Enterprise — Because the Old One Was a Lie

Slack was built for teams in single workspaces. Enterprise customers were using it across dozens of workspaces simultaneously — and the architecture had never been designed for that. Every major enterprise feature was a workaround on top of a foundation that assumed one workspace per person. Slack spent two years rebuilding the foundation.

2 years development time Workspace-centric → org-wide Thousands of APIs refactored +3

Read full story →

Slack Reliability

Slack Cut Deploy-Related Customer Impact by 90% in Eighteen Months

73% of Slack's customer-facing incidents were being triggered by Slack itself — by its own code deploys. The team stopped treating each outage as a one-off and started treating deploy safety as a program, with metrics, milestones, and automated rollbacks. Eighteen months later, customer impact hours were down 90%.

73% incidents from own deploys 90% reduction in impact hours Manual → automatic rollbacks +3

Read full story →

Atlassian Reliability

How a Two-Line Script Silently Deleted 883 Customer Cloud Sites

At 07:38 UTC on April 5th, 2022, a maintenance script begins its run — methodical, peer-reviewed, totally routine. Twenty-three minutes later, 883 Atlassian Cloud sites have been permanently deleted, and the company's own incident management tool, Opsgenie, is one of the casualties.

883 sites deleted 14 days max outage 775 customers affected +3

Read full story →

Netflix Live Streaming

65 Million Streams: How Netflix Rebuilt Its Guts for Live

November 15, 2024: 65 million people log on to watch Mike Tyson fight Jake Paul, the largest live sports stream in history. Behind the scenes, Netflix engineers are white-knuckling a system they built from scratch — one where a single bad video segment, a CDN request storm, or a missed 2-second write deadline means millions of viewers see a black screen.

65M concurrent streams 113ms → 25ms p50 latency 200Gbps+ read throughput +3

Read full story →

Stripe Performance

Stripe Converted 3.7 Million Lines of JavaScript in One Pull Request on a Sunday

On Sunday, March 6, 2022, Stripe merged a single pull request that converted their entire largest JavaScript codebase from Flow to TypeScript. 3.7 million lines of code. Hundreds of engineers arrived Monday morning to start writing TypeScript. The migration had been invisible until it wasn't.

3.7M lines converted in 1 PR Single Sunday deploy Largest JS codebase at Stripe +3

Read full story →

GitHub Reliability

The Test That Broke GitHub: A Failover Drill Goes Live

June 29, 2023, 17:39 UTC: GitHub engineers initiate a planned live failover test of their brand-new second Internet edge facility — six months of infrastructure work designed to eliminate a single point of failure. Within seconds, instead of validating their redundancy, they've created an outage that takes GitHub offline for millions of developers across North America and South America.

32-minute outage 2-min detect-to-revert US East + South America +3

Read full story →

Shopify Databases

Shopify Sharded a Rails Database With Vitess and the App Never Knew It Happened

The Shop app was growing exponentially. Its single MySQL database was approaching vertical scaling limits. Shopify needed horizontal sharding — but they had a Rails monolith that expected a single database, and a system that couldn't have downtime during a commerce platform used by millions daily.

KateSQL → Vitess migration user_id as sharding key VTGate transparent to app +3

Read full story →

Shopify Databases

Shopify's Engineers Hunted Deadlocks at 19 Million Queries per Second

During Black Friday and Cyber Monday 2023, Shopify's MySQL fleet was handling 19 million queries per second. At that scale, even rare deadlock patterns become common enough to cause real incidents. The engineering team published a detailed playbook for diagnosing and eliminating MySQL deadlocks in high-concurrency production environments.

19M MySQL QPS at BFCM peak 58M requests/min app servers 99.999%+ uptime maintained +3

Read full story →

Cloudflare Reliability

Cloudflare's Datacenter Partner Failed and the Control Plane Went Dark for 40 Hours

On November 2, 2023, Cloudflare's primary datacenter partner experienced a power failure. The control plane — the system that lets customers configure DNS, firewall rules, and every Cloudflare service — went dark. It stayed dark, in various forms, for nearly 40 hours. The postmortem introduced a concept Cloudflare hadn't had before: Code Orange.

Nov 2–4 2023 outage ~40 hours control plane down Flexential datacenter failure +3

Read full story →

Cloudflare Reliability

A Database Permission Change in ClickHouse Took Down 28% of Cloudflare's HTTP Traffic

On November 2, 2023 — the same day as the control plane datacenter failure — Cloudflare also experienced a separate six-hour global outage. The cause: a database permission change in ClickHouse generated a corrupt configuration file that was silently propagated to every server in Cloudflare's Bot Management system, crashing it globally.

Nov 2 2023 outage 28% HTTP traffic impacted 6 hours total duration +3

Read full story →

Netflix Performance

Netflix Made Their Workflow Orchestrator 100x Faster by Rewriting the Engine Nobody Thought Was Slow

Maestro had been running Netflix's data and ML workflows successfully for two and a half years. Then Live, Ads, and Games drove sub-hourly scheduling requirements that revealed the orchestrator's overhead — not in crashes or alerts, but in slow step launches that nobody had measured. The fix was a complete engine rewrite that delivered 100x throughput improvement.

100x throughput improvement 2.5 years before overhead visible Sub-hourly scheduling trigger +3

Read full story →

Netflix Performance

Netflix's Containers Were Fighting Their Own CPUs — and Losing

Netflix ran millions of containers per day on modern multi-core CPUs. The containers performed well on benchmarks. In production, under certain workloads, they were mysteriously slower than expected — slower than the hardware should have allowed. The culprit was CPU topology: the operating system was scheduling container workloads in ways that violated modern CPU cache architecture. They called the investigation 'Mount Mayhem.'

Mount Mayhem investigation Modern multi-core CPU topology NUMA and cache locality +3

Read full story →

Netflix Reliability

Netflix Streamed Live Sports for Millions — and the Hard Part Wasn't the Video

When Netflix began streaming live events — boxing, NFL games, comedy specials — the engineering challenge wasn't encoding or delivery. It was building the human infrastructure: the operations team, the escalation paths, the real-time decision systems, and the runbooks that let engineers respond to live event failures in seconds, not minutes.

Live events at Netflix scale Sub-second escalation paths NFL Christmas Day 2023 +3

Read full story →

Figma Databases

Figma's Database Grew 100x in Four Years — Here's How a Small Team Kept It From Toppling

In 2020, Figma ran on a single Postgres instance on AWS's largest available machine. Four years later, that database had grown nearly 100x. Some tables had swelled to several terabytes and billions of rows. The Postgres vacuum process — the background job that keeps Postgres alive — was causing reliability incidents. They had months of runway left before hitting the IOPS ceiling. A small databases team had nine months to fix it.

100x DB growth since 2020 Single instance → horizontal shards 9-month migration +3

Read full story →

Datadog Reliability

Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

On March 8, 2023, Datadog — the platform engineers use to know when their own infrastructure is broken — broke. For more than 24 hours, across five regions on three cloud providers, metrics stopped arriving, logs disappeared, and dashboards showed nothing. The people whose job was to fix it couldn't see what was happening. It cost $5 million. It changed how Datadog thinks about building software.

24h+ global outage $5M revenue loss 50–60% Kubernetes nodes lost +3

Read full story →

OpenAI Databases

OpenAI Runs ChatGPT for 800 Million Users on One PostgreSQL Instance — and It Works

ChatGPT has 800 million users. It handles millions of database queries per second. And it runs on a single primary PostgreSQL instance on Azure — one writer, backed by about fifty read replicas. No sharding. No distributed SQL. Just Postgres, pushed further than almost anyone thought possible through obsessive optimization and ruthless operational discipline.

800M users, 1 primary PG instance ~50 read replicas globally Millions of QPS, p99 <20ms +3

Read full story →

Uber Security

Uber Had 150,000 Secrets Scattered Across 25 Vaults — So They Built One Platform to Rule Them

150,000 secrets. 25 separate vaults. Hundreds of teams managing their own credentials in their own ways, some in plain text in version control. At Uber's scale — 5,000 microservices, 5,000 databases, 500,000 analytical jobs per day — secrets sprawl is not a compliance problem. It is an incident waiting to happen. A team of ten engineers decided to fix it.

150,000 secrets managed 25 vaults → 6 managed vaults 5,000 microservices secured +3

Read full story →

Shopify Reliability

The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

Every team building with LLMs discovers the same brutal truth: 80% quality arrives in a few weeks. The final 15% — the gap between 'impressive demo' and 'product I'd trust with my customers' — takes the rest of the time. Shopify's Flow agent and Sidekick teams lived this curve and came back with a systematic playbook. It is mostly about measurement.

LLM judge: 0.02 → 0.61 Kappa 300-example hand-crafted benchmark Production mirroring closes gap in 2 weeks +3

Read full story →

Netflix · Performance

Netflix Made Their Workflow Orchestrator 100x Faster by Rewriting the Engine Nobody Thought Was Slow

The Story

Two and a half years after Netflix's Maestro workflow orchestrator replaced Meson, it had achieved its design goals: horizontal scalability, support for hundreds of thousands of workflows, reliable execution of millions of jobs per day. By 2024, however, Netflix's business had changed in ways that revealed new performance requirements. Live programming, Ads, and Games drove use cases with sub-hourly scheduling needs — ad targeting pipelines that needed to run every 15 minutes, live event data processing that needed to execute within seconds of an event, low-latency ad hoc queries. These workloads exposed overhead in Maestro's step execution path that had been invisible during daily and hourly ETL workflows. The orchestrator wasn't broken — but it was noticeably slower than it needed to be for a new class of latency-sensitive use cases.

The overhead that sub-hourly workloads exposed wasn't measured in seconds of latency — it was measured in fractions of seconds of step launch time that added up across thousands of daily executions. For hourly ETL pipelines, a 200ms step launch overhead is irrelevant. For 15-minute ad targeting workflows with hundreds of steps, that overhead becomes a material fraction of the entire scheduling budget.

The Maestro engineering team investigated the overhead and traced it to the flow engine — the component responsible for managing state transitions between workflow steps. The original flow engine had been built on top of Netflix Conductor, an open-source workflow orchestration system that provided a full feature set of state management capabilities. Maestro used only a subset of Conductor's features — lightweight state transitions — but paid the overhead of Conductor's full implementation. This overhead was acceptable at 1-million-task-per-day scale with daily scheduling. It was unacceptable for the sub-hourly, low-latency workloads that Netflix's evolving product portfolio demanded.

THE INVISIBLE OVERHEAD

The flow engine overhead didn't cause errors or trigger alerts. Workflows completed. SLOs were met. But the step launch time was higher than it needed to be, and for sub-hourly workloads, 'higher than needed' became 'unacceptably slow.' This is a class of performance issue that only becomes visible when new use cases push the system closer to its boundaries — the boundary had always been there, but daily ETL workloads never reached it.

Problem

Sub-Hourly Workloads Expose Step Launch Overhead

Netflix's expansion into Live, Ads, and Games drove scheduling requirements as short as 15 minutes. Sub-hourly workflows executing hundreds of steps were sensitive to per-step launch overhead that was invisible on daily ETL pipelines. The Maestro flow engine's overhead, acceptable at hourly+ scheduling, became a bottleneck for the new use case class.

Cause

Flow Engine Built on Conductor's Full Feature Set

Maestro's flow engine used Netflix Conductor for state management, but only needed lightweight state transitions — not Conductor's full feature set. The team also considered Temporal (optimized for inter-process orchestration via external service calls) but concluded that coupling the DAG engine to an external service introduced unnecessary reliability risk at 1M+ daily tasks.

Solution

Purpose-Built State Machine: Keep DAG, Rewrite Flow Engine

The team kept the DAG engine (workflow definition and dependency management) and rewrote only the flow engine (state transitions). The new flow engine was purpose-built for Maestro's specific requirements: lightweight state transitions at very high frequency, without the overhead of a general-purpose state management framework.

Result

100x Throughput Improvement

The rewritten flow engine delivered 100x throughput improvement, enabling the sub-hourly and low-latency use cases that Netflix's Live, Ads, and Games products required. The improvement opened new possibilities for workflow orchestration at Netflix that hadn't been feasible on the original engine.

We felt it was an unnecessary source of risk to couple the DAG engine execution with an external service call. If our requirements went beyond lightweight state transition management we might reconsider because Temporal is a very robust control plane orchestration system, but for our needs it introduced complexity and potential reliability weak spots when there was no direct need for the advanced feature set that it offered.

— Netflix Engineering — via '100X Faster: How We Supercharged Netflix Maestro's Workflow Engine'

Why Not Temporal?

Temporal is a popular workflow orchestration framework that handles complex, long-running workflows with strong durability guarantees. The Netflix team evaluated it seriously but concluded it was optimized for a different use case: inter-process orchestration via external service calls. Maestro operates at 1M+ daily tasks; coupling the DAG execution engine to an external Temporal service call for each state transition would add network latency and a reliability dependency to the most critical path in the system. For Maestro's needs — lightweight, in-process state transitions at very high frequency — Temporal was over-engineered and over-coupled.

The architectural decision to keep the DAG engine while rewriting only the flow engine reflects a key engineering principle: surgical rewrites are better than complete rewrites when you can precisely identify the component causing the problem. The DAG engine — the code that parses workflow definitions, evaluates dependencies, and determines which steps are ready to execute — was not the source of the overhead. Replacing it alongside the flow engine would have added scope, risk, and development time without addressing the actual bottleneck. The team's ability to identify precisely where the overhead lived was the prerequisite for a scoped, successful rewrite.

New Use Cases Unlocked

The 100x throughput improvement wasn't just a quantitative improvement in existing workflows — it unlocked qualitatively new use cases. Ad targeting pipelines that previously ran hourly can now run on 15-minute cycles, providing fresher signals. Live event data processing can now run within seconds of event completion rather than waiting for the next hourly window. The performance improvement changed what Netflix could build, not just how fast they could run existing things.

The 2.5-Year Latency to Visibility

Maestro had operated successfully for two and a half years before the sub-hourly workloads revealed the flow engine overhead. This timeline is instructive: performance bottlenecks are often invisible until a new use case pushes the system closer to its limits. Daily ETL pipelines completing in hours have no reason to notice a 200ms step launch overhead. 15-minute ad targeting pipelines immediately feel it. Building systems with performance observability from the start allows bottlenecks to be found proactively rather than reactively.

LIVE, ADS, GAMES: THE PRODUCT DRIVERS

Netflix's expansion into live events (sports, comedy specials, live programming), advertising (a new revenue stream launched 2022), and games (mobile and cloud gaming) created data pipeline requirements that hadn't existed in Netflix's purely subscription VOD model. Advertising requires near-real-time data to be effective: ad targeting signals from viewer behavior need to be processed and applied within minutes, not hours. Live events generate immediate engagement data that needs to flow through analytics pipelines before the event ends. These new product lines were the forcing function for Maestro's performance improvement.

We built the new flow engine from first principles specifically for Maestro's requirements — lightweight state transitions at very high frequency, without coupling the DAG execution engine to an external service call on every state change.

— Netflix Engineering — via '100X Faster: How We Supercharged Netflix Maestro's Workflow Engine'

The Fix

The Flow Engine Rewrite

The new flow engine was designed from first principles for Maestro's specific requirements. Rather than building on Conductor's general-purpose state management or Temporal's inter-process orchestration, the team implemented a purpose-built state machine that handled exactly the transitions Maestro needed: step-ready → running → completed/failed, with retry and timeout logic, at extremely high frequency without external service dependencies. The design was minimal by intention: every abstraction layer that wasn't serving Maestro's use case was eliminated.

100x

Throughput improvement from the flow engine rewrite — enabling sub-hourly scheduling and low-latency ad hoc queries that were infeasible on the original engine

2.5 years

Time Maestro operated successfully before the sub-hourly use case revealed the flow engine overhead — a reminder that performance requirements change as products evolve

External service dependencies in the new flow engine — state transitions happen in-process, eliminating the network latency and reliability coupling of external orchestration services

Kept DAG

Components preserved from the original architecture — the DAG engine was not the bottleneck and was not rewritten, limiting scope and risk

java

// Conceptual: The old flow engine approach vs new flow engine
// Old: Conductor-based state management (full feature set, higher overhead)
// New: Purpose-built lightweight state machine

// OLD APPROACH: Conductor state transitions
// Each step state change requires a round-trip to Conductor's state store
// Conductor evaluates full state management logic for each transition
class OldStepExecutor {
    void onStepComplete(Step step, StepResult result) {
        // Conductor handles state transition — full feature set overhead
        conductor.updateTaskStatus(
            step.taskId,
            result.toTaskResult()  // serialization + network call
        );
        // Conductor evaluates downstream dependencies
        conductor.decide(step.workflowId); // another network call
    }
}

// NEW APPROACH: Purpose-built in-process state machine
// State transitions are in-memory, no external service calls
// Only the transitions Maestro needs, optimized for high frequency
class NewStepExecutor {
    void onStepComplete(Step step, StepResult result) {
        // In-process state update — no network round-trip
        WorkflowState state = stateStore.get(step.workflowId);
        state.markStepComplete(step.id, result);
        
        // Evaluate ready steps locally — no external service dependency
        List<Step> readySteps = state.getReadySteps();
        
        // Dispatch ready steps to execution queue
        readySteps.forEach(this::dispatch);
        
        // Persist state change atomically
        stateStore.save(state);
    }
}

SURGICAL REWRITE: SCOPE IS A VIRTUE

The decision to rewrite only the flow engine — not the DAG engine, not the API layer, not the scheduling system — is what made the 100x improvement possible within a reasonable development timeline. A complete rewrite of Maestro would have taken years and carried enormous risk. A targeted rewrite of the bottleneck component took months and carried bounded risk. The prerequisite was precise understanding of where the overhead lived. Profiling and measurement before architectural decisions is not overhead — it's the work that makes targeted improvements possible.

Open-Source Beneficiaries

The 100x performance improvement was contributed to the open-source Maestro repository. Organizations that adopted Maestro after the original open-sourcing in July 2024 now benefit from an orchestration engine capable of sub-hourly scheduling at million-task-per-day scale. The compound value of open-sourcing battle-tested systems: community users get production-grade improvements as they're developed.

The Netflix Product Evolution That Drove the Fix

Maestro's 100x improvement is a case study in how product evolution creates engineering requirements that didn't exist at system design time. When Maestro was designed in 2020, Netflix's primary workflow use cases were daily ETL pipelines and hourly ML training runs. By 2024–2025, Live, Ads, and Games had created sub-hourly and real-time data requirements. Workflow orchestrators that were designed for daily batch jobs don't automatically handle real-time event-driven workloads — the latency requirements are an order of magnitude different.

Keeping the DAG Engine: The Right Scope Decision

The DAG engine — the component that parses workflow definitions, evaluates dependencies, and determines which steps are ready to run — was not contributing to the flow engine overhead. Rewriting it alongside the flow engine would have added months of development time, introduced new bugs in a working component, and required re-validating all of Maestro's workflow semantics. Scope discipline — rewriting only what needs to be rewritten — is the engineering decision that made 100x improvement achievable in a reasonable timeline.

THE OPEN SOURCE TIMELINE

The 100x improvement was contributed to the open-source Maestro repository following its development. Since Maestro was open-sourced in July 2024, external users who adopted it benefit from a continuously improving orchestration platform — not a snapshot. The value of open-sourcing production systems compounds over time as improvements driven by internal Netflix requirements become available to the broader engineering community.

Architecture

Maestro's architecture after the flow engine rewrite maintains the same three-layer structure: Workflow Engine (DAG state, dependency tracking), Step Runtime Workers (stateless executors), and Signal Service (event-driven triggers). The change is internal to the Workflow Engine layer: the flow engine that manages step state transitions was replaced with a purpose-built implementation. From the outside — from users defining workflows, from the Signal Service publishing events, from the Step Runtime Workers reporting completions — nothing changed. The optimization was architecturally invisible.

Maestro Before: Conductor-Based Flow Engine (Higher Overhead)

Maestro After: Purpose-Built Flow Engine (100x Faster)

flowchart TD dag_engine["DAG Engine\n(dependency evaluation)"] -->|"step ready"| new_flow["New Flow Engine\n(purpose-built state machine)"] new_flow -->|"in-process state"| in_memory["In-Process State\n(no external service calls)"] in_memory -->|"persist atomically"| postgres[("PostgreSQL\n(durable state)")] new_flow -->|"dispatch ready steps"| workers["Step Runtime Workers"] workers -->|"completion — in process"| new_flow note["No external service round-trips\nfor state transitions — 100x faster"]

PROFILING BEFORE REWRITING

The 100x improvement was possible because the team could precisely identify the flow engine as the overhead source. This required detailed profiling of Maestro's step execution path — measuring where time was spent at each stage of a step state transition. Without this profiling work, a rewrite might have targeted the wrong component and produced minimal improvement. Measurement before optimization is not a platitude — it's the prerequisite for targeted, effective engineering.

The 1M+ Task/Day Scale Constraint

The new flow engine had to maintain support for Maestro's existing workload — 1M+ tasks per day, workflows with hundreds of thousands of steps, long-running daily ETL pipelines. The 100x improvement was not achieved by sacrificing existing workload support — it was achieved by removing overhead that wasn't serving existing workloads either. The new engine is faster at all scales, not just at sub-hourly scales. The improvement was architectural, not a tradeoff.

Performance Impact at Maestro's Scale

The 100x throughput improvement at Maestro's operating scale — 1M+ tasks per day — translates to significant concrete capacity. The same infrastructure can now support 100x more concurrent step executions, enabling Netflix to run sub-hourly workflows alongside existing daily ETL pipelines without requiring additional worker capacity. For a system already handling hundreds of thousands of workflows, the improvement effectively eliminates step-launch as a scaling bottleneck for the foreseeable future.

Lessons

The Maestro 100x story is about the intersection of product evolution, performance measurement, and surgical engineering. The lessons apply to any long-running production system that needs to serve new use cases it wasn't designed for.

Measure before you rewrite. The Maestro team knew exactly which component to rewrite because they had profiled the execution path and located the overhead precisely. A rewrite without measurement is a guess. A rewrite with measurement is a targeted intervention. The profiling work is not overhead — it's the work that makes targeted improvements possible.

have lower risk and faster delivery than complete rewrites. The flow engine was replaced; the DAG engine was kept. This scoping decision is why the improvement was achievable in months rather than years.

Performance requirements change as products evolve. Maestro was correctly designed for daily ETL workloads in 2020. Netflix's expansion into Live, Ads, and Games in 2024–2025 created sub-hourly requirements that didn't exist at design time. Build systems that are measurable and targetable for performance improvement as requirements evolve.

General-purpose frameworks have overhead that purpose-built implementations don't. Use general-purpose frameworks when their full feature set is needed; build purpose-built when it isn't. Conductor was the right choice when Maestro was designed — it provided reliable state management quickly. The rewrite was right when the overhead became the bottleneck — the team had the data to make that call.

Architectural improvements that remove external dependencies improve both performance and reliability simultaneously. The new flow engine is faster because it has no external service round-trips. It's also more reliable because it has fewer failure modes — no external service to go down, no network partition to handle in the hot path.

PERFORMANCE OBSERVABILITY AS DESIGN REQUIREMENT

The Maestro overhead existed for 2.5 years before it became visible. If per-step launch latency had been a tracked metric from day one, the overhead would have been visible from the beginning — even if it hadn't mattered yet. Building systems with detailed performance instrumentation from the start means bottlenecks are discovered via monitoring rather than via new use cases hitting walls. Performance observability is a first-class design requirement, not an afterthought.

The Temporal Consideration

The Netflix team explicitly evaluated Temporal before deciding to build a custom flow engine. Their conclusion: Temporal's value proposition is in managing long-running, durably-persisted workflows with complex retry and compensation logic — a use case that requires coupling the execution engine to an external orchestration service. Maestro's lightweight state transition needs don't justify that coupling. Choosing not to adopt a popular framework when its overhead exceeds its benefit is an engineering decision, not a gap.

Netflix's workflow orchestrator ran 2.5 years without anyone noticing a 100x performance improvement was available — which is either a compliment to how well Maestro worked or a reminder that daily ETL jobs don't complain about latency.TechLogStack — built at scale, broken in public, rebuilt by engineers