Two and a half years after Netflix's Maestro workflow orchestrator replaced Meson, it had achieved its design goals: horizontal scalability, support for hundreds of thousands of workflows, reliable execution of millions of jobs per day. By 2024, however, Netflix's business had changed in ways that revealed new performance requirements. Live programming, Ads, and Games drove use cases with sub-hourly scheduling needs — ad targeting pipelines that needed to run every 15 minutes, live event data processing that needed to execute within seconds of an event, low-latency ad hoc queries. These workloads exposed overhead in Maestro's step execution path that had been invisible during daily and hourly ETL workflows. The orchestrator wasn't broken — but it was noticeably slower than it needed to be for a new class of latency-sensitive use cases.
⏱️The overhead that sub-hourly workloads exposed wasn't measured in seconds of latency — it was measured in fractions of seconds of step launch time that added up across thousands of daily executions. For hourly ETL pipelines, a 200ms step launch overhead is irrelevant. For 15-minute ad targeting workflows with hundreds of steps, that overhead becomes a material fraction of the entire scheduling budget.
The Maestro engineering team investigated the overhead and traced it to the flow engine — the component responsible for managing state transitions between workflow steps. The original flow engine had been built on top of Netflix Conductor, an open-source workflow orchestration system that provided a full feature set of state management capabilities. Maestro used only a subset of Conductor's features — lightweight state transitions — but paid the overhead of Conductor's full implementation. This overhead was acceptable at 1-million-task-per-day scale with daily scheduling. It was unacceptable for the sub-hourly, low-latency workloads that Netflix's evolving product portfolio demanded.
THE INVISIBLE OVERHEAD
The flow engine overhead didn't cause errors or trigger alerts. Workflows completed. SLOs were met. But the step launch time was
higher than it needed to be, and for sub-hourly workloads, 'higher than needed' became 'unacceptably slow.' This is a class of performance issue that only becomes visible when new use cases push the system closer to its boundaries — the boundary had always been there, but daily ETL workloads never reached it.
Problem
Sub-Hourly Workloads Expose Step Launch Overhead
Netflix's expansion into Live, Ads, and Games drove scheduling requirements as short as 15 minutes. Sub-hourly workflows executing hundreds of steps were sensitive to per-step launch overhead that was invisible on daily ETL pipelines. The Maestro flow engine's overhead, acceptable at hourly+ scheduling, became a bottleneck for the new use case class.
Cause
Flow Engine Built on Conductor's Full Feature Set
Maestro's flow engine used Netflix Conductor for state management, but only needed lightweight state transitions — not Conductor's full feature set. The team also considered Temporal (optimized for inter-process orchestration via external service calls) but concluded that coupling the DAG engine to an external service introduced unnecessary reliability risk at 1M+ daily tasks.
Solution
Purpose-Built State Machine: Keep DAG, Rewrite Flow Engine
The team kept the DAG engine (workflow definition and dependency management) and rewrote only the flow engine (state transitions). The new flow engine was purpose-built for Maestro's specific requirements: lightweight state transitions at very high frequency, without the overhead of a general-purpose state management framework.
Result
100x Throughput Improvement
The rewritten flow engine delivered 100x throughput improvement, enabling the sub-hourly and low-latency use cases that Netflix's Live, Ads, and Games products required. The improvement opened new possibilities for workflow orchestration at Netflix that hadn't been feasible on the original engine.
The architectural decision to keep the DAG engine while rewriting only the flow engine reflects a key engineering principle: surgical rewrites are better than complete rewrites when you can precisely identify the component causing the problem. The DAG engine — the code that parses workflow definitions, evaluates dependencies, and determines which steps are ready to execute — was not the source of the overhead. Replacing it alongside the flow engine would have added scope, risk, and development time without addressing the actual bottleneck. The team's ability to identify precisely where the overhead lived was the prerequisite for a scoped, successful rewrite.
LIVE, ADS, GAMES: THE PRODUCT DRIVERS
Netflix's expansion into live events (sports, comedy specials, live programming), advertising (a new revenue stream launched 2022), and games (mobile and cloud gaming) created data pipeline requirements that hadn't existed in Netflix's purely subscription VOD model.
Advertising requires near-real-time data to be effective: ad targeting signals from viewer behavior need to be processed and applied within minutes, not hours. Live events generate immediate engagement data that needs to flow through analytics pipelines before the event ends. These new product lines were the forcing function for Maestro's performance improvement.