A logistics scale-up went live with five agents on day one. A coordinator at the top, a research agent that pulled carrier rates, a drafting agent that wrote quote emails, a validator that checked the quote against contract terms, and a fifth agent that pushed the result into Salesforce. They had wired it together in CrewAI over a fortnight. The architecture diagram occupied a full slide and the founder had shown it to investors.

On day eight the coordinator started looping between the research agent and the drafting agent on a single quote request from a Belgian freight forwarder. The research agent kept handing back partial data because one carrier's API was rate-limiting it. The drafting agent kept rejecting the draft because three required fields were null and bouncing it back. Neither agent surfaced the rate-limit as the root cause. The loop ran for six hours, burned through the daily token budget by 04:00 GMT, tripped a billing alert, and woke an on-call engineer who had no trace tooling, no per-agent eval harness, and no idea which of the five agents had caused the spend. The post-mortem took three working days.

We asked the founder one question afterwards. Had they operated even one of those agents in production for ninety days before chaining them? They had not. They had a prototype that worked impressively on a Tuesday afternoon and shipped four agents that needed to coordinate before they had watched a single one of them survive a real workload.

We told them what we tell almost every team in this position. Ship one agent first. Operate it for sixty to ninety days. Then come back and talk about coordinators.

What multi-agent means in practice

Multi-agent, in the sense most vendors use it in 2026, means multiple specialised agents that pass work between each other, usually with a coordinator or router agent sitting in the middle deciding who handles what. The coordinator reads the input, picks a sub-agent, hands off context, receives the result, decides whether to escalate to another sub-agent, and eventually returns an answer. Frameworks such as LangGraph, CrewAI, AutoGen, and the custom orchestrators some teams write in-house all exist to make this kind of topology easier to express in code. We name them as examples of the category, not endorsements.

This is genuinely useful for problems that decompose naturally. A research task that branches into five sub-questions, each requiring a different corpus, is a real fit. A workflow where one component searches the web, another writes a draft, and a third fact-checks the draft against an internal knowledge base can be a real fit. The pattern earns its complexity in the right place.

It is overkill for almost every operational workflow a B2B team wants to automate first. Most first builds look like this. A claims triage step at an insurer's first-notice-of-loss desk. A vendor onboarding check. A customer-message classifier. A draft-and-review loop for one document type. These are single-agent problems wearing multi-agent costumes because multi-agent diagrams photograph better in a board deck.

Reason 1: trust calibration is observational

You do not learn how much to trust an agent from a benchmark. You learn it from watching one handle a thousand real cases. Benchmarks tell you the agent can perform on a curated set of examples a researcher hand-picked. Production tells you what it does on a Tuesday when the input is malformed, the upstream system is slow, and the user has copy-pasted something from an email client that mangled the encoding.

Single-agent calibration follows a recognisable sixty-day arc. Days one to fourteen, an operator reads every output before it leaves the system and logs every disagreement. Days fifteen to thirty, the operator samples roughly twenty per cent of outputs, weighted by the agent's confidence score so that low-confidence outputs get checked more often. Days thirty-one to forty-five, the team moves to spot-checking only flagged escalations and the weekly drift report. Days forty-six to sixty, the team runs a monthly review against a held-out set of historical cases and a small adversarial suite the operators have built from the surprises of the previous six weeks. By day sixty you have a defensible position on what the agent does well, what it does badly, and what category of input still needs a human.

A single agent gives you exactly one thing to observe. One set of inputs. One sequence of decisions. One log of escalations. One drift curve over weeks. When something goes wrong you can read the trace top to bottom and form a hypothesis in an hour. The four-question filter we use before any build exists in part to make sure the workflow you pick is one you can observe this way.

Put three agents in a chain and that observational discipline collapses. When the final output is wrong you cannot tell which agent caused it without specialised tracing tooling, structured handoff schemas, and per-agent evaluation pipelines that you do not yet have. You will spend the first three months building the debugging stack you should have built before you needed it, while the agents keep producing output you cannot fully audit. Teams in that position usually start patching prompts in panic and learn nothing durable about any of the agents.

Reason 2: the operational surface area is multiplicative

A single agent has one set of operational artefacts. One prompt-version history. One evaluation harness. One escalation policy. One audit-log shape. One on-call rotation. One drift monitor. One vendor relationship for the underlying model. One incident-response runbook. One set of red-team prompts.

Three agents have at least three of each. Often more, because the coordinator itself is a fourth agent with its own version of everything. Then there are the inter-agent contracts, the handoff schemas, the shared memory store, and the routing logic that all need their own versioning, monitoring, and rollback strategy. Most teams underestimate this expansion by a factor of five.

This matters because the operational layer is where AI projects live or die. We have written separately about escalation policies that survive an audit, and the short version is that maintaining one good escalation policy is hard. Maintaining four that have to agree with each other, in writing, in a way a regulator can read, is a different category of hard. A team that has never written one is not in a position to write four.

Reason 3: multi-agent failure modes are weird

Single-agent failures are mostly familiar. The model hallucinates. The retrieval is stale. The tool call times out. The prompt is too long. These are well-understood failure modes with well-understood mitigations, and your team can build intuition for them within a few weeks of operation.

Multi-agent failure modes are stranger. The first is compounding hallucination. Agent A produces a confident, plausible-sounding mistake. Agent B receives that mistake as input and treats it as fact, because nothing in the handoff schema flags it as uncertain. By the time the coordinator reads the final output the mistake has been laundered through three reasoning steps and looks like a conclusion. The output passes every shallow check you have.

The second is loop-and-lock. The coordinator cannot decide which sub-agent to route to and starts ping-ponging. Agent A defers to Agent B because the task looks like research. Agent B defers back to Agent A because the research is structurally a draft. The system burns through its token budget producing nothing. You discover the loop two days later when a finance dashboard flags an unusual spend pattern, which is exactly how the logistics team in the opening of this piece found theirs.

The third is silent capability drift. You update one sub-agent's prompt to fix a small issue. The update changes the format of the agent's tool calls in a subtle way. The coordinator, which was matching on the old format, silently stops routing to that agent. The fleet keeps running. Outputs degrade slowly because one of the four specialists has effectively been removed from the system, and you do not notice for three weeks because no metric you have was built to catch this specific class of regression.

None of these failure modes are obvious until you have spent six months debugging them. They are not in the vendor demo. They are the kind of thing you learn the way you learn that a particular database has a quiet locking pathology, which is to say, painfully and at three in the morning.

Reaching for a multi-agent framework before you have operated a single agent in production is the architecture equivalent of installing Kubernetes for a workload that fits on one VM.

When multi-agent IS the right answer

Not never. Multi-agent is a legitimate architecture for the right problem and the right team. We recommend it when three conditions all hold.

Most teams on their first build do not have all three. They usually have the first one in theory and neither of the others in practice. The teams that do have all three are almost always teams that have already shipped two or three single agents and have arrived at multi-agent through experience rather than ambition.

What we recommend instead

Pick the single most painful operational workflow your team owns. Not the most interesting one. Not the one with the best demo potential. The one where a person currently spends time doing something that is structurally repetitive, has clear inputs and outputs, and has a tolerable cost of being wrong on a single instance.

Build a single agent for that workflow. Give it one model, one prompt, one tool surface, one escalation path to a human. Ship it in a contained scope. Run it for sixty to ninety days. Measure the things you can measure and write down the things you cannot. Watch how it drifts. Watch what kind of inputs break it. Watch what your operators do when they take over a case.

Use those ninety days to build your operational vocabulary. What does drift look like for this team and this workflow? What is the right escalation cadence? What does an incident feel like when an agent rather than a service is the proximate cause? What does your evaluation harness need to look like to catch the regressions that matter? These answers are not generic. They are specific to your team, and you cannot learn them without operating something.

Only then start adding agents to adjacent workflows. Often the second and third agents do not need to coordinate. They just run alongside the first one, each owning their own workflow, sharing nothing but the operational practices you have now built. That is a fleet too, and for most teams it is the fleet they needed all along. The coordinator is something you reach for when the work genuinely demands it, not when the architecture diagram does.

A note on framework choice

The same logic applies to multi-agent frameworks. LangGraph, CrewAI, AutoGen, and the custom orchestrators teams sometimes build in-house are not bad tools. They are advanced tools. They solve real problems for teams that have those problems. Adopting one of them before you have operated a single agent in production is the architecture equivalent of installing Kubernetes for a workload that fits on one VM. The tooling is excellent. It is solving a problem you do not yet have, in exchange for an operational burden you are not yet equipped to carry.

If your first build genuinely needs structured handoffs, persistent multi-step state, and a router, then yes, reach for the framework. Most first builds do not. They need one agent, one prompt-version history, one evaluation pass, one escalation policy, and one team that knows how to read the logs.

Earning multi-agent: the signal in your audit log

There is a specific signal that tells you a team has earned the right to add a second agent and a coordinator. It is not a hunch and it is not the day a vendor finishes a slide deck. It lives in the audit log, and it looks like this. Thirty consecutive operating days where every escalation the agent raised was either correctly flagged or correctly not-flagged when reviewed by an operator, where operator overrides on non-escalated cases sat under three per cent, and where the post-hoc weekly review surfaced zero silent failures the agent had let through. If you can show that pattern, and your drift monitor is flat across the same window, you have the operational muscle to take on a second agent without dropping the first.

The team that hits that standard for ninety days has earned the right to consider multi-agent. They have an evaluation harness, an escalation policy, a drift monitor, a runbook, and an operator who has been paged at least once and survived it. They know what a quiet failure mode looks like in their environment. When they add a second agent they know what to instrument from day one. When they add a third and a coordinator they have the muscle to maintain four sets of everything without dropping the first.

The team that has not is buying complexity on credit. The bill arrives in month four, usually in the form of an incident that nobody on the team is positioned to debug. Almost every multi-agent rollout we have been asked to rescue was a first build. Almost every single-agent rollout we have been asked to extend was already running, already trusted, and already producing the kind of operational knowledge that earns the next agent the right to exist.