A demo agent that handles ten cases in a conference room is not the same artefact as a production agent that handles ten thousand cases over a quarter. The production version has to survive the seventeen records with a NULL field, the three whose upstream API returned a 200 for a failed write, the customer whose name has an apostrophe that breaks a downstream SQL query the integration team never tested, and the case that requires the operator to override at 02:14 on a Tuesday because the on-call rota has a name on it. Most of the engineering work is not in the model call. It is in the eleven stages below: workflow selection, architecture, integration plan, prompt and tool design, eval harness, shadow run, escalation policy, audit log, cut-over, operate, extend. Stages four through nine are where production diverges from demo, and the teams that skip those stages are the teams whose agents quietly stop being trusted three months in.

Stage 1: Workflow selection

Pick the wrong workflow and nothing else in this pillar saves you. We have watched well-designed agents fail on the wrong workflow and lazy agents succeed on the right one. Selection is the single most consequential decision in the lifecycle.

A workflow is implementable when five conditions hold. The input is structured enough that a person could specify the rules in a paragraph. The shape repeats: the same trigger, the same decision, the same downstream action, hundreds or thousands of times a month. The individual judgement per case is low; the work is volume, not artistry. There is a clear override path so a human can take a case back without filing a ticket. And every system the agent touches has an API, a service account, or an integration the agent can actually call. Miss any one of those and the build will stall.

The practical version of this filter is the four-question filter before building any agent. About seventy per cent of attractive-sounding ideas die at question three, which is the do-you-have-integration-access question. That is the result we want. Finding out at the scoping table is cheap; finding out three months into the build is not.

Stage 2: Architecture decision

Two architecture choices matter on the first build: single-agent versus multi-agent, and which model and framework you use. Almost always single-agent on the first build. The full reasoning is in single-agent before multi-agent on the first build, but the short version is that trust calibration is observational, operational surface area multiplies rather than adds, and the genuinely weird multi-agent failure modes only surface after months of production. Earn the right to multi-agent. Do not buy it.

Model choice in 2026 is a two-tier decision, not a one-tier one. The reasoner is a top-end general-purpose LLM (Claude Sonnet 4.7 or Opus 4.7 for the work that justifies the bill, GPT-5-class for OpenAI-aligned shops, Gemini 2.5 Pro for the teams already on Google Cloud). The classifier or router that sits in front of it is a smaller, cheaper model (Haiku, GPT-5-nano, Gemini Flash, or a fine-tuned Llama or Qwen running on the team's own infra). The classifier's job is binary or low-cardinality: is this a refund case or a fraud case, is the confidence above or below the threshold, should this escalate or proceed. The reasoner's job is the actual decision. Treating them as one model is what makes naive builds expensive at scale: every routing call pays the reasoner's per-token rate, and the bill compounds.

Framework choice is the one most teams over-think. The credible options in 2026 are LangGraph, CrewAI, AutoGen, the new Anthropic Agent SDK, and a hand-rolled state machine. Most production agents we ship use a hand-rolled state machine of a few hundred lines, because frameworks add maintenance overhead the team is not yet ready to absorb. A framework is a bet that you will operate at a complexity the framework was designed for. On a first build, you are not. You are running one agent through a handful of states with a couple of tool calls. Pick the framework when you have earned a reason. Until then, write the orchestration yourself and read it in one sitting.

Vector stores are the other over-decision. The default assumption is that the agent needs a RAG layer with Pinecone, Weaviate or Qdrant on day one. Most do not. If the agent is reading a known set of structured records from Salesforce, NetSuite or Snowflake, the right answer is to query the source of truth directly: SQL, GraphQL, REST. The data is already governed there, already audited there, already current there. The vector store earns its place when the agent has to reason over a large corpus of unstructured documents (policy archives, lease libraries, prior-case files) and the corpus is too big to fit in the context window. Skip it on the first build unless that condition holds. Add it later if you must. Premature vector indexing is the second-most common form of architecture astronaut behaviour, behind premature multi-agent.

Stage 3: Integration plan

This is the stage where most projects either ship or quietly stall. The integration plan is a one-page document, one row per system the agent will touch, with five columns: authentication mechanism, rate limit, error model, data-quality story, change-management story. Fill the rows in before you write the agent.

Authentication is whether the system gives you a service account, OAuth, SAML, or nothing at all. Rate limit is the number the vendor will admit to in writing and the number you will actually see under load, usually different. Error model is the question of whether the API returns honest error codes or whether it returns a 200 with a hidden success-false payload that your code has to learn to read (Salesforce Bulk API and several Workday endpoints are notable repeat offenders here). Data quality is whether the upstream system contains the field the agent needs, populated correctly, on most of the records. Change management is who tells you when the vendor changes the schema, usually nobody, which means you instrument for it.

The integration plan is also where you say no to scraping. If the only way to reach the system is to drive a browser through it, you are not building an agent. You are building screen-scraping infrastructure that breaks the first Tuesday the vendor changes a button. Push back at scoping. The cost of that pushback is one awkward conversation. The cost of accepting it is six months of fragility.

Stage 4: Prompt and tool design

A production prompt is usually two hundred to five hundred lines, not twenty. It contains the role, the rubric the agent grades cases against, the escalation rules in plain language, the strict output format (usually JSON with a schema enforced server-side, not just requested in the prompt), four to eight few-shot examples drawn from real anonymised cases, two or three negative examples that show what the agent must not do, and the safety constraints that always apply. The prompt is a specification document the agent reads on every call. Treat it like one: version it in git, review changes in pull requests, attach the version number to every audit-log row.

Prompt versioning matters more than most teams expect. The model changes underneath you: Anthropic ships a new minor version, OpenAI quietly updates the routing on their endpoint, the Gemini default switches from one checkpoint to another. The combination of (prompt version, model version) is what determines a decision, and the only honest way to debug a regression six weeks later is to know which combination produced the row in the log. Pin the model version explicitly in the API call. Tag every commit to the prompt with a semver. Write both to the audit log. The first time somebody asks "why did the agent do this on the fourteenth?" you will not be guessing.

Tools, the agent's API calls into the rest of the world, are designed with the same care you would give a public REST API. Typed schemas on both inputs and outputs. Idempotency keys on anything that writes (use the upstream system's native idempotency primitives where they exist: Stripe's Idempotency-Key header, the OMS's request-id field, a deterministic hash where you have to build one). A retry policy with exponential back-off for transient failures and a hard stop for the ones that should never retry. Structured errors that the agent can actually reason about, not free-text strings that look the same to the model whether the request failed for a network reason or a permission reason. Most the-agent-did-the-wrong-thing incidents are prompt or tool design failures, not model failures. The model did what the prompt told it to do, and what the prompt told it was ambiguous.

Most the-agent-did-the-wrong-thing incidents are prompt or tool design failures, not model failures. The model did what the prompt told it to do, and what the prompt told it was ambiguous.

Stage 5: Eval harness

Before the agent goes anywhere near live data, you build a written eval set. One hundred to five hundred representative cases, with known correct outputs, drawn from a real sample of the workflow's traffic rather than invented in a workshop. The set covers the boring middle, the long-tail edges you know the agent will struggle with, and the cases where the correct answer is escalate rather than act. Build it once, version it in git alongside the prompt, and add to it every time you find a new failure mode in production.

The harness runs the agent against the eval set on every meaningful prompt change. Three metrics are non-negotiable: accuracy against the known-correct output, escalation rate (because an agent that escalates everything is technically accurate and operationally useless), and time per case (because a prompt that triples the per-call cost has to justify itself somewhere). Reject any change that moves any one of those metrics materially in the wrong direction. Most teams skip this stage and learn why three months later, when nobody can debug a regression because nobody has a reference set of cases that used to work.

The toolchain question is less interesting than it looks. Use what you already use for tests: pytest plus a thin wrapper, a Jupyter notebook for ad-hoc exploration, plus a CI step that fails the build if accuracy drops more than two percentage points. Promptfoo, Inspect, OpenAI Evals and the Anthropic eval SDK are all fine; pick one and stop shopping. The eval set itself, the labelled cases drawn from your data, is the asset. The runner is a commodity. Teams that obsess over the framework and underinvest in the labelled set get exactly the result they paid for.

The harness is also what makes the agent boring to operate. A new model release lands; you run it against the eval set in twenty minutes; you have an evidence-backed answer on whether to switch. Without the harness, every model upgrade becomes a small crisis. With it, it becomes a routine task.

Stage 6: Shadow run

The agent runs against live data for two to four weeks and writes nothing. The operator handles every case in the normal workflow, exactly as they would have without the agent, and after the case is closed the system shows them what the agent would have done. This is the single most important calibration step in the lifecycle, and our methodology is built around it.

Three things happen during a shadow run that cannot happen any other way. Confidence thresholds get calibrated against the agent's actual agreement rate with humans, not its training-time benchmark. Integration edge cases that never appeared in eval surface in production data: the customer record with the apostrophe, the legacy field full of free text where the schema promised an enum, the third-party endpoint that returns 200 for failed requests. And the operating team builds the vocabulary they will need once the agent is live: which kinds of cases the agent handles well, which it gets wrong in instructive ways, what to look at first when something does go wrong. None of that is on the slide deck. All of it is what makes the cut-over undramatic.

Stage 7: Escalation policy

Escalation is the contract between the agent and the human. Three patterns cover almost every production workflow: confidence-thresholded handoff for high-volume work where per-case stakes are moderate, exception-typed routing for the small list of case types that always require a human regardless of confidence, and severity-gated cut-out for the catastrophic-tail cases that halt the agent and page on-call. The full design and the auditor-facing reasoning is in escalation policies that survive an audit.

The escalation policy is the architecture; the model is the engine. A 90% accurate agent inside a tight escalation policy is a system the operations lead can defend across a table from a regulator. A 95% accurate agent without one is a finding waiting for a sample. Pick the pattern per workflow. Write the policy down before you cut over. Re-read it monthly.

Stage 8: Audit log and observability

Every decision the agent makes, not just the escalated ones, every decision, gets a row in the audit log. The full field list, in the order you will instrument:

  1. Full inputs the agent saw, with privacy redaction where applicable
  2. Model and prompt version, so a decision from three months ago can be replayed exactly
  3. Confidence score, on the calibration scale the threshold uses
  4. Escalation rules evaluated, with the one that fired marked clearly
  5. Decision and rationale: the agent's structured output plus its free-text reasoning
  6. Downstream actions taken, including which system was written to and the resulting identifier
  7. Human reviewer and their decision, if the case was escalated
  8. Timestamps for each step, including the gap between escalation and human review
  9. Stable case ID that ties the agent row back to the originating customer, claim, or order

Each field exists because an auditor will eventually ask the question it answers. What did the agent know at the time. What was the agent, exactly. What decision logic produced this outcome. What changed in the real world as a result. Who reviewed it. How long did it sit. What is the join key back to the customer record. This is what lets the head of operations answer the auditor in thirty seconds eighteen months later, instead of freezing.

Observability is the live half of the same story. The audit log answers what happened on the fourteenth of March. Observability tells you what is happening right now. Wire structured traces (OpenTelemetry spans for the agent's planning, tool calls, and tool responses) into the same Datadog, Splunk, Honeycomb or New Relic instance the rest of production already uses. Dashboards for accuracy, escalation rate, per-case latency, tool-call error rate, and per-call cost. Alerts on the four numbers that matter: accuracy below floor, escalation rate above ceiling, p95 latency above SLA, error rate above threshold. The team should see drift before the customer does.

The log lives in the same observability stack the rest of the business uses for production systems (Datadog, Splunk, an internal warehouse on BigQuery or Snowflake), not a sidecar database the AI team keeps to itself. Retention matches the regulatory regime: six years for HIPAA, the relevant local period for GDPR-governed personal data, seven years for SEC broker-dealer records. Rows are append-only; changes to historical rows are themselves logged.

Stage 9: Cut-over

Cut-over is the day the agent starts writing to live systems. It is undramatic when the shadow run was thorough and dramatic when it was not. The checklist is short, written, and signed off by named people before anything goes live:

Three rollback patterns ship from day one. Per-decision override, so the operator can reverse any single agent action without filing a ticket. Threshold-rollback, which reverts to fully manual if accuracy or confidence drops below a defined floor. Full kill switch, which halts the agent entirely and routes the queue back to humans. Nobody should have to be a hero to roll back. If rolling back requires a code change or a deploy, you do not have a rollback. You have a hope.

Stage 10: Operate

An agent in production is not done. It is on cadence. The cadence is fixed and instrumented, and the team treats missed reviews the way they would treat missed security patches. The short version is three rhythms.

Weekly, the operations lead reviews a sample of escalations and asks whether the right cases are arriving. Under-escalation is the silent failure: the cases that should have routed to a human but did not are the ones no one is looking for. Monthly, the build team reviews the confidence calibration against the previous month's data and adjusts the threshold or retrains the classifier. Quarterly, the compliance owner reviews the exception list and the severity cut-out list against any change in regulation, contract terms, or business product. Without the cadence the policy degrades and you do not notice until the auditor does.

Operating also means reading the audit log when nothing is wrong. The teams that get this right run a fifteen-minute log review every Friday afternoon: pick five random rows, walk through the decision, check that the rationale still makes sense. The teams that get it wrong only open the log when something is on fire, by which point the question is no longer what is the agent doing but what was the agent doing six weeks ago.

Stage 11: Extend

Once the first agent is paying back, usually six to ten weeks after cut-over, adjacent workflows are dramatically faster. The integration layer to the OMS, the CRM, the ticketing system is already in place. The escalation patterns are already calibrated. The audit log shape is already in production. The operator vocabulary is already built. The second agent inherits all of it.

Most teams who run this loop right ship three to five agents in the first twelve months. Most teams who try to ship all five in parallel ship zero. The discipline that earns the extensions is the discipline of finishing the first one cleanly. The boring workflows (high volume, low glamour, with a measurable hour count on the manual baseline) are usually where the second and third builds live.

What this lifecycle is not

This lifecycle is not a research programme. It does not start with a six-month evaluation of frontier models, a centre-of-excellence document, or a vendor bake-off. It starts with one workflow, one operator, one binding constraint, one rollback plan, and a model the team has already decided to use. The model choice is replaceable; the lifecycle is not. Teams that get this right are operating one agent in production while their competitors are still drafting the slide deck about which framework to evaluate.

It is also not a checklist you delegate to a vendor and walk away from. The operator on the workflow has to be in the room from stage one. The compliance owner has to be in the room from stage three. The on-call engineer who will be paged at 02:14 has to be in the room from stage eight. Skip any one of those seats and the agent ships into a vacuum: technically running, operationally unowned, quietly distrusted by everyone who is supposed to depend on it.

If the deliverable you want at the end of an engagement is a running agent with a named operator, a tested rollback plan, an audit log a regulator can read, and a measurable line item on what changed, the eleven stages above are how it gets built. The rest is overhead.

Where to start this week

Take the workflow you would put on stage one. Open a blank text file. Write the seven rows of an integration plan for it: every system the agent will touch, the auth mechanism, the rate limit, the error model, the data-quality story, the change-management story, the owner of the service-account conversation on the buyer side. If you cannot fill in all seven columns for any system, that is the conversation you need to have this week, not next quarter. Then sketch fifty representative cases from real data on the workflow. Label the correct output for each. That is the first quarter of an eval set. Doing both of those things will tell you in a week whether the workflow ships in twelve weeks or stalls in twenty-four. If you would like a second pair of eyes on the integration plan, that is the conversation we are happy to have.