A compliance officer is sitting opposite a head of operations with one question. Why did the agent close case 47? The case is a refund decision the agent made at 03:09 on a Tuesday in February, picked at random from the eight thousand cases the agent handled last quarter. In one version of the next ten minutes the head of operations opens a log, filters to case 47, and reads out the inputs the agent saw, the rule that fired, the confidence score, the model version live that night, and the system the action was written to. In the other version they say they will need to get back to the auditor by end of week. The first conversation ends. The second one expands. The gap is not the model. It is the escalation policy and the log that records what it did.
What escalation means in practice
Escalation is the contract between the agent and the human. It defines, in writing and before the agent ships, which decisions the agent makes alone, which it routes to a human with the relevant context attached, and which it is forbidden to make at all. Those three buckets are not optional. They are the structure of the system.
The default mistake is to phrase the contract as the agent handles everything except edge cases. That sounds reasonable in a steering meeting and turns the human into a rubber stamp inside six weeks. The reviewer never sees the routine cases the agent already closed, so they have no calibration for what routine looks like this month versus last. By the time something has drifted, the reviewer is approving the agent's escalations on autopilot because the working assumption is that the agent was right on the ninety-eight percent they did not see.
A useful escalation policy works the other way round. The reviewer's role is defined positively. They own a named category of decisions and they see those decisions in full. The agent acts under restriction, not the human. That framing matters because it is the only one an auditor can verify. The human reviews exceptions is unverifiable. The human reviews every case where rule X fired, and the median time-to-review is logged is verifiable, and the log will prove it or it will not.
Pattern 1: Confidence-thresholded handoff
The first pattern uses the agent's own confidence score (or the score from an LLM-as-judge layer sitting in front of the action) as the trigger. Above the threshold the agent acts. Below it the case routes to a human queue with the agent's draft decision attached as a starting point. This is the right pattern for high-volume work where the per-case stakes are moderate and the value of automation is in the throughput. Returns triage on a retailer pushing forty thousand RMAs a week. Ticket categorisation across a hundred-agent contact centre. CV screening for a graduate scheme with twelve thousand applications a cycle. First-pass anomaly classification on a SIEM feed.
Here is the scene this pattern fits. The retailer's returns queue is sitting at four days. Sixty percent of returns are clean (item arrived, condition matches, refund approved against the policy table), and they have always been clean. A confidence-thresholded agent picks those off the top and closes them, and the human team focuses on the forty percent that need judgement: damage disputes, late returns, repeat offenders, cross-border tax recoveries. The queue drops to under a day. The audit story is that every closed case has a logged confidence above the threshold and a logged rule, and the auditor can pull the November distribution and check it against the January one.
The calibration step is where most teams get this wrong. The threshold is not picked from the model's training-time metrics, and it is not picked from the build team's intuition about what high confidence should mean. It is picked from a two-week shadow run on live data, where the agent makes a parallel decision on every real case and a human still does the work. At the end of the window the team plots the agent's confidence on the horizontal axis and the agreement rate against the human on the vertical axis. The threshold sits where the curve flattens, the point above which extra confidence does not buy extra accuracy. That number is almost never the round number the build team would have guessed.
Confidence-thresholded handoff also needs a re-calibration cadence, because the data drifts. A returns triage agent calibrated in November will see a different distribution in January once the post-holiday returns arrive. The threshold that was right for November will under-escalate in January. Calibration is not a launch task. It is a monthly task.
Pattern 2: Exception-typed routing
The second pattern ignores the agent's confidence entirely. Certain types of case route to a human every time, by rule, regardless of how sure the agent says it is. Any refund above a defined value. Any prior-authorisation denial. Any contract clause that deviates from the standard playbook. Any first notice of loss on a motor claim with bodily injury. Any change to a customer's payment instrument inside the seven days following a password reset.
The list is short. It is written down in a single document the operations lead can read in five minutes. It is reviewed monthly with a named owner. And it is treated as a hard switch in the orchestrator, not a soft preference in a prompt. A confidence score the agent generates cannot override an exception rule, because the rule lives one layer above the agent in the routing logic.
Consider the insurance scene. A US carrier runs a first-notice-of-loss agent on motor claims. The exception list says any claim mentioning bodily injury, any claim above $25,000, and any claim involving a commercial vehicle routes straight to a human adjuster regardless of confidence. The NAIC market-conduct examiner who arrives eighteen months later does not want to hear that the agent was 96% confident on a $40,000 claim it closed. They want the rule, the routing decision, the adjuster's name, and the timestamp. Same logic for a UK lender running an affordability-check agent under PRA and FCA expectations. Any application from a customer flagged as vulnerable, any application above a defined LTV, any application where the income evidence is non-standard: routed, full stop. The cost of false escalation is a small amount of human time. The cost of missed escalation is a Section 166 skilled-person review the firm pays for itself.
Pattern 3: Severity-gated cut-out
The third pattern is the rarest in volume and the most consequential when it fires. The agent has a defined list of severity flags. Any one of them, if detected in the inputs or in the agent's own intermediate reasoning, halts the agent immediately and pages a named on-call human through the on-call system (PagerDuty or its equivalent) with the case context attached. The agent takes no further action on that case.
These are the catastrophic-tail cases. A healthcare claims agent reading clinical notes that suggest a suspected adverse drug event triggers the cut-out, because the path from there runs into HIPAA breach-assessment territory and OCR will want to know who saw the notes and when. A payments agent where one party matches an OFAC sanctions list, or a UK firm's equivalent HM Treasury list, halts before any value moves. A customer-support agent that detects self-harm language in a message halts and pages the safeguarding rota. A privileged-access change request landing outside the maintenance window halts and pages the security on-call. The defining feature is that the agent's confidence is meaningless. Not low. Meaningless. The cost of getting it wrong is unbounded and the rate is too rare to calibrate against.
The cut-out list is also short, also written, also owned. The difference from Pattern 2 is the action: Pattern 2 routes to a queue that a human will work through during business hours. Pattern 3 wakes someone up. The on-call rota, the escalation tree if the primary does not acknowledge inside the defined window, and the back-out procedure for the agent's partial state are all part of the policy. Without them the cut-out is a comforting line in a document that has never been tested under load.
Escalation is not a safety feature bolted on to the side of the agent. It is the architecture. The model is just the engine.
The audit log, and what to put in it
This is the section that justifies the article's title. The three patterns are how the agent behaves. The audit log is how anyone outside the agent can later verify the behaviour was correct. Without the log, the policy is an assertion. With the log, the policy is a defence. For every decision the agent makes (not just the escalated ones, every decision) the log captures the following:
- the full inputs the agent saw, with privacy redaction where applicable
- the model and prompt version used, so a decision from three months ago can be replayed against the same model
- the confidence score, on the calibration scale the threshold uses
- the escalation rules that applied: which one fired, and which were evaluated and did not
- the decision and its rationale, in free text from the agent
- the downstream actions taken, including which system was written to
- the human reviewer and their decision, if the case was escalated
- the timestamps for each step, including the gap between escalation and human review
- the stable case ID that ties the agent action back to the originating customer, claim, or order
Each field exists because an auditor will ask the question it answers. Imagine the agent rejected a loan application three months ago and the customer complains to the FOS. Can the team reconstruct the decision in 30 seconds? The inputs field answers what did the agent know at the time. The model and prompt version answers what was the agent, exactly (because a December model and a March model are different systems even if they share a name). The confidence score and rules-fired fields answer what decision logic produced this outcome, which is the question pure free-text rationales never quite settle. The downstream-actions field answers what changed in the real world as a result, which is the question that distinguishes an agent that recommended an action from an agent that took one. The stable case ID is the join key that lets the auditor pull every related artefact (the original application, the customer record, the bureau pull, the regulatory filing) without anyone reconstructing it by hand.
The log lives in the same observability stack the rest of the business uses for production systems: Splunk, Datadog, an internal warehouse on top of BigQuery or Snowflake. It is not a sidecar database the AI team keeps to itself. The retention period matches the regulatory regime the workflow falls under: six years for HIPAA records the OCR may want to see, the ICO's documented periods for personal data under UK GDPR, the SEC's seven years for broker-dealer records, the NAIC's state-by-state schedules for insurance. Access is read-only for everyone except the ingest pipeline, and changes to historical rows are themselves logged.
What most agent-governance writing gets wrong
Most of what is written about agent governance focuses on the model. Bias evaluations on benchmark datasets. Hallucination rates on held-out questions. Prompt-injection resistance under red-team prompts. Real engineering concerns that belong in the build. Not what an auditor asks about.
An auditor asks show me this decision and your defence for it. If the operations lead can pull the row, walk through the rule that fired, point at the reviewer who approved the escalation, and produce the model version that was running at the time, the conversation moves on. If they cannot, the model evaluations are irrelevant, not because they are wrong, but because they are answers to a different question. The auditor is not assessing whether the model is good in general. They are assessing whether one specific decision was defensible, on the specific day, under the specific policy in force.
Review cadence keeps the policy honest
A policy without a review cadence degrades, and no one notices until the auditor does. Three cadences cover most cases. Weekly, the operations lead reviews a sample of escalations and asks whether the right cases are arriving. Under-escalation is the silent failure mode, because the cases that should have routed to a human but did not are the ones no one is looking for. Monthly, the build team reviews the confidence calibration against the previous month's data and adjusts the threshold or retrains the judge layer. Quarterly, the compliance owner reviews the exception list and the severity cut-out list against any change in regulation, contract terms, or business product. Without the cadence, the policy is a document. With it, the policy is a control.
The architecture, not the safety net
It is tempting to treat escalation as a safety feature, the bit of the design that catches the agent when the model gets it wrong. That framing misses what is happening. The escalation policy decides which decisions the agent owns and which the organisation owns, which means it decides the agent's effective scope. Everything else (the model choice, the prompt engineering, the evaluation suite) is in service of that scope. A 90% accurate agent inside a tight escalation policy is a system the operations lead can defend on a Tuesday morning across the table from an auditor. A 95% accurate agent without one is a regulatory finding waiting for a sample size. The model is the engine. The escalation policy is the vehicle.
One move for this week
Before the next steering meeting, pick a closed case at random from the last quarter of the agent's output. Time yourself reconstructing it: the inputs the agent saw, the model and prompt version, the confidence, the rule that fired, the downstream action, the reviewer if any. If it takes more than two minutes, the gap is your audit log, not your agent. Write down which of the nine fields above were missing, hand the list to whoever owns the observability stack, and fix the field that would have been the first one the auditor asked for. That is the upgrade. Everything else follows from it.