Methodology: how Synarsi builds and ships AI agents

Phase 1: scope

Scope is where the engagement is won or lost. Most failed AI projects fail here, not in the build. The build is the easy part once the scope is honest. We use a written four-question filter, and roughly seventy percent of attractive-sounding ideas die at question three. Better to kill a bad idea on a Tuesday call than six months into a build nobody wants to admit is failing.

Can you describe the workflow in two sentences? If it takes a paragraph, it is several workflows wearing one hat. "We want to automate claims" is not a workflow. "The agent reads the FNOL email, pulls the policy from Guidewire, checks coverage, and either opens a claim or routes to the adjuster's queue" is.
What is the single most constraining boundary? Every workflow has one: HIPAA Minimum Necessary in healthcare, playbook fidelity in legal, the auditor's reconstruction test in banking, the refund-authority limit in retail, line-down risk in manufacturing, SLA breach risk on the FNOL queue. Naming it up front shapes the build.
Do you have API or integration access, or do we have to scrape? This is the kill question. Without integration access we are building screen-scraping infrastructure that breaks the first time the vendor ships a UI update. We have walked away at this question and will do it again.
Who keeps override authority, and what does the rollback look like? If no human keeps override, no compliance team will sign off. If there is no rollback plan, no ops team will trust the cut-over. Name the operator, on paper, before the build starts.

Output of phase one: a one-page scope. One workflow, one named operator, one binding constraint, one measurable target (hours back, error rate down, time-to-decision down), one rollback plan. We do not write multi-thousand-dollar discovery reports. The scope fits on a page or it is not scoped. Full reasoning in this article.

Phase 2: shadow run

We do not run sandbox pilots. A sandbox pilot proves the agent works on cherry-picked data with stubbed integrations. It proves nothing about Monday at 6am when the carrier API throttles, the vendor portal returns a 502, or the OMS returns a success code for an order that quietly failed. We have watched sandbox pilots get exec approval and collapse inside two weeks of production traffic. The most predictable failure pattern in this category.

The Synarsi alternative is a shadow run in production. The agent reads live data from the real systems (Salesforce, Guidewire, Epic, ServiceNow, the FNOL inbox, the OMS), makes a live decision on every case, writes nothing. The operator handles every case as normal, then sees what the agent would have done. Two to four weeks gives calibrated confidence thresholds, an honest accuracy number on YOUR data, and the integration edge cases: the customer record with an apostrophe, the API that returns 200 OK for a failed request, the Guidewire policy with a null effective date because the broker entered it as "see notes".

Output of phase two: a calibrated agent and a written go/no-go decision against the targets from phase one. Hit them, we cut over. Miss, we tighten scope or recommend stopping. We do not let shadow runs quietly extend without a decision, which is the single most common way agent projects die. Why pilots kill projects covers the failure mode in detail.

Phase 3: cut over

Cut-over is what makes the build a production agent rather than a demo. Three things go live together: written escalation policy, audit log, rollback plan. Miss any one and the agent is a demo with production traffic, which is worse than no agent at all.

The written escalation policy

Escalation is the contract between the agent and the operator. Three patterns, picked per workflow, sometimes layered:

Confidence-thresholded handoff. Below a calibrated threshold, the case goes to a human. A retail returns-triage agent routes anything below 0.85 confidence to the human queue. A recruitment CV-screener sends borderline matches to the sourcer.
Exception-typed routing. Certain case types always go to a human regardless of confidence. A refunds agent escalates anything above the store manager's authority limit. A prior-auth agent routes denials to the medical director. A contract-review agent flags any clause off the playbook to the partner of record.
Severity-gated cut-out. A defined severity flag halts the agent and pages on-call. A banking agent halts on a sanctions hit and pages compliance. A pharma agent halts on a suspected serious adverse event and pages the safety lead. A support agent halts on self-harm language and pages the on-call manager.

The audit log

Every decision is instrumented: full inputs the agent saw (with privacy redaction), model and prompt version, confidence score, which escalation rules fired, the decision and rationale, downstream actions (which Salesforce object updated, which Guidewire claim opened, which ServiceNow ticket created), the human reviewer if escalated, timestamps, a stable case ID. This is what lets the head of ops answer the auditor's question in thirty seconds, not three weeks. Full spec in this article.

The rollback plan

Three rollback patterns live from day one. Per-decision override: the operator reverses any single action without engineering. Threshold-rollback: if accuracy or confidence drops below a defined floor, the agent reverts to manual and pages the operator. Full kill switch: one button stops the agent and routes everything to the manual queue. Nobody should have to be a hero, or wake an engineer at 2am, to roll back.

The model is the engine. Escalation, audit log, and rollback are the architecture. The architecture is what decides whether the agent ships or quietly fails.

Phase 4: extend

Once the first agent is paying back, usually six to ten weeks after cut-over, extending to adjacent workflows is faster. Integration layer, escalation patterns, audit log shape, operator vocabulary already in place. A second agent on the same Salesforce + ServiceNow surface is roughly half the build time of the first. We almost never recommend multi-agent on the second build: the single-agent-first thesis holds for the first 60 to 90 days of every subsequent workflow.

Teams who run this loop right ship three to five agents in the first twelve months. Teams who try to ship all five at once ship zero.

Review cadence (after cut-over)

An agent in production is not done. We instrument and review on a fixed cadence: weekly on escalations (are the right cases escalating, are humans agreeing with the agent on routine ones, is the reviewer overturn rate above 30 percent), monthly on threshold recalibration (has the confidence distribution drifted, are we letting too much through or escalating too much), and quarterly on the exception list (have business rules changed, new regulator guidance, new product line, new case types that should always escalate).

Without the cadence the policy degrades and you do not notice until the auditor does. Or worse, a customer does.

Why this methodology, and not the alternatives

Every consulting firm has a methodology. Most are designed to fit a slide deck. Ours is designed to fit production. The output of a Synarsi engagement is not a 40-page report with a roadmap and a maturity model. It is a running agent with a named operator, a rollback plan, an audit log a regulator can read, and a measurable line item on what it changed. If you want that deliverable, the methodology above is how we get there. If you want the report, other firms will sell it to you, and we will say so on the first call.

How we work: the four-phase build methodology

Phase 1: scope

Phase 2: shadow run

Phase 3: cut over

The written escalation policy

The audit log

The rollback plan

Phase 4: extend

Review cadence (after cut-over)

Why this methodology, and not the alternatives

Want to run this on your workflow?

How we work: the four-phase build methodology

Phase 1: scope

Phase 2: shadow run

Phase 3: cut over

The written escalation policy

The audit log

The rollback plan

Phase 4: extend

Review cadence (after cut-over)

Why this methodology, and not the alternatives

Keep reading

Why AI pilots quietly kill agent projects, and what to ship instead

The four questions we ask before building any agent

Writing an agent escalation policy that survives an audit

Why we recommend single-agent before multi-agent on the first build

Want to run this on your workflow?