The slide read Pilot complete: 94% agreement with adjuster baseline. The room nodded. The COO asked when production would land. The programme director said we will scope phase two in the next planning cycle. Eighteen months later there was no agent in production, the programme director had moved to a different remit, and the procurement team could not work out who was meant to renew the original statement of work. The Looker board still existed. Nobody opened it. That is the pattern, and it is not unusual. In our intake conversations across 2025 and early 2026, roughly two in three buyers describe a previous AI engagement that ended exactly this way. The pilot did not fail. It simply did not graduate.
Pilots are safe, and that is the problem
The 90-day pilot is the most common shape AI consulting takes in 2026, and the shape that ships the least production code. It is easy to see how it became the default. A pilot has a fixed end date, which protects the buyer from runaway spend. It has a small budget, usually under the GBP 50k mark that triggers a multi-stakeholder procurement review. It has no production owner, which means no one inside the organisation has to publicly promise that the agent will be running their workflow on the other side.
Each of those features removes a different category of risk. Together they remove almost every reason for anyone to fight for the project once the demo is done. The operations director was not asked to commit headcount. The procurement lead was not asked to defend a recurring software line item in next year's budget. The team leads on the floor were not told their override authority would be preserved on day one of cutover, because there was no day one of cutover on anyone's calendar.
So when the pilot ends and the real question lands, do we move this into the live workflow, the answer requires someone to spend political capital they were specifically told they would not need to spend. Nobody volunteers. The operations director who would have to reshape a 40-person team's daily work was not asked to plan for that reshaping. The IT lead who would have to sign off write scopes against the production Salesforce instance was not asked to scope that approval. The team leads who would have to retrain their floor on a new tool were not given a date to plan around. Every approval that was deferred to the production phase is now a brand-new conversation, started from cold, with people whose calendars filled the week the pilot ended.
Meanwhile the Looker board with the favourable comparison sits in a Confluence page that fewer people open each week. Six months later it gets a one-line reference in a slide deck about AI initiatives we have explored, which is the corporate equivalent of a death certificate. The vendor has moved on to the next pilot at the next account. The internal champion who pushed for the project has either left or learned to stop pushing. The integration code is still in the feature branch. Nobody merges it because nobody's job description says they should.
The "measure ROI" tell
The other reason pilots feel reassuring is that they promise to measure ROI before the organisation commits anything serious. This is the part that deserves the hardest look, because the ROI number that comes out of a pilot is almost always meaningless, and the people producing it usually know.
The pilot ran on a curated slice of data. The build team picked the cleanest fortnight of order history, or the three product categories where the SKU master is well-maintained, or the customer segment where call-centre notes are reliably structured because that team happens to use a template. The buyer-side operator was the build team's internal champion: the person who tolerated the rough edges, filled in the missing fields by hand, and gave the agent the benefit of the doubt when its confidence score sat at 0.71 against a working threshold of 0.85. The integration to the live OMS was a hand-stubbed mock returning canned responses on a 180ms latency that the real system, sharing a database with batch reporting on a Tuesday afternoon, cannot hit on its best day.
Extrapolating from that environment to the agent will save 4.2 hours per planner per week across 60 planners, worth GBP 480k annually is not analysis. It is a sales artefact. The honest version is that you cannot know what the agent will do in production until it has run in production. The pilot did not answer the question it was sold as answering. It answered an easier one, can this thing work at all on a good day, on data we chose, and called that ROI.
A pilot answers an easier question than the one it was sold to answer, then calls the easier answer ROI.
Ship a scoped first-workflow instead
The alternative is not bigger or riskier. It is just shaped differently. Pick one workflow, not a portfolio of three to compare. The FNOL triage queue in a claims operation. The reconciliation step between the OMS and the 3PL's shipment confirmations. The first-pass review on motor claims under GBP 5k. One workflow, with a named operator team on the other side, whose floor lead knows the agent is coming.
Build against the production integrations from week one. Not a sandbox copy of Salesforce, the real Salesforce instance with read scopes pared down and write scopes behind a feature flag controlled by the IT lead. Not a CSV export of Guidewire, a service account on Guidewire with audit logging routed to the security team's existing SIEM. Integration work is sixty to seventy per cent of the project. Doing it against fake systems means doing it twice, and the second time is the time the live data shape disagrees with the sample data shape in three places nobody wrote down.
Run the agent in shadow mode in production for a defined window, three to six weeks depending on the workflow's volume. The agent sees real cases, makes real decisions, and writes them to a parallel log. A human still does the work. At the end of the window the team compares the agent's decisions against the human's on the same cases, on the same day, at the same level of queue pressure. This is the only ROI measurement that means anything, because it is the only one that ran on the real distribution under real load.
Then cut over with the rollback wired in. The agent takes the workflow. The operator team keeps override authority on every decision, visibly, with one click, with the override logged to a dashboard the floor lead reviews every morning. A flag in the orchestrator can route 100% of cases back to the human queue inside two minutes if drift starts to show up in the override rate. The budget was scoped to this workflow, not to the AI strategy, which means it can be defended on a single P&L conversation rather than a portfolio narrative that depends on five other things going right.
Three questions that reveal the shape
If you are the buyer, the difference between a vendor selling a pilot and a vendor selling a scoped build is not always obvious from the pitch deck. Both decks have a timeline, a price, and an outcomes slide. The difference shows up when you ask three questions and watch what happens in the next thirty seconds.
- What does the rollback plan look like on day one of cutover, who has authority to trigger it, and how is it tested before go-live?
- What does the agent explicitly not decide, which cases route to a human, by what rule, and what does the human see when they receive one?
- Who owns this workflow on our side once it is live, what does their first month look like in terms of override review and tuning authority, and where does that show up in their performance review?
A vendor selling a real build has answers ready. The rollback is a flag in the orchestrator, the operations lead can flip it, and it was last flipped in the staging environment on a Wednesday two weeks ago. The agent does not decide refund cases above GBP 250, or any case where the customer has logged a complaint in the last 30 days, or where the model's calibrated confidence falls below 0.85, and those rules are written into the escalation policy, not implied in a sales conversation. The owner on the buyer side is named in the SOW, has two hours per week blocked for the first month, and has been in the design conversations since week two.
A vendor selling a pilot dressed as a build will redirect each question into that will be defined in the next phase. That phrase is the tell. The next phase is where intentions go to be measured against budgets that have not been written and committees that have not been formed. If the rollback, the boundary, and the owner cannot be named before the contract is signed, the project does not have the shape of something that will run in production. It has the shape of something that will demo well, attract a polite round of questions, and then drift into a planning cycle that never finds room for it.
When a pilot is the right answer
Pilots have one legitimate use. There is a genuine open question about whether an agent's accuracy on the buyer's specific data will clear a usable threshold, usually because the data is unusually messy, or the domain is unusually nuanced, or the buyer is the first organisation in a sector to try this particular agent shape. A clinical coding agent on a hospital trust's free-text discharge summaries written by junior doctors at 3am. A claims classifier on a regional insurer's 1990s-era policy taxonomy that still references product codes the company retired in 2009. A contract clause extractor on a law firm's mixed scanned-and-digital archive, where a quarter of the documents are 1990s fax copies that OCR turns into Cyrillic.
In those cases a pilot is the right tool, but only if it is structured as one. That means a single written question, can the agent classify the top 20 ICD-10 categories with precision above 0.92 on our last 12 months of discharge summaries from this trust, and a written go/no-go gate that every named stakeholder has signed before the work starts. If precision hits 0.92, we move to a scoped build against the live EHR by Q3, with the clinical informatics lead as owner. If it does not, we stop, document the failure mode, and the budget reverts. Both branches have an owner and a date. The go/no-go meeting is in the calendar before week one.
If the gate is not written down in advance, the project is not a pilot. It is a hedge: a way for everyone involved to do something AI-shaped without anyone committing to the outcome. Hedges are politically useful and operationally inert. They produce decks, not deployments. We will run another month and see how it goes said no agent project that ever made it to production.
The shape that ships
The agents that make it into production look the same across sectors. Operators give them a name and use it in conversation, not the AI, but the triage bot or the recon agent or Helga. They handle one workflow the floor team can describe in a sentence. They have an owner on the buyer side whose quarterly review references the agent's outcomes by number. They have a rollback that was tested in anger before go-live, and a feature flag that has been flipped at least once because something looked wrong. They were scoped, built, shadowed, and cut over on a calendar everyone agreed to before the contract was signed.
The agents that do not ship had a 90-day window and a we will see. The technical work in both cases is comparable. The shape is not. If a project does not have a name, a workflow, an owner, a rollback, and a written go/no-go gate from day one, it is not an agent project. It is a pilot, and the pilot is where the project quietly ends. The tactical move for any buyer reading this: before you sign the next AI engagement, write the production owner's name and the go/no-go criteria into clause one of the SOW. If the vendor pushes back, you have your answer about what was being sold under the pilot label.