Why most AI projects fail, and what we do differently.

We've audited a few dozen stalled or failed AI projects over the last two years. The failure modes are remarkably consistent. Here are the six we see most often, ranked by how much money they cost the company, with the prevention next to each.

1 · No eval suite

Cost: a lot.The team builds something that “feels good” in demos but has no measurable definition of done. When a stakeholder pushes back on quality, there's no objective answer — so the team rebuilds, again, against the new taste of the new stakeholder. We've seen this consume eighteen months on a project that should have shipped in three.

Fix:Write the eval suite in week one, before any model code. 50–200 representative examples with expected outputs. The eval bar (e.g. “≥85% accuracy on this set”) is your ship gate. No exceptions.

2 · Framework lock-in

Cost: high.Teams adopt LangChain or LlamaIndex on day one because “everyone uses them.” A few months in they hit a bug in the orchestration layer, can't debug it, and can't move off without a partial rewrite. The vendor's incident pages and Github issues become required reading.

Fix: Go direct to the model API for the parts you control. Frameworks are great for getting to a prototype faster; they are bad at being load-bearing infrastructure for a year-long engagement.

3 · Latency tax under-counted

Cost: medium.The team ships an agent that takes fifteen seconds per request. Users abandon. Daily active usage drops, and the team can't figure out why the deflection number looks fine but the impact metric doesn't.

Fix:Treat latency as a first-class metric next to quality from day one. If you're replacing a human who took a day, 15 seconds is free. If you're replacing a click that took half a second, 15 seconds is a non-starter.

4 · Hidden prompt injection

Cost: medium, occasionally catastrophic. The team accepts user-controlled text into the prompt without sanitization. A user discovers (or a security researcher reports) that they can override the system prompt and get the agent to misbehave or leak data from the context window.

Fix: Treat the prompt boundary like a SQL boundary. Sanitize, escape, and where possible structure user input as JSON arguments rather than free-form text. Run a red-team pass before launch.

5 · No cost monitoring

Cost: small, then suddenly large. The team ships. Usage grows. The model bill grows with it. Two months in, the CFO asks why the OpenAI bill is $42k for the month and nobody on the eng team can explain it without a half-day investigation.

Fix: Token-usage telemetry per route, per user tier, per feature. Cost dashboard in the same place as the latency dashboard. Alert on weekly spend exceeding plan + 30%.

6 · The handoff was a handover

Cost: ongoing.External vendor builds the thing, delivers a zip, leaves. Six months later the model API has changed, the eval suite hasn't been touched, and the internal team doesn't know how to add a new use case. The system stops mattering.

Fix:A real handoff includes the runbook, the eval suite the system was built against, the cost dashboard, the observability dashboard, the on-call rotation transition, and one to two weeks of paired work after the build is “done” where the internal team adds a new use case with the build team watching.

What we do differently

We open with the eval suite. We go direct to the model API. We treat latency, cost, and prompt-injection as first-class metrics next to quality. And the engagement isn't closed until your team has shipped a change on their own. Boring. Effective.

If you're in the middle of one of these failure modes already, the cheapest fix is to stop adding features and spend a week building the eval suite. Most of the “the AI isn't working” problems become tractable engineering problems the moment they have a number attached to them.