Most LLM ROI math is wrong. Not because the model is mispriced — the numbers are easy enough to find — but because teams forget that the deflection rate is a distribution, not a constant, and they forget that latency is a tax on user behavior.
Here's the framework we use when we scope an engagement, written plainly so you can apply it before you talk to any vendor.
The five inputs
- Task volume. How many of the relevant tasks does your team handle per week? (Tickets, claims, RFPs, queries — pick the unit.)
- Loaded hourly cost.Salary + benefits + tooling + overhead, divided by hours worked. Don't use base salary; you'll under-count the savings.
- Expected deflection rate. What percentage of the tasks can plausibly be handled without a human, at acceptable quality? Be conservative — we usually start at 50% for the business case even when pilots suggest 70%+.
- Deployment cost. All-in build cost (us, your team, third-party services for 6–12 months). For an 8-week engagement, the build cost is a known fixed number; for the first year of ongoing cost, model it generously.
- Latency tax.If responses take 30 seconds, a percentage of users will abandon — measurable in your funnel. If you're replacing humans who took 24 hours, latency is a free lunch.
The worked example
A 200-analyst capital-markets firm. Loaded hourly cost: $84. Internal-research tasks per week per analyst: ~12. Conservative deflection rate at confidence ≥0.85: 35%. Build cost for a retrieval copilot like the one we built for Affidavit Mapp: in the low six-figures all-in for year one.
Run the math: 200 analysts × 12 tasks × 35% deflection × ~20 minutes saved per task × $84/hr = ~$1.4M / year in recovered time. Even with a generous build budget, payback is under three months.
The sensitivity check
The example above assumes 35% deflection. What if it lands at 20%? Re-run the same math at 20% and you still net out positive in year one, but the payback period stretches to five or six months. What if it lands at 10%? Now you need to be honest with yourself: maybe this is the wrong use case, or maybe the eval bar needs to come down, or maybe the quality you're targeting is unrealistic.
The point of the sensitivity check isn't to find the answer. It's to find the assumption your ROI depends on, then decide whether you believe it before you commit budget.
The 90-day post-mortem
Whatever number you computed pre-build, write it down. Then measure the same number ninety days post-launch. If the gap is more than 20%, something in your assumptions was wrong — and you should figure out which one before you scale the system to the next use case.
Most of the AI projects that “fail” never get this far, because nobody measures rigorously. Don't be that team.