We get asked this on roughly every other first call. The question usually arrives in the shape of “we're considering fine-tuning a model on our data” — and the answer, nine times out of ten, is “you don't need to.”
Here's the decision tree we walk teams through, in the order we walk them.
Default: start with RAG
Retrieval-augmented generation — pulling relevant context from a vector store and passing it to the model at inference time — solves the problem most teams have. Which is: “the model is generally smart but it doesn't know ourspecifics.”
RAG is cheap to iterate on. You can change the corpus daily. You can add a new document and have it instantly accessible to the model. You don't need GPUs. You don't need a training pipeline. You don't need to wait six hours for a fine-tune job. For most internal-search, internal-copilot, and customer-support use cases, this is correct, and you're done.
Three real signals to consider fine-tuning
We'll recommend fine-tuning over (or in addition to) RAG when one of these is true:
- The output needs to follow a specific style or format that's hard to prompt. Brand voice. Legal-document structure. Code in your house style. Things you can show ten examples of and have the model still drift back to its prior.
- Latency is critical and prompt size is the bottleneck.Smaller fine-tuned models can outperform a larger model + RAG context for narrow tasks. We've seen 40-token completions go from 800ms to 120ms by moving from gpt-4o + RAG to a fine-tuned llama for the same task.
- Compliance requires the model to never “see” certain data at inference time.If sensitive data has to be in the model's weights instead of in a prompt, fine-tuning is the answer.
The hybrid case
Sometimes you want both — a fine-tuned base model that knows your style and structure, with RAG on top for facts and fresh data. This is most common in product-facing use cases where output quality is public and brand-sensitive.
Don't start here. Start with RAG. Add fine-tuning if and when a specific quality bar in your eval suite isn't reachable any other way.
Costs over time
A RAG system has roughly flat operating cost as your data grows — you pay for embeddings and vector storage, both of which are cheap. A fine-tuned system has near-zero marginal data cost but a periodic re-training cost when your data shifts. At six months in, RAG is usually cheaper. At twenty-four months in, it depends on how much your domain drifts.
What we ship by default
On every engagement we open with the question “is RAG enough here?” — and on most engagements, the answer is yes. We ship a pgvector-backed retrieval layer, evals that measure citation accuracy, and a clean integration with the surface your team already uses.
If RAG isn't enough, we'll tell you in week one. We won't spend your budget proving it.