RAG vs fine-tuning: which is right for you?

We get asked this on roughly every other first call. The question usually arrives in the shape of “we're considering fine-tuning a model on our data” — and the answer, nine times out of ten, is “you don't need to.”

Here's the decision tree we walk teams through, in the order we walk them.

Default: start with RAG

Retrieval-augmented generation — pulling relevant context from a vector store and passing it to the model at inference time — solves the problem most teams have. Which is: “the model is generally smart but it doesn't know ourspecifics.”

RAG is cheap to iterate on. You can change the corpus daily. You can add a new document and have it instantly accessible to the model. You don't need GPUs. You don't need a training pipeline. You don't need to wait six hours for a fine-tune job. For most internal-search, internal-copilot, and customer-support use cases, this is correct, and you're done.

Three real signals to consider fine-tuning

We'll recommend fine-tuning over (or in addition to) RAG when one of these is true:

The output needs to follow a specific style or format that's hard to prompt. Brand voice. Legal-document structure. Code in your house style. Things you can show ten examples of and have the model still drift back to its prior.
Latency is critical and prompt size is the bottleneck.Smaller fine-tuned models can outperform a larger model + RAG context for narrow tasks. We've seen 40-token completions go from 800ms to 120ms by moving from gpt-4o + RAG to a fine-tuned llama for the same task.
Compliance requires the model to never “see” certain data at inference time.If sensitive data has to be in the model's weights instead of in a prompt, fine-tuning is the answer.

The hybrid case

Sometimes you want both — a fine-tuned base model that knows your style and structure, with RAG on top for facts and fresh data. This is most common in product-facing use cases where output quality is public and brand-sensitive.

Don't start here. Start with RAG. Add fine-tuning if and when a specific quality bar in your eval suite isn't reachable any other way.

Costs over time

A RAG system has roughly flat operating cost as your data grows — you pay for embeddings and vector storage, both of which are cheap. A fine-tuned system has near-zero marginal data cost but a periodic re-training cost when your data shifts. At six months in, RAG is usually cheaper. At twenty-four months in, it depends on how much your domain drifts.

What we ship by default

On every engagement we open with the question “is RAG enough here?” — and on most engagements, the answer is yes. We ship a pgvector-backed retrieval layer, evals that measure citation accuracy, and a clean integration with the surface your team already uses.

If RAG isn't enough, we'll tell you in week one. We won't spend your budget proving it.

RAG vs fine-tuning:which is right for you?

Default: start with RAG

Three real signals to consider fine-tuning

The hybrid case

Costs over time

What we ship by default

Got a project that needs this thinking?

RAG vs fine-tuning:
which is right for you?