Here's the reflex worth unlearning. Your RAG system retrieves mediocre chunks, the answers are vague, and the instinct is to fine-tune the language model — teach it your domain. But if retrieval handed the model the wrong passages, a smarter generator just writes a more fluent wrong answer. The bottleneck is upstream, in the part almost nobody fine-tunes: the embedding model that decides which chunks the model ever sees.

Fine-tuning that little model is one of the highest-leverage, lowest-cost moves in the whole stack — and the evidence for how cheap and how effective it is has gotten unusually concrete.

The numbers are smaller than you think

In a widely reproduced worked example, fine-tuning BGE-base on 6,300 query–chunk pairs lifts retrieval quality by about 7.4% on NDCG@10. Not 6.3 million pairs. Not a labeled dataset you commissioned. 6.3k pairs, a few minutes on a single GPU, a few dollars.

And you likely don't have to label any of it. The standard recipe is synthetic: take each chunk of your corpus, hand it to an LLM, and ask it to write a handful of questions that chunk would answer. Those (question, chunk) pairs are your positive training examples. You can fine-tune an embedding model on a body of documents that has never had a single labeled query against it — which is to say, on basically any corpus you already own.

Almost all the lift is one ingredient

The non-obvious part isn't that fine-tuning helps. It's what helps. Throw more positive pairs at the model and you get marginal gains. The substantial improvement comes from hard negatives — documents that look relevant but aren't — because they're what teach the model the fine distinctions your domain actually turns on.

This is the entire thesis of NVIDIA's NV-Retriever, which topped the MTEB retrieval leaderboard on the strength of how it mines those negatives. The trap is false negatives: naively, the "hardest" negative is the passage most similar to the query — but past a threshold, that passage is often genuinely relevant, and training against it teaches the model to push away correct answers. NV-Retriever's positive-aware mining filters those out, keeping the hard-but-wrong and discarding the hard-but-actually-right.

The model doesn't learn what "relevant" means from a thousand right answers. It learns it from the wrong answers that were one degree away from right.

There's even a ceiling worth knowing: in practice the gains plateau around 40 negatives per query. More than that buys little. The loss function doing the work is usually MultipleNegativesRankingLoss, which also recruits the rest of the batch as in-batch negatives for free.

The win-win nobody advertises

Fine-tuning is normally a tradeoff: more accuracy, more cost. Embeddings can be the rare exception, because of Matryoshka Representation Learning. Train with a Matryoshka loss and the model packs the most important information into the first dimensions of every vector, so you can truncate them later with almost no loss: ~99% of full quality at 128 dimensions, >99.5% at 256.

The startling consequence from the worked example: the fine-tuned model at 64 dimensions outperforms the off-the-shelf baseline at 768. That's a 12× smaller vector that's also more accurate on your data. Cheaper storage, faster nearest-neighbor search, lower latency — and better retrieval. You almost never get both directions at once. Here you do.

Where the real cost hides

So why isn't everyone doing this? Because the cost isn't the training run — it's everything around it. Fine-tune your embedding model and every vector you've ever stored is now computed by a different function; you have to re-embed the entire corpus to make old and new vectors comparable. On a large index that migration is the expensive part, not the gradient descent. And a fine-tuned model is a snapshot of a moment: as your documents drift, the model that was tuned to last quarter's language slowly decays, and you're back at the keyboard.

That's the honest decision rule. Fine-tune embeddings when your corpus is stable and full of domain-specific language the base model never saw — legal, medical, internal jargon, a product nobody's written about. Don't bother when the domain is generic (a good off-the-shelf model already handles it) or when your data changes weekly.

And try the cheaper levers first. Better chunking, hybrid search, and especially a reranker often close most of the gap with zero training. Fine-tuning the embedding model is what you reach for when those run out — and it remains a smarter first move than fine-tuning the LLM, which can't retrieve what retrieval never found.