Your RAG system answers the question correctly and the stakeholder asks the only question that matters in production: says who? So you add "cite your sources" to the system prompt, ship it, and a week later someone notices the answer is footnoted to a passage that says the opposite. The citation was there. It just wasn't true.

That gap is the whole subject. A citation is a pointer, not a proof. It claims the model looked at a source; it does not claim the source supports the sentence. Once you internalize that, the engineering stops being a prompt-writing exercise and becomes an architecture question with a clear gradient of answers.

The two failures hiding in one feature

There are two ways a citation can be wrong, and they are independent. Recall: is every claim in the answer backed by some cited passage, or are there bald assertions with no footnote? Precision: do the cited passages actually support the claim they're attached to, or did the model staple a plausible-looking chunk to a sentence it made up?

A model can hit perfect citation recall and abysmal precision: footnote every sentence, and footnote them to the wrong thing.

This isn't theoretical. The ALCE benchmark — the standard academic harness for this exact task — scores generation on fluency, correctness, and citation quality, and it splits citation quality into recall and precision precisely because they fail separately. Its headline finding is sobering: even strong models leave a large fraction of their statements without complete support on the harder datasets. Most teams instrument recall ("did it cite anything?") and never measure precision, which is the failure that actually burns you.

The non-negotiable precondition: provenance has to survive

Before any strategy works, one thing has to be true: the model needs something real to point at. That means every retrieved chunk carries a stable id and a span locator — a unique chunk id, the source document identifier, and either character offsets, a page range, or a section anchor. And that metadata has to survive the whole trip: from the retriever, into the generation prompt, and back out into the rendered answer.

The common bug is a retriever that returns bare strings. You concatenate them into context, the model emits "[1]", and "[1]" maps to nothing because you never told the model what document one was. The fix is unglamorous: format each chunk in the prompt with its id visible (<doc id="d3" page="7">…</doc>), and keep a lookup table so a returned id resolves to a real source and span. If you skip this, every citation downstream is a hallucinated index.

Four strategies, one reliability gradient

With provenance in place, the choices line up from "cheap and hopeful" to "forced and verified."

1. Prompt instruction. Tell the model to cite the chunk ids inline, and specify the exact format and placement. This is the right first cut and it genuinely improves with specificity — naming the format and demanding a citation per sentence helps. But it's best-effort: nothing forces a citation, so coverage drifts, especially on the long, synthesized sentences that most need a source.

2. Structured output. Instead of free text with inline markers, have the model return a schema — an answer plus a citations list mapping each claim (or sentence span) to a chunk id and offset. Now the citation is a first-class, parseable field you can validate, render, and align. The shape is guaranteed; the mapping is still the model's judgment, but you've removed the "did it even emit a citation" class of bug. This is the workhorse for portable, provider-agnostic pipelines, and it pairs naturally with the structured-output tooling you're probably already using.

3. A provider citation API. The major labs now ship this as a managed feature, which forces the citation step rather than hoping for it. Anthropic's Citations API chunks your documents (sentence-level for text and PDFs), and returns a cited_text field plus a location object — char_location for text, page_location for PDFs, content_block_location for custom blocks — with start/end indices, and the cited_text doesn't count toward your output tokens. Anthropic reports its built-in citations lift recall accuracy up to 15% over typical hand-rolled implementations. Google's Gemini grounding returns groundingMetadata with groundingSupports that tie answer segments to source chunks by byte offset; OpenAI's Responses API emits file_citation annotations (file id, filename, position) from file_search. The cost is portability — you've bound your citation layer to one vendor.

4. Verify-after. None of the first three checks precision — whether the cited passage supports the claim. That's a separate pass: decompose the answer into atomic claims and check each for entailment against its cited evidence. This is exactly the RAGAS faithfulness metric — supported claims over total claims — and the academic lineage runs through Attributed QA's AIS/AutoAIS framework, which uses an NLI model to estimate whether evidence entails a statement. It's the only step that catches a confident, well-cited, wrong sentence, and it's the step everyone skips because it costs an extra model call per claim.

Where to stop

You don't need all four for every system. A low-stakes internal tool can live on a good prompt. A customer-facing answer that quotes a source should use structured output or a provider API so the spans are exact — and if you want to display character-level highlights rather than whole sentences, know that positional, fine-grained citation is its own harder problem, which is why benchmarks like ALiiCE exist to evaluate it. And anywhere a wrong-but-cited answer carries real cost — legal, medical, financial — the verify-after pass is not optional, because that's the only layer that closes the gap between pointing at a source and being supported by it.

The reflex is to treat citations as the last 5% of the build, a formatting flourish on top of a working pipeline. They're not. They're a property you design in from the retriever forward, or you don't really have.