Pick a coding agent and the argument starts immediately: Opus or GPT, this benchmark or that one. Almost nobody asks the question that decides whether the model's good idea ever reaches your disk intact: how does the agent actually write the change? That mechanism has a name — the edit format — and it is frequently the real bottleneck, not the model.

The three ways a model touches a file

There are three dominant formats, and they sit on a single tradeoff line: token cost against apply-reliability.

The catch with the cheap formats is the same in every case: the model has to reproduce a slice of the existing file verbatim so the tool can find where to cut. Anthropic's reference implementation of the text editor tool is unforgiving about this — if your search string doesn't appear, you get No replacement was performed, old_str ... did not appear verbatim, and if it appears more than once, the tool refuses and tells you to make it unique. Miss a space, hallucinate a line you couldn't see, and the edit bounces.

The non-obvious part: the format is often the bottleneck

Here is the thing the leaderboards bury. A weaker model on a forgiving format can beat a stronger model forced into a strict one — because the strict format is where capability leaks out.

Aider's code-editing leaderboard is built to expose exactly this. It reports two numbers per model: the code score (did the fix work) and the percent using correct edit format (could the model even produce a well-formed edit). When those diverge, the format is eating the model. In Aider's own data, llama3-70b on the diff format came out only 73.5% well-formed — more than a quarter of its attempts were malformed before correctness was even on the table. gemini-1.5-pro on diff-fenced landed at 87.2%. The format is silently capping the score.

The cleanest proof is a single model held constant while only the format changes. Aider's unified-diff writeup measured GPT-4 Turbo on a laziness benchmark: 20% with the existing search/replace format, 61% once they switched it to unified diffs — a 3x cut in lazy "rest of code here" placeholder comments. Same weights, same prompt budget, same task. The only thing that moved was how the edit was expressed.

You can buy a smarter model, or you can stop wasting the one you have on a format it can't reliably produce. The second is cheaper and the leaderboards keep proving it.

This is why the format choice isn't cosmetic. Whole-file always applies and never mis-locates — Aider found that for files under a few hundred lines, full rewrites can actually beat diffs. But whole-file burns tokens and invites dropped code. Diffs are cheap and surgical until the model's memory of the file drifts and the patch won't land. There is no free lunch on this axis; every agent, from Claude Code to the CLI competition, is picking a point on it.

The escape hatch: let the big model be lazy

The newest move breaks the tradeoff instead of navigating it. Fast-apply models split the job in two: the expensive frontier model emits a loose edit — only the changed lines, with // ... existing code ... markers standing in for everything it didn't touch — and a small, dedicated model does the mechanical merge into the full file.

The economics are stark. Morph's Fast Apply is a 7B model that merges edits at about 10,500 tokens per second with a claimed 98% accuracy, taking the original file plus the loose snippet and returning the complete merged result. Cursor pioneered the consumer version — its "instant apply" used a ~70B model with a technique it calls speculative edits to rewrite files at over 1,000 tokens per second, conditioning on a full-file rewrite so the apply step can't lose code the way a brittle diff can.

The logic is almost obvious once stated: don't make your most expensive model spend its attention counting whitespace. Let it think about the change and offload placement to a model that does nothing else. It's the same instinct that separates Cursor, Windsurf, Copilot, and Claude Code at the product layer — the apply step is a real engineering surface, not a footnote.

What to actually do

If your agent keeps "failing to edit" a file, the model probably isn't the problem — the format is too strict for it. Loosen it: prefer whole-file or a fast-apply pipeline for weaker or smaller models, and reserve tight diffs for models that score high on format adherence. And when you read a coding benchmark, look for the second number. A headline score with no format-adherence column is telling you how smart the model is, not whether it can land the patch.

The model picks the fix. The edit format decides whether you ever see it.