There is a category of open-weights model on the Hugging Face leaderboard that was never trained. Nobody ran a gradient step. Nobody held a dataset. Someone took two or three existing fine-tunes, did arithmetic on their weight tensors, and uploaded the result — and it scored at or near the top. The technique is model merging, and the most useful thing to understand about it is that the competing methods you'll see named — SLERP, TIES, DARE — are not a quality ladder. They are escalating answers to a single problem.
That problem is interference.
The one idea: a fine-tune is mostly a small, redundant nudge
Start with the unit everything else is built on. A "task vector," as defined by Ilharco et al. in Editing Models with Task Arithmetic, is just the fine-tuned weights minus the base weights: τ = θ_finetuned − θ_base. It's the change fine-tuning made, isolated as a direction in weight space. Their surprising result is that these vectors behave algebraically — add two task vectors and you get a model better at both tasks; negate one and the model selectively forgets that task while barely touching the others.
The reason any of this is possible is linear mode connectivity: models fine-tuned from the same pretrained checkpoint stay inside one low-loss basin, so a point partway between them is also low-loss. Model Soups (Wortsman et al.) leaned on exactly this — averaging the weights of dozens of hyperparameter fine-tunes of a single base, with their best ViT-G "soup" reaching 90.94% top-1 on ImageNet, a state-of-the-art at the time, at the inference cost of one model rather than an ensemble.
Then DARE — Yu et al.'s memorably titled Language Models are Super Mario — delivered the finding the whole field now rests on. You can randomly drop 90%, and in places 99%, of a fine-tune's delta parameters, rescale the survivors by 1/(1−p) to preserve the expected magnitude, and lose almost no performance. Read that again: nine out of ten of the weight changes fine-tuning made are discardable. The nudge is tiny and redundant. That is the fact that makes merging robust, and it's the fact each method below is exploiting.
The methods, as answers to interference
When you sum two task vectors naively, they fight. Yadav et al. (TIES) name the two ways: redundancy — a flood of tiny, low-magnitude changes that don't matter individually but dilute the merge — and sign disagreement — for a given parameter, one model pushed it up and another pushed it down, so they cancel into noise. Every method is a strategy for not letting that happen.
The deltas are sparse and redundant. Every merge method is a different way of deciding which ones to keep and how to stop them from canceling each other out.
- SLERP — spherical linear interpolation. Blend exactly two models, but along the geodesic on the hypersphere their weight vectors define, not the straight Euclidean line. Linear interpolation cuts through the interior of the sphere, shrinking the vector norm and producing a degraded midpoint; SLERP stays on the high-density surface where trained weights actually live. It's the smoothest two-way mix, and nothing more — it can't take three models.
- Task Arithmetic — the additive baseline for many models. Sum the task vectors, scale, add back to the base. Composes skills; can also subtract them. But it does nothing about interference, so it degrades as the models you add disagree.
- TIES — Trim, Elect Sign, Merge. First trim each task vector to its top-magnitude entries (kill the redundancy), then elect a single winning sign per parameter by total magnitude across models (kill the sign conflict), then merge only the entries that agree with the elected sign. It's task arithmetic with the interference explicitly removed.
- DARE — not a merge but a preprocessor. Drop-and-rescale each model's deltas before you fuse them, so there's far less to collide. It's combinable with the others; mergekit ships it as
dare_tiesanddare_linear.
The progression is legible once you see the enemy: SLERP avoids the bad geometry, task arithmetic ignores interference, TIES surgically resolves it, DARE pre-empts it by deletion.
What this is good for — and what it isn't
The practical payoff is leverage. Merging is training-free: no labeled data, no optimizer, no gradient steps, and it runs on a memory-constrained CPU. The standard tool, mergekit (~7.2k stars), implements all of the above plus DELLA, Model Stock, and passthrough "frankenmerges" that stack layers to grow a model's depth. It's why a large share of the open leaderboard is merged rather than trained — the iteration loop is minutes, not GPU-days.
The hard limit is the one the whole mechanism depends on: homologous models. Same architecture, shared base lineage. Merge two fine-tunes of the same Llama checkpoint and the math holds; try to merge a Llama with a Qwen and the same parameter index means different things in each, the interpolation crosses high-loss barriers, and you get garbage. Cross-base merging needs permutation-matching or optimal-transport alignment first, and remains a research frontier rather than a mergekit one-liner.
So merging doesn't replace the work in LoRA vs QLoRA vs full fine-tuning or the tooling in Unsloth vs Axolotl vs Torchtune — it sits downstream of it. Fine-tuning creates specialists. Merging combines them, for free, in the time it takes to load the weights twice. The reason it's nearly free is the reason it works at all: most of what fine-tuning writes into a model is a small, redundant nudge, and arithmetic is enough to keep the nudges that matter.



