There are three ways to adapt a large model to your data, and they're separated by more than a slider. They're separated by orders of magnitude of GPU memory — the difference between needing a datacenter, needing one A100, and needing the card already in your workstation. The instinct is to read that as a quality ladder, where you pay memory for results. It mostly isn't. The thing that decides whether your fine-tune works is hiding somewhere else entirely.
Full fine-tuning: why the model size is the small number
When people budget for full fine-tuning, they look up the model's parameter count, double it for 16-bit weights, and get a number that turns out to be wildly optimistic. The weights are the cheap part.
With mixed-precision training and the Adam optimizer, the per-parameter bill is roughly 16 bytes: 2 bytes for the FP16 weight, 2 bytes for its gradient, and then 12 bytes of FP32 optimizer state — a master copy of the weight plus Adam's first and second moments, 4 bytes each. That's before activations, which scale with batch size and sequence length on top. So full fine-tuning needs on the order of 12–20x the raw weight memory just to hold the training state. (This 16-byte breakdown is the standard engineering derivation, not a figure printed in any one paper — treat it as a rule of thumb.)
That's the wall LoRA was built to climb over.
LoRA: train a thin update, freeze everything else
LoRA (Hu et al., 2021) starts from an observation about what fine-tuning actually changes. The weight update a model learns during adaptation has low "intrinsic rank" — it can be approximated by a much smaller matrix. So instead of updating the full weight W₀, LoRA freezes it and learns a low-rank decomposition of the update: the forward pass becomes h = W₀x + BAx, where B and A are skinny matrices whose product BA = ΔW has rank r, and r is tiny next to the weight dimensions. The update is scaled by α/r, with A initialized random and B initialized to zero so training starts from exactly the pretrained model.
The numbers from the paper are striking. On GPT-3 175B, LoRA cut trainable parameters about 10,000x and GPU memory about 3x (roughly 1.2TB down to 350GB), and the saved artifact shrank from a 350GB checkpoint to about 35MB of adapter. Crucially, there's no added inference latency: because the update is linear, you can merge BA back into W₀ at deployment and ship a model indistinguishable in speed from a fully fine-tuned one. Swap tasks by subtracting one adapter and adding another.
LoRA's real saving isn't the weights — it's that you stop storing a gradient and three optimizer slots for parameters you've frozen.
QLoRA: quantize the part you're not training anyway
If the base model is frozen, why hold it in full 16-bit precision? That's the question QLoRA (Dettmers et al., 2023) answers. It stores the frozen base in 4-bit NormalFloat (NF4) — a datatype the authors argue is information-theoretically optimal for the normally-distributed weights you actually find in a trained network — and backpropagates through that 4-bit model into LoRA adapters that stay in 16-bit. Two more tricks make it fit: double quantization (quantizing the quantization constants, saving ~0.37 bits/param) and paged optimizers that spill to CPU memory during gradient-checkpoint spikes instead of OOM-ing.
The headline: finetuning a 65B model on a single 48GB GPU, dropping memory requirements from over 780GB to under 48GB — while matching 16-bit fine-tuning quality. Their Guanaco models reached 99.3% of ChatGPT's level on the Vicuna benchmark after 24 hours on one GPU. And the quality caveat is precise and worth knowing: NF4 with double quantization recovers 16-bit LoRA quality, but the older FP4 format lagged by about a point. The choice of 4-bit format is doing real work; it's not interchangeable.
For a 7B model the practical landscape, approximately: full FT wants 100GB+ (multi-GPU or an 80GB card), 16-bit LoRA lands around 16–20GB, and QLoRA fits in roughly 6–12GB — a consumer card. Those 7B figures are engineering estimates, not paper claims, and they drift with batch and sequence length, but the 10–20x gap between full FT and QLoRA is the durable shape.
The lever nobody tunes
Here's the part that reframes the whole decision. Teams agonize over the rank r — is 8 enough, should I push to 64 — as if it were the master dial. The evidence says it mostly isn't.
Biderman et al. (2024), bluntly titled LoRA Learns Less and Forgets Less, found two things. First, on genuinely hard target domains — programming and mathematics — LoRA can substantially underperform full fine-tuning, because full FT learns weight perturbations of much higher rank (10–100x) than a typical LoRA config can represent. So there's a real capacity ceiling, and on demanding domains it bites. Second, and as compensation, LoRA forgets less: it preserves the base model's out-of-domain abilities better than full FT, and better than regularizers like weight decay or dropout. LoRA behaves like a regularizer — less capacity to learn the new thing, less tendency to clobber the old things.
And the levers that actually move LoRA's quality, per the same work, are the learning rate (LoRA wants a higher one than full FT) and which modules you target — adapting all of them, attention and MLP, matters more than cranking the rank. The early convention of adapting only the attention projections was leaving quality on the table. Rank is a knob; learning rate and module coverage are the levers.
So the real decision tree isn't "how much memory can I spare." It's: is my target domain far enough from the base model that I need full FT's higher-rank capacity? If yes, and you can afford it, full fine-tune. If no — which is most instruction-tuning and domain-adaptation work — LoRA or QLoRA will match it, and the choice between those two genuinely is just memory. Then spend your tuning budget on learning rate and target modules, not on the rank. If you're still deciding whether to fine-tune at all versus retrieve, that's the prior question — start with fine-tuning vs RAG before you price out a single GPU-hour.



