Every team that ships an open model hits the same fork: the full-precision weights are too big and too slow to serve, so you quantize. Then a tab explosion: GGUF, GPTQ, AWQ, and a dozen suffixes like Q4_K_M and w4a16. The comparison posts argue about perplexity to three decimals.
Here is the one thing that actually decides it, and it is not accuracy: the format you want is downstream of where the model runs. Get the deployment target right and the rest is tuning. Get it wrong and you will fight your serving engine for a week.
The split that matters: local vs. served
GGUF is for the machine in front of you — or in someone's pocket. It is the binary format of llama.cpp, the successor to the old GGML, and its whole reason for living is running well across wildly different hardware: CPU, GPU, and especially Apple Silicon via Metal.
If you run models through Ollama or LM Studio, you are already running GGUF whether you knew it or not — they wrap llama.cpp. GGUF's quant menu is the richest of the three: the k-quants (Q4_K_M, Q5_K_M, Q6_K) that replaced the legacy Q4_0 family, and the importance-matrix i-quants (IQ4_XS and friends) that squeeze a big model into a small card at the cost of more sensitivity. For local and offline work, GGUF is not a contender — it's the default.
GGUF answers "how do I run this on my laptop." GPTQ and AWQ answer "how do I serve this to a thousand users on a GPU." Those are different questions, and the format is the answer, not the debate.
GPTQ and AWQ are GPU-server formats. Both are 4-bit, post-training, and both want calibration data — but they get there differently.
GPTQ (Frantar et al., ICLR 2023) uses approximate second-order information to quantize weights one layer at a time while minimizing the error introduced — famously quantizing a 175B model in a few GPU-hours. It is the old reliable: broadly supported, well understood, everywhere.
AWQ (Lin et al., MLSys 2024 Best Paper) starts from a sharper observation: weights are not equally important, and you can find the ~1% that matter most by looking at the activations, then protect them. Because it does no gradient-based reconstruction, it doesn't overfit the calibration set, which tends to make it robust across tasks. On vLLM, AWQ and GPTQ are the fast path, accelerated by Marlin kernels.
The trap nobody updated their READMEs for
Now the part that catches teams in 2026, and the reason this piece exists. The libraries you remember are dead.
Both went read-only in 2025. Transformers dropped AutoGPTQ support. Yet the top Google result and half the Medium tutorials still tell you to pip install autoawq. The algorithms GPTQ and AWQ are perfectly alive and first-class in every serious serving stack — it's the tooling that consolidated. Two successors now own the ground:
For the Hugging Face / Transformers route, GPTQModel is the drop-in. For the vLLM route, llm-compressor is the path the project itself points you to: you compress to the compressed-tensors format and vLLM loads it natively, with FP8/INT8 (W8A8) options for newer datacenter GPUs. If you read one canonical reference before choosing, make it the vLLM quantization docs, which carry the hardware-compatibility matrix.
And if you're fine-tuning, not serving
One more format people conflate with the above and shouldn't: bitsandbytes. Its NF4 / QLoRA path quantizes on the fly so you can fine-tune a big model on one GPU. It is a training-time convenience, not a high-throughput serving format — reach for it when the job is adaptation, then export to GGUF or a compressed-tensors checkpoint for the actual deployment.
The decision rule
Don't pick a format to win a perplexity decimal. Pick where it runs, then tune the quant level:
- Laptop, Mac, edge, offline, Ollama/LM Studio → GGUF, a
Q4_K_MorQ5_K_Mto start, an i-quant only if you're squeezing. - GPU serving on vLLM → AWQ or GPTQ, produced by llm-compressor (compressed-tensors) or GPTQModel, not the archived libs; consider FP8/INT8 on Hopper-class hardware. Pairs naturally with picking an inference engine.
- Fine-tuning on one GPU → bitsandbytes NF4 / QLoRA, then export — the same QLoRA story behind choosing a fine-tuning framework.
The formats survived the year. The tooling didn't. Check the archive banner before you trust the tutorial.



