Every team that ships an open model hits the same fork: the full-precision weights are too big and too slow to serve, so you quantize. Then a tab explosion: GGUF, GPTQ, AWQ, and a dozen suffixes like Q4_K_M and w4a16. The comparison posts argue about perplexity to three decimals.

Here is the one thing that actually decides it, and it is not accuracy: the format you want is downstream of where the model runs. Get the deployment target right and the rest is tuning. Get it wrong and you will fight your serving engine for a week.

The split that matters: local vs. served

GGUF is for the machine in front of you — or in someone's pocket. It is the binary format of llama.cpp, the successor to the old GGML, and its whole reason for living is running well across wildly different hardware: CPU, GPU, and especially Apple Silicon via Metal.

LLM inference in C/C++; GGUF format, k-quants & i-quants, runs on CPU/GPU/Apple Silicon

If you run models through Ollama or LM Studio, you are already running GGUF whether you knew it or not — they wrap llama.cpp. GGUF's quant menu is the richest of the three: the k-quants (Q4_K_M, Q5_K_M, Q6_K) that replaced the legacy Q4_0 family, and the importance-matrix i-quants (IQ4_XS and friends) that squeeze a big model into a small card at the cost of more sensitivity. For local and offline work, GGUF is not a contender — it's the default.

GGUF answers "how do I run this on my laptop." GPTQ and AWQ answer "how do I serve this to a thousand users on a GPU." Those are different questions, and the format is the answer, not the debate.

GPTQ and AWQ are GPU-server formats. Both are 4-bit, post-training, and both want calibration data — but they get there differently.

GPTQ (Frantar et al., ICLR 2023) uses approximate second-order information to quantize weights one layer at a time while minimizing the error introduced — famously quantizing a 175B model in a few GPU-hours. It is the old reliable: broadly supported, well understood, everywhere.

AWQ (Lin et al., MLSys 2024 Best Paper) starts from a sharper observation: weights are not equally important, and you can find the ~1% that matter most by looking at the activations, then protect them. Because it does no gradient-based reconstruction, it doesn't overfit the calibration set, which tends to make it robust across tasks. On vLLM, AWQ and GPTQ are the fast path, accelerated by Marlin kernels.

The trap nobody updated their READMEs for

Now the part that catches teams in 2026, and the reason this piece exists. The libraries you remember are dead.

The classic GPTQ packaging lib — ARCHIVED April 2025; the banner says use GPTQModel
★ 5.1kPythonAutoGPTQ/AutoGPTQ
The classic AWQ packaging lib — ARCHIVED May 2025; "officially deprecated," points to llm-compressor
★ 2.3kPythoncasper-hansen/AutoAWQ

Both went read-only in 2025. Transformers dropped AutoGPTQ support. Yet the top Google result and half the Medium tutorials still tell you to pip install autoawq. The algorithms GPTQ and AWQ are perfectly alive and first-class in every serious serving stack — it's the tooling that consolidated. Two successors now own the ground:

Active quantization toolkit (GPTQ + more); HF/Transformers, vLLM, SGLang; supplants AutoGPTQ/AutoAWQ for the HF route
★ 1.2kPythonModelCloud/GPTQModel
vLLM's compression workflow — GPTQ, AWQ, SmoothQuant, FP8/INT8 — outputs the compressed-tensors format

For the Hugging Face / Transformers route, GPTQModel is the drop-in. For the vLLM route, llm-compressor is the path the project itself points you to: you compress to the compressed-tensors format and vLLM loads it natively, with FP8/INT8 (W8A8) options for newer datacenter GPUs. If you read one canonical reference before choosing, make it the vLLM quantization docs, which carry the hardware-compatibility matrix.

And if you're fine-tuning, not serving

One more format people conflate with the above and shouldn't: bitsandbytes. Its NF4 / QLoRA path quantizes on the fly so you can fine-tune a big model on one GPU. It is a training-time convenience, not a high-throughput serving format — reach for it when the job is adaptation, then export to GGUF or a compressed-tensors checkpoint for the actual deployment.

8-bit optimizers, LLM.int8(), and 4-bit NF4/QLoRA on-the-fly quantization for fine-tuning

The decision rule

Don't pick a format to win a perplexity decimal. Pick where it runs, then tune the quant level:

The formats survived the year. The tooling didn't. Check the archive banner before you trust the tutorial.