On March 30, 2026, Ollama did something quietly radical: on Apple Silicon, it tore out llama.cpp — the engine it had been built on — and replaced it with Apple's MLX. The preview numbers were not subtle. 57% faster prefill, 93% faster decode; on qualifying Macs, generation roughly doubled, from about 58 to 112 tokens per second.
If you run models locally on a Mac, this is the most consequential infrastructure change of the year, and it surfaces a decision most people have been making by accident. The two engines under every local-LLM app — Ollama, LM Studio, Jan — are MLX and llama.cpp. Knowing which one you're standing on, and why, tells you your real ceiling on speed, context length, and what hardware you can flee to later.
The same job, two philosophies
llama.cpp is the universal donor of local inference. Its whole personality is portability: one ggml codebase, shaped largely around CUDA's compute model, that gets translated onto every backend there is — plain CPU, NVIDIA, AMD's ROCm, Vulkan, and Apple's Metal. That breadth is the reason it became the substrate for an entire ecosystem, and the reason GGUF is the quantization format you've probably already got a folder full of.
MLX makes the opposite bet. Apple released it in December 2023 as an array framework written for Apple Silicon, treating the chip's unified memory — where CPU and GPU address the same pool with no copying — as the architectural primitive from the first line, not a backend to translate onto. On the newest M5-class chips it can drive the GPU Neural Accelerators directly.
A portable engine pays a translation tax on every platform. A native engine collects that tax back — but only on the one platform it was born for.
The switch is a bottleneck story
Here's the part worth internalizing, because it reframes the benchmark wars. MLX's win is real but conditional, and the condition is where your bottleneck sits.
For models under roughly 14B parameters, decoding is compute-bound: the chip can feed weights faster than it can crunch them, so a runtime with tighter, native kernels pulls ahead. There, MLX leads by anywhere from 20% to 87%. But push to a 27B model and the story inverts — you become memory-bandwidth-bound, limited by how fast the chip can stream weights from unified memory, full stop. At that ceiling both engines run at nearly the same tokens per second, because the bottleneck is the silicon, not the software. The runtime stopped mattering; the chip took over.
So the honest reading of Ollama's switch isn't "MLX is faster." It's: on Apple's hardware, for the model sizes that fit comfortably in a Mac, the portability tax llama.cpp pays to support five backends finally cost more than it saved. That's a statement about Apple Silicon, not a verdict on llama.cpp.
Where llama.cpp still wins
Two places, and both are easy to forget in the rush to the faster number.
The first is long-context prefill. The same comparative study that crowns MLX on sustained decode notes that MLX's prefill is comparatively slow at long context, while llama.cpp with FlashAttention chews through long prompts faster. If your workload is short prompts and long generations — chat, agents thinking out loud — MLX's profile is ideal. If it's stuffing a 100K-token document in and asking one question, the older engine may still beat it.
The second is everywhere that isn't a Mac. MLX runs on Apple Silicon and nothing else. The instant your deployment story includes a Linux box with an NVIDIA card, or a Raspberry Pi, or a cloud GPU, llama.cpp's portability stops being overhead and becomes the entire point: one engine, one GGUF file, every machine you own.
There's also a quieter constraint. Ollama's MLX path requires 32GB or more of unified memory; 8GB and 16GB Macs stay on the old Metal route. The future arrived, but it checks your spec sheet at the door.
How to actually choose
Pick MLX if you're Mac-only, on 32GB+ of recent Apple Silicon, running models that fit with headroom, and your prompts are short. You'll get the fastest tokens-per-second available on the platform, and the ecosystem is voting with its feet.
Pick llama.cpp if you want one engine across mixed hardware, you lean on long-context prefill, or you've already standardized on GGUF and don't want to re-quantize your shelf. Portability is a feature you only notice the day you need it — and then it's the only feature.
Most people will touch neither directly; they'll pick Ollama, LM Studio, or Jan and inherit an engine. That's fine — but inherit it on purpose. The same question shows up one layer down in GGUF vs GPTQ vs AWQ when you choose how to compress the weights, and one layer up in vLLM vs SGLang vs Ollama when you outgrow your laptop and serve the thing. The engine isn't an implementation detail. It's the floor your whole local stack stands on.



