NVIDIA sells the DGX Spark as a supercomputer you can hold in one hand: a GB10 Grace Blackwell superchip, 128GB of unified memory, and a headline claim of running local models up to 200 billion parameters — all in a box the size of a hardback, for about $4,699. The remarkable part is that the pitch is basically true. You can put a 70-billion-parameter model on your desk with no cloud account and no data leaving the room.

The catch is that "you can load it" and "you can run it" are different sentences, and the spec that separates them is not on the box.

The number NVIDIA doesn't put on the box#

Two figures do the marketing work: 128GB of coherent, CPU-and-GPU-shared memory, and roughly 1 PetaFLOP of FP4 compute (with the asterisk that the petaflop is sparse FP4 — structured-sparsity math that roughly halves to ~500 dense TFLOPS when your workload isn't full of zeros).

The figure that governs what the machine actually feels like is quieter: the unified LPDDR5X memory runs at about 273 GB/s. An H100 moves data across its HBM3 at ~3.35 TB/s. That is a ~12× gap, and for the thing you most want a local model to do — generate text — it is the entire story.

Here is why. Generating a token (the decode phase) is memory-bandwidth-bound, not compute-bound. To produce each new token, the hardware has to stream the model's active weights through the memory bus once. So the decode ceiling is, to a first approximation:

tokens per second ≈ memory bandwidth ÷ bytes read per forward pass

That is a formula you can run on a napkin, and it predicts the published benchmarks with uncomfortable accuracy. A 70B model in FP8 is ~70GB of weights; 273 GB/s ÷ 70GB ≈ 3.9 tokens/sec before overhead. And indeed, independent testers and NVIDIA's own developer forum report Llama 3.1 70B (FP8) decoding at roughly 2.7–3 tokens per second on the Spark at batch 1 — slower than most people read. Drop to a 4-bit quant (~35GB) and the same arithmetic predicts ~7–8 tok/s; the measurements land there too. An 8B model in FP16 (~16GB) predicts ~17 tok/s, and SGLang clocks it at ~20. The bus, not the GPU, is doing the rationing — the same memory wall that decides B200 vs H200 vs H100 in the datacenter, only here the wall is a foot in front of you.

The inversion#

Put the two headline numbers together and you get the trap in one line:

The models the 128GB lets you fit are exactly the ones the 273 GB/s won't let you run interactively.

The memory is generous precisely so you can hold a 70B or a ~120B that fits on no single consumer GPU. But the bandwidth that has to feed that model is the same bandwidth whether the model is 8B or 120B — so the bigger the model you were sold on fitting, the slower each token comes back. A machine optimized to store large models is, by the same design, poorly optimized to serve them one stream at a time.

If your mental model of "local AI" is a chatbot answering you fast, the Spark will disappoint you exactly where you expected it to shine.

What it's actually for#

But calling the Spark slow is a misread, not a verdict. It is very good at two things the marketing curiously underplays.

The first is prefill — chewing through the prompt before the first token. Prefill is compute-bound, and compute is the thing the Spark has plenty of. It processes prompts at thousands of tokens per second (GPT-OSS 20B prefills at ~2,000 tok/s; 8B on SGLang exceeds 7,000). Long system prompts, big retrieved contexts, tool schemas — the stuff that makes an agent turn expensive — the Spark eats cheaply.

The second is batching. When you run many sequences at once, a single read of the weights serves all of them, so the bandwidth tax gets amortized across the batch. This is where the Spark stops looking anemic. Llama 3.1 8B climbs from ~20 tok/s at batch 1 to ~368 tok/s at batch 32, and roughly 924 tok/s at 128 concurrency; a Qwen3-Coder 30B (A3B) MoE hits ~483 tok/s at batch 64. Point a real serving engine at it — the same vLLM vs SGLang vs LMDeploy stack you'd run in the cloud — and the continuous batching does the amortizing for you. The box that gives you 3 tokens/sec for one big model will give you hundreds of tokens/sec spread across a crowd of small ones.

That crowd is the point — and it happens to be this publication's whole beat. An agent fleet is not one genius answering you in real time; it is dozens of small, prompt-heavy, non-interactive calls running in parallel, overnight, on data you'd rather not ship to anyone. That workload is prefill-heavy and batch-shaped: the two axes the Spark is built for. A bandwidth-poor, batch-rich machine is the wrong tool for a chatbot and close to the right tool for a swarm.

The buying rule#

So decide by the workload, not the parameter count on the box:

The Spark isn't slow and it isn't fast. It's a throughput device wearing a latency device's marketing. Buy it for the width of the road, never the speed limit — and if you're building agents rather than chatting with one, the width is exactly what you were short on.