A year ago, if you wanted to train an agent with reinforcement learning, the hard part was the algorithm. You fought to stabilize PPO, you stood up a reward model, you watched the policy collapse and started over. The optimizer was where the expertise — and the moat — lived.

That is over. Group Relative Policy Optimization, the method that powered DeepSeek-R1, did to RL fine-tuning what FastAPI did to web servers: it made the hard thing a default. GRPO throws out the separate critic model, samples a group of answers to the same prompt — typically sixteen — and scores each one relative to the group's average. It is stable, it is cheap to reason about, and it ships in every major training library: TRL, veRL, OpenRLHF. Its successors (DAPO, the RLVR family) are iterations on the same commodity. If you have read our veRL vs OpenRLHF vs TRL and GRPO vs PPO breakdowns, you already have the algorithm. Nobody is winning on the optimizer anymore.

So where did the difficulty go? It moved one layer down the stack, to the thing the algorithm trains against: the environment.

An environment is a task, a harness, and a verifier

Strip the jargon and an RL environment is three concrete things. A dataset of task inputs. A harness the model acts through — the tools it can call, the sandbox it runs in, the context it's handed. And a reward: a function or rubric that looks at what the model did and returns a number. Prime Intellect's verifiers library, at 4.2k stars, defines it in exactly those words — "a dataset of task inputs, a harness for the model, and a reward function or rubric."

The reward is where the real shift hides. RLHF's reward came from a trained model fed human preference data — expensive, noisy, and itself a research project. The 2026 environment's reward is verifiable: did the code pass its tests, did the SQL return the right rows, did the agent satisfy a checkable rule. You are not learning a reward; you are computing one. That is what makes crowdsourcing environments tractable, and it is why Prime Intellect could launch an Environments Hub with 2,500-plus open-source environments and pitch it, without irony, as "the GitHub for RL environments." Their training framework, prime-rl, is built to run a single asynchronous job across a thousand-plus GPUs — and it consumes those Hub environments natively. The compute and the optimizer are solved plumbing. The environments are the supply.

Environment quality is the new differentiator. The teams shipping reliable agents are not the ones with a cleverer loss function — they are the ones who built realistic, reproducible places for the agent to fail safely.

The part nobody says out loud: your eval is a half-built environment

Here is the insight worth the read. Look again at what an environment is — inputs plus a scorer — and look at what an evaluation is. Inputs plus a scorer. They are the same artifact. The only difference is which direction the score flows: an eval reports it to you; an environment feeds it back to the model as a reward.

This is not a metaphor; it is literally how the tooling is built. The verifiers library is described as a library for "RL environments and evals," and its environments are "the foundational unit used by both the evaluation system and the prime-rl training framework." HUD, another entrant, markets itself in one breath as agent evaluation and RL training. The vendors aren't bundling two products — they noticed it was one.

Which means the eval suite you already maintain to keep your agent from regressing is a partially-finished training environment. The expensive, irreducible work — defining the tasks that matter, building the harness, writing a rubric precise enough to be trusted — is the same work either way. We argued in The Evals Are the Product that your test set is your real spec; the RL turn sharpens that claim. Your test set is also your training data, the moment you decide to use the score for something other than a dashboard.

What this changes for you

If you train models, the takeaway is blunt: stop optimizing the optimizer and start investing in environments — coverage, realism, reward functions that can't be gamed. That is now the high-leverage surface.

If you only ever prompt a frontier model, you are not off the hook, because the discipline is identical. The reason agents fail in production is almost never the model's raw capability; it is that no one specified the task crisply, built a faithful harness, or wrote a check for "did it actually work." Do that, and — as we covered in evaluating an agent's tool use — you get a reliable agent today and a training environment for free tomorrow. Skip it, and no algorithm, commodity or otherwise, will save you.

The moat was never the math. It was always the world you put the agent in.