If you came here to be told which algorithm to use, I'll save you the scroll: it doesn't much matter. GRPO, its relatives, the PPO variants underneath them — they're solved, packaged, and sitting in a dozen libraries you can pip install before lunch. The interesting fight in reinforcement learning for AI agents in 2026 isn't in the optimizer. It's one rung down, in the part nobody puts on a slide: where does the reward come from, and can you trust it?

That's the whole story. Everything good that's happened in agent RL over the last eighteen months traces back to a boring engineering question — how do you build an environment that emits a reward signal you'd actually stake a training run on — and everything stuck traces back to not having one.

RLVR is why coding agents pulled ahead

The acronym to know is RLVR: Reinforcement Learning from Verifiable Rewards. The reward isn't a learned model guessing how good an answer looks. It's a deterministic checker. Did the unit tests pass? Did the math expression evaluate to the reference answer? Did the SQL actually run and return the right rows? The signal is rule-based, cheap, and — this is the part that matters — not gameable by sounding confident.

DeepSeek-R1 made this legible to everyone. Its recipe was GRPO plus two rule-based rewards: an accuracy reward (is the final answer correct against ground truth) and a format reward (did it produce the right structure). No reward model in the loop. The reasoning gains came from the verifier, not the algorithm — and the algorithm choice itself is a smaller decision than the discourse implies, as we covered in GRPO vs PPO.

Here's the uncomfortable corollary. Coding, math, and structured tool use got dramatically better because they come with verifiers for free. The test suite is the reward function. Meanwhile "be a generally helpful agent" has barely moved under RL, and it's not because the algorithms can't handle it. It's because nobody can cheaply, reliably verify open-ended helpfulness. No verifier, no trustworthy reward, no signal worth optimizing against.

The model didn't get smart because of GRPO. It got smart because the test suite told the truth a few hundred thousand times.

The environment is the product

So the real work moved to environments. An environment, in the modern sense, bundles three things: a dataset of tasks, a harness that lets the model act (tools, a sandbox, context management, multi-turn protocol), and a rubric that scores the result. Prime Intellect's verifiers library — tagline, "our library for RL environments + evals" — codifies exactly this split. You build the environment once; it works as an eval, a synthetic-data pipeline, or an RL target against any OpenAI-compatible endpoint.

The clever move is making environments portable packages. Prime Intellect's Environments Hub distributes them as ordinary Python wheels with their own pyproject.toml. Their trainer, prime-rl, treats environments as the shared unit: the trainer owns the model, endpoint, and sampling; the environment owns the harness and reward. You can develop and test an environment in isolation against an API model, push it, and train on it without touching the trainer code. That's the shape of a healthy ecosystem — the reward logic is a versioned dependency, not a fork.

This reframes the toolchain question entirely. You don't pick a library for its algorithm. You pick it for how it gets you a reward signal:

The training-loop frameworks underneath — verl, TRL, OpenRLHF — are a separate, already-commoditized layer (verl vs OpenRLHF vs TRL). Worth knowing, not worth agonizing over.

When you can't write a checker

Plenty of useful tasks have no clean rule. Did the agent write a good customer email? Pick a reasonable next action in a workflow? You can't unit-test taste.

OpenPipe's answer is RULER — Relative Universal LLM-Elicited Rewards. Instead of a hand-crafted reward function, you let a frontier model rank multiple agent trajectories against each other. No labeled data, no expert feedback. The startling result from their writeup: RULER-trained models matched or beat hand-crafted reward functions on three of four benchmarks. When a checker is expensive and a judge is cheap, the judge often wins on effort-adjusted return.

But be clear-eyed about what you bought. An LLM judge is a learned, approximate reward — exactly the thing RLVR exists to avoid. It can be gamed by outputs that look right. It drifts. It's a reasonable bridge over terrain where rules don't reach, not a replacement for a real verifier where one exists. Even the frontier agentic models hedge: Kimi K2's report describes a joint RL stage mixing verifiable rewards with self-critique, precisely because neither covers the whole space alone.

The actual decision

Strip away the library marketing and the choice is a triage of your task by reward source.

The teams winning at agent RL right now aren't the ones with a secret optimizer. They're the ones who spent their quarter building a sandbox that says true when the agent did the thing and false when it didn't — and who refused to ship a training run until that signal was honest. The algorithm was never the bottleneck. The verifier always was.