If you want to train an AI agent with reinforcement learning, the algorithm is the part you no longer have to think about. PPO is a solved import. GRPO — group-relative, critic-free, the default that swept 2025 — is forty lines you copy from a paper. The general-purpose RLHF trainers that popularized bothverl (Volcano Engine RL), TRL, and OpenRLHF — are commodity substrate now; you pip install one and the loss function is handled.

So why did four separate teams ship four new frameworks this year, all pointed at the same problem? Because none of them are really about the algorithm. They are about the rollout — and the rollout is where agent RL actually breaks.

The trainer is a solved import. The environment is the part nobody can buy off the shelf.

Why the environment, not the loss, is the hard part#

Classic RLHF has a trivially cheap environment: prompt in, completion out, scalar reward. The GPU that generates the completion is the same GPU you train on, and it is busy the whole time. An agent detonates that assumption. A single agent rollout is a long-horizon loop — call a tool, wait on a sandbox, read a file, hit a browser, run a test, decide again — that can run for minutes and dozens of turns before any reward exists. Most of that wall-clock is I/O: your expensive accelerator sits idle while a headless Chromium paints or pytest collects.

That is the whole game. Get it wrong and utilization craters to single digits; you are renting H100s to watch a spinner. Every framework below is a different bet on how to hide that latency and how to model the environment that produces it. Read them as answers to one question: what do you do while the agent is thinking?

Trains an existing agent with RL via near-zero code change — decouples agent execution from the trainer through a central LightningStore

Agent Lightning's bet is don't touch the agent at all. Your agent keeps running in whatever it already runs in — LangChain, the OpenAI Agents SDK, AutoGen, CrewAI, or raw Python — and a tracer siphons off every prompt, tool call, and reward into a LightningStore. "On the other side of the store sits the algorithm you choose." The pitch is literally "ZERO CODE CHANGE (almost)", and it supports RL, automatic prompt optimization, and SFT against the same trace stream. The environment here is your real production agent, unmodified. That is the cleanest answer to the latency problem: you were going to run those rollouts in production anyway.

Modular full-stack RL library for long-horizon agents, with a fully async trainer, an in-flight weight-update pipeline, and a gym of tool-use environments
★ 2kPythonNovaSky-AI/SkyRL

SkyRL's bet is own the whole stack and make it async. It splits into skyrl-train (the trainer), skyrl-agent (the long-horizon agent layer, tuned for SWE-Bench-style tasks), and skyrl-gym (a gymnasium of math, coding, search, and SQL environments behind the standard Gymnasium interface). The headline feature is "Fully Async RL with In-Flight Weight Updates" — the policy keeps generating rollouts while fresh weights are swapped in mid-flight, so the generators never stall waiting for the optimizer to finish. If your bottleneck is a slow, variable-length environment like a repo agent, async-with-in-flight-updates is the direct countermeasure.

RL infrastructure for embodied and agentic AI; hybrid + disaggregated execution ("M2Flow") aimed squarely at rollout throughput
★ 4kPythonRLinf/RLinf

RLinf's bet is it's a systems problem, so bring systems tools. It offers hybrid and disaggregated execution — colocate the generator and trainer, or split them across separate GPU pools — under a "macro-to-micro flow" (M2Flow) abstraction, with both FSDP+HuggingFace and Megatron backends feeding SGLang or vLLM. The numbers it advertises are all about the rollout, not the reward: up to 2.43× throughput over existing frameworks on embodied RL, and a 25× end-to-end speedup on the BEHAVIOR simulator by cutting rollout latency from 1028.7 ms/step to 41.2 ms/step. If your environment is heavy — embodied, VLA, a real simulator — RLinf is the one that treats latency as the primary enemy.

Multi-turn RL for long-horizon decision-making, with a curriculum ("ScalingInter-RL") that grows interaction depth over training

AgentGym-RL's bet is the curriculum is the product. Coming out of the AgentGym line of work (Xi et al.), it argues that dumping a fresh policy into deep, many-turn episodes just teaches it to flail; its ScalingInter-RL schedule starts shallow and grows the interaction horizon as competence builds. The framework is decoupled and modular across web, search, game, and embodied environments, but its real contribution is a claim about how you should feed rollouts to the optimizer over time. It is the most research-forward of the four and the right starting point if what you want to study is the training dynamics themselves.


The one idea to take away#

Pick by the shape of your environment, not the name of your algorithm. Framework-locked and I/O-bound, agent already in production? Agent Lightning, because the rollout is free. Want to own a SWE-style stack end to end? SkyRL, for the async pipeline. Heavy simulator, embodied, throughput-critical? RLinf. Researching how curricula and horizons shape learning? AgentGym-RL.

But the deeper point is the one these repos keep quietly conceding: the loss function is a footnote. The environment is the new dataset. verl and TRL commoditized the optimizer, and the moment they did, all the remaining leverage rushed downstream — into how fast, how realistically, and how cheaply you can run an agent through a world that fights back. Whoever owns that owns the training run. Everyone ships the same PPO. Nobody can buy your environment.