
Reinforcement learning (RL) has shaped AI for decades, from early control systems to game-playing agents and, more recently, large language models that learn through interaction. At its core, RL works by teaching a model to learn, respond, and receive feedback, improving the model over the course of time.

However, as AI becomes agentic capable of multi-step reasoning, tool use, and decision-makingwe are entering the "Era of Experience", where progress is driven by systems that learn from their own experience rather than just static data. RL must evolve from optimizing single responses to shaping behaviors across entire trajectories. In this, learning happens through interaction with "environments" that define permissible actions, state changes, and the definition of success.
An RL workflow unifies a policy model, a training algorithm, and an environment with a method to verify agent responses. This interaction loop enables agents to plan, adapt, and recover from failure.
This blog explores how RL is evolving for agentic AI, why environments are central to this shift, and how open tools like Unsloth, NVIDIA NeMo RL, NVIDIA NeMo Gym, and NVIDIA NeMo Data Designer help developers build these RL workflows efficiently.
Before building an environment, it is critical to understand when RL is the right tool.
Supervised Fine-Tuning (SFT) fits best when you can provide clear target behaviors via demonstrations (instruction-response pairs). It is great for teaching format and style. However, SFT has limitations:
Reinforcement Learning (RL) becomes the better choice as complexity grows. Instead of telling the model "say exactly this," you provide a goal and a way to verify it. This allows the model to explore reasoning paths, making it resilient to edge cases. This tends to work well for tasks like math, code, and tool calling, among other things that have a clear path to verification of answers.
In practice, SFT and RL are not mutually exclusive, and a hybrid strategy is often employed:
For example, the NVIDIA Nemotron 3 family of models utilizes SFT as a substantial first stage to ground the model before moving into RL refinement. The ultimate choice depends on your compute budget, data availability, and the level of generalization your agent requires. The industry is generally shifting towards allocating more compute during RL stages, especially as RL environments become more sophisticated and accessible.
Traditionally, RL methods like PPO (Proximal Policy Optimization) were the standard. However, their resource intensivenessrequiring multiple complex and compute-intensive models, like the reward and critic modelshas driven a shift toward more scalable algorithms.
Modern workflows are increasingly adopting more efficient methods like DPO and GRPO to handle different aspects of model improvement:
Direct Preference Optimization (DPO) sidesteps the RL loop entirely, treating alignment as a classification problem on static preference data.
However, DPO lacks explicit reward optimization or exploration. It learns from fixed preference pairs, preventing it from discovering new strategies or optimizing long-horizon outcomes. Furthermore, because the DPO algorithm models the relative output preference rather than trajectory reward, it is less effective for agentic workflows requiring multi-step reasoning and tool use.
To address these limitations in agentic domains, developers are turning to algorithms that leverage verifiable rewards.
One such algorithm is Group Relative Policy Optimization (GRPO), which is an optimized version of PPO. In this setup, heavy critic models are replaced by generating groups of outputs and scoring them against a deterministic verifier.
0 or 1), but supports continuous values (-∞ to +∞). While it thrives when an environment can programmatically say "Yes" or "No" (for example, passing a unit test), it supports complex rewards where scores may exceed 1 for more granular feedback.This broader shift toward verifiable correctness is distinct from any single algorithm. While verification can drive improvements even in supervised settings (such as rejection sampling), it is central to the paradigm of Reinforcement Learning from Verifiable Rewards (RLVR). By replacing subjective scoring with explicit checksdid the agent produce the correct answer, or did it call the right tools?RLVR moves the "center of gravity" from the optimizer to the environment. Algorithms like GRPO simply provide an efficient mechanism to optimize against these environmental signals.
In RLVR, the environment becomes the contract between learning and behavior. Let’s now define more concretely what we mean by an environment.
The environment is everything outside the absolute control of the agent. An environment is defined by the task for the agent to accomplish, the actions the agent can take, and the state of the world the agent observes and acts upon. The environment also determines how the agent’s performance is evaluated: what constitutes success and how reward is assigned.
Before we move further, it’s important to formally introduce key terminology:
Rollout The process of executing a policy in an environment to generate experience. It emphasizes the act of collecting data: stepping through the environment, taking actions, and recording what happens.
Trajectory The resulting sequence of states, actions, and rewards produced by a rollout. It emphasizes the data itself: the ordered record of what happened. In practice, most codebases (and papers) treat them as synonymous since a rollout produces exactly one trajectory, and when people say "trajectory" they usually imply it came from rolling out a policy.
NeMo Gym is an open-source library for building and scaling RL environments, battle-tested through the development of the Nemotron 3 model family.
NeMo Gym is designed to address these challenges by providing a clean decoupling of rollout collection from training, standardizing trajectories using the OpenAI Responses API, and infrastructure to manage resource lifecycles that scale to thousands of parallel environments.
In NeMo Gym, tasks define what the agent must accomplish. Resources provide the external state the agent interacts withfor example, tools, databases, sandboxed execution, as well as the verification logic that scores performance. The Model Interface handles generation, producing the model’s actions each turn, such as text, tool calls, or code. The Agent orchestrates each rollout: calling the model to generate actions, updating the environment state via resource servers, and collecting the final reward.

Figure 1: The architecture of NeMo Gym that works alongside an RL training framework, illustrating the decoupling of environment rollout orchestration from model training and generation.
NeMo Gym integrates with RL training libraries such as NeMo RL, Unsloth, HuggingFace TRL, and others which implement the training algorithms (for example, GRPO) that update the model. NeMo Gym collects rollout trajectories and rewards from the environment and passes them to the training framework, which manages policy updates and serves the updated model for the next round of rollouts.
Before writing a single line of code, it is essential to understand the two-phase journey of an RLVR practitioner. Figure 2: The RLVR workflow environment preparation precedes and shapes model training.

The key insight is that environment preparation is how you define what "better" means. The training phase simply optimizes for the signal you’ve built.
For the purposes of this blog post, we will assume that you have a good understanding of the capability or benchmark you want to see the model improve on. The following section covers specifically concepts around Phase 1.3 above, that is, building an environment.
NeMo Gym, an open-source library within the NVIDIA NeMo framework, defines and orchestrates RL environments and generates scalable, verifiable rollout data, while Unsloth consumes these rollouts for efficient RL training.
Within the NeMo Gym ecosystem, building an environment relies on three foundational pillars, culminating in model training executed via an integrated RL framework.
Agents need to be exposed to a diverse set of scenarios in order to specialize and improve on a given task. For instance, in the Workplace Assistant environment, task data consists of natural language business requests that require the agent to autonomously navigate simulated databases and tools over multiple steps. A simple single-step example of a user query and expected response would be:
User query:
"Send an email to [email protected] with the subject 'Team Meeting' and body 'Let's meet tomorrow at 2pm to discuss the project.'"
Expected tool call:
email_send_email(
recipient="[email protected]",
subject="Team Meeting",
body="Let's meet tomorrow at 2pm to discuss the project."
)
When short on task-specific data, developers can turn to synthetic data generation (SDG) using tools such as NeMo Data Designer to programmatically create task queries, and potentially corresponding ground truths. To train effectively, you need thousands of diverse prompts that exercise the environment’s tools. For example, if you are building a coding environment, you might use an LLM to generate 5,000 unique Python word problems, while a deterministic script generates the unit tests (the ground truth) used to verify the answers.
Understanding the task is the first step in designing the environment itself.
Referring back to the left half of Figure 1, the environment design consists of three primary components:
Take a look at an example Agent Server pseudocode below. It sends the conversation to the model, gets back a response, and if the model makes any tool calls, it routes the tool calls to the resources server and feeds the results back to the model. This repeats until the model replies with a plain text message (no tool calls), hits the token limit, or exceeds max_steps.
# Agent Server pseudocode (based on SimpleAgent)
async def run(task_data):
# 1. Initialize episode
resource_server.seed_session(task_data)
# 2. Run the agent loop
response = self.responses(task_data.prompt, task_data.tools)
# 3. Grade the result
reward = resource_server.verify(response, task_data.ground_truth)
return response, reward
async def responses(prompt, tools):
conversation = prompt
step = 0
while step < max_steps:
model_output = model_server.responses(conversation, tools)
conversation.append(model_output)
if model_output is text:
break # model is done, no more tool calls
for tool_call in model_output.function_calls:
result = resource_server.post(
f"/{tool_call.name}",
tool_call.arguments,
)
conversation.append(result)
step += 1
return conversation
Importantly, you can use an existing agent in NeMo Gym, bring your own, or create a completely new one. As such, it’s possible this loop looks very different depending on your setup. The MiniSWEAgent, for example, delegates the run logic to an external harness running in Docker containers, and then converts the output back into the NeMo Gym format.
Existing agents may also come with predefined tools, allowing you to leverage them directly. You can then seamlessly use the resources server to supplement the agent with any additional, external tools it may need.
The Resources Server is the "world" the agent interacts with. In NeMo Gym, this is implemented as a lightweight FastAPI application. It exposes tools as HTTP endpoints (for example, POST /search_database) that the model can call via standard OpenAI-compatible tool schemas, as well as reward calculation logic.
Crucially, these servers handle session management. Because an agentic rollout involves multiple steps, the environment must "remember" what happened in previous steps. NeMo Gym uses a session_id to maintain isolated state for every parallel rollout.
# Conceptual Resources Server Structure
class MyResourceServer(SimpleResourcesServer):
async def seed_session(self, session_id, initial_data):
# Initialize the "sandbox" for this specific rollout
self.state[session_id] = initialize_environment(initial_data)
async def my_custom_tool(self, session_id, tool_args):
# Model calls this during the rollout
result = execute_action(self.state[session_id], tool_args)
return result
The verifier is one of the most critical parts of environment design. It is often a deterministic function that evaluates the final state of a rollout and returns a reward signal.
Two common ways to design these rewards are:
Other major ways to design verification logic include sandboxed execution (running generated code or artifacts against unit tests), using an LLM-as-a-judge (for semantic or open-ended evaluation), and training reward models (to capture human preferences), among others.
# Conceptual Verification Logic
async def verify(self, session_id, agent_response, ground_truth):
# 1. Extract what the agent actually did
actual_outcome = self.state[session_id].get_final_state()
# 2. Compare against the "Golden" result
if actual_outcome == ground_truth:
return reward(1.0) # Success!
return reward(0.0) # Failure
Some best practices for designing verification logic include:
With environments in place, agent training proceeds by generating rollouts through repeated interaction between the policy model(s) and the environment. NeMo Gym orchestrates this process by running environments at scale, managing session state, and producing structured rollout trajectories annotated with rewards from the verification logic.
These rollouts are then consumed by an RL training framework such as Unsloth, NeMo RL, or HuggingFace TRL, which applies an optimization algorithm (for example, GRPO or PPO-style methods) to update model weights. Check out tutorials for GRPO runs with NeMo RL and NeMo Gym and RL training with Unsloth and the NeMo Gym Sudoku environment.
The training framework remains decoupled from environment implementation, allowing teams to swap optimizers, scaling strategies, or hardware backends without modifying environment logic.
Training follows an iterative loop: generate rollouts, verify outcomes, update the policy, and re-evaluate performance. This separation of rollout generation and optimization enables scalable, flexible RL workflows across different domains and infrastructure.
Deep dive: For a step-by-step technical walkthrough, including code examples for stateful and multi-step environments, refer to the supplemental developer guide for building environments.
Environment-driven RL workflows are increasingly shaping how agentic systems are trained across research and industry. By separating environment definition, rollout generation, and optimization, teams can iterate faster and scale reinforcement learning without tightly coupling reward logic to a single training framework.
This pattern has already been applied in real-world systems. For example, the NVIDIA Nemotron 3 model family was predominantly refined using structured RL across interactive environments, where verification logic prioritized correct trajectories and tool usage over single-step responses. The same environment abstractions used in that work are now available as open libraries and integrate with multiple RL training frameworks.
RL environments are also being developed for applied domains. For example, Edison Scientific integrated NeMo Gym with their Aviary gym to train scientific agents that explore hypotheses, run simulations, and receive deterministic feedback from domain-specific environments. See also NVIDIA’s post on how to train scientific agents with reinforcement learning.
Today, interactive environments built with NeMo Gym generate verifiable rollout data that can be consumed by libraries such as Unsloth, HuggingFace TRL, NeMo RL, and other PyTorch-native stacks. This interoperability allows practitioners to choose optimizers, memory strategies, and hardware backends independently of environment design, supporting scalable agentic AI from research through production.
In the era of agentic AI, the environment defines the contract for intelligence. Here’s how you can get started today:





