Introduction
JAX is widely recognized for its power in training large-scale AI models. However, a primary bottleneck in the next phase of AI development—LLM post-training with Reinforcement Learning (RL)—is the scarcity of environments with verifiable rewards.
Today, we are highlighting the work of the GRL (Game Reinforcement Learning) team at UC San Diego. To solve the data bottleneck, they have built a pipeline to turn video games into rigorous reasoning benchmarks. They utilized Tunix, a JAX-native research-friendly RL framework that supports multi-host, multi-turn capabilities, and leveraged the Google TPU Research Cloud (TRC) to scale their experiments. The results are promising: this approach has yielded significant improvements in model quality, particularly in planning and reasoning tasks, proving that games can be a viable substrate for serious AI capability training.
In this blog the GRL team explains how they are combining game environments, modular Tunix library for RL post-training, and TPU compute to train the next generation of agents.
Why Verifiable Games for LLM Post-Training?
Current RL post-training has shown strong gains in domains like math and coding because success can be auto-checked. However, these settings are often narrow and short-term. We are effectively overfitting RL to clean problems, while the next generation of agents must operate in messy, multi-step worlds.
To unlock RL as a systematic method for reasoning, we need a diverse pool of environments where rewards are grounded in explicit, machine-checkable rules. Games are this missing, underused substrate.
- The Performance Gap: LLMs still perform surprisingly poorly on many strategy games, revealing a clear gap between model behavior and human-level interactive competence.
- Verifiable Signals: Games come with built-in verifiable signals—wins, scores, puzzle completion—meaning outcomes are automatically and unambiguously graded without human labeling.
- Long-Horizon Reasoning: Unlike short QA tasks, games force models to plan, explore, and reason over many steps.
- Abundance: Decades of RL research has produced a standardized ecosystem of diverse environments ready to be recycled.
Game Reinforcement Learning (GRL): A Unified Game-to-Post-Training Pipeline
To harness this ecosystem, we built GRL, a comprehensive suite designed to recycle diverse game environments into a reusable post-training resource. Our mission is to prioritize environments with executable success checks—ranging from text-based puzzles to embodied 3D worlds and web/GUI workflows. Our code and ecosystem live under the LM Games organization (lmgame.org).
GRL provides three key capabilities:
- A Unified Pipeline: We standardize the conversion of games into RL-ready environments with structured states and consistent metrics. This makes results comparable across models and research groups.
- Versatile Configuration: Researchers can tailor interaction styles (e.g., max_turns, natural language feedback) while mixing training data from different tasks seamlessly. This allows for training on puzzles, math, and web tasks within a single run.
- Algorithm-Agnostic Interface: GRL works with any agentic training algorithm. While we frequently use PPO, the system serves as a robust testbed for developing new RL techniques.
The Engine: Plugging into the Tunix RL Framework
Designed for Research Flexibility and Multi-Turn Agents
In practice, plugging a GRL game agent into Tunix is seamless thanks to its modular design. Tunix is built specifically to support multi-turn agentic tasks, allowing researchers to leverage native one-turn inference APIs to achieve complex multi-turn rollouts, then batch those outputs directly back into the training flow. This research flexibility is key; the framework is lightweight enough for quick iteration and benchmarking, yet modular enough to allow fine-grained adjustments to reward functions, algorithms, and hardware-aware settings like mesh sizes.
We first define an agent_cfg (see picture above) that tells the system which game to play (eg. Sokoban or Tetris), how the LLM should talk (chat template + reasoning style), and its budgets (max turns, tokens per turn, action format). On the Tunix side, we then load a pre-trained model into three roles: actor, critic, and reference and build ClusterConfig to specify rollout and training configs and PpoConfig to specify RL hyperparameters. The glue is minimal and the layout is clear and research friendly: once agent_cfg, ppo_cfg, and cluster_cfg are defined, we construct an RLCluster and pass everything into PpoLearner, which gives us a complete multi-turn PPO trainer in JAX.
Our multi-turn RL workflow is equally lightweight from the user's point of view. For example, with a 5-turn budget, the trainer repeatedly lets the LLM "play" the game for up to five conversational turns: at each turn it sees the current grid or state, reasons in language using the chat template, outputs a series of actions, and receives the next state and a verifiable reward signal (win/loss/score/step penalty). GRL's agent + env configs handle all the orchestration: they log observations, actions, and rewards into structured trajectories, which Tunix then turns into token-level advantages and returns for PPO updates. You don't manually build datasets or rollouts; the trainer owns the loop - interact -> log -> compute rewards -> update policy -> repeat.
In our preliminary experiments using this setup, training Qwen2.5-7B-Instruct on Sokoban and Tetris yielded strong in-domain gains (+2-56% across game variants). We also observed modest generalization to out-of-domain tasks, with consistent improvements in planning tasks (Blocksworld: +3-7%) and positive but unstable signals in computer use (Webshop: ~+6%). All scripts and configs are available in the GRL repo: https://github.com/lmgame-org/GRL/tree/main. To reproduce the end-to-end Tunix + GRL training example (including our Sokoban/Tetris runs), you can simply clone the repo and run one line: bash tunix_quick_training_example.sh.
Google TRC & TPUs: Accelerating Game-Based RL at Scale
A critical component of our research was the Google TPU Research Cloud (TRC) program. Access to Cloud TPUs allowed us to move from small-scale prototypes to production-grade training runs with minimal friction.
TPUs and JAX directly attacked our two biggest bottlenecks:
- Rollout Throughput: Using the vLLM-TPU path via tpu-inference, we could serve multiple model families on the same TPU v5p backend. This boosted sampling throughput, making the data-collection loop tighter and multi-environment concurrency cheaper.
- Multi-Host Scale for 7B Models: Tunix's lightweight design combined with JAX's mesh-based sharding allowed us to scale the same code from a single host to multi-host setups declaratively. This capability was essential for our experiments with 7B parameter models (such as Qwen2.5-7B), where we leveraged 2 v5p-8 hosts with minimal code change (in fact, only an env var config). The scale up is seamless, proving that the infrastructure can handle the heavy computational lifting required for modern LLM post-training without requiring complex engineering overhauls.
- Hardware Advantage: At the hardware level, the performance gains were significant. Each TPU v5p chip delivers around 459 BF16 TFLOPs, compared to roughly 312 on an NVIDIA A100. This raw power, combined with the TRC program's support, meant that large-N studies—involving more seeds, longer horizons, and more environments—became routine experiments rather than "special ops" engineering challenges.
This combination of Tunix's flexible abstraction and TRC's massive compute resources allowed us to iterate quickly on ideas while benefiting from production-grade infrastructure.
Get Started
GRL and Tunix are open for the community to explore. You can reproduce our end-to-end training example (including the Sokoban/Tetris runs) by cloning the repo, following the installation instructions, and then running a single command:
bash tunix_quick_training_example.sh