Introduction
We are seeing an increasing interest in Tunix among researchers focusing on the post-training phase of model development. As a native JAX library, Tunix offers the flexibility needed to refine foundation models—including Vision-Language Models (VLMs) and not just LLMs—helping them significantly improve their spatial reasoning capabilities.
Today, we are highlighting the work of the PLAN Lab (Perception and LANguage Lab) at the University of Illinois Urbana-Champaign (UIUC). To address the critical lack of spatial awareness in VLMs, they built SpatialReasoner-R1, a model capable of fine-grained spatial logic. They utilized Tunix and leveraged the Google TPU Research Cloud (TRC) to scale their experiments.
In this blog, Professor Ismini Lourentzou and her team explain how they used Tunix's modular design to implement novel alignment algorithms and improve spatial reasoning in VLMs.
The "Where" Problem in VLMs
Modern Vision-Language Models (VLMs) can describe images and answer basic visual questions with impressive fluency. However, they often struggle with fine-grained spatial understanding. If you ask a VLM to estimate distances, directions, or the precise relative positions of objects, it frequently "hallucinates" coordinates or produces inconsistent reasoning with vague answers.
These capabilities are critical for real-world applications, such as robotics, where precise spatial reasoning enables safe and intelligent interaction with physical environments.
To bridge this gap, we developed the SpatialReasoner-R1 (4B and 8B versions), a model trained to perform step-by-step visually grounded spatial reasoning. It achieves 95.59 on Qualitative Accuracy and 77.3 on Quantitative Accuracy for our 8B fDPO model, outperforming the strongest baseline by ~9% in average accuracy on the SPATIALRGPT-Bench while preserving strong general vision-language abilities.
The Method: Fine-Grained Direct Preference Optimization (fDPO)
The secret sauce behind SpatialReasoner-R1 is a new technique called Fine-Grained Direct Preference Optimization (fDPO).
Standard alignment methods (like DPO) usually give a model a simple "thumbs up" or "thumbs down" for an entire response. But spatial reasoning is complex— for example, a model might correctly identify an object yet make a flawed logical inference about its location.
fDPO introduces segment-specific preference granularity. We optimize separate loss components for:
- Descriptive Grounding: Does the model correctly perceive and describe the objects in the image?
- Logical Reasoning: Is the step-by-step deduction sound and follows coherent spatial logic?
To generate high-quality training signals, we built a Multi-Model Monte Carlo Tree Search (M3CTS) data generation pipeline, which constructs diverse reasoning trajectories that guide the model toward reliable spatial understanding.
Tunix: Modularity for Novel Research
Implementing a custom objective like fDPO can be difficult in rigid frameworks. Tunix addresses this by providing a well-structured and extensible DPOTrainer that makes it possible to introduce new alignment objectives without reengineering the training pipeline.
This modularity meant we could reuse the entire underlying training stack—sharding, data loading, and loop management—while injecting our novel research logic with just a small amount of well-contained code.
While our backbone model (Sa2VA) required specific architectural handling, the core fDPO algorithm is model-agnostic. We found the Tunix experience smooth and well-documented, making it easy to prototype and iterate on fine-tuning workflows without reinventing the wheel.
Google TRC & TPUs: Reliability at Scale
Training a model to reason over long horizons requires significant compute. The Google TPU Research Cloud (TRC) provided the infrastructure we needed to make large-scale training practical.
- Scalability: Tunix's integration with TPUs allowed us to scale our experiments seamlessly.
- Reliability: The system performed reliably across multiple TPU runs, which was essential for conducting large-scale spatial reasoning benchmarks.
- Support: The Google Tunix and TRC teams assisted with infrastructure setup and experiment design, helping us refine our multi-model exploration strategy.
Looking Ahead: Open Source Contributions
We believe that open-source, extensible tools like Tunix are vital for fostering innovation. They lower the barrier for researchers to experiment with new training objectives without rebuilding core infrastructure.
In that spirit, we contributed our fDPO implementation back to the Tunix ecosystem. We open-source the core fDPO components, enabling the community to apply segment-specific preference optimization to their own models.
Get Started
You can explore our research and the tools we used below: