Last summer, we partnered with Google to share how Marin trained a fully open 8B foundation model using JAX and TPUs. Since then, our process hasn't changed much, but the scale has. Over the summer, we trained a 32B model entirely in the open, and most days there was just one person keeping the run moving.
Large-scale training is usually associated with big teams and bigger infrastructure. Large model releases typically have hundreds of authors. Marin tests a different hypothesis: using open source software and data, small teams can train serious foundation models if the tooling is good, the platform is stable, and the process is transparent. The Marin 32B run was our strongest validation yet.
A model built with one hand on the helm
Marin was started at Stanford University's Center for Research on Foundation Models with the goal of building radically open foundation models. In May, we released Marin 8B Base, which bested the popular Llama 3.1 8B Base on 14 of 19 benchmarks. Marin 8B was trained using Google TPU v4 and TPU v5e from the TPU Research Cloud.
Building on that success, we set out to build a 32B model starting in June. Our 32B training run followed Marin's usual "Tootsie Roll" style: start with a solid recipe, instrument heavily, and adapt mid-flight when necessary. That flexibility matters, because the first time you train at a larger scale, issues inevitably arise.
The timing, however, was less than ideal, as universities tend to empty out over the summer. Students graduate, get internships, go home, or travel the world. Marin was no different. By June, our team was down to one full time research engineer, with a few PhD students providing guidance when they weren't busy with their dissertations. Nevertheless, we pushed forward.
To spoil the ending, the model turned out quite well. On release, Marin 32B Base was the best open source base model, and it outperformed comparable open-weights models like Google's Gemma 3 27B PT on 24 of 42 base-model evaluations.
There were many bumps along the way, resulting in multiple mid-run corrections, but through it all Google's TPU infrastructure stayed rock-solid, and JAX's predictable performance let us iterate quickly. This meant that even with a tiny team, we could diagnose, patch, and continue training without losing momentum.
To be blunt: one researcher kept the 32B run alive all summer, juggling preemptible slices, rebuilding optimizer state, switching architectures, and generally shepherding ~6.4 trillion tokens across v5p and v4 pods—while mostly working on other Marin projects. The fact that this was possible speaks to the stability of the TPU platform and the maturity of the JAX/Marin stack.
The short version of a long summer
Our retrospective goes into much more detail about every spike, switch and cooldown. Here's the condensed version.
We began with a Llama-3-style 32B backbone and our best 8B data mix, running on preemptible TPU v5p pods. Preemptions were predictable, and recovery was nearly automatic. As availability tightened, however, we moved to dedicated TPU v4 capacity. After a slight tweak to gradient checkpointing to accommodate the older hardware (made easy by JAX's built-in support), we were back up and running and performance stayed excellent.
Around 70k steps, persistent loss spikes appeared. We tried clipping, update-norm guards, skip-step heuristics, "necromancy" (rebuilding optimizer state), and swapping in optimizers like Muon. Nothing helped. The model needed architectural support.
So, we warm-started the run onto a Qwen3-style architecture, which is the same as the Llama 3 architecture, except that it adds QK-Norm to attention. After a brief loss bump, the spikes vanished. The model recovered to its expected trajectory within ~10 billion tokens and remained stable.
Towards the end of training, it was time for a cool down. When training LLMs, one "cools down" the model by lowering the learning rate and changing the data mix to higher quality data. Our first cooldown surfaced two issues: contamination from a cached math dataset, and a training-loss phase shift caused by our linear-congruential shuffle. Switching to a Feistel-based shuffle fixed the latter completely. After cleaning the data and re-running the cooldown, the second cooldown was smooth and produced the final model.
The result: a strong, open 32B base model
Marin 32B Base is a competitive open-source base model. It outperformed Olmo 2 32B Base—the previous best fully open-source base model—on 32 of 42 tasks, and it performs especially well on knowledge-heavy evaluations like ARC, BoolQ, and PIQA.
Head-to-head, Marin 32B Base also beat Gemma 3 27B PT on 24 of 42 tasks, and its overall average rank places it alongside Qwen 2.5 32B and the newer Olmo 3 32B models. On our evaluation suite, Marin 32B Base actually ties Olmo 3 32B Base in win rate, despite Olmo 3 being trained by a much larger team and arriving a month later.
Mean rank across our evaluation suite (lower is better). Marin 32B Base lands in the top cluster of open(-weight) models, alongside Qwen 2.5 and Olmo 3, and ahead of Gemma 3 27B PT and Olmo 2 32B. Gray bars indicate open weight models, while blue bars indicate open source models.
While Olmo 3 32B Base now comfortably leads on math and coding benchmarks, Marin 32B Base holds its own and still leads on many knowledge QA evaluations. For a model trained with a fraction of the team size typically expected for a 30B-scale run, we're proud of where it landed.
Because Marin 32B Base (like Olmo 3 32B) is open source, the weights, code, data recipes, and every experimental detour are public. Anyone can reproduce, audit, or build on the work.
The stack that made it possible
TPU stability across large slices
During the run, we moved across preemptible v5p-512 slices coordinated with Cloud TPU Multislice, a v4-2048 slice for the long middle, and several mid-run architectural transitions. Throughout, TPUs were completely reliable for us: no mysterious hangs, no collective-op debugging. Preemptions were predictable and easy to recover from.
JAX + Levanter = predictable performance
Levanter builds on JAX's XLA compilation. In practice, what mattered for us was deterministic restarts, stable MFU at scale without custom kernels, and JAX's activation checkpointing, which made the v5p to v4 migration easy.
Marin's experiment system
Marin logs every step of the experimental pipeline: hyperparameters, code versions, datasets, metrics, and artifacts. Even with architectural switches and restarts, the run never devolved into a tangle of scripts. And because it's all open, anyone can retrace or reproduce the training.
What's next
Marin 32B Base is a strong base model, but we're not done. Here's what's coming next:
- A reasoning-optimized Marin 32B
- Hardened multislice TPU support for smoother preemptible training
- Exploring MoE variants for the next scale
- Continuing to release everything, including successes and failures, openly
Closing thought
Training a 32B model with a small team isn't about heroics but about using the right tools and infrastructure. TPUs' reliability, JAX's clarity and performance, and Marin's open, reproducible process provided the leverage we needed. If the 8B run showed that open labs can build credible models, the 32B run showed they can do it at scale: quietly, steadily, and with far fewer people than you might expect.
