opensource.google.com

Menu

In-place pod restarts: Boosting efficiency and workload reliability in Kubernetes v1.35

Thursday, June 18, 2026

Operational efficiency and system resilience are critical when running scaled platforms. Yet, in Kubernetes, recovering from software crashes remains a headache because you couldn't trigger a clean restart of a Pod's containers without recreating the entire Pod object, leading to some amount of resource waste.
To address this, Restart All Containers on Container Exits graduated to beta and is enabled by default in Kubernetes v1.36. Developed in close collaboration with the CNCF community, this capability represents Google's commitment to investing in the success of foundation-led open source projects. By sharing best practices from running large distributed systems internally, we are helping build a more resilient and efficient ecosystem. Letting containers restart while keeping the Pod's runtime identity provides a built-in way to perform in-place Pod recovery, boosting application reliability and saving resource costs.

The Problem: The High Cost of Pod Re-creation

Historically, Kubernetes managed failures using pod level restart policies. While sufficient for simple services, modern multi-container Pods often have complex dependencies. When a failure requires a full environment reset, your only option was deleting and recreating the entire Pod.
This introduces massive control plane churn, causing latency and pressure on the etcd backend during large failures:

  • Initialization Dependencies: If a main container corrupts a local environment, for example, single-use secrets that must be re-requested, restarting just that container is insufficient; the setup must run again.
  • Watcher Interoperability: If a watcher sidecar detects a fatal error, it must trigger a full recreate of the entire pod and its infrastructure, including the sandbox.
  • Stale States: If a database sidecar proxy restarts, the main application can get stuck attempting to use stale, broken connections.
  • Resource Race Conditions: When a large job finds a proper set of nodes, recreating Pods can lead to other pending Pods taking over those resources. In-place restarts eliminate this race condition risk.

Previously, resolving these failures required destroying the entire Pod. For large batch or AI/ML workloads, where thousands of Pods might fail simultaneously, this can lead to "Thundering Herd" scheduling requests, delaying recovery and wasting expensive GPU/TPU compute time.

Introducing In-Place Restarts: The RestartAllContainers Action

Kubernetes v1.35 introduces the RestartAllContainers action, enabled by the RestartAllContainersOnContainerExits feature gate, which graduated to beta in 1.36 alongside its dependencies ContainerRestartRules and NodeDeclaredFeatures. This lets a container's exit behavior trigger a fast, in-place restart of the entire Pod on its existing node.
The Kubelet halts all containers while keeping the Pod sandbox intact, preserving critical infrastructure:

  • Network Identity: Keeps the same IP, network namespace, and UID, completely bypassing IP reassignment.
  • Hardware and Devices: Keeps GPUs/TPUs bound, eliminating scheduling and re-allocation delays.
  • Storage Mounts: Volumes, including emptyDir and PVCs, remain fully mounted; their content is not cleared during restarts.

Once terminated, the Kubelet re-runs init containers (including sidecars, which are part of the init sequence) in order, guaranteeing a clean setup in a known-good environment.

A Native Pod Specification Example

You can implement this under the container's restartPolicyRules field. Here is a quick example of how a watcher sidecar can trigger an in-place restart of the entire Pod by exiting with code 88:
YAML
Note: Image names and paths in the YAML below are for illustrative purposes.

apiVersion: v1
kind: Pod
metadata:
  name: ml-worker-pod
spec:
  restartPolicy: Never
  initContainers:
    - name: setup-environment
      image: registry.k8s.io/ml-tools/setup-worker:v1.0
    - name: watcher-sidecar
      image: registry.k8s.io/ml-tools/watcher:v1.0
      restartPolicy: Always
      restartPolicyRules:
        - action: RestartAllContainers
          exitCodes:
            operator: In
            values: [88]
  containers:
    - name: main-application
      image: registry.k8s.io/ml-tools/training-app:v1.0

The Operational Impact of In-Place Restarts

For organizations running distributed workloads, RestartAllContainers provides serious operational advantages:

  • No Control Plane Overhead: By preserving identity, clusters avoid scheduling latency and DNS propagation. This was a key factor for JobSet using this feature to reduce recovery from minutes to seconds.
  • Node Locality Preservation: Since the Pod stays anchored to the same node, restarted containers can instantly access local, warm storage caches.
  • Maximized Hardware Efficiency: In distributed AI training, losing a single node halts the entire job. Keeping accelerators like GPUs/TPUs bound lets workloads resume training significantly faster, directly reducing compute costs.

Observability and SRE Best Practices

To support monitoring, Kubernetes v1.35 introduces the AllContainersRestarting Pod condition. Set to True during restarts, it alerts SREs and autoscalers, preventing false-positive alerts, while container restart counts increment to let Prometheus easily track recovery events.
To use in-place restarts successfully, shift your mental model to "persistent sandboxes" and follow three best practices:

  1. Ensure Reentrancy: Kubelet only guarantees "at least once" execution for init containers. Reentrancy is now a standard requirement, so your code must be fully idempotent.
  2. Plan for Termination Handling: Graceful termination (preStop hooks) is not supported for in-place restarts. SIGKILL is almost immediate, so applications must handle sudden exits gracefully.
  3. Prepare External Tooling: CD and observability tools should expect re-running init containers without interpreting them as new deployments.

What's Next?

This beta capability is a major step toward fluid workload management and serves as a building block for advanced community features like JobSet in-place restarts (KEP-467).
Our work on KEP-5532 reflects our commitment to transparent open source governance. Developed collaboratively within SIG Node, this feature shows how we hold ourselves to high citizenship standards; making our design, goals, and intentions transparent while building shared best practices that benefit everyone. We encourage you to experiment with Kubernetes v1.35 and share your feedback with the community!

Learn More

Open rails for agentic commerce at Open Source Summit North America 2026

Tuesday, June 16, 2026

At Open Source Summit North America 2026, I shared why agentic commerce needs open rails.

As AI agents become more capable, the shopping journey is shifting from "show me" to "help me." Instead of browsing, comparing, clicking, and checking out step by step, people can increasingly ask an agent to help them decide what to buy and, in some cases, complete the purchase. Industry forecasts suggest agentic shopping could account for roughly 10% to 25% of U.S. e-commerce by 2030 (Bain), which points to a meaningful shift in how digital commerce will work. Watch the full keynote here.

Why shared rules matter

That shift also exposes a challenge. Commerce is still highly fragmented. Different businesses, payment providers, and platforms operate with their own rules, workflows, and business logic. Every new surface adds more integration work. Every bespoke connection creates more complexity. And that fragmentation makes it harder for AI systems to understand and perform commerce actions consistently across businesses. A shared language lowers that barrier for everyone.

A common language for agentic commerce

That is the problem Universal Commerce Protocol (UCP) is designed to solve.

We launched the Universal Commerce Protocol, or UCP, with industry leaders to establish an open standard for agentic commerce, built to work across the shopping journey. UCP creates a common language for agents and systems to operate together across consumer surfaces, businesses, and payment providers, so the ecosystem does not need a different bespoke integration for every new agent or platform.

Just as importantly, UCP is designed for the real world. Every business has its own way of selling. Checkout, fulfillment, loyalty, policy logic, shipping, and post-purchase flows can vary widely between a local shop, a marketplace, and a large retailer. UCP is built to support that reality.

A diagram of the Universal Commerce Protocol (UCP), subtitled 'The common language for platforms, agents and businesses.' It illustrates a central UCP framework containing modules for 'Shopping' and 'Common' services, flanked by 'Consumer platforms' on the left and 'Business platforms' on the right, with bidirectional arrows showing how they connect and communicate through the central protocol.

A layered architecture for a shared commerce language

UCP uses a layered model to create a reusable shared language for commerce. Services organize domains like shopping and common. Capabilities define core actions such as checkout, catalog, cart, orders, and shared functions like identity linking. Extensions keep those capabilities configurable, so features like fulfillment can be modeled once and reused across multiple flows instead of being hardwired each time. At the transport layer, UCP stays agnostic, supporting bindings like REST, Model Context Protocol, and Agent2Agent.

Together with capability discovery and payment handling, these layers help consumer platforms, agents, and businesses interoperate more consistently over time. They also let different participants advertise what they support, compose new behaviors, and communicate over the transport that works best for them.

Built in the open

A standard for everyone should be shaped by everyone. Because UCP is open, merchants, developers, and community contributors can pressure-test real-world gaps, propose new capabilities and extensions, and help make sure the protocol reflects more than the needs of the largest players. That kind of participation is what keeps an ecosystem moving.

Since launch, UCP has continued to evolve through new capabilities, an expanded Tech Council, and new consumer experiences built on top of the protocol. That momentum matters because standards only work when the ecosystem uses them.

Watch the full keynote

Agentic commerce is still evolving, and UCP is a foundational building block to support what's next in this new era.

If you want the full architecture walkthrough and the complete story from Open Source Summit North America, watch the session here. And if you want to go deeper, you can explore the UCP documentation, join the community conversation, and contribute to the public repository.

CEL finds a new home at github.com/cel-expr!

We're excited to announce that the official Common Expression Language (CEL) repositories have moved to a dedicated GitHub organization. Visit the new cel-expr repository now!

Why the move?

This move is a key step in strengthening the CEL ecosystem. By centralizing our projects, including the language specification, Go, C++, C, Java, and Python implementations, under the cel-expr organization, we aim to:

  • Enhance Branding: Create a clear and unified brand identity for CEL.
  • Improve Discoverability: Make it easier for users and contributors to find all official CEL resources in one place.
  • Ensure Consistency: Foster consistency across all CEL projects.
  • Streamline Development: Simplify our development and release processes.

What's Changing?

The following repositories now reside in the cel-expr organization:

  • google/cel-spec is now cel-expr/cel-spec
  • google/cel-cpp is now cel-expr/cel-cpp
  • google/cel-go is now cel-expr/cel-go
  • google/cel-java is now cel-expr/cel-java
  • cel-expr/cel-python and cel-expr/cel-c have already been in the cel-expr namespace

All future development, issues, and pull requests for these projects will take place in their new homes within the cel-expr organization. This is a non-breaking change, due to automatic redirects, but you should update your URLs where possible.

What Stays the Same?

We've worked to make this transition as seamless as possible:

  • Automatic Redirects: GitHub will automatically redirect all web traffic and git operations from the old google/cel-* URLs to the new cel-expr/cel-* locations. Your existing links and git remote configurations pointing to the old URLs should continue to work for cloning and fetching.
  • Preserved History: The full commit history, issues, and pull requests for each repository have been migrated and are available in the new locations.

Action Required: Update Your Dependencies

While existing links and git remote configurations pointing to the old URLs should continue to work thanks to GitHub's redirects, we recommend updating your dependency management configurations (e.g., go.mod, pom.xml, requirements.txt, etc.) to point directly to the new repository URLs under https://github.com/cel-expr. This ensures you are fetching the latest code and releases from the canonical source.

We're thrilled about this new chapter for CEL, bringing all our core components under one roof. We believe this will foster a stronger CEL community and accelerate the development and adoption of CEL.

Explore the new organization at https://github.com/cel-expr!

A new pkg.go.dev API for Go

Friday, June 12, 2026

Access to Go metadata has been an everpresent need for the Go community. Since its launch, pkg.go.dev has served as a central hub for Go package documentation and discovery. While we initially prioritized providing this comprehensive access via a web interface, the need for streamlined programmatic access has become increasingly clear.

Structured API access has been one of the most highly requested features for pkg.go.dev for a while now. Developers building tools, IDE integrations, automated workflows, and other systems have had to rely on inconsistent and fragile scraping methods. By providing a formal API, we can provide fast and efficient access to required data. This foundation also sets Go up for the future of AI-assisted coding. Large language models and agents can access the context necessary to reason about the Go ecosystem with greater precision and accuracy.

Empowering Tool Builders

Our goal with this API is to reduce the technical churn for builders and innovators. By offering structured JSON metadata, we address the following use cases:

  • Search and Discovery: The API enables fast and efficient search across the entire Go module ecosystem.
  • Driving AI Innovation: As AI-assisted coding evolves, LLMs and agents need precise context. This API provides the data required for agents and models to reason deterministically about Go packages.

The Service Interface

Built for stability and efficient caching, the API uses a stateless, GET-only architecture. Primary endpoints are currently hosted under the v1beta path. Following a period of feedback from the Go community and confirmed stability, we intend to transition toward a formal v1 release.

For a complete interactive reference of all endpoints, query parameters, and response shapes, see pkg.go.dev/api. The machine-readable API contract is also published directly at pkg.go.dev/v1beta/openapi.yaml.

Endpoint Description
/v1beta/imported-by/{path} Paths of packages importing the package at {path}.
/v1beta/module/{path} Information about the module at {path}.
/v1beta/package/{path} Information about the package at {path}.
/v1beta/packages/{path} Information about packages of the module at {path}.
/v1beta/search/search?q={query} Search results for a given query.
/v1beta/symbols/{path} List of symbols declared by the package at {path}.
/v1beta/versions/{path} Versions of the module at {path}.
/v1beta/vulns/{path} Vulnerabilities of the module or package at {path}.

An example of retrieving package information is shown below:

curl https://pkg.go.dev/v1beta/package/github.com/google/go-cmp/cmp | jq
{
  "modulePath": "github.com/google/go-cmp",
  "version": "v0.7.0",
  "isLatest": true,
  "isStandardLibrary": false,
  "goos": "all",
  "goarch": "all",
  "path": "github.com/google/go-cmp/cmp",
  "name": "cmp",
  "synopsis": "Package cmp determines equality of values.",
  "isRedistributable": true
}

A Reference Implementation

To demonstrate how to interact with our API, we are providing a reference CLI implementation: pkgsite-cli. This implementation serves as a practical example for developers looking to build their own integrations, showing how to handle the data directly from the terminal. Note, as the API continues to evolve, the interface and behavior of this CLI may change.

You can use it to search for packages or inspect symbols without leaving your shell:

go install golang.org/x/pkgsite/cmd/internal/pkgsite-cli@latest

pkgsite-cli search "uuid"
github.com/google/uuid
  Module:   github.com/google/uuid@v1.6.0
  Synopsis: Package uuid generates and inspects UUIDs.
... more


pkgsite-cli package github.com/google/go-cmp/cmp
github.com/google/go-cmp/cmp
  Name:      cmp
  Module:    github.com/google/go-cmp
  Version:   v0.7.0 (latest)
  Synopsis:  Package cmp determines equality of values.

pkgsite-cli package --symbols github.com/google/go-cmp/cmp
github.com/google/go-cmp/cmp
  Name:     cmp
  Module:   github.com/google/go-cmp
  Version:  v0.7.0 (latest)
  Synopsis: Package cmp determines equality of values.

Symbols:
  type Indirect struct{}
  type MapIndex struct{}
  type Option interface{}
  ... more

Looking Ahead

While we prioritize stability for our new /v1beta endpoints, we are eager to hear how open source communities use these resources to solve real-world problems.

We look forward to your feedback via our issue tracker and to seeing the tools you’ll build next.

Introducing OpenRL: A self-hosted post-training API for fine-tuning LLMs

Thursday, June 11, 2026

We are pleased to share a research preview of OpenRL, a new open-source project coming out of GKE Labs. OpenRL is a self-hosted training API for fine-tuning LLMs on your own Kubernetes cluster.

Why we built it

If you look at agentic RL on LLMs, it is incredibly easy to get bogged down in system complexity. To run a single RL loop, you have to coordinate a dozen different things: selecting and cleaning datasets, choosing RL environments, debugging training loops, managing reward signals, handling inference mismatches, allocating hardware, and managing infrastructure. Picture looks something like this:

an AI researcher and an infrastructure engineer staring at the hurdles in post training along the way to the summit
Figure shows an AI researcher and an infrastructure engineer staring at the hurdles in post training along the way to the summit.

Each of these is a hard problem. But what makes it more complex is how tightly AI research and infrastructure concerns are mixed together in today's tooling and frameworks.

We believe decoupling the infrastructure from AI research can make these problems more tractable so that infrastructure engineers and AI researchers can independently tackle them. We have seen this pattern with Kubernetes where Kubernetes abstracted out the infrastructure and made application developers and SREs life easier.

So, can you abstract out post training infrastructure? We believe so and drew huge inspiration/validation from Tinker (from Thinking Machines). The Tinker APIs for post training hit that Goldilocks zone where it hides all the post training infrastructure behind four key APIs:

high level components and their interaction in a OpenRL based RL workflow
Figure shows high level components and their interaction in a OpenRL based RL workflow

So the end result of this abstraction is that AI Researchers get full flexibility on their RL loop and infrastructure engineers can focus on scaling, orchestration, and reliability. OpenRL allows you to run the same training APIs but on your own infrastructure. And this decoupling has other interesting benefits.

Sharing GPUs

Traditional RL loops are strictly sequential. The trainer waits for the sampler to finish rollouts, the sampler waits for the environment to score rewards (which is often bound by slow CPU/network tasks), and the whole loop sits blocked. Your expensive GPUs spend a lot of time doing nothing. The abstraction allows running multiple RL jobs and allows infrastructure engineers to pack the training/sampling steps to utilize more of their GPUs. The graph below shows the GPU consumption in OpenRL for running one, two, and three RL jobs concurrently.

The figure shows the trainer/sampler duty cycle in OpenRL for scenarios with 1 RL job, 2RL jobs and 3 RL jobs respectively
The figure shows the trainer/sampler duty cycle in OpenRL for scenarios with 1 RL job, 2RL jobs and 3 RL jobs respectively.

Better UX

Once you separate out the infrastructure behind the APIs, you start to see the gains in user experience of developing the RL loop because AI researchers no longer have to wrangle the complex python dependencies like cuda. When you are doing R&D, you do not have to run the RL loop directly on the machines with GPUs, you can simply run your RL loop on your Mac pointing to the training APIs running on a Kubernetes cluster/VMs.

Autoresearch

We believe that frontier AI research will get more and more automated in the future and abstracting out infrastructure as a building block is key to that. To demonstrate that, we added an autoresearch recipe inspired heavily by karpathy's work. The recipe demonstrates how to conduct parallel experiments to conduct parameter sweep, and improve the reward signal for our text-to-sql recipe for Gemma models.

Figure showing autoresearch UI with multiple AI researchers conducting experiments in parallel in OpenRL
Figure showing autoresearch UI with multiple AI researchers conducting experiments in parallel in OpenRL

What OpenRL is not

  • A managed service. OpenRL is self-hosted and not a managed service. We aim to make it easy for users to deploy and operate it on their Kubernetes clusters.
  • An RL framework. OpenRL gives AI researchers full control over their RL loop.

Get started

We have made it easy to run OpenRL on your Mac, Nvidia GPUs, or on GKE. This allows you to test your RL loop on Mac and when you are ready to scale, you can point the RL loop to the OpenRL endpoint running in the GKE cluster.

Try out our text-to-SQL example for teaching the latest Gemma model SQL here: guides.

One of the benefits of a Tinker compatible endpoint is that you can use Tinker-Cookbook with OpenRL. Tinker-cookbook is one of the best resources for post training infrastructure for RL.

Future steps

We have started with a simple architecture focussing on LoRA fine-tuning and plan to evolve the project in the coming months, so please give it a try and share your feedback. A few things we are very excited to work on:

  • Full parameter fine-tuning
  • Multitenancy (simultaneous RL on different types of base models)

Acknowledgement

We have been inspired by the work done by various open source projects in AI communities, so huge thank you to Thinking Machines, vLLM, PyTorch, prime-rl, verl, SkyRL, and llm-d.

.