opensource.google.com

Menu

Kubernetes goes AI-First: Unpacking the new AI conformance program

Monday, April 6, 2026

As AI workloads move from experimental notebooks into massive production environments, the industry is rallying around a new standard to ensure these workloads remain portable, reliable, and efficient.

At the heart of this shift is the launch of the Certified Kubernetes AI Conformance program.

This initiative represents a significant investment in common, accessible, industry-wide standards, ensuring that the benefits of AI-first Kubernetes are available to everyone.

How Kubernetes is Evolving for an AI-First World

Traditional Kubernetes was built for stateless, cloud-first applications. However, AI workloads introduce unique complexities that standard conformance doesn't fully cover:

  • Specific Hardware Demands: AI models require precise control over accelerators like GPUs and TPUs.
  • Networking and Latency: Inference and distributed training require low-latency networking and specialized configurations.
  • Stateful Nature: Unlike traditional web apps, AI often relies on complex, stateful data pipelines.

The AI Conformance program acts as a superset of standard Kubernetes conformance. To be AI-conformant, a platform must first pass all standard Kubernetes tests and then meet additional requirements specifically for AI.

Key Pillars of the AI Conformance Program

The Kubernetes AI Conformance program is being driven in the open via the AI Conformance program. This cross-company effort is led by industry experts Janet Kuo (Google), Mario Fahlandt (Kubermatic GmbH), Rita Zhang (Microsoft), and Yuan Tang (RedHat). This program is a collaborative effort within the open source ecosystem, involving multiple organizations and individuals. By developing this program in the open, the community ensures the standard is built on trust and directly addresses the diverse needs of the global ecosystem. The program establishes a verified set of capabilities that platforms across the industry, like Google Kubernetes Engine (GKE) and Azure Kubernetes Service (AKS) are already adopting.

Dynamic Resource Allocation (DRA)

DRA is the cornerstone of the new standard. It shifts resource allocation from simple accelerator quantity to fine-grained hardware control via attributes. For data scientists, this means they can now request specific hardware based on characteristics such as memory capacity or specialized capabilities, ensuring the environment perfectly matches the model's needs.

All-or-Nothing Scheduling

Distributed training jobs often face "deadlocks" where some pods start while others wait for resources, wasting expensive GPU time. AI Conformance mandates support for solutions like Kueue, allowing developers to ensure a job only begins when all required resources are available, improving cluster efficiency.

Intelligent Autoscaling for AI Workloads

Conformant clusters must support Horizontal Pod Autoscaling (HPA) based on custom AI metrics, such as GPU or TPU utilization, rather than just standard CPU/memory. This allows clusters to scale up for heavy inference demand and scale down to save costs when idle.

Standardized Observability for High Performance

To manage AI at scale, you need deep visibility. The program requires platforms to expose rich accelerator performance metrics directly, enabling teams to monitor inference latency, throughput, and hardware health in a standardized way.

What's Next?

The launch of AI Conformance is just the beginning. As we head further into 2026, the community is adding automated testing for certification and expanding the standard to include more advanced inference patterns and stricter security requirements.

The ultimate goal? Making "AI-readiness" an inherent, invisible part of the Kubernetes standard.

To get involved and help shape the future of AI on Kubernetes, consider joining AI Conformance in Open Source Kubernetes. We welcome diverse perspectives, as your expertise and feedback are crucial to building a robust and inclusive standard for all.

Gemma 4: Expanding the Gemmaverse with Apache 2.0

Thursday, April 2, 2026

Gemma 4: Expanding the Gemmaverse with Apache 2.0

For over 20 years, Google has maintained an unwavering commitment to the open-source community. Our belief has been simple: open technology is good for our company, good for our users, and good for our world. This commitment to fostering collaborative learning and rigorous testing has consistently proven more effective than pursuing isolated improvements. It's been our approach ever since the 2005 launch of Google Summer of Code, and through our open-sourcing of Kubernetes, Android, and Go, and it remains central to our ongoing, daily work alongside maintainers and organizations.

Today, we are taking a significant step forward in that journey. Since first launch, the community has downloaded Gemma models over 400 million times and built a vibrant universe of over 100,000 inspiring variants, known in the community as the Gemmaverse.

The release of Gemma 4 under the Apache 2.0 license — our most capable open models ranging from edge devices to 31B parameters — provides cutting-edge AI models for this community of developers. The industry-standard Apache license broadens the horizon for Gemma 4's applicability and usefulness, providing well-understood terms for modification, reuse, and further development.

A long legacy of open research

We are committed to making helpful, accessible AI technology and research so that everyone can innovate and grow. That's why many of our innovations are freely available, easy to deploy, and useful to developers across the globe. We have a long history of making our foundational machine-learning research, including word2vec, Jax, and the seminal Transformers paper, publicly available for anyone to use and study.

We accelerated this commitment last year. By sharing models that interpret complex genomic data and identify tumor variants, we contributed to the "magic cycle" of research breakthroughs that translate into real-world impact. This week, however, marks a pivotal moment — Gemma 4 models are the first in the Gemmaverse to be released under the OSI-approved Apache 2.0 license.

Empowering developers and researchers to deliver breakthrough innovations

Since we first launched Gemma in 2024, the community of early adopters has grown into a vast ecosystem of builders, researchers, and problem solvers. Gemma is already supporting sovereign digital infrastructure, from automating state licensing in Ukraine to scaling Project Navarasa across India's 22 official languages. And we know that developers need autonomy, control, and clarity in licensing for further AI innovation to reach its full potential.

Gemma 4 brings three essential elements of free and open-source software directly to the community:

  • Autonomy: By letting people build on and modify the Gemma 4 models, we are empowering researchers and developers with the freedom to advance their own breakthrough innovations however they see fit.
  • Control: We understand that many developers require precise control over their development and deployment environments. Gemma 4 allows for local, private execution that doesn't rely on cloud-only infrastructure.
  • Clarity: By applying the industry-standard Apache 2.0 license terms, we are providing clarity about developers' rights and responsibilities so that they can build freely and confidently from the ground up without the need to navigate prescriptive terms of service.

Building together to drive real-world impact

Gemma 4, as a release, is an invitation. Whether you are a scientific researcher exploring the language of dolphins, an industry developer building the next generation of open AI agents, or a public institution looking to provide more effective, efficient, and localized services to your citizens, Google is excited to continue building with you. The Gemmaverse is your playground, and with Apache 2.0, the possibilities are more boundless than ever.

We can't wait to see what you build.

Google Cloud: Investing in the future of PostgreSQL

Tuesday, March 31, 2026

At Google Cloud, we are deeply committed to open source, and PostgreSQL is a cornerstone of our managed database offerings, including Cloud SQL & AlloyDB.

Continuing our work with the PostgreSQL community, we've been contributing to the core engine and participating in the patch review process. Below is a summary of that technical activity, highlighting our efforts to enhance the performance, stability, and resilience of the upstream project. By strengthening these core capabilities, we aim to drive innovation that benefits the entire global PostgreSQL ecosystem and its diverse user base.

Our investments in PostgreSQL logical replication aim to unlock critical capabilities for all users. By enhancing conflict detection, we are paving the way for robust active-active replication setups, increasing write scalability and high availability. We are also focused on expanding logical replication to cover missing objects. This is key to enabling major version upgrades with minimal downtime, offering a more flexible alternative to pg_upgrade. Furthermore, our ongoing contributions to bug fixes are dedicated to improving the overall stability and resilience of PostgreSQL for everyone in the community.

Technical contributions: July 2025 – December 2025

The following sections detail technical enhancements and bug fixes contributed to the PostgreSQL open source project between July 2025 and December 2025. Primary engineering efforts were dedicated to advancing logical replication toward active-active capabilities, implementing missing features, optimizing pg_upgrade, and fixing bugs.

Logical Replication Enhancements

Logical replication is a critical feature of PostgreSQL enabling capabilities like near zero down time, major version upgrades, selective replication, active-active replication. We have been working towards closing some of the key gaps.

Automatic Conflict Detection

Active-active replication is a mechanism for increasing PostgreSQL write scalability. One of the most significant hurdles for active-active PostgreSQL setups is handling row-level conflicts when the same data is modified on two different nodes. Historically, these conflicts could stall replication, requiring manual intervention.

In this cycle, the community committed Automatic Conflict Detection which is the first phase of Automatic Conflict Detection and Resolution. This foundation allows the replication worker to automatically detect when an incoming change (Insert, Update, or Delete) conflicts with the local state.

Contributors: Dilip Kumar helped by performing code and design reviews. He is currently advancing the project's second phase, focusing on implementing conflict logging into a dedicated log table.

Logical replication of sequences

Until recently, logical replication in PostgreSQL was primarily limited to table data. Sequences did not synchronize automatically. This meant that during a migration or a major version upgrade, DBAs had to manually sync sequence values to prevent "duplicate key" errors on the new primary node. Since many databases rely on sequences, this was a significant hurdle for logical replication.

Contributors: Dilip Kumar helped by performing code and design reviews.

Drop subscription deadlock

The DROP SUBSCRIPTION command previously held an exclusive lock while connecting to the publisher to delete a replication slot.

If the publisher was a new database on the same server, the connection process would stall while trying to access that same locked catalog.

This conflict created a "self-deadlock," where the command was essentially waiting for itself to finish.

Contributors: Dilip Kumar analyzed and authored the fix.

Upgrade Resilience

Operational ease of use and friction-less upgrades are important to PostgreSQL users. We have been working on improving the upgrade experience.

pg_upgrade optimization for Large Objects

For databases with massive volumes of Large Objects, upgrades could previously span several days. This bottleneck is resolved by exporting the underlying data table directly rather than executing individual Large Object commands, resulting in an upgrade process that is several orders of magnitude faster.

Contributors: Hannu Krosing, Nitin Motiani and, Saurabh Uttam, highlighted the severity of the issue, proposed the initial fix and actively drove it to the resolution.

Prevent logical slot invalidation during upgrade:

Upgrade to PG17 fails if max_slot_wal_keep_size is not set to -1. This fix improves pg_upgrade's resilience, eliminating the need for users to manually set max_slot_wal_keep_size to -1. The server now automatically retains the necessary WAL data for upgrading logical replication slots, simplifying the upgrade process and reducing the risk of errors.

Contributors: Dilip Kumar analyzed and authored the fix.

pg_upgrade NOT NULL constraint related bug fix

A bug in pg_dump previously failed to preserve non-inherited NOT NULL constraints on inherited columns during upgrades from version 17 or older.

The fix updates the underlying query to ensure these specific schema constraints are correctly identified and migrated during the pg_upgrade process.

Contributors: Dilip Kumar analyzed and authored the fix.

Miscellaneous Bug Fixes

We continue to contribute bug fixes to help improve the stability and quality of PostgreSQL.

Make pgstattuple more robust about empty or invalid index pages

pgstattuple is a PostgreSQL extension for analyzing the physical storage of tables and indexes at the row (tuple) level, to determine whether a table is in need of maintenance. However, pgstattuple would raise errors with empty or invalid index pages in hash and gist code. This bug handles the empty and invalid index pages to make pgstattuple more robust.

Contributors: Nitin Motiani and Dilip Kumar, participated as author and reviewer.

Loading extension from different path

A bug incorrectly stripped the prefix from nested module paths when dynamically loading shared library files. This caused libraries in subdirectories to fail to load. The bug fix ensures the prefix is only removed for simple filenames, allowing the dynamic library expander to correctly find nested paths

Contributors: Dilip Kumar, reported and co-authored the fix for this bug.

WAL flush logic hardening

XLogFlush() and XLogNeedsFlush() are internal PostgreSQL functions that ensure log records are written to the WAL to ensure durability. In certain edge cases, like the end-of-recovery checkpoint, the functions relied on inconsistent criteria to decide which code path to follow. This inconsistency posed a risk for upcoming features i.e. Asynchronous I/O for writes that require XLogNeedsFlush() to work reliably.

Contributors: Dilip Kumar, co-authored the fix for this bug.

Major Features in Development

Beyond our recent commits, the team is actively working on several high-impact proposals to further strengthen the PostgreSQL ecosystem.

  • Conflict Log Table for Detection: Dilip Kumar is developing a proposal for a conflict log table designed to offer a queryable, structured record of all logical replication conflicts. This feature would include a configuration option to determine whether conflict details are recorded in the history table, server logs, or both.
  • Adding pg_dump flag for parallel export to pipes: Nitin Motiani is working on this feature. This introduces a flag which allows the user to provide pipe commands while doing parallel export/import from pg_dump/pg_restore (in directory format).

Leadership

Beyond code, our team supports the ecosystem through community leadership. We are pleased to share that Dilip Kumar has been selected for the PGConf.dev 2026 Program Committee to help shape the project's premier developer conference.

Community Roadmap: Your Feedback Matters

We encourage you to utilize the comments area to propose new capabilities or refinements you wish to see in future iterations, and to identify key areas where the PostgreSQL open-source community should focus its investments.

Acknowledgement

We want to thank our open source contributors for their dedication to improving the upstream project.

Dilip Kumar: PostgreSQL significant contributor

Hannu Krosing: PostgreSQL significant contributor

Nitin Motiani: Contributing features and bug fixes

Saurabh Uttam: Contributing bug fixes

We also extend our sincere gratitude to the wider PostgreSQL open source members, especially the committers and reviewers, for their guidance, reviews, and for collaborating with us to make PostgreSQL the most advanced open source database in the world.

Advanced TPU optimization with XProf: Continuous profiling, utilization insights, and LLO bundles

Monday, March 23, 2026

Advanced TPU optimization with XProf: Continuous profiling, utilization insights, and LLO bundles

In our previous post, we introduced the updated XProf and the Cloud Diagnostics XProf library, which are designed to help developers identify model bottlenecks and optimize memory usage. As machine learning workloads on TPUs continue to grow in complexity—spanning both massive training runs and large-scale inference—developers require even deeper visibility into how their code interacts with the underlying hardware.
Today, we are exploring three advanced capabilities designed to provide "flight recorder" visibility and instruction-level insights: Continuous Profiling Snapshots, the Utilization Viewer, and LLO Bundle Visualization.

Continuous Profiling Snapshots: The "Flight Recorder" for ML

Standard profiling often relies on "sampling mode," where users manually trigger high-fidelity traces for short, predefined durations. While effective for general optimization, this traditional approach can miss transient anomalies, intermittent stragglers, or unexpected performance regressions that occur during long-running training jobs.
To address this visibility gap, XProf is introducing Continuous Profiling Snapshots. This feature functions as an "always-on" flight recorder for your TPU workloads.
How it works: Continuous profiling snapshots (Google Colab) operates quietly in the background with minimal system overhead (approximately 7µs per packet CPU overhead). It utilizes a host-side circular buffer of roughly 2GB to seamlessly retain the last ~90 seconds of performance data. This architecture allows developers to snapshot performance data programmatically precisely when an anomaly occurs, bypassing the overhead and unpredictability of traditional one-shot profiling.

A diagram illustrating the limitation of traditional trace capturing, where a transient performance anomaly is missed because the trace capture was manually triggered before or after the anomaly occurred.
Figure 1: Traditional trace capturing without Continuous Profiling Snapshots.

A diagram showing how Continuous Profiling Snapshots capture comprehensive context. A performance anomaly occurs, and the 'always-on' circular buffer allows the user to snapshot the performance data, capturing the anomaly and the preceding 90 seconds of context.
Figure 2: Comprehensive context captured via Continuous Profiling Snapshots.

Key technical features include:

  • Circular Buffer Management: Continuously holds recent trace data to ensure you can capture the exact moments leading up to an anomaly or regression.
  • Out-of-band State Tracking: A lightweight service polls hardware registers for P-state (voltage and frequency) and trace-drop counters, ensuring the snapshot contains the necessary environmental context for accurate analysis.
  • Context Reconstruction: The system safely decouples state capture from the trace stream. This ensures that any arbitrary snapshot retains the ground truth required for precise, actionable debugging.

Visualizing Hardware Efficiency with the Utilization Viewer

Raw performance counters are powerful, but interpreting thousands of raw hardware metrics can be a daunting, time-consuming process. The new Utilization Viewer bridges the gap between raw data streams and actionable optimization strategies.
This tool translates raw performance counter values into easily understandable utilization percentages for specific hardware components, such as the TensorCore (TC), SparseCore (SC), and High Bandwidth Memory (HBM).

A screenshot or visualization of raw performance counter data, presented as a long, detailed list of thousands of uninterpreted hardware metrics and event counts.
Figure: Raw Performance Counter
Figure 3: Deriving actionable insights from raw performance counters.

From Counters to Insights: Instead of requiring developers to manually analyze a raw list of event counts, the Utilization Viewer automatically derives high-level metrics. For example, it can translate raw bus activity into a clear utilization percentage (e.g., displaying an average MXU bus utilization of 7.3%). This immediate clarity allows you to determine at a glance whether your model is compute-bound or memory-bound.

A visualization from the Utilization Viewer showing automatically derived high-level metrics, displaying clear utilization percentages for key hardware components like TensorCore (TC), SparseCore (SC), and High Bandwidth Memory (HBM), to help determine if a model is compute-bound or memory-bound.
Figure 4: Perf Counters Visualization in Utilization Viewer

Inspecting the Metal: Low-Level Operations (LLO) Bundles

For advanced users and kernel developers utilizing Pallas, we are now exposing Low-Level Operations (LLO) bundle data. LLO bundles represent the specific machine instructions issued to the TPU's functional units during every clock cycle.

This feature is critical for "Instruction Scheduling" verification—ensuring that the compiler is honoring your programming intentions and correctly re-ordering instructions to maximize hardware performance.

New Visualizations via Trace View Integration: You can now visualize LLO bundles directly within the trace viewer. Through dynamic instrumentation, XProf inserts traces exactly when a bundle executes. This provides exact execution times and block utilization metrics, rather than relying on static compiler estimates.

Why it matters: Accessing this level of granularity enables hyper-specific bottleneck analysis. For instance, developers can now identify idle cycles within the Matrix Multiplication Unit (MXU) pipeline, making it easier to spot and resolve latency between vmatmul and vpop instructions.

Conclusion

Whether you are trying to capture a fleeting performance regression with Continuous Profiling, verifying kernel efficiency with LLO Bundles, or assessing overall hardware saturation with the Utilization Viewer, these new features bring internal-grade Google tooling directly to the open-source community. These tools are engineered to provide the absolute transparency required to optimize high-scale ML workloads.

Get started by checking out the updated resources:

Open Source, Open Doors, Apply Now for Google Summer of Code!

Monday, March 16, 2026

Join Google Summer of Code (GSoC) and start contributing to the world of open source development! Applications for GSoC are open from now - March 31, 2026 at 18:00 UTC.

Google Summer of Code is celebrating its 22nd year in 2026! GSoC started back in 2005 and has brought over 22,000 new contributors from 123 countries into the open source community. This is an exciting opportunity for students and beginners to open source (18+) to gain real-world experience during the summer. You will spend 12+ weeks coding, learning about open source development, and earn a stipend under the guidance of experienced mentors.

Apply and get started!

Please remember that mentors are volunteers and they are being inundated with hundreds of requests from interested participants. It may take time for them to respond to you. Follow their Contributor Guidance instructions exactly. Do not just start submitting PRs without reading their guidance section first.

Complete your registration and submit your project proposals on the GSoC site before the deadline on Tuesday, March 31, 2026 at 18:00 UTC.

We wish all our applicants the best of luck!

.