opensource.google.com

Menu
Showing posts with label artificial intelligence. Show all posts
Showing posts with label artificial intelligence. Show all posts

Transforming Kubernetes and GKE into the leading platform for AI/ML

Wednesday, May 21, 2025

The world is rapidly embracing the power of AI/ML, from training cutting-edge foundation models to deploying intelligent applications at scale. As these workloads become more sophisticated and demanding, the infrastructure required to support them must evolve. Kubernetes has emerged as the standard for container orchestration, but AI/ML introduces unique challenges that push traditional infrastructure to its limits.

AI training jobs often require massive scale, needing to coordinate thousands of specialized hardware like GPUs and TPUs. Reliability is critical, as failures can be costly for long running, large-scale training jobs. Efficient resource sharing across teams and workloads is essential given the expense of accelerators. Furthermore, deploying and scaling AI models for inference demands low latency and faster startup times for large container images and models.

At Google, we are deeply invested in the AI/ML revolution. This is why we are doubling down on our commitment to advancing Kubernetes as the foundational open standard for these workloads. Our strategy centers on evolving the core Kubernetes platform to meet the needs of the "next trillion core hours," specifically focusing on batch and AI/ML. We then bring these advancements, alongside enterprise-grade management and optimizations, to users through Google Kubernetes Engine (GKE).

Here's how we are transforming Kubernetes and GKE:

Redefining Kubernetes' relationship with specialized hardware

Kubernetes was initially designed for more uniform CPU compute. The surge of AI/ML brought new requirements for seamless integration and efficient management of expensive, sparse, and diverse accelerators. To support these new demands, Google has been a key investor in upstream Kubernetes to offer robust support for a diverse portfolio of the latest accelerators, including multiple generations of TPUs and a wide range of NVIDIA GPUs.

A core Kubernetes enhancement driven by Google and the community to better support AI/ML workloads is Dynamic Resource Allocation (DRA). This framework, developed in the heart of Kubernetes, provides a more flexible and extensible way for workloads to request and consume specialized hardware resources beyond traditional CPU and memory, which is crucial for efficiently managing accelerators. Building on such foundational open-source capabilities, GKE can then offer features like Custom Compute Classes, which improve the obtainability of these resources through intelligent fallback priorities across different capacity types like reservations, on-demand, and Spot instances. Google's active contributions to advanced resource management and scheduling capabilities within the Kubernetes community ensure that the platform evolves to meet the sophisticated demands of AI/ML, making efficient use of these specialized hardware resources more broadly accessible.

Unlocking scale and reliability

AI/ML workloads demand unprecedented scale and have new failure modes compared to traditional applications. GKE is built to handle this, supporting up to 65,000 nodes in a single cluster. We've demonstrated the ability to run the largest publicly announced training jobs, coordinating 50,000 TPU chips with near-ideal scaling efficiency.

Critically, we are enhancing core Kubernetes capabilities to support the scale and reliability needed for AI/ML. For instance, to better manage distributed AI workloads like serving large models split across multiple hosts, Google has been instrumental in developing features like JobSet (emerging from earlier concepts like LeaderWorkerSet) within the Kubernetes community (SIG Apps). This provides robust orchestration for co-scheduled, interdependent groups of Pods. We are also actively working upstream to improve Kubernetes reliability and stability through initiatives like Production Readiness Reviews, promoting safer upgrade paths, and enhancing etcd stability for the benefit of all Kubernetes users.

Optimizing Kubernetes performance for efficient inference

Low-latency and cost-efficient inference is critical for AI applications. For serving, the GKE Inference Gateway routes requests based on model server metrics like KVCache utilization and pending queue length, reducing serving costs by up to 30% and tail latency by 60% compared to traditional load balancing. We've even achieved vLLM fungibility across TPUs and GPUs, allowing users to serve the same model on either accelerator without incremental effort.

To address slow startup times for large AI/ML container images (often 20GB+), GKE offers rapid scale-out features. Secondary boot disks allow preloading container images and data, resulting in up to 29x faster container mounting time. GCS FUSE enables streaming data directly from Cloud Storage, leading to faster model load times. Furthermore, GKE Inference Quickstart provides data-driven, optimized Kubernetes deployment configurations, saving extensive benchmarking effort and enabling up to 30% lower cost, 60% lower tail latency, and 40% higher throughput.

Simplifying the Kubernetes experience and enhancing observability for AI/ML

We understand that data scientists and ML researchers may not be Kubernetes experts. Google aims to simplify the setup and management of AI-optimized Kubernetes clusters. This includes contributions to Kubernetes usability efforts and SIG-Usability. Managed offerings like GKE provide multiple paths to set up AI-optimized environments, from default configurations to customizable blueprints. Offerings like GKE Autopilot further abstract away infrastructure management, aiming for the ease of use that benefits all users.
Ensuring visibility into AI/ML workloads is paramount. Google actively supports and contributes to the integration of standard open-source observability tools within the Kubernetes ecosystem, such as Prometheus, Grafana, and OpenTelemetry. Building on this open foundation, GKE then provides enhanced, out-of-the-box observability integrated with popular AI frameworks & tools, including specific insights into workload startup latency and end-to-end tracing.

Looking ahead: continued investment in Open Source Kubernetes for AI/ML

The transformation continues. Our roadmap includes exciting developments in upstream Kubernetes for easily deploying and managing large-scale clusters, support for new GPU & TPU generations integrated through open-source mechanisms, and continued community-driven innovations in fast startup, reliability, and ease of use for AI/ML workloads.

Google is committed to making Kubernetes the premier open-source platform for AI/ML, pushing the boundaries of scale, performance, and efficiency while maintaining stability and ease of use. By driving innovation in core Kubernetes and building powerful, deeply integrated capabilities in our managed offering, GKE, we are empowering organizations to accelerate their AI/ML initiatives and unlock the next generation of intelligent applications built on an open foundation.

Come explore the possibilities with Kubernetes and GKE for your AI/ML workloads!

By Francisco Cabrera & Federico Bongiovanni, GCP Google Kubernetes Engine

Cirq Turns 1.0

Tuesday, July 19, 2022



Today we are excited to announce the first full version release of the open source quantum programming framework Cirq: Cirq 1.0. Cirq is a Python framework for writing, running, and analyzing the results of quantum computer programs. It was designed for near-term quantum computers, those with a few hundred qubits and few thousands of quantum gates. The significance of the 1.0 release is that Cirq has support for the vast majority of workflows for these systems and is considered to be a stable API that we will only update with breaking changes at major version numbers.

Getting to Cirq 1.0 is the culmination of a large amount of hard work by hundreds of contributors from Google, industry, and academia. We have been running a weekly meeting, called the “Cirq Cync”, for over four years where community members gather to discuss work on Cirq, bugs, and to generally tell terrible but amusing quantum programming jokes. We’re proud of this inclusive community, and we’ve been particularly happy to see the growth of many software developers into quantum computing experts, and quantum computing experts into solid software developers. One of our contributors, Victory Omole, won the 2021 Witteck Quantum Prize for Open Source Software. Way to go Victory!

The first commit to Cirq on GitHub (an internal version of Cirq at Google existed prior to this) was on Dec 19, 2017 by Craig Gidney, and we publicly announced Cirq in July of 2018. 3,200+ commits later to the GitHub repo, in the hands of the team at Google and the Cirq community, we’ve seen Cirq help accomplish some amazing things:
  • Cirq is the lingua franca that Google’s hardware team uses to write quantum programs that run on Google’s quantum computing hardware. Because of this, we have been able to post open source code in our ReCirq repo for these experiments for anyone to examine and extend. A few highlights of the past few years:
    • “Realizing topologically ordered states on a quantum processor”, K. J. Satzinger et al., Science 374 6572, 1237-1241 (2021) [paper] [ReCirq code]
    • “Information scrambling in quantum circuits”, X. Mi, P. Roushan, C. Quintana et al, Science 374, 6574 1479-1483 (2021) [paper] [ReCirq code]
    • “Hartree-Fock on a superconducting qubit quantum computer”, F. Arute et al., Science 369, 6507 1084--1089 (2020) [paper] [ReCirq code]
  • A healthy community of libraries have now been built on top of Cirq, enabling different quantum computing research areas. These libraries include:
    • TensorFlow Quantum: a tool for exploring quantum machine learning. Using TensorFlow Quantum researchers trained a machine learning model on 30 qubits at a rate of 1.1 petaflops per second (1.1 x 1015 operations per second).
    • OpenFermion: an open source tool for quantum computations involved in chemistry simulations.
    • Pytket (pytkey-cirq): an open source Python tool for optimizing and manipulating quantum circuits.
    • Mitiq: an open source library developed by the non-profit Unitary fund for error mitigation techniques developed by the non-profit Unitary fund.
    • Qsim: a high performance state vector simulator written using AVX/FMA vectorized instructions with optional GPU acceleration. qsimcirq is the Cirq interface one can use to access qsim from Cirq.
  • Numerous quantum computing cloud services from companies in the industry have also integrated/standardized Cirq. Programs written in Cirq can be used to run through AQT, IonQ, Pascal, Rigetti, and IQM vendors. In addition, Cirq can be used on Azure Quantum to run on the hardware supported by Azure Quantum. Finally, one can get realistic noise simulations of Google’s quantum computing hardware using our newly released Quantum Virtual Machine.
  • Cirq is not just for stuffy research. Cirq has also been used to help develop Quantum Chess, a version of chess that uses superposition and entanglement. This notebook shows you how the game of Quantum Chess can be programmed using Cirq.
Cirq moving to its first full version does not just come with new features (see 1.0 release notes), but also with more guarantees about stability. Cirq uses semantic versioning, which means that future point release of Cirq will be compatible with the full version release. For example, version 1.1 of Cirq will not introduce breaking changes to Cirq’s interfaces from version 1.0; only at major version bumps (from 1.x to 2.0, for example) will breaking changes occur.

When we began working on Cirq, quantum computers consisted of only a few qubits and a few quantum gates on these qubits. Building Cirq and the supporting software for these custom systems and having them start to scale to hundreds of qubits over the past (nearly) five years has taught us many lessons. One key takeaway from these lessons is that: As quantum computing hardware continues to grow in scale and complexity, we expect that making software to support this growth will be essential to continue meaningful research and progress. In the next five years, with hardware expected to reach hundreds or even thousands of qubits, the software that is developed for quantum computing will need to have a careful eye set on supporting these bigger and bigger systems. Going forward we will need an ever wider set of frameworks, programming languages, and libraries to achieve quantum computing’s promise.

Acknowledgements

We are indebted to all 169 contributors to the Cirq github repo, and the many more who have filed issues and used Cirq in their own software. A particular shout out to the original lead of Cirq, Craig Gidney, to Cirq’s second lead, ‪Bálint Pató who guided Cirq through its middle ages, and to Alan Ho and Catherine Vollgraff Heidweiller for product wisdom. A special thanks to the core Cirq contributors including Doug Strain, Matthew Neely, Tanuj Khatter, Dax Fohl, Adam Zalcman, Kevin Sung, Matt Harrigan, Casey Duckering, Orion Martin, Smit Sanghavi, Bryan O'Gorman, Wojciech Mruczkiewicz, Ryan LaRose, Tony Bruguier, Victory Omole, and Cheng Xing, and our documentarians Auguste Hirth and Abe Asfaw.

By Dave Bacon and Michael Broughton – Quantum AI Team

Dopamine 2.0: providing more flexibility in reinforcement learning research

Wednesday, February 6, 2019

Reinforcement learning (RL) has become one of the most popular fields of machine learning, and has seen a number of great advances over the last few years. As a result, there is a growing need from both researchers and educators to have access to a clear and reliable framework for RL research and education.

Last August, we announced Dopamine, our framework for flexible reinforcement learning.  For the initial version we decided to focus on a specific type of RL research: value-based agents evaluated on the Atari 2600 framework supported by the Arcade Learning Environment. We were thrilled to see how well it was received by the community, including a live coding session, its inclusion in a recently-announced benchmark for RL, considered as the top “Cool new open source project of 2018” by the Octoverse, and over 7K GitHub stars on our repository.

One of the most common requests we have received is support for more environments. This confirms what we have seen internally, where simpler environments, such as those supported by OpenAI’s Gym, are incredibly useful when testing out new algorithms. We are happy to announce Dopamine 2.0, which includes support for discrete-domain gym environments (e.g. discrete states and actions). The core of the framework remains unchanged, we have simply generalized the interface with the environment. For backwards compatibility, users will still be able to download version 1.0.

We include default configurations for two classic control environments: CartPole and Acrobot; on these environments one can train a Dopamine agent in minutes. When compared with the training time for a standard Atari 2600 game (around 5 days on a standard GPU), these environments allow researchers to iterate much faster on research ideas before testing them out on larger Atari games. We also include a Colaboratory that illustrates how to train an agent on Cartpole and Acrobot. Finally, our GymPreprocessing class serves as an example for how to use Dopamine with other custom environments.

We are excited by the new opportunities enabled by Dopamine 2.0, and look forward to seeing what the research community creates with it!

By Pablo Samuel Castro and Marc G. Bellemare, Dopamine Team

Google Summer of Code wrap-up post: Institute for Artificial Intelligence

Friday, September 11, 2015

Now that the 11th year of Google Summer of Code has officially come to a close, we will devote Fridays to wrap-up posts from a handful of the 137 mentoring organizations that participated in 2015. Organizations this year represented a wide range of computing fields including artificial intelligence, featured below.


logo-ai.png

Two software libraries that originate from our laboratory, the Institute for Artificial Intelligence, that are used and supported by a larger user community are the KnowRob system for robot knowledge processing and the CRAM (Cognitive Robot Abstract Machine) framework for plan-based robot control. In our group, we have a very strong focus on open source software and active maintenance and integration of projects. The systems we develop are available under BSD and MIT licenses, and partly (L)GPL.

Within the context of these frameworks, we offered four projects during the summer term in 2015, which were all accepted to Google Summer of Code (GSoC).

Multi-modal Big Data Analysis for Robotic Everyday Manipulation Activities

The project "Multi-modal Big Data Analysis for Robotic Everyday Manipulation Activities" added to our ongoing work to build the robotic perception system RoboSherlock for service robots performing household chores. Our GSoC student, Alexander, made exciting progress and valuable contributions during the summer. He ported an earlier prototypical proprioceptive module from Java to C++ to integrate it into RoboSherlock, he developed tools for visualizing the module's various detections and annotations, and applied this infrastructure to detect collisions of the robot's arms with unperceived parts of the environment in a shelf reordering task. We are also very happy that Alexander decided to stay and keep on working on RoboSherlock after GSoC ended.

Kitchen Activity Games GUI

Our GSoC student, Mesut, developed a GUI to interact with the robotics simulator Gazebo. The simulator has been used as a library, allowing different scenarios (worlds) to be selected and executed. Playlists can be generated in order to replay logged episodes. During the replay, various plugins can be linked and executed from the GUI to allow post processing the data. The user interface will ease organizing and saving simulation data further used for learning. You can view Mesut’s project on GitHub here.

Symbolic Reasoning Tools with Bullet using CRAM

Autonomous robots performing complex manipulation tasks in household environments, such as preparing a meal or tidying up, are required to know where different objects are located and what properties they have. The knowledge about their environment is called “belief state”, i.e. the information that the robot believes holds true in the surrounding world. Our GSoC student, Kunal, worked on improving the world representation of the CRAM robotic framework, which represents the environment as a 3-dimensional world where simple physics rules of the Bullet Physics engine apply. The goal of the project was to issue events when errors are found in the belief state, such as, if the robot thinks its arm is inside of a table, which is physically impossible. A stand-alone ROS (Robot Operating System) publisher node, that would notify all its listeners about errors, was partially implemented while integration with the CRAM belief state is still in progress.

Report Card Generation from Robot Mobile Manipulation Activities

Throughout the summer, our GSoC student Kacper made great progress in developing a framework for automatically generating report cards from robot experiences. We have a special focus in mobile manipulation activities in robots and are interested in anomaly detection in our rather complex systems — the developed components greatly help us save time on mundane analysis tasks, and make complicated analysis steps (looking up all aspects of a certain action, comparing different trials) easier to do.

By Jan Winkler, Organization Administrator and PhD student at the Institute of Artificial Intelligence

.