Google Open Source Blog: January 2025

Posts from January 2025

The New Frontier of Security: Creating Safe and Secure AI Models

Wednesday, January 29, 2025

Are you looking to safely create the next state-of-the-art AI model? Today we’re sharing a list of recommendations on how to create and distribute your models securely.

Choosing the Right Foundation: Safe Model Formats

Before you start building your model, consider using a safe file format, as it can influence your development tool options. However, if you've already created a model, you can also convert it to a safe format before sharing it.

Once trained, models are saved and distributed as binary files. Common formats include PyTorch (Pickles usually with .pt, .pth extensions), TensorFlow SavedModel (.pb), GGUF (.gguf), and Safetensors (.safetensors). However, binary files are dangerous because it's hard to verify if their content is safe. This is especially true with formats such as Pickles and SavedModels, which are designed to include arbitrary code, raising the risk of remote code execution (RCE) on users' machines.

To mitigate these risks:

If sharing only model weights: Consider formats such as Safetensors. These formats only contain model weights, and are therefore safe from RCE.
If sharing weights and metadata: Consider formats like GGUF, which include weights and additional metadata but not executable code configurations.
For any format, but especially if your model requires custom code: Keep reading to see how to help users verify that they're getting the correct model.

Secure Releases and Verification Methods

To ensure your users are getting the model you originally deployed, consider automating your releases, making them transparent and auditable. Instead of training your model on your local machine, consider using a predefined script to train your model within an isolated environment. When building smaller models, using GitHub Actions can be a good option. However, for larger models, GitHub Actions might not have the necessary hardware capabilities or availability. In that case, and if budget allows, consider using other platforms with proper security safeguards, such as Google Cloud Platform (GCP).

If building your model on a cloud platform is not an option, you can sign the model on your local machine to give your users confidence it was created by you.

However, if building your model on a cloud platform, sign the model and generate a provenance attestation for the release. This allows users to not only confirm that the model was created by you, but that it came from your approved infrastructure, was trained following the specific instructions defined in your training script, and wasn't tampered with by a malicious actor.

While signatures and provenance do not guarantee the absence of malicious intent from the developer, they provide users with a means to verify the integrity of the model they downloaded.

On GitHub, signing and provenance can be easily achieved using GitHub Artifact Attestations. For the general use case, tools like Sigstore and SLSA are also available to sign and attest provenance to your deployments.

For a few examples, check out this workflow to build a model with SLSA on GitHub and on GCP, and an example of how to sign models.

Educate Your Users

After sharing your model with the world, it is essential to educate users on the safety and security concerns surrounding model consumption. You should therefore:

Document potential biases in your models and datasets.
Clearly display all licenses associated with the model and datasets.
Benchmark your model, assessing and disclosing metrics around hallucinations, prompt injection risk and fairness.

With this information users can make informed decisions and abide by the ethical and technical guidelines associated with your model. For instance, they might choose to implement an input sanitization layer to enhance their software security.

Establish a Security Policy

Your model’s privacy, safety, and security guarantees can be documented in separate files or in a security policy. A security policy helps you address users' safety and security concerns regarding your models. It is a dedicated file that instructs users on how to privately report vulnerabilities, such as prompt injection strategies or potential out-of-memory (OOM) errors, allowing you to investigate and address the potential vulnerabilities before they become public knowledge. It is also a good place to define the scope of what your project considers a vulnerability.

In summary, considering model security from the outset of development is crucial. Additionally, ensuring safe distribution and informing users of potential risks is essential. It's important to remember that security is a continuous process – more like a marathon than a sprint – and constant vigilance is necessary to mitigate potential threats.

Keep Improving

The steps above will put your models’ security on a solid footing, but there is always more you can learn and do. Please take a look at Google's Secure AI Framework for a deeper dive in this subject, and take its risk self assessment to better understand which risks are most important for you.

By Gabriela Gutierrez and Pedro Nacht – GOSST Upstream Team

Producer java library for Data Lineage is now open source

Tuesday, January 28, 2025

Integrating OpenLineage producers with GCP Lineage just got a lot easier

What is Data Lineage

Data Lineage is a GCP feature that allows tracking data movement. This tool helps data owners and analysts detect anomalies in data flows, find connections between data sources and verify the potential consequences of planned changes in data pipelines.

Lineage is injected automatically for some Google Cloud products (BigQuery, Cloud Data Fusion, Cloud Composer, Dataproc, Vertex AI). That means, if Lineage integration with any of those products is enabled in the projects, data movements coming from executing jobs by these products will be reported to GCP Lineage.

For custom integrations, the API can be used to report and fetch lineage.

After injecting, lineage can be viewed in the Google Cloud console (available from DataCatalog UI, BigQuery UI, Vertex UI). There are two representations: graph view, with data sources as nodes and data movements as edges, and list view, a tabular representation. Lineage information can also be fetched from the API.

More information is available in the documentation.

GCP Lineage information model

We describe data flows using the following concepts:

Process is a definition of some data transformation. For example, a SQL or Spark script.

Run is an execution of a Process.

Lineage Event is a data transformation event. It is reported in context of a Run.

A Link represents a connection between two data sources, when data in the link’s Target depends on its Source. A Lineage Event contains a list of Links.

OpenLineage support

OpenLineage is an open standard for reporting lineage information. It unifies lineage reporting between systems, which means the events generated in this format can be consumed by any product supporting it. This leads to more flexibility: adding or replacing a lineage producer does not imply changing the consumer, and vice versa.

OpenLineage format is adopted by a number of lineage producers and consumers, meaning there is already tooling available to report lineage from/to those systems. GCP Lineage is one of those consumers: users can report events in OpenLineage format, see the resulting lineage on the UI, and query it via the API.

OpenLineage is the preferred method for reporting lineage in GCP Lineage. It is used by the Dataproc lineage integration. To find out more about sending OpenLineage events to GCP Lineage refer to the documentation.

After injecting lineage in OpenLineage format, it can be accessed in the same way as if it was injected via other API methods or automatically: from the Google Cloud console or the API.

Why producer library

The GCP Lineage producer library is an extension of the client library. Client libraries are recommended for calling Cloud APIs programmatically. They handle low level API call details, leaving the necessary user code simpler and shorter.

The producer library further simplifies integration by providing ready to use code needed to call the API from Java. It adds additional functionality such as synchronous and asynchronous clients, translating OpenLineage JSON messages to the API friendly format, error handling etc.

Using the producer library, all the code needed to send a request to GCP Lineage API is:

SyncLineageProducerClient client = SyncLineageProducerClient.create();
ProcessOpenLineageRunEventRequest request =
        ProcessOpenLineageRunEventRequest.newBuilder()
            .setParent(parent)
            .setOpenLineage(openLineageMessage)
            .build();
client.processOpenLineageRunEvent(request);

The field openLineageMessage here is a protobuf Struct that includes information about job execution, inputs and outputs and other metadata. The object model is described in the documentation. An example message is:

{
  "eventType": "START",
  "eventTime": "2023-04-04T13:21:16.098Z",
  "run": {
    "runId": "502483d6-3e3d-474f-9380-da565eaa7516",
    "facets": {
       "spark_properties": {
        "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.22.0/integration/spark",
        "_schemaURL": "https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet",
        "properties": {
          "spark.master": "yarn",
          "spark.app.name": "sparkJobTest.py"
        }
      }
    }
  },
  "job": {
    "namespace": "project-name",
    "name": "cluster-name",
    "facets": {
    "jobType": {
        "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.22.0/integration/spark",
        "_schemaURL": "https://openlineage.io/spec/facets/2-0-3/JobTypeJobFacet.json#/$defs/JobTypeJobFacet",
        "processingType": "BATCH",
        "integration": "SPARK",
        "jobType": "SQL_JOB"
      },

    }
  },
  "inputs": [
    {
      "namespace": "bigquery",
      "name": "project.dataset.input_table",
    }],
  "outputs": [
   {
      "namespace": "bigquery",
      "name": "project.dataset.output_table",
    }],
  "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/integration/spark",
  "schemaURL": "https://openlineage.io/spec/1-0-5/OpenLineage.json#/$defs/RunEvent"
}

Learn more about building an OpenLineage message.

Best Practices for Constructing OpenLineage Messages

The openLineageMessage should follow the OpenLineage format. The fields that are required for correct parsing by the GCP Lineage API are:

job	mapped to Process
job.namespace	used to construct Process name
job.name	used to construct Process name
run	mapped to Run
run.runId	used to construct Run name
producer	URI identifying the producer of this metadata
eventTime	time of the data movement
schemaURL	URL pointing to the schema definition for this message

In addition to those, the fields used to create lineage are:

eventType	corresponds to the status of the Run
inputs	mapped to sources of links. Must be specified according to the naming conventions
outputs	mapped to targets of links. Must be specified according to the naming conventions

The GCP Lineage API supports OpenLineage major versions 1 and 2. For more information please refer to the documentation.

How to access GCP Lineage?

The code is now publicly available on GitHub. The library is also published to Maven.

GcpLineageTransport

To simplify integration with GCP Lineage, we offer GcpLineageTransport. It is available on the OpenLineage GitHub repository and is built to a separate maven artifact. It is built on top of the producer library mentioned above.

Using the transport minimises the code for sending events to GCP Lineage. The GcpLineageTransport can be configured as the event sink for any existing OpenLineage producer such as Airflow, Spark, and Flink. Find more information and examples on GCP Lineage.

By Mary Idamkina – Data Lineage

See the code that powered the Pebble smartwatches

Monday, January 27, 2025

We are excited to announce that the source code that powered Pebble smartwatches is now available for download.

This is part of an effort from Google to help and support the volunteers who have come together to maintain functionality for Pebble watches after the original company ceased operations in 2016.

A quick look back

Pebble was initially launched through a very successful Kickstarter project. Pebble’s first Kickstarter was the single most funded at the time, and its successor Kickstarter for the Pebble Time repeated that feat – and remains the second most funded today! Over the course of four years, Pebble sold over two million smartwatches, cultivating a thriving community of thousands of developers who created over ten thousand Pebble apps and watchfaces.

In 2016, Fitbit acquired Pebble, including Pebble’s intellectual property. Later on, Fitbit itself was acquired by Google, taking the Pebble OS with it.

Despite the Pebble hardware and software support being discontinued eight years ago, Pebble still has thousands of dedicated fans.

What is being released

We are releasing most of the source code for the Pebble operating system. This repository contains the entire OS, which provides all the standard smartwatch functionality – notifications, media controls, fitness tracking, and support for custom apps and watchfaces – on tiny ARM Cortex-M microcontrollers. Built with FreeRTOS, it contains multiple modules for memory management, graphics, and timekeeping, as well as an extensive framework to load and run custom applications written in C, as well as in Javascript via the Jerryscript Javascript engine. The Pebble architecture allowed for a lightweight system delivering a rich user experience as well as a very long battery life.

It's important to note that some proprietary code was removed from this codebase, particularly for chipset support and the Bluetooth stack. This means the code being released contains all the build system files (using the waf build system), but it will not compile or link as released.

The path forward

From here, we are hoping this release will assist the dedicated community and volunteers from the Rebble project to carry forward the support for Pebble watches that users still love. For someone to build a new firmware update, there is a non-trivial amount of work to do in finding replacements for the pieces that were stripped out of this code, as well as updating this source code that has not been maintained for a few years.

By Matthieu Jeanson, Katharine Berry, and Liam McLoughlin

Introducing Eclipsa Audio: immersive audio for everyone

Wednesday, January 15, 2025

In the real world, we hear sounds from all around us. Some sounds are ahead of us, some are to our sides, some are behind us, and - yes - some are above or below us. Spatial audio technology brings an immersive audio experience that goes beyond traditional stereo sound. It creates a 3D soundscape, making you feel like sounds are coming from all around you, not just from the left and right speakers.

Spatial audio technologies were first developed over 50 years ago, and playback has been available to consumers for over a decade, but creating spatial audio has been mostly limited to professionals in the movie or music industries. That’s why Google and Samsung are releasing Eclipsa Audio, an open source spatial audio format for everyone.

From Creation to Distribution to Experience

Eclipsa Audio is based on Immersive Audio Model and Formats (IAMF), an audio format developed by Google, Samsung, and other key contributors within the Alliance for Open Media (AOM), and released under the AOM royalty-free license. Because IAMF is open source, Eclipsa Audio files can be created by anyone using freely available audio tools, which support a wide variety of workflows:

A diagram shows three different workflows for encoding video and audio using `iamf_tools` and `ffmpeg` to create MP4 files with IAmF audio and video. Each workflow handles a different input type, including ADM-BWF, Wav files, Textproto, and Video.

An open source reference renderer ^[1] is freely available for standalone spatial audio playback, or you can test your Eclipsa Audio files right in your browser at the Binaural Web Demo Application.

Starting in 2025, creators will be able to upload videos with Eclipsa Audio tracks to YouTube. As the first in the industry to adopt Eclipsa Audio, Samsung is integrating the technology across its 2025 TV lineup — from the Crystal UHD series to the premium flagship Neo QLED 8K models — to ensure that consumers who want to experience this advanced technology can choose from a wide range of options. Google and Samsung will be launching a certification and brand licensing program in 2025 to provide quality assurance to manufacturers and consumers for products that support Eclipsa Audio.

Next Steps

To simplify the creation of Eclipsa Audio files, later this spring we will release a free Eclipsa Audio plugin for AVID Pro Tools Digital Audio Workstation. We also plan to bring native Eclipsa Audio playback to the Google Chrome browser as well as to TVs and Soundbars from multiple manufacturers later in 2025. Eclipsa Audio support will also arrive in an upcoming Android AOSP release; stay tuned for more information.

We believe that Eclipsa Audio has the potential to change the way we experience sound. We are excited to see how it is used to create new and innovative audio experiences.

By Matt Frost, Jani Huoponen, Jan Skoglund, Roshan Baliga – the Open Audio team

^[1]Special thanks to Arm for providing high performance optimizations to the IAMF reference software.

Google Summer of Code 2025 is here!

Tuesday, January 14, 2025

Level up your coding skills in Google Summer of Code 2025

Get ready for the 2025 Google Summer of Code (GSoC) program! We started this adventure in 2005 and over the past 20 years we have welcomed over 21,000 new contributors to open source through the program under the guidance of 20,000+ mentors from over 1,000 open source organizations. Check out the video below to learn more about the impact GSoC has made over the last two decades.

Our mission since day one has been to foster the next generation of open source contributors. Participants are immersed in a supportive environment where they spend 3+ months collaborating on real-world projects alongside experienced mentors. This deep dive into open source not only builds valuable coding skills, but cultivates a strong understanding of community dynamics and best practices, empowering them to become impactful contributors.

2024 was a milestone year with 1,127 GSoC contributors completing their projects with 195 open source organizations. We hope to surpass these numbers in 2025!

Be a GSoC 2025 mentoring organization

Application period: January 27 – Feb 11

Interested organizations can learn more by visiting our website; there you’ll find supportive materials to get started.

An invaluable resource is our Mentor Guide, which is a quick way to familiarize yourself with the GSoC program. You’ll find tips on how to engage your community, suggestions on how to present achievable project ideas, and guidance on applying these to your communities.

We are happy to welcome organizations new to GSoC each year. Typically, 20-30 organizations join us for the first time, and we encourage you to apply. In 2025, we're particularly excited to expand our reach in the Security and Machine Learning domains.

Learn more about being a GSoC mentoring organization

Join us in our first information session of the year:

Organization Applications Tips on Tuesday, January 22 17:00 UTC

meet.google.com/gse-thow-xmc

Be a GSoC 2025 contributor

Application period: March 24 - April 8

If you are a beginner to open source development or a student interested in learning about open source, this is your chance to get involved and gain experience on real-world software development projects. Follow these quick steps to set yourself up for success:

Learn more about GSoC on our official website.
Watch our Introduction to GSoC video for a quick overview of the program.
Review the Contributor Guide to get insightful ideas from past students and mentors for your program success.
Read the GSoC Program rules to understand how GSoC works.

Learn more about being a GSoC contributor

Join one of our upcoming information sessions:

Contributor Talk #1 on Wednesday, February 19, 16:00 UTC

meet.google.com/oct-wgna-paa

Contributor Talk #2 on Tuesday, February 25, 2:00 UTC

meet.google.com/zau-gtwk-tif

Contributor Talk #3 on Thursday, March 6, 16:00 UTC

meet.google.com/gyq-mcuz-wey

Please help us spread the word about GSoC 2025 to your peers, family members, colleagues, universities and anyone interested in making a difference in the open source community. Join us and help shape the future of open source!

By Stephanie Taylor, Mary Radomile, and Lucila Ortiz