Google Open Source Blog: February 2024

Posts from February 2024

Mentor organizations announced for Google Summer of Code 2024

Wednesday, February 21, 2024

We are thrilled to share that we have 195 open source projects that have been selected for Google Summer of Code (GSoC) 2024! This year we are excited to welcome 30 new organizations for their first year as part of the program.

Check out our program site to view the complete list of GSoC 2024 accepted mentoring organizations. Get to know more about each organization on their GSoC program page, which includes reading through the project ideas that they are looking for GSoC contributors to work on this year.

Are you interested in being a GSoC Contributor?

The 2024 GSoC program is open to students and to beginners in open source software development. Contributor applications will open on Monday, March 18, 2024 at 18:00 UTC with a deadline of Tuesday, April 2, 2024 18:00 UTC to submit your application (including your project proposal).

If you are eager to enhance your chances of becoming a successful contributor this year, we highly recommend beginning your preparations and initiating communication with the organizations that interest you right away. Below are some tips for prospective GSoC contributors to accomplish before the application period begins March 18th:

Watch our ‘Introduction to GSoC’ video to see a quick overview of the program, and view our Community Talks or Org Highlight Videos to get inspired and learn more about some projects that contributors have worked on in the past.
Check out the Contributor Guide (so much great info in here!) and Advice for Applying to GSoC doc.
Review the list of accepted organizations here. We recommend finding two to four that interest you and reading through their project ideas lists. Use the filters on the site to help you narrow down based on the programming languages you are familiar with and the categories that interest you (cloud, AI, security, science, etc.).
As soon as you see an idea that sparks your interest, reach out to the organization via their preferred communication methods (listed on their org page on the GSoC program site). The earlier you start the conversation, the better your chances of being accepted as a GSoC contributor.
Talk with the mentors and community to determine if this project idea is something you would enjoy working on during the program. Find a project that excites you, otherwise it may be a challenging summer for you and your mentor.
Use the information you received during your communications with the mentors and other org community members to write up your proposal.

You can find more information about the program on our website which includes a full timeline of important dates. We also urge anyone interested in applying to read the FAQ and Program Rules and watch some of our other videos with more details about GSoC for contributors and mentors.

A hearty welcome—and thank you—to all of our mentor organizations. We look forward to working with all of you during this 20th year of Google Summer of Code!

By Stephanie Taylor – Google Open Source

Building Open Models Responsibly in the Gemini Era

Google has long believed that open technology is not only good for our company, but good for the industry, consumers, and the world. We’ve released open-source projects like Android and Chromium that transformed access to mobile and web technologies, and have done the same in AI with Transformers, TensorFlow, and AlphaFold. The release of our Gemma family of open models is a next step in how we’re deepening our commitment to open technology alongside an industry-leading safe, responsible approach. At the same time, the rapidly evolving nature of AI raises important considerations for how to enable safety-aligned open models: an approach that supports broad innovation while promoting safe uses.

A benefit of open source is that once it is released, its license gives users full creative autonomy. This is a powerful guarantee of technology access for developers and end users. Another benefit is that open-source technology can be modified to fit the unique use case of the end user, without restriction.

In the hands of a malicious actor, however, the lack of restrictions can raise risks. Computing has been through similar cycles before, addressing issues such as protecting users of the open internet, handling cryptography, and addressing open-source software security. We now face this challenge with AI. Below we share the approach we took to openly releasing Gemma models, and the advancements in open model safety we hope to accelerate.

Providing access to Gemma open models

Today, Gemma models are being released as what the industry collectively has begun to refer to as “open models.” Open models feature free access to the model weights, but terms of use, redistribution, and variant ownership vary according to a model’s specific terms of use, which may not be based on an open-source license. The Gemma models’ terms of use make them freely available for individual developers, researchers, and commercial users for access and redistribution. Users are also free to create and publish model variants. In using Gemma models, developers agree to avoid harmful uses, reflecting our commitment to developing AI responsibly while increasing access to this technology.

We’re precise about the language we’re using to describe Gemma models because we’re proud to enable responsible AI access and innovation, and we’re equally proud supporters of open source. The definition of "Open Source" has been invaluable to computing and innovation because of requirements for redistribution and derived works, and against discrimination. These requirements enable cross-industry collaboration, individual innovation and entrepreneurship, and shared research to happen with exponential effects.

However, existing open-source concepts can’t always be directly applied to AI systems, which raises questions on how to use open-source licenses with AI. It’s important that we carry forward open principles that have made the sea-change we’re experiencing with AI possible while clarifying the concept of open-source AI and addressing concepts like derived work and author attribution.

Taking a comprehensive approach to releasing Gemma safely and responsibly

Licensing and terms of use are only one part of the evaluations, technical tools, and considered decision-making that went into aligning this release with our responsible AI Principles. Our approach involved:

Systematic internal review in accordance with our AI Principles: Consistent with our AI Principles, we release models only when we have determined the benefits are significant, and the risks of misuse are low or can be mitigated. We take that same approach to open models, incorporating a balance of the benefits of wider access to a particular model as well as the risks of misuse and how we can mitigate them. With Gemma, we considered the increased AI research and innovation by us and many others in the community, the access to AI technology the models could bring, and what access was needed to support these use cases.

A high evaluation bar: Gemma models underwent thorough evaluations, and were held to a higher bar for evaluating risk of abuse or harm than our proprietary models, given the more limited mitigations currently available for open models. These evaluations cover a broad range of responsible AI areas, including safety, fairness, privacy, societal risk, as well as capabilities such as chemical, biological, radiological, nuclear (CBRN) risks, cybersecurity, and autonomous replication. As described in our technical report, the Gemma models exhibit state-of-the-art safety performance in human side-by-side evaluations.

Responsibility tools for developers: As we release the Gemma models, we are also releasing a Responsible Generative AI Toolkit for developers, providing guidance and tools to help them create safer AI applications.

We continue to evolve our approach. As we build these frameworks further, we will proceed thoughtfully and incorporate what we learn into future model assessments. We will continue to explore the full range of access mechanisms, with benefits and risk mitigation in mind, including API-based access and staged releases.

Advancing open model safety together

Many of today’s AI safety tools are designed for systems where the design approach assumes restricted access and redistribution, as well as auxiliary controls like query filters. Similarly, much of the AI safety research for improving mitigations takes on the design assumptions of those systems. Just as we have created unique threat models and solutions for other open technology, we are developing safety and security tools appropriate for the differences of openly available AI.

As models become more and more capable, we are conducting research and investing in rigorous safety evaluation, testing, and mitigations for open models. We are also actively participating in conversations with policymakers and open-source community leaders on how the industry should approach this technology. This challenge is multifaceted, just like AI systems themselves. Model-sharing platforms like Hugging Face and Kaggle, where developers inspire each other with novel model iterations, play a critical role in efforts to develop open models safely; there is also a role for the cybersecurity community to contribute learnings and best practices.

Building those solutions requires access to open models, sharing innovations and improvements. We believe sharing the Gemma models will not just help increase access to AI technology, but also help the industry develop new approaches to safety and responsibility.

As developers adopt Gemma models and other safety-aligned open models, we look forward to working with the open-source community to develop more solutions for responsible approaches to AI in the open ecosystem. A global diversity of experiences, perspectives, and opportunities will help build safe and responsible AI that works for everyone.

By Anne Bertucio – Sr Program Manager, Open Source Programs Office; Helen King – Sr Director of Responsibility, Google DeepMind

Magika: AI powered fast and efficient file type identification

Thursday, February 15, 2024

Today we are open-sourcing Magika, Google’s AI-powered file-type identification system, to help others accurately detect binary and textual file types. Under the hood, Magika employs a custom, highly optimized deep-learning model, enabling precise file identification within milliseconds, even when running on a CPU.

Magika command line tool used to recognize a identify the type of a diverse set of files

Magika command line tool used to identify the type of a diverse set of files

You can try the Magika web demo today, or install it as a Python library and standalone command line tool (output is showcased above) by using the standard command line pip install magika.

Why identifying file type is difficult

Since the early days of computing, accurately detecting file types has been crucial in determining how to process files. Linux comes equipped with libmagic and the file utility, which have served as the de facto standard for file type identification for over 50 years. Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file. For example, modern code editors use file-type detection to choose which syntax coloring scheme to use as the developer starts typing in a new file.

Accurate file-type detection is a notoriously difficult problem because each file format has a different structure, or no structure at all. This is particularly challenging for textual formats and programming languages as they have very similar constructs. So far, libmagic and most other file-type-identification software have been relying on a handcrafted collection of heuristics and custom rules to detect each file format.

This manual approach is both time consuming and error prone as it is hard for humans to create generalized rules by hand. In particular for security applications, creating dependable detection is especially challenging as attackers are constantly attempting to confuse detection with adversarially-crafted payloads.

To address this issue and provide fast and accurate file-type detection we researched and developed Magika, a new AI powered file type detector. Under the hood, Magika uses a custom, highly optimized deep-learning model designed and trained using Keras that only weighs about 1MB. At inference time Magika uses Onnx as an inference engine to ensure files are identified in a matter of milliseconds, almost as fast as a non-AI tool even on CPU.

Magika Performance

Magika detection quality compared to other tools on our 1M files benchmark

Performance wise, Magika, thanks to its AI model and large training dataset, is able to outperform other existing tools by about 20% when evaluated on a 1M files benchmark that encompasses over 100 file types. Breaking down by file type, as reported in the table below, we see even greater performance gains on textual files, including code files and configuration files that other tools can struggle with.

Table showing various file type identification tools performance for a selection of the file types included in our benchmark

Various file type identification tools performance for a selection of the file types included in our benchmark - n/a indicates the tool doesn’t detect the given file type.

Magika at Google

Internally, Magika is used at scale to help improve Google users’ safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners. Looking at a weekly average of hundreds of billions of files reveals that Magika improves file type identification accuracy by 50% compared to our previous system that relied on handcrafted rules. In particular, this increase in accuracy allows us to scan 11% more files with our specialized malicious AI document scanners and reduce the number of unidentified files to 3%.

The upcoming integration of Magika with VirusTotal will complement the platform's existing Code Insight functionality, which employs Google's generative AI to analyze and detect malicious code. Magika will act as a pre-filter before files are analyzed by Code Insight, improving the platform’s efficiency and accuracy. This integration, due to VirusTotal’s collaborative nature, directly contributes to the global cybersecurity ecosystem, fostering a safer digital environment.

Open Sourcing Magika

By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Magika code and model are freely available starting today in Github under the Apache2 License. Magika can also quickly be installed as a standalone utility and python library via the pypi package manager by simply typing pip install magika with no GPU required. We also have an experimental npm package if you would like to use the TFJS version.

To learn more about how to use it, please refer to Magika documentation site.

Acknowledgements

Magika would not have been possible without the help of many people including: Ange Albertini, Loua Farah, Francois Galilee, Giancarlo Metitieri, Luca Invernizzi, Young Maeng, Alex Petit-Bianco, David Tao, Kurt Thomas, Amanda Walker, and Zhixun Tan.

By Elie Bursztein – Cybersecurity AI Technical and Research Lead and Yanick Fratantonio – Cybersecurity Research Scientist

YouTube releases scripts to help partners and creators to optimize their work

Thursday, February 8, 2024

At YouTube Technology Services, we believe that open source software is essential for driving innovation and collaboration in the YouTube ecosystem. We want to make automation on YouTube more accessible by providing publicly available scripts to automate common use cases, aiming to decrease the cost for partners and creators to handle the most common scenarios when managing their content on YouTube.

In order to do so, we are announcing a new GitHub Organization, YouTubeLabs, where you will find open source code examples in the code-samples repository. We are providing open source scripts for a variety of use cases, including but not limited to:

Bulk video editing
Analyzing your YouTube Analytics data
Downloading and analyzing your YouTube Studio Content Manager Reports
Content democratization and internationalization

Most code samples rely on public YouTube APIs or Google APIs and are well-documented and well-commented, in order to be easily modified by partners and creators.

We are delivering code that aims to be as accessible as possible to our partners and creators, with minimal configurations and minimal installation required. That's why we rely on Colaboratory Notebooks (Colab) and AppsScript as the main pillars of our open source offering. Colab is a free, cloud-based Jupyter notebook environment that makes it easy to run Python code in the browser, and it is integrated with Google Drive. AppsScript is a serverless platform that allows you to write scripts that run on Google's servers.

We believe that open source software is key to the future of the YouTube ecosystem. By making our code available to the public, we are helping to empower partners and creators to do more with YouTube.

Want to get started? Check out some of the code examples already available in YouTubeLabs’ code-sharing repository:

Tutorial on how to use Colab – this is a great place to start if you’re new to using Colab!
Create Shorts from a longform video – create a few shorts for your audience from your popular longform videos (local video file required).
Download your available Studio Content Manager reports using the Reporting API – instead of manually downloading your reports from the Studio Content Manager, consider downloading them from the Reporting API to your Google Drive using this code sample.
Translate captions and synthesize - create automated caption translations as well as automated synthetisation (Text to-speech generation) for a given video.

We look forward to continuing to build out our open source examples in the coming months, so don’t forget to “like and subscribe” to our repository to stay tuned for more!

By Federico Villa and Haley Schafer – Partner Technology Managers on behalf of YouTube Technology Services

Kubernetes 1.29 is available in the Regular channel of GKE

Wednesday, February 7, 2024

Kubernetes 1.29 is now available in the GKE Regular Channel since January 26th, and was available in the Rapid Channel January 11th, less than 30 days after the OSS release! For more information about the content of Kubernetes 1.29, read the Kubernetes 1.29 Release Notes.

New Features

Using CEL for Validating Admission Policy

Validating admission policies offer a declarative, in-process alternative to validating admission webhooks.

Validating admission policies use the Common Expression Language (CEL) to declare the validation rules of a policy. Validation admission policies are highly configurable, enabling policy authors to define policies that can be parameterized and scoped to resources as needed by cluster administrators. [source]

Validating Admission Policy graduates to beta in 1.29. We are especially excited about the work that Googlers Cici Huang, Joe Betz, and Jiahui Feng have led in this release to get to the beta milestone. As we move toward v1, we are actively working to ensure scalability and would appreciate any end-user feedback. [public doc here for those interested]

The beta of ValidatingAdmissionPolicy feature can be opted into by enabling the beta APIs.

InitContainers as a Sidecar

InitContainers can now be configured as sidecar containers and kept running alongside normal containers in a Pod. This is only supported by nodes running version 1.29 or later, so ensure all nodes in a cluster are at version 1.29 or later before using this feature in Pods. The feature was long awaited. This is evident by the fact that Istio has already widely tested it and the Istio community working hard to make sure that the enablement of it can be done early with minimal disruption for the clusters with older nodes. You can participate in the discussion here.

A big driver to deliver the feature is the growing number of AI/ML workloads which are often represented by Pods running to completion. Thos Pods need infrastructure sidecars - Istio and GCSFuse are examples of it, and Google recognizes this trend.

Implementation of sidecar containers is and continues to be the community effort. We are proud to highlight that Googler Sergey Kanzhelev is driving it via the Sidecar working group, and it was a great effort of many other Googlers to make sure this KEP landed so fast. John Howard made sure the early versions of implementation were tested with Istio, Wojciech Tyczyński made sure the safe rollout vie production readiness review, Tim Hockin spent many hours in API review of the feature, and Clayton Coleman gave advice and helped with code reviews.

New APIs

API Priority and Fairness/Flow Control

We are super excited to share that API Priority and Fairness graduated to Stable V1 / GA in 1.29! Controlling the behavior of the Kubernetes API server in an overload situation is a key task for cluster administrators, and this is what APF addresses. This ambitious project was initiated by Googler and founding API Machinery SIG lead Daniel Smith, and expanded to become a community-wide effort. Special thanks to Googler Wojciech Tyczyński and API Machinery members Mike Spreitzer from IBM and Abu Kashem from RedHat, for landing this critical feature in Kubernetes 1.29 (more details in the Kubernetes publication). In Google GKE we tested and utilized it early. In fact, any version above 1.26.4 is setting higher kubelet QPS values trusting the API server to handle it gracefully.

Deprecations and Removals

The previously deprecated v1beta2 Priority and Fairness APIs are no longer served in 1.29, so update usage to v1beta3 before upgrading to 1.29.
With the API Priority and Fairness graduation to v1, the v1beta3 Priority and Fairness APIs are newly deprecated in 1.29, and will no longer be served in 1.32.
In the Node API, take a look at the changes to the status.kubeProxyVersion field, which will not be populated starting in v1.33. The field is currently populated with the kubelet version, not the kube-proxy version, and might not accurately reflect the kube-proxy version in use. For more information, see KEP-4004.
1.29 removed support for the insecure SHA1 algorithm. To prevent impact on your clusters, you must replace incompatible certificates of webhook servers and extension API servers before upgrading your clusters to version 1.29.

GKE will not auto-upgrade clusters with webhook backends using incompatible certificates to 1.29 until you replace the certificates or until version 1.28 reaches end of life. For more information refer to the official GKE documentation.

The Ceph CephFS (kubernetes.io/cephfs) and RBD (kubernetes.io/rbd) volume plugins are deprecated since 1.28 and will be removed in a future release

For more information, refer to the OSS Kubernetes announcement and https://github.com/ceph/ceph-csi/

Shoutout to the Production Readiness Review (PRR) team

For each new Kubernetes Release, there is a dedicated sub group of SIG Architecture, composed of very senior contributors in the Kubernetes Community, that regularly conducts Production Readiness reviews for each new release, going through each feature.

OSS Production Readiness Reviews (PRR) reduce toil for all the different Cloud Providers, by shifting the effort onto OSS developers.
OSS Production Readiness Reviews surface production safety, observability, and scalability issues with OSS features at design time, when it is still possible to affect the outcomes.
By ensuring feature gates, solid enable → disable → enable testing, and attention to upgrade and rollout considerations, OSS Production Readiness Reviews enable rapid mitigation of failures in new features.

As part of this group, we want to thank Googlers John Belamaric and Wojciech Tyczyński for doing this remarkable, heavy lifting on non shiny, and often invisible work. Additionally, we’d like to congratulate Googler Joe Betz who recently graduated as a new PRR reviewer, after shadowing during all 2023 the process.

By Jordan Liggitt, Jago Macleod, Sergey Kanzhelev, and Federico Bongiovanni – Google Kubernetes Kernel team

Announcing Google Season of Docs 2024

Friday, February 2, 2024

Google Season of Docs provides direct grants to open source projects to improve their documentation and gives professional technical writers an opportunity to gain experience in open source. Together we raise awareness of open source, of docs, and of technical writing.

How does GSoD work?

Google Season of Docs allows open source organizations to apply for a grant based on their documentation needs. If selected, the open source organizations use their grant to directly hire a technical writer to complete their documentation project. Organizations have up to six months to complete their documentation project. At the end of the program, organizations complete their final case study which outlines the problem the documentation project was intended to solve, what metrics were used to judge the effectiveness of the documentation, and what the organization learned for the future. All project case studies are published on the Season of Docs site at the end of the program.

Organizations: apply to be part of GSoD

The applications for Google Season of Docs open February 22 for the 2024 cycle. We strongly suggest that organizations take the time to complete the steps in the exploration phase before the application process begins, including:

Creating a project page to gauge community and technical writer participant interest (see our project ideas page for examples).
Publicizing your interest in participating in GSoD through your project channels and adding your project to our list of interested projects on GitHub.
Lining up community members who are interested in mentoring or helping onboard technical writers to your project.
Brainstorming requirements for technical writers to work on your project. Will they need to be able to test code, work with video, or have prior experience with your project or related technologies? Will you allow the use of generative AI tools in creating documentation for your project?
Reading through the case studies from previous Season of Docs participants.

Organizations: create your project page

Every Google Season of Docs project begins with a project page, which is a publicly visible page that serves as an overview of your documentation project. A good project page includes:

A statement of the problem your project needs to solve: “users on Windows don’t have clear guidance of how to install our project”.
The documentation that might solve this problem: “We want to create a quickstart doc and installation guide for Windows users”.
How you’ll measure the success of your documentation: “With a good quickstart, we expect to see 50% fewer issues opened about Windows installation problems.”
What skills your technical writer would need (break down into “must have” and “nice to have” categories): “Must have: access Windows machine to test instructions”.
What volunteer help is needed from community members: “need help onboarding technical writers to our discussion groups”. Include a way for the community to discuss the proposal.
Most importantly, include a way for interested technical writers to reach you and ask questions!

Technical writers: reach out to organizations early

Technical writers do not submit a formal application through Google Season of Docs, but instead apply to accepted organizations directly. Technical writers can share their contact information now via the Google Season of Docs GitHub repository. They can also submit proposals directly to organizations using the contact information shared on the organization’s project page. Check out our technical writer guide for more information. We suggest that interested technical writers read through the case studies from the previous Google Season of Docs participants to get an idea of the kinds of projects that have been accepted and what organizations have learned from working with technical writers.

General Timeline

February 22 - April 2, 2024	Open source organizations apply to take part in Google Season of Docs
April 10	Google publishes the list of accepted organizations, along with their project proposals and doc development can begin
May 22	Technical writer hiring deadline
June 5	Organization administrators begin to submit monthly evaluations to report on the status of their project
November 22 - December 10	Organization administrators submit their case study and final project evaluation.
December 13	Google publishes the 2024 case studies and aggregate project data.
May 1, 2025	Organizations begin to participate in post-program followup surveys.

See the full program timeline for more details.

Join us

Explore the Google Season of Docs website at g.co/seasonofdocs to learn more about participating in the program. Use our logo and other promotional resources to spread the word. Check out the timeline and FAQ, and get ready to apply!

By Erin McKean – Google Open Source Programs Office