opensource.google.com

Menu

This Week in Open Source #7

Friday, August 8, 2025

This Week in Open Source for 08/08/2025

A look around the world of open source
by Daryl Ducharme, Google Open Source

Upcoming Events

  • August 14-16: Open Source Festival 2025 (OSCAFest'25) is happening in Lagos, Nigeria. It uses community to help integrate the act of open source contribution to African developers whilst strongly advocating the movement of free and open source software.
  • August 25-27: Open Source Summit Europe (OSSEU) is happening in Amsterdam, Netherlands. It is the premier event for the open source community to collaborate, share information, solve problems, and gain knowledge, furthering open source innovation and ensuring a sustainable open source ecosystem. Many Googlers will be there giving talks along with so many others.
  • September 5-7: NixCon 2025 is happening in Switzerland. It is the annual conference for the Nix and NixOS community where Nix enthusiasts learn, share, and connect with others.

Open Source Reads and Links

  • The Asymmetry of Open Source - Open source software projects need funding, but users are not obligated to pay for them. Companies should invest in open source to maintain quality and avoid issues, while hobbyists can contribute without financial pressure. Proper boundaries and mutual responsibility between companies and developers are essential for a healthy open source ecosystem. How do we find and set those boundaries?
  • Linux Foundation Announces Intent to Form Developer Relations Foundation - The Linux Foundation has created the Developer Relations Foundation which aims to unify best practices and enhance the role of developer relations in technology. The DRF will focus on collaboration and shared knowledge. Having an open source organization behind this, helps to make sure DevRel is always of service to developers along with whoever is employing them.
  • 5 tips to get started on accessibility - Not exactly open source and yet super important. So important to the open source community that All Things Open posted it on their site. Accessibility (A11y) is always useful. The more it gets used properly, the more useful it is for everyone.
  • Bringing open source development to Trust and Safety - Ever open source champion, former Googler and now COO at Roost, Anne Bertucio discusses how some teams still have a difficult time understanding open source. The standards that they are used to don't always occur within the transparent world of open source. This means, bringing open source to those teams requires understanding where they are coming from and discussing its limitations as well as its benefits.
  • How we made JSON.stringify more than twice as fast - One of the beautiful things about open source is the transparency in projects. Google's Chromium V8 engine is no exception. This walk through of the technical structuring that led to a faster JSON.stringify is a great way to learn some approaches to solving software bottlenecks that you may not have thought of. With it being open source, you can also visit the repository and follow along with the history of these code changes.

What exciting open source events and news are you hearing about? Let us know on our @GoogleOSS X account.

What's new in Apache Iceberg v3?

Thursday, August 7, 2025

A Deeper Dive into Apache Iceberg V3: How New Designs Are Solving Core Data Lake Challenges

The Next Chapter for Apache Iceberg: Welcoming the Iceberg V3 Spec
by Talat Uyarer, BigQuery Managed Iceberg & Shane Glass, Google Open Source Programs Office

An infographic illustrating the new features in Apache Iceberg V3. In the center is a logo of an iceberg with V3 written on it. Arrows point from the central logo to four surrounding illustrations, each representing a new feature: Top left: Deletion Vectors, depicted as a tall stack of data blocks. Top right: Variant Data Type, shown as a collection of colorful circles and cubes. Bottom right: Geospatial Data Types, illustrated by a folded world map with location pins. Bottom left: Row Lineage, represented by a grid of various colorful icons.

The data community has long grappled with the challenge of how to bring database-like agility to petabyte-scale datasets stored in open cloud storage. The trade-off has often been between the scalability of data lakes and the performance and ease-of-use of traditional data warehouses. Executing fine-grained updates or evolving table schemas on massive tables often required slow, expensive, and disruptive operations.

The Apache Iceberg project is taking on this challenge. Early versions introduced a revolutionary metadata layer that brought reliability and ACID transactions to data lakes. However, certain operations still presented performance bottlenecks at scale.

With the ratification of the V3 specification, the Apache Iceberg community has introduced new designs that directly address these core issues. These advancements represent a significant leap forward in the mission to build an open and high-performance data lakehouse architecture. Let's explore the technical details of these solutions.

More Efficient Row-Level Transactions with Deletion Vectors

A primary challenge for data lakes has been handling row-level deletes efficiently. Previous approaches, like positional delete files, were a clever solution but could lead to performance degradation at query time when a reader had to reconcile many small delete files against large data files.

The Iceberg V3 spec introduces binary deletion vectors, a more performant and scalable architecture. The core idea is to attach a bitmap to each data file, where each bit corresponds to a row, marking it as deleted or not.

When a query engine reads a data file, it also reads its corresponding deletion vector. As it scans rows, it can check the bitmap with minimal overhead and skip rows marked for deletion. This design is made exceptionally efficient through the use of Roaring bitmaps. This data structure is ideal for this task because it can compress sparse sets of integers—like the positions of deleted rows—into a tiny footprint.

The practical difference is profound:

  • Previous Model (Positional Deletes): A query might involve reading a central log of deletes, like deletes.avro, containing tuples of (file_path, row_position).
  • V3 Model (Deletion Vectors): Each data file (e.g., file_A.parquet) is paired with a small, efficient sidecar file (e.g., file_A.puffin) containing a Roaring bitmap of its deleted rows.

This change localizes delete information, streamlines the read path, and dramatically improves the performance of workloads that rely on frequent Change Data Capture (CDC) or row-level updates.

Simplified Schema Evolution with Default Column Values

Another common operational hurdle in managing large tables has been schema evolution. Adding a column to a table with billions of rows traditionally required a "backfill"—a costly and time-consuming job to rewrite all existing data files to add the new column.

Iceberg V3 eliminates this friction with default column values. This feature allows a default value to be specified directly in the table's metadata when a column is added.

ALTER TABLE events ADD COLUMN version INT DEFAULT 1;

This operation is instantaneous because it only modifies metadata. No data files are touched. When a query engine encounters an older data file without the version column, it consults the table schema, finds the default value, and seamlessly populates it in the query results on the fly. This simple but powerful mechanism makes schema evolution a fast, non-disruptive operation, allowing data models to evolve quickly.

Improved Query Engine Compatibility with Enhanced Data Types and Lineage

Beyond these headline features, V3 broadens the capabilities of Iceberg to support more advanced use cases:

  • Row-Level Lineage: For robust auditing and reliable CDC pipelines, V3 formalizes the tracking of row history. By embedding metadata about when a row was added or last modified, Iceberg tables can now provide a clear lineage, simplifying data governance and enabling more efficient downstream data replication.
  • Rich Data Types: V3 closes the gap with traditional databases by introducing a more expressive type system. This includes a VARIANT type for handling semi-structured data like JSON, native GEOMETRY and GEOGRAPHY types for advanced geospatial analysis, support for nanosecond-precision timestamps with the new timestamp_ns and timestamptz_ns data types, a significant increase from the previous microsecond limit.

Building the Future of the Open Data Lakehouse

These V3 features—deletion vectors, default values, row lineage, and richer types—are more than just individual improvements. Together, they represent a cohesive step toward a new paradigm where the lines between the data lake and the data warehouse are erased. They enable faster, more efficient, and more flexible data operations than previously thought possible.

This progress is a testament to the collaborative spirit of the Apache Iceberg community. At Google, we are proud to contribute to and support open-source projects like Iceberg that are defining the future of data architecture. We are excited to see the innovative applications the community will build on this powerful new foundation.

Want to get started with Iceberg? Check out this blog post to learn more about how Google Cloud's managed Iceberg offering, BigLake tables for Apache Iceberg in BigQuery, makes building Iceberg-native lakehouses easier by maximizing performance without sacrificing governance.


This Week in Open Source #6

Friday, August 1, 2025

This Week in Open Source for 08/01/2025

A look around the world of open source

by Daryl Ducharme & amanda casari, Google Open Source Programs Office

Diving into the open source world this week, we'll cover upcoming events that foster collaboration and innovation, alongside new reads and links that highlight significant advancements and discussions within the open source community. From new Google projects enhancing package ecosystem confidence to thought-provoking articles on open source funding, we hope this keeps you aware of new areas of the ecosystem.

Upcoming Events

  • August 14-16: Open Source Festival 2025 (OSCAFest'25) is happening in Lagos, Nigeria. It uses community to help integrate the act of open source contribution to African developers whilst strongly advocating the movement of free and open source software.
  • August 25-27: Open Source Summit Europe (OSSEU) is happening in Amsterdam, Netherlands. It is the premier event for the open source community to collaborate, share information, solve problems, and gain knowledge, furthering open source innovation and ensuring a sustainable open source ecosystem. Many Googlers will be there giving talks along with so many others.
  • September 5-7: NixCon 2025 is happening in Switzerland. It is the annual conference for the Nix and NixOS community where Nix enthusiasts learn, share, and connect with others.

Open Source Reads and Links

  • [Blog] Google introduced OSS Rebuild, a new project designed to enhance confidence in open source package ecosystems through the reproduction of upstream artifacts.
  • [Story] SF-Based Internet Archive Is Now a Federal Depository Library. What Does That Mean? - The Internet Archive is a foundational reference and repository for open-access information and digital archives.The San Francisco-based digital library now has federal depository status, joining a network of over 1,100 libraries that archive government documents and make them accessible to the public — even as ongoing legal challenges pose an existential threat to the organization.
  • [Video] Keynote: Building community through collaborative datasets - Mago Torres' keynote from csv,conf 8, on her work building collaborative datasets for award-winning data journalism, captures the spirit and focus on where open technology enables communities to accomplish more together.
  • [Paper] Anubis Pilot Project Report - June 2025 - In May and June 2025, Duke University Libraries (DUL) successfully implemented Anubis, a configurable open source web application firewall (WAF), to combat persistent AI-related bot scraping. During this pilot (May 1 - June 10, 2025), aggressive bot scraping caused outages for three critical library platforms (Duke Digital Repository, Archives & Manuscripts, and the Books & Media Catalog); Anubis mitigated the problem in each instance.
  • [Article] Microsoft-owned GitHub says open source needs to be funded - The Register published this editorial which asks whether open source software has reached the point that it should be managed as infrastructure and funded by governments that rely on it? Some studies show impressive numbers in how much it contributes to many economies.
  • [Blog] Open Source Explained Like You're Five (But Smarter) - Explaining open source to people outside the tech world is tough. This article uses some good metaphors along with some details you may not have known to better explain it and spread the word. Or, you could just send them this article and hope they read it. 😜

What exciting open source events and news are you hearing about? Let us know on our @GoogleOSS X account.

This Week in Open Source #5

Friday, July 25, 2025

This Week in Open Source for July 25, 2025

A look around the world of open source

by Daryl Ducharme & amanda casari, Google Open Source Programs Office

We hope everyone is having a good summer. The world of open source is, with more events and news that caught our attention.

Upcoming Events

  • July 31-August 3: FOSSY (Free and Open Source Software Yearly) will be held in Portland, Oregon and is focused on the creation and impact of free and open source software, uplifting contributors of all experience.
  • August 14-16: Open Source Festival 2025 (OSCAFest'25) is happening in Lagos, Nigeria. It uses community to help integrate the act of open source contribution to African developers whilst strongly advocating the movement of free and open source software.
  • August 25-27: Open Source Summit Europe (OSSEU) is happening in Amsterdam, Netherlands. It is the premier event for the open source community to collaborate, share information, solve problems, and gain knowledge, furthering open source innovation and ensuring a sustainable open source ecosystem. Many Googlers will be there giving talks along with so many others.

Open Source Reads and Links

  • [Press Release] Tech Veterans Anne Bertucio and Vinay Rao Join ROOST - A bit of a bittersweet post as our recent, now former Head of Open Source Programs Office, Anne Bertucio, joins ROOST as COO and the previous Head of Safeguards at Anthropic, Vinay Rao, joins as CTO.
  • [Article] An open-source SDK for finding dead code - Maintaining dead code is a waste of resources. So, having good tools for finding dead code in your applications is important. The open sourcing of Reaper for iOS and Android applications might be a worthwhile part of your toolbelt.
  • [Blog] Why I used to prefer permissive licenses and now favor copyleft - Choosing the right license for your open source projects is a very personal choice. A choice that is worth revisiting once in a while to see if your values have shifted and if there are new ideas for what might constitute free software that better align with those new values.
  • [Blog] Announcing FOKS: The Federated Open Key Service - Security and authentication are key to the tech world and open source is a good way to get many eyes on the problems to find solutions. A new federated open key service, FOKS, built from the ground up and based on concepts while working with Keybase is available now.
  • [Article] Kubernetes Surges in Enterprise, But What Can Take It Mainstream? - Different teams in the development work streams have their own ideas about the tech stack. Many teams using Kubernetes have made it quite popular for use in enterprise work, but some are still using systems that have been tried and tested in their own domains. What work needs to be done to get all teams on-board with using Kubernetes?
  • [Blog] Death by a thousand slops - The lead maintainer for the open source project, curl, continues to blog on where low-quality recommendations to curl's Bug Bounty program are increasing the work for the security team.
  • [Article] From A2A to MCP, a look at the protocols that might one day help AI automate you out of a job - Click-bait headline aside, a good overview of where these protocols are at, what they do, and a certain view on whether that's useful or not. We have our opinions, but we are probably biased ;)
  • [Article] How the Free Software Foundation battles the LLM bots - There are many bots out there crawling the web. In the early days of search, the solution was the robots.txt files and bots crawling the web slow enough for the systems to continue to run smoothly. However, many LLM bots are ignoring robots.txt, being greedy with site resources, and that's on top of other bot traffic to deal with. Looking at how a large organization approaches this current trend has some great shared knowledge.

What exciting open source events and news are you hearing about? Let us know on our @GoogleOSS X account.

Stop Leaked Credentials in Their Tracks with Veles, Our New Open-Source Secret Scanner

Tuesday, July 22, 2025

Stop Leaked Credentials in Their Tracks with Veles, Our New Open-Source Secret Scanner

by Kevin Dungs, Charl de Nysschen & Sarah Lucas, Google

In today's complex software supply chain, a single leaked credential—an API key, a service account token, a password—can be all an attacker needs to breach your systems. These secrets can be accidentally committed to a source code repository, embedded in a container image, or attached to a support ticket, creating a critical and often invisible risk.

To help developers and security teams proactively find and fix these exposures, we are excited to announce Veles, a new open-source secret and credential scanner from Google.

Veles is designed to detect unintended exposure of sensitive credentials across your organization's internal systems. It helps you find secrets where they don't belong, so you can prevent them from being abused.

Why Veles? Key Features

Veles is a new, standalone module within our OSV-SCALIBR (Software Composition Analysis LIBRary) ecosystem, but it is built to be used independently. This means you can easily integrate it into your existing security tooling or use it as a standalone scanner.

In its initial release, Veles helps you find high-risk secrets in source code and user-provided artifacts. Our detection library currently identifies:

  • Google Cloud Platform (GCP) API Keys
  • GCP Service Account Keys
  • RubyGems API Keys

This is just the beginning. Veles is built to be extensible, allowing for the rapid addition of new secret types.

Battle-Tested at Google: Powerful Real-World Integration

At Google, we're not just releasing Veles; we're actively using it to protect our own systems and the open-source ecosystem.

  • Internal Protection: Veles is already scanning Google's internal source code repositories and artifacts, helping us find and remediate leaked secrets before they become a problem.
  • Securing the Open Source Ecosystem: The Google Open Source Security Team is incorporating Veles into its pipeline that powers deps.dev, scanning hundreds of millions of open-source artifacts (packages, Docker images, and repositories) to detect and remediate leaked credentials across the community.
  • Integration with Google Cloud Products: Veles is being integrated directly into Google Cloud security services to bring secret scanning to our customers:
    • Artifact Analysis & Artifact Registry: Veles will power secret scanning in Artifact Registry, with findings surfaced through the Container Analysis API and, eventually, in the Artifact Registry UI.
    • Security Command Center (SCC): SCC's integration will provide comprehensive secret detection across the entire cloud lifecycle. This means scanning "left" into the development pipeline (like Infrastructure as Code) and "right" into active runtime environments (like Compute Engine and GKE). SCC will then unify these findings, helping you prioritize the most critical exposures and visualize potential attack paths.

The Road Ahead: What's Next for Veles?

This first release is a foundational step. Our roadmap for Veles includes:

  • Broader Detection: We will continuously expand the library of supported secret and credential types.
  • Automated Validation: We plan to add functionality to intelligently validate if a discovered secret is active.
  • Remediation Workflows: In the future, we aim to help automate the revocation of confirmed, leaked secrets.

Get Started with Veles Today

Veles is open-source and ready for you to use. You can integrate it into your CI/CD pipeline, run it against your existing repositories, or contribute to its development. Protecting your organization from leaked credentials is a critical part of a strong security posture, and Veles is here to help.

Ready to start scanning? Head over to the Veles GitHub repository to get started!

.