opensource.google.com

Menu

Posts from August 2025

This Week in Open Source #7

Friday, August 8, 2025

This Week in Open Source for 08/08/2025

A look around the world of open source
by Daryl Ducharme, Google Open Source

Upcoming Events

  • August 14-16: Open Source Festival 2025 (OSCAFest'25) is happening in Lagos, Nigeria. It uses community to help integrate the act of open source contribution to African developers whilst strongly advocating the movement of free and open source software.
  • August 25-27: Open Source Summit Europe (OSSEU) is happening in Amsterdam, Netherlands. It is the premier event for the open source community to collaborate, share information, solve problems, and gain knowledge, furthering open source innovation and ensuring a sustainable open source ecosystem. Many Googlers will be there giving talks along with so many others.
  • September 5-7: NixCon 2025 is happening in Switzerland. It is the annual conference for the Nix and NixOS community where Nix enthusiasts learn, share, and connect with others.

Open Source Reads and Links

  • The Asymmetry of Open Source - Open source software projects need funding, but users are not obligated to pay for them. Companies should invest in open source to maintain quality and avoid issues, while hobbyists can contribute without financial pressure. Proper boundaries and mutual responsibility between companies and developers are essential for a healthy open source ecosystem. How do we find and set those boundaries?
  • Linux Foundation Announces Intent to Form Developer Relations Foundation - The Linux Foundation has created the Developer Relations Foundation which aims to unify best practices and enhance the role of developer relations in technology. The DRF will focus on collaboration and shared knowledge. Having an open source organization behind this, helps to make sure DevRel is always of service to developers along with whoever is employing them.
  • 5 tips to get started on accessibility - Not exactly open source and yet super important. So important to the open source community that All Things Open posted it on their site. Accessibility (A11y) is always useful. The more it gets used properly, the more useful it is for everyone.
  • Bringing open source development to Trust and Safety - Ever open source champion, former Googler and now COO at Roost, Anne Bertucio discusses how some teams still have a difficult time understanding open source. The standards that they are used to don't always occur within the transparent world of open source. This means, bringing open source to those teams requires understanding where they are coming from and discussing its limitations as well as its benefits.
  • How we made JSON.stringify more than twice as fast - One of the beautiful things about open source is the transparency in projects. Google's Chromium V8 engine is no exception. This walk through of the technical structuring that led to a faster JSON.stringify is a great way to learn some approaches to solving software bottlenecks that you may not have thought of. With it being open source, you can also visit the repository and follow along with the history of these code changes.

What exciting open source events and news are you hearing about? Let us know on our @GoogleOSS X account.

What's new in Apache Iceberg v3?

Thursday, August 7, 2025

A Deeper Dive into Apache Iceberg V3: How New Designs Are Solving Core Data Lake Challenges

The Next Chapter for Apache Iceberg: Welcoming the Iceberg V3 Spec
by Talat Uyarer, BigQuery Managed Iceberg & Shane Glass, Google Open Source Programs Office

An infographic illustrating the new features in Apache Iceberg V3. In the center is a logo of an iceberg with V3 written on it. Arrows point from the central logo to four surrounding illustrations, each representing a new feature: Top left: Deletion Vectors, depicted as a tall stack of data blocks. Top right: Variant Data Type, shown as a collection of colorful circles and cubes. Bottom right: Geospatial Data Types, illustrated by a folded world map with location pins. Bottom left: Row Lineage, represented by a grid of various colorful icons.

The data community has long grappled with the challenge of how to bring database-like agility to petabyte-scale datasets stored in open cloud storage. The trade-off has often been between the scalability of data lakes and the performance and ease-of-use of traditional data warehouses. Executing fine-grained updates or evolving table schemas on massive tables often required slow, expensive, and disruptive operations.

The Apache Iceberg project is taking on this challenge. Early versions introduced a revolutionary metadata layer that brought reliability and ACID transactions to data lakes. However, certain operations still presented performance bottlenecks at scale.

With the ratification of the V3 specification, the Apache Iceberg community has introduced new designs that directly address these core issues. These advancements represent a significant leap forward in the mission to build an open and high-performance data lakehouse architecture. Let's explore the technical details of these solutions.

More Efficient Row-Level Transactions with Deletion Vectors

A primary challenge for data lakes has been handling row-level deletes efficiently. Previous approaches, like positional delete files, were a clever solution but could lead to performance degradation at query time when a reader had to reconcile many small delete files against large data files.

The Iceberg V3 spec introduces binary deletion vectors, a more performant and scalable architecture. The core idea is to attach a bitmap to each data file, where each bit corresponds to a row, marking it as deleted or not.

When a query engine reads a data file, it also reads its corresponding deletion vector. As it scans rows, it can check the bitmap with minimal overhead and skip rows marked for deletion. This design is made exceptionally efficient through the use of Roaring bitmaps. This data structure is ideal for this task because it can compress sparse sets of integers—like the positions of deleted rows—into a tiny footprint.

The practical difference is profound:

  • Previous Model (Positional Deletes): A query might involve reading a central log of deletes, like deletes.avro, containing tuples of (file_path, row_position).
  • V3 Model (Deletion Vectors): Each data file (e.g., file_A.parquet) is paired with a small, efficient sidecar file (e.g., file_A.puffin) containing a Roaring bitmap of its deleted rows.

This change localizes delete information, streamlines the read path, and dramatically improves the performance of workloads that rely on frequent Change Data Capture (CDC) or row-level updates.

Simplified Schema Evolution with Default Column Values

Another common operational hurdle in managing large tables has been schema evolution. Adding a column to a table with billions of rows traditionally required a "backfill"—a costly and time-consuming job to rewrite all existing data files to add the new column.

Iceberg V3 eliminates this friction with default column values. This feature allows a default value to be specified directly in the table's metadata when a column is added.

ALTER TABLE events ADD COLUMN version INT DEFAULT 1;

This operation is instantaneous because it only modifies metadata. No data files are touched. When a query engine encounters an older data file without the version column, it consults the table schema, finds the default value, and seamlessly populates it in the query results on the fly. This simple but powerful mechanism makes schema evolution a fast, non-disruptive operation, allowing data models to evolve quickly.

Improved Query Engine Compatibility with Enhanced Data Types and Lineage

Beyond these headline features, V3 broadens the capabilities of Iceberg to support more advanced use cases:

  • Row-Level Lineage: For robust auditing and reliable CDC pipelines, V3 formalizes the tracking of row history. By embedding metadata about when a row was added or last modified, Iceberg tables can now provide a clear lineage, simplifying data governance and enabling more efficient downstream data replication.
  • Rich Data Types: V3 closes the gap with traditional databases by introducing a more expressive type system. This includes a VARIANT type for handling semi-structured data like JSON, native GEOMETRY and GEOGRAPHY types for advanced geospatial analysis, support for nanosecond-precision timestamps with the new timestamp_ns and timestamptz_ns data types, a significant increase from the previous microsecond limit.

Building the Future of the Open Data Lakehouse

These V3 features—deletion vectors, default values, row lineage, and richer types—are more than just individual improvements. Together, they represent a cohesive step toward a new paradigm where the lines between the data lake and the data warehouse are erased. They enable faster, more efficient, and more flexible data operations than previously thought possible.

This progress is a testament to the collaborative spirit of the Apache Iceberg community. At Google, we are proud to contribute to and support open-source projects like Iceberg that are defining the future of data architecture. We are excited to see the innovative applications the community will build on this powerful new foundation.

Want to get started with Iceberg? Check out this blog post to learn more about how Google Cloud's managed Iceberg offering, BigLake tables for Apache Iceberg in BigQuery, makes building Iceberg-native lakehouses easier by maximizing performance without sacrificing governance.


This Week in Open Source #6

Friday, August 1, 2025

This Week in Open Source for 08/01/2025

A look around the world of open source

by Daryl Ducharme & amanda casari, Google Open Source Programs Office

Diving into the open source world this week, we'll cover upcoming events that foster collaboration and innovation, alongside new reads and links that highlight significant advancements and discussions within the open source community. From new Google projects enhancing package ecosystem confidence to thought-provoking articles on open source funding, we hope this keeps you aware of new areas of the ecosystem.

Upcoming Events

  • August 14-16: Open Source Festival 2025 (OSCAFest'25) is happening in Lagos, Nigeria. It uses community to help integrate the act of open source contribution to African developers whilst strongly advocating the movement of free and open source software.
  • August 25-27: Open Source Summit Europe (OSSEU) is happening in Amsterdam, Netherlands. It is the premier event for the open source community to collaborate, share information, solve problems, and gain knowledge, furthering open source innovation and ensuring a sustainable open source ecosystem. Many Googlers will be there giving talks along with so many others.
  • September 5-7: NixCon 2025 is happening in Switzerland. It is the annual conference for the Nix and NixOS community where Nix enthusiasts learn, share, and connect with others.

Open Source Reads and Links

  • [Blog] Google introduced OSS Rebuild, a new project designed to enhance confidence in open source package ecosystems through the reproduction of upstream artifacts.
  • [Story] SF-Based Internet Archive Is Now a Federal Depository Library. What Does That Mean? - The Internet Archive is a foundational reference and repository for open-access information and digital archives.The San Francisco-based digital library now has federal depository status, joining a network of over 1,100 libraries that archive government documents and make them accessible to the public — even as ongoing legal challenges pose an existential threat to the organization.
  • [Video] Keynote: Building community through collaborative datasets - Mago Torres' keynote from csv,conf 8, on her work building collaborative datasets for award-winning data journalism, captures the spirit and focus on where open technology enables communities to accomplish more together.
  • [Paper] Anubis Pilot Project Report - June 2025 - In May and June 2025, Duke University Libraries (DUL) successfully implemented Anubis, a configurable open source web application firewall (WAF), to combat persistent AI-related bot scraping. During this pilot (May 1 - June 10, 2025), aggressive bot scraping caused outages for three critical library platforms (Duke Digital Repository, Archives & Manuscripts, and the Books & Media Catalog); Anubis mitigated the problem in each instance.
  • [Article] Microsoft-owned GitHub says open source needs to be funded - The Register published this editorial which asks whether open source software has reached the point that it should be managed as infrastructure and funded by governments that rely on it? Some studies show impressive numbers in how much it contributes to many economies.
  • [Blog] Open Source Explained Like You're Five (But Smarter) - Explaining open source to people outside the tech world is tough. This article uses some good metaphors along with some details you may not have known to better explain it and spread the word. Or, you could just send them this article and hope they read it. 😜

What exciting open source events and news are you hearing about? Let us know on our @GoogleOSS X account.

.