Explore public datasets with Apache Iceberg & BigLake

by Talat Uyarer & Alex Stephen, Biglake Team

A vintage-style illustration titled THE PUBLIC DATASETS OF APACHE ICEBERG shows a man in a boat named BigLake Explorer viewing a large iceberg.

The promise of the Open Data Lakehouse is simple: your data should not be locked into a single engine. It should be accessible, interoperable, and built on open standards. Today, we are taking a major step forward in making that promise a reality for developers, data engineers, and researchers everywhere.

We are thrilled to announce the availability of high-quality Public Datasets served via the Apache Iceberg REST Catalog. Hosted on Google Cloud's BigLake, these datasets are available for read-only access to anyone with a Google Cloud account.

Whether you are using Apache Spark, Trino, Flink, or BigQuery, you can now connect to a live, production-grade Iceberg Catalog and start querying data immediately. No copying files, no managing storage bucket. Just configure your catalog and query.

How to Access Public Datasets

This initiative is designed to be engine-agnostic. We provide the storage and the catalog and you bring the compute. This allows you to benchmark different engines, test new Iceberg features, or simply explore interesting data without setting up infrastructure or finding data to ingest.

How to Connect with Apache Spark

You can connect to the public dataset using any standard Spark environment (local, Google Cloud Dataproc, or other vendors). You only need to point your Iceberg catalog configuration to our public REST endpoint.

Prerequisites:

A Google Cloud Project (for authentication).
Standard Google Application Default Credentials (ADC) set up in your environment.

Spark Configuration:

Use the following configuration flags when starting your Spark Shell or SQL session. This configures a catalog named bqms (BigQuery Metastore) pointing to our public REST endpoint.

PROJECT_ID=<YOUR_PROJECT_ID>

  spark-sql \
    --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0,org.apache.iceberg:iceberg-gcp-bundle:1.10.0 \
    --conf spark.hadoop.hive.cli.print.header=true \
    --conf spark.sql.catalog.bqms=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.bqms.type=rest \
    --conf spark.sql.catalog.bqms.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog \
    --conf spark.sql.catalog.bqms.warehouse=gs://biglake-public-nyc-taxi-iceberg \
    --conf spark.sql.catalog.bqms.header.x-goog-user-project=$PROJECT_ID \
    --conf spark.sql.catalog.bqms.rest.auth.type=google \
    --conf spark.sql.catalog.bqms.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO \
    --conf spark.sql.catalog.bqms.header.X-Iceberg-Access-Delegation=vended-credentials \
    --conf spark.sql.defaultCatalog=bqms

Note: Replace <YOUR_PROJECT_ID> with your actual Google Cloud Project ID. This is required for the REST Catalog to authenticate your quota usage, even for free public access.

Exploring the Data: Sample Queries

Once connected, you have full SQL access to the datasets. We are launching with the classic NYC Taxi dataset, modeled as an Iceberg table to showcase partitioning and metadata capabilities.

1. The "Hello World" of Analytics

This query aggregates millions of records to find the average fare and trip distance by passenger count. It demonstrates how Iceberg efficiently scans data files without needing to list directories.

SELECT 
    passenger_count,
    COUNT(1) AS num_trips,
    ROUND(AVG(total_amount), 2) AS avg_fare,
    ROUND(AVG(trip_distance), 2) AS avg_distance
FROM 
    bqms.public_data.nyc_taxicab
WHERE 
    data_file_year = 2021
    AND passenger_count > 0
GROUP BY 
    passenger_count
ORDER BY 
    num_trips DESC;

What this demonstrates:

Partition Pruning: The query filters on data_file_year, allowing the engine to skip scanning data from other years entirely.
Vectorized Reads: Engines like Spark can process the Parquet files efficiently in batches.

2. Time Travel: Auditing Data History

One of Iceberg's most powerful features is Time Travel. You can query the table as it existed at a specific point in the past.

-- Compare the row count of the current version vs. a specific snapshot
SELECT 
    'Current State' AS version, 
    COUNT(*) AS count 
FROM bqms.public_data.nyc_taxicab
UNION ALL
SELECT 
    'Past State' AS version, 
    COUNT(*) AS count 
FROM bqms.public_data.nyc_taxicab VERSION AS OF 2943559336503196801;

Description:

This query allows you to audit changes. By querying the history metadata table (e.g., SELECT * FROM bqms.public_data.nyc_taxicab.history), you can find snapshot IDs and "travel back" to see how the dataset grew over time.

Coming Soon: An Iceberg V3 Playground

We are not just hosting static data; we are building a playground for the future of Apache Iceberg. We plan to release new datasets specifically designed to help you test Iceberg V3 Spec features.

Start Building Today

The goal of these public datasets is to lower the barrier to entry. You don't need to manage infrastructure to learn Iceberg; you just need to connect. Whether you are a data analyst, data scientist, data engineer or a data enthusiast, today you can:

Use BigQuery (via BigLake) to query these tables directly using SQL, combining them with your private data.
Test your OSS engine (e.g. Spark, Trino, Flink etc.) configurations against a live REST Catalog.

Start building an open, managed and high-performance Iceberg lakehouse to enable advanced analytics and data science with https://cloud.google.com/biglake today!

Happy Querying!

opensource.google.com

Google Open Source Blog