HarbourBridge: From PostgreSQL to Cloud Spanner

Would you like to try out Cloud Spanner with data from an existing PostgreSQL database? Maybe you’ve wanted to ‘kick the tires’ on Spanner, but have been discouraged by the effort involved?

Today, we’re announcing a tool that makes trying out Cloud Spanner using PostgreSQL data simple and easy.

HarbourBridge is a tool that loads Spanner with the contents of an existing PostgreSQL database. It requires zero configuration—no manifests or data maps to write. Instead, it ingests pg_dump output, automatically builds a Spanner schema, and creates a new Spanner database populated with data from pg_dump.

HarbourBridge is part of the Cloud Spanner Ecosystem, a collection of public, open source repositories contributed to, owned, and maintained by the Cloud Spanner user community. None of these repositories are officially supported by Google as part of Cloud Spanner.

Get up and running fast

HarbourBridge is designed to simplify Spanner evaluation, and in particular to bootstrap the process by getting moderate-size PostgreSQL datasets into Spanner (up to a few GB). Many features of PostgreSQL, especially those that don't map directly to Spanner features, are ignored, e.g. (non-primary) indexes, functions and sequences.

View HarbourBridge as a way to get up and running fast, so you can focus on critical things like tuning performance and getting the most out of Spanner. Expect that you'll need to tweak and enhance what HarbourBridge produces—More on this later.

Quick-start guide

The HarbourBridge README contains a step-by-step quick-start guide. We’ll quickly review the main steps. Before you begin, you'll need a Cloud Spanner instance, Cloud Spanner API enabled for your Google Cloud project, authentication credentials configured to use the Cloud API, and Go installed on your development machine.

To download HarbourBridge and install it, run

go get -u github.com/cloudspannerecosystem/harbourbridge

The tool should now be installed as $GOPATH/bin/harbourbridge. To use HarbourBridge on a PostgreSQL database called mydb, run

pg_dump mydb | $GOPATH/bin/harbourbridge

The tool will use the cloud project specified by the GCLOUD_PROJECT environment variable, automatically determine the Cloud Spanner instance associated with this project, convert the PostgreSQL schema for mydb to a Spanner schema, create a new Cloud Spanner database with this schema, and finally, populate this new database with the data from mydb. HarbourBridge also generates several files when it runs: a schema file, a report file (with details of the conversion), and a bad data file (if any data is dropped). See Files Generated by HarbourBridge.

Take care with ACLs

Note that PostgreSQL table-level and row-level ACLs are dropped during conversion since they are not supported by Spanner (Spanner manages access control at the database level). All data written to Spanner will be visible to anyone who can access the database created by HarbourBridge (which inherits default permissions from your Cloud Spanner instance).

Next steps

The tables created by HarbourBridge provide a starting point for evaluation of Spanner. While they preserve much of the core structure of your PostgreSQL schema and data, many important PostgreSQL features have been dropped.

In particular, HarbourBridge preserves primary keys but drops all other indexes. This means that the out-of-the-box performance you get from the tables created by HarbourBridge can be significantly slower than PostgreSQL performance. If HarbourBridge has dropped indexes that are important to the performance of your SQL queries, consider adding Secondary Indexes to the tables created by HarbourBridge. Use the existing PostgreSQL indexes as a guide. In addition, Spanner's Interleaved Tables can provide a significant performance boost.

Other dropped features include functions, sequences, procedures, triggers, and views. In addition, types have been mapped based on the types supported by Spanner. Types such as integers, floats, char/text, bools, timestamps and (some) array types map fairly directly to Spanner, but many other types do not and instead are mapped to Spanner's STRING(MAX). See Schema Conversion for details of the type conversions and their tradeoffs.

Recap

HarbourBridge automates much of the manual work of trying out Cloud Spanner using PostgreSQL data. The goal is to bootstrap your evaluation and help get you to the meaty issues as quickly as possible. The tables generated by HarbourBridge provide a starting point, but they will likely need to be tweaked and enhanced to support a full evaluation.

We encourage you to try out the tool, send feedback, file issues, fork and modify the codebase, and send PRs for fixes and new functionality. Our plans and aspirations for developing HarbourBridge further are outlined in the HarbourBridge Whitepaper. HarbourBridge is part of the Cloud Spanner Ecosystem, owned and maintained by the Cloud Spanner user community. It is not officially supported by Google as part of Cloud Spanner.

By Nevin Heintze, Cloud Spanner

opensource.google.com

Google Open Source Blog