Protocol Buffers: Google's Data Interchange Format
Monday, July 7, 2008
At Google, our mission is organizing all of the world's information. We use literally thousands of different data formats to represent networked messages between servers, index records in repositories, geospatial datasets, and more. Most of these formats are structured, not flat. This raises an important question: How do we encode it all?
XML? No, that wouldn't work. As nice as XML is, it isn't going to be efficient enough for this scale. When all of your machines and network links are running at capacity, XML is an extremely expensive proposition. Not to mention, writing code to work with the DOM tree can sometimes become unwieldy.
Do we just write the raw bytes of our in-memory data structures to the wire? No, that's not going to work either. When we roll out a new version of a server, it almost always has to start out talking to older servers. New servers need to be able to read the data produced by old servers, and vice versa, even if individual fields have been added or removed. When data on disk is involved, this is even more important. Also, some of our code is written in Java or Python, so we need a portable solution.
Do we write hand-coded parsing and serialization routines for each data structure? Well, we used to. Needless to say, that didn't last long. When you have tens of thousands of different structures in your code base that need their own serialization formats, you simply cannot write them all by hand.
Instead, we developed Protocol Buffers. Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple "get" and "set" methods, and once you're ready, serializing the whole thing to – or parsing it from – a byte array or an I/O stream just takes a single method call.
OK, I know what you're thinking: "Yet another IDL?" Yes, you could call it that. But, IDLs in general have earned a reputation for being hopelessly complicated. On the other hand, one of Protocol Buffers' major design goals is simplicity. By sticking to a simple lists-and-records model that solves the majority of problems and resisting the desire to chase diminishing returns, we believe we have created something that is powerful without being bloated. And, yes, it is very fast – at least an order of magnitude faster than XML.
And now, we're making Protocol Buffers available to the Open Source community. We have seen how effective a solution they can be to certain tasks, and wanted more people to be able to take advantage of and build on this work. Take a look at the documentation, download the code and let us know what you think.