Gumbo: A C library for parsing HTML

Tuesday, August 13, 2013

We're pleased to announce the open source release of the Gumbo HTML parser, a C implementation of the HTML5 parsing algorithm.

One of the big accomplishments of the HTML5 standard was to standardize the HTML parsing algorithm, so that all browsers see the same HTML document in the same way. So far, most implementations of this algorithm have either been tied to specific browsers or rendering engines, or they've been written in specific scripting languages. This makes it hard to write quick one-off tools to manipulate and cleanup HTML if you don't happen to be working in a language that already has an HTML5-compatible parsing library.

Gumbo seeks to provide a simple library that can serve as a basic building block for linters, refactoring tools, templating languages, page analysis, and other small programs that need to manipulate HTML. It's written in pure C for ease of interfacing with other languages, and has no outside dependencies. Gumbo was built from the start to support source locations and correlating nodes in the parse tree with positions in the original text.

For more information including download, installation, and usage instructions, please visit the Gumbo project page.

By Jonathan Tang, Search Features team