RE2: a principled approach to regular expression matching

Thursday, March 11, 2010

Regular expressions are one of computer science's shining examples of the benefits of good computer science theory. They were originally developed by theorists as a way to describe infinite sets, but Ken Thompson introduced them to programmers as a way to describe text patterns in his implementation of the text editor QED for CTSS. Dennis Ritchie followed suit in his own implementation of QED, for GE-TSS. Thompson and Ritchie would go on to create Unix, and they brought regular expressions with them. By the late 1970s, regular expressions were a key feature of the Unix landscape, in tools such as ed, sed, grep, egrep, awk, and lex. They remain a key feature of the open source landscape today, in those venerable Unix tools and at the core of new languages like Perl, Python, and JavaScript.

The feature-rich regular expression implementations of today are based on a backtracking search with a potential for exponential run time and unbounded stack usage. At Google, we use regular expressions as part of the interface to many external and internal systems, including Code Search, Sawzall, and Bigtable. Those systems process large amounts of data; exponential run time would be a serious problem. On a more practical note, these are multithreaded C++ programs with fixed-size stacks: the unbounded stack usage in typical regular expression implementations leads to stack overflows and server crashes. To solve both problems, we've built a new regular expression engine, called RE2, which is based on automata theory and guarantees that searches complete in linear time with respect to the size of the input and in a fixed amount of stack space.

Today, we released RE2 as an open source project. It's a mostly drop-in replacement for PCRE's C++ bindings
and is available under a BSD-style license. See the RE2 project page for details.