If you’ve worked on compilers, translators, or other tools that need to read a programming language, chances are you’ve spent many painful hours detailing that language’s grammar. Recently, we opened up the source code for Classp, a side-project a few of us have been working on that demonstrates it’s possible to have an automatic parser generator that is not based on formal grammars. Instead of grammar productions, you write classes similar to C++ or Java and you write class patterns to define the syntax. Although there are libraries like Boost.Spirit and Scala Parsers that give you a nice way to write a grammar in the programming language itself, in the end you are still writing a grammar. Even though Classp looks a lot like C++ or Java, it is not just a C-like way to write a grammar. It’s an entirely different way to specify syntax.
Grammars are great for diagramming a complex syntactic structure for human readers, but as a computer specification, they leave a lot to be desired. Four key problems with grammars inspired us to work on Classp as an alternative.
First, a grammar is intended to represent the actual syntactic structure of the language: all of the little details like what goes first and what goes second, where to put your commas and semicolons, where can you substitute one thing for another, etc... But this surface structure doesn’t really matter to the programmer who wants to process the language. It just gets in the way. What you really care about are the logical parts of the declaration: what is the type? What is the name? Is there an initializer and what is it?
Second, many common parser generators don’t actually specify any tree at all. They let the programmer write actions to build up a parse tree. But the actions in most systems tend to form an awkward fit with the grammar.
Next, when you write a grammar, you have to worry not just about the surface structure of the language, but also about how the language will be parsed. You have to write the grammar around ambiguities in the language and sometimes around other features. You can’t just write the rules as you would write them for a human reader:
Expr ::= Int | ( Expr ) | Expr + Expr | Expr - Expr | Expr * Expr | Expr / Expr
instead, you have to write them in a way that avoids ambiguity:
Expr ::= Expr + Term | Expr - Term;
Term ::= Term * Factor | Term / Factor
Factor ::= Int | ( Expr )
Finally, grammar-based parsers are extremely verbose. For serious parsing tasks, it is common to write a grammar, design a parse tree, write actions in the grammar to create the parse tree, design an abstract syntax tree, and write code to translate the parse tree into an abstract syntax tree. There are many dependencies among these parts that all have to be kept consistent over the life of the program. It’s a complex and error-prone process.
Classp attempts to avoid these problems. The abstract syntax tree is what programmers typically want to work with. With class patterns, you only have two jobs: design the abstract syntax tree and write a formatter for it. (A formatter is the function that writes out the abstract syntax tree in the target language.)
Here’s an example class declaration for an abstract syntax tree. The class pattern is the part inside the parentheses of the syntax statement: “arg1 '+' arg2”.
class Plus: Expression {
Expression arg1;
Expression arg2;
syntax(arg1 '+' arg2);
}
This class pattern says that to print a Plus node, you first print arg1, then you print a plus sign, then you print arg2. So it looks like a nice formatting language, but where do we specify the parser? The answer is that we don’t specify a parser; Classp will invert the formatter to generate a parser for us. Since formatters are typically much easier to write and maintain than parsers, it almost feels like magic.
Classp is still a work in progress. We still have to deal with ambiguity in languages, features that are only output in one way but may be input in several ways, and a few other issues. But it’s ready to play with now and we’d love to hear from others interested in this subject. To learn more, visit http://google.github.io/classp.
By David Gudeman, Classp team