Parsing and grammars

Parsing is a common task for software systems. Most domain specific languages and every programming language require a parser to process their input before acting. Most bridges between two or more systems need to encode then parse the data passed between them.

I've probably written dozens of parsers over the years, of which I remember less than half. The following experience report and light introduction to the topics of parsing & grammars may lead to better decisions when building parsers.

So, we need to build a parser?

We've got some input. It's a string. The string has some structure which follows a recognizable format. We want to turn that string into data we can use. We need a parser.

There are two primary approaches (I know of) to write parsers; hand built or parser generator with a grammar.

In my experience, parsers begin hand built. The input syntax is simple or you just want to get it done quickly. You write a small regular expression. You add an iterative loop or recursion. Suddenly, you've got a hand built parser.

Hand built parser

You've got a string with a general syntax. You need code that finds the parts of every string matching the syntax and act on it. You write code that finds matches then directly calls the action code.

Hand built parsers can be fast. Being purpose built for the task, code can optimized for performance. Any abstraction would require more machine effort than a well chosen algorithm.

Time passes and after a couple updates or changes in syntax, the code gets messy. Each change brings an accumulating pain. You've got difficult-to-follow recursion or incomprehensible clauses in your switch/cond statement. You long for a better abstraction or easier debugging but you're vibing sunk cost fallacy and can't bear to toss this significant subsystem. If you muster enough courage or 20% time then you go for the full refactor but like an old back injury, the pain returns in time.

Breaking down the work

Whether explicit or not, hand built parsers perform 3 duties. First, they search the input for specific tokens. Often input languages are defined in mutually exclusive states. In the JavaScript programming language for example, some characters are invalid in identifiers but valid in strings.

Second, they parse the token stream into the rules for the domain specific language. In JavaScript, the var keyword must be followed by an identifier string.

Third, hand built parsers (often) act on the rules of the domain specific language.

Let's use this information to find a better abstraction. As Rich Hickey would say "let's decomplect it".

Lexer

A lexical analyzer (or lexer) scans the input and splits it into tokens. In a string, a token is a collection of characters (including a collection of size one). Tokens should have meaning. Meaning that a parser would need to apply the rules of the domain specific language.

Lexer definition often looks like a regular expression for recognizing a specific character or sequence of characters. The lexer produces a series of tokens pulled from the input.

A common example of a lexical analyzer generator is Lex). Interestingly, Lex was originally written in 1975 by Mike Lesk and Eric Schmidt (the future CEO of Novell & Google).

Parser

Using the rules of a language, a parser takes a stream of tokens and produces a tree. Most languages are recursive so a tree data structure makes it clear which tokens are composed within the body of others.

Yacc is a commonly used parser, often paired with Lex. This is what my University computer science courses required (15 years ago).

Grammar

Grammars are an expressive language for describing rules of a domain specific language. You write a grammar then give it to a parser generator, which generates code for interpreting the input (usually a string).

Here's an example grammar for the common CSV (comma separated values) format. This grammar is defined in ANTLR 4 which combines both lexer and parser definitions in the same grammar.

csvFile: hdr row+ ;
hdr : row ;
row : field (',' field)* ' '? ' ' ;
field
: TEXT
| STRING
|
;
TEXT : ~[, "]+ ;
STRING : '"' ('""'|~'"')* '"' ;

ANTLR combines both lexer and parser rules in the same grammar. In it's language, a lexer rule identifier begins with an upper case letter and a parser rule does not. TEXT and STRING are both lexer rules which result in tokens. The field parser rule uses the tokens (including the inline ',' in the row rule) to build the higher level abstractions. In ANTLR rules that use alternatives (|) order matters; the field rule with prefer TEXT tokens over STRING tokens.

Ambiguous and unambiguous languages

There are languages that cannot be specified in a grammar, so beware but (in my experience) they are rare. More commonly, you're going to find languages that are ambiguous.

An ambiguous language can have more than one parser rule match a set of characters. For example, let's say you have a language with the following rules.

link: [[ STRING ]]
alias: [ STRING ]( STRING )
STRING: [a-zA-Z0-9 ]+

These two rules share the same left stop character. If a grammar were to parse [[alias](target)] then the parser would be unable to determine which rule to follow. Likely, the parser would fail trying to apply the link rule but not finding the ]] right stop characters.

There are ways to work around ambiguous rules, but it would be better to design the language to remove these ambiguities if possible. The best work around I have discovered is to define each rule with optional characters to cover other ambiguous rules. From our previous example, you could add an optional [ like so. ]

link: [[ STRING ]]
alias: [? [ STRING ]( STRING )
STRING: [a-zA-Z0-9 ]+

The parser can remove the ambiguity through matching the left stop characters on both rules. Note that this is ANTLR 4 specific, but you may be able find a similar solution in other grammar definition languages.

Hi I'm Adam Tait