The Seductive Trap of the Hand-Written Parser

Every developer, at some point in their career, falls into the same seductive trap. You have some string input—maybe a custom config file format, a DSL for layout, or a set of nested parenthetical commands. You need to parse it.

You think: "I don't need a heavy grammar engine. I'll just write a quick regular expression and a split loop. It'll take ten minutes."

And it does. And it works beautifully. You feel like a wizard.

But then, the requirements change. You need to support nested brackets. You need to handle escaped quotes. Suddenly, that "quick regex" turns into a recursive, state-holding, back-tracking monster. You spend your weekends stepping through 300 lines of nested if/else statements, muttering dark oaths under your breath. You are trapped by the sunk-cost fallacy, unwilling to throw away the subsystem you’ve spent weeks debugging.

A hand-written parser is a financial loan shark: it offers incredibly easy terms upfront, followed by ruinous interest rates later.

Decomplecting the Problem

If we want to build a better parser, we have to follow the classic Rich Hickey advice: let's decomplect it. We must untangle the three distinct duties that a parser implicitly performs:

  1. Lexing (The Tokenizer): Slicing raw characters into meaningful words, or "tokens." For example, in Javascript, the characters f, u, n, c, t, i, o, n are recognized collectively as a single keyword token.
  2. Parsing (The Structurer): Taking the flat stream of tokens and arranging them into a hierarchical tree (an Abstract Syntax Tree, or AST) that reflects the recursive nature of the language.
  3. Evaluating (The Actor): Walking the AST and executing the actual logic or translating it into another format.

Historically, computer scientists separated these concerns into specialized tools. In 1975, Mike Lesk and Eric Schmidt (who would later become the CEO of Google) wrote Lex, a lexical analyzer generator. It was typically paired with Yacc ("Yet Another Compiler-Compiler"), a parser generator.

For decades, CompSci students (including myself, fifteen years ago) were forced to learn these tools. It was dry, theoretical, and slightly painful. But the professors had a point: formal grammars are vastly superior to custom procedural parsing code.

The Modern Alternative: ANTLR 4

Today, we don't have to use Lex and Yacc. We have tools like ANTLR 4 (Another Tool for Language Recognition), which combines lexer and parser definitions into a single, clean grammar file.

Here is a simple ANTLR 4 grammar for the ubiquitous CSV (comma-separated values) format:

csvFile: hdr row+ ;
hdr    : row ;
row    : field (',' field)* ' '? '\r'? '\n' ;
field  : TEXT
       | STRING
       |
       ;

TEXT   : ~[, "\r\n]+ ;
STRING : '"' ('""'|~'"')* '"' ;

In ANTLR, lexer rules (which define tokens) begin with an uppercase letter (TEXT, STRING), while parser rules (which define structure) begin with a lowercase letter (csvFile, hdr, row, field).

Look at how declarative this is. You don't write loops or search algorithms. You simply describe what a CSV file is. The generator takes this description and outputs highly optimized parsing code in Java, Clojure, Python, or JavaScript.

The Ambiguity Cliff

The trickiest part of designing your own language is ambiguity. An ambiguous language is one where a single string of characters can match more than one parser rule.

Consider this naive grammar for links:

link: '[[' STRING ']]' ;
alias: '[' STRING '](' STRING ')' ;
STRING: [a-zA-Z0-9 ]+ ;

What happens if the parser encounters [[alias](target)]?

Because both rules start with the open bracket [, the parser has to guess which path to follow. Without sufficient lookahead, it may commit to the link rule, proceed to look for ]], fail to find it, and throw a syntax error.

To avoid this, you should design your language syntax to be structurally unambiguous from the start. If that isn't possible, you can resort to lookahead markers or design your parser rules to explicitly allow optional characters to resolve the conflict. For example, using lookahead checks like ANTLR’s semantic predicates to check what characters follow.

Further Reading

If you want to escape the cycle of hand-written parser grief, I highly recommend:

Published: 2022-12-11

Tagged: dsl clojure blog