I like software built with care , healthy teams and puzzle pieces that fit
Every developer, at some point in their career, falls into the same seductive trap. You have some string input—maybe a custom config file format, a DSL for layout, or a set of nested parenthetical commands. You need to parse it.
You think: "I don't need a heavy grammar engine. I'll just write a quick regular expression and a split loop. It'll take ten minutes."
And it does. And it works beautifully. You feel like a wizard.
But then, the requirements change. You need to support nested brackets. You need to handle escaped quotes. Suddenly, that "quick regex" turns into a recursive, state-holding, back-tracking monster. You spend your weekends stepping through 300 lines of nested if/else statements, muttering dark oaths under your breath. You are trapped by the sunk-cost fallacy, unwilling to throw away the subsystem you’ve spent weeks debugging.
A hand-written parser is a financial loan shark: it offers incredibly easy terms upfront, followed by ruinous interest rates later.
If we want to build a better parser, we have to follow the classic Rich Hickey advice: let's decomplect it. We must untangle the three distinct duties that a parser implicitly performs:
f, u, n, c, t, i, o, n are recognized collectively as a single keyword token.Historically, computer scientists separated these concerns into specialized tools. In 1975, Mike Lesk and Eric Schmidt (who would later become the CEO of Google) wrote Lex, a lexical analyzer generator. It was typically paired with Yacc ("Yet Another Compiler-Compiler"), a parser generator.
For decades, CompSci students (including myself, fifteen years ago) were forced to learn these tools. It was dry, theoretical, and slightly painful. But the professors had a point: formal grammars are vastly superior to custom procedural parsing code.
Today, we don't have to use Lex and Yacc. We have tools like ANTLR 4 (Another Tool for Language Recognition), which combines lexer and parser definitions into a single, clean grammar file.
Here is a simple ANTLR 4 grammar for the ubiquitous CSV (comma-separated values) format:
csvFile: hdr row+ ;
hdr : row ;
row : field (',' field)* ' '? '\r'? '\n' ;
field : TEXT
| STRING
|
;
TEXT : ~[, "\r\n]+ ;
STRING : '"' ('""'|~'"')* '"' ;
In ANTLR, lexer rules (which define tokens) begin with an uppercase letter (TEXT, STRING), while parser rules (which define structure) begin with a lowercase letter (csvFile, hdr, row, field).
Look at how declarative this is. You don't write loops or search algorithms. You simply describe what a CSV file is. The generator takes this description and outputs highly optimized parsing code in Java, Clojure, Python, or JavaScript.
The trickiest part of designing your own language is ambiguity. An ambiguous language is one where a single string of characters can match more than one parser rule.
Consider this naive grammar for links:
link: '[[' STRING ']]' ;
alias: '[' STRING '](' STRING ')' ;
STRING: [a-zA-Z0-9 ]+ ;
What happens if the parser encounters [[alias](target)]?
Because both rules start with the open bracket [, the parser has to guess which path to follow. Without sufficient lookahead, it may commit to the link rule, proceed to look for ]], fail to find it, and throw a syntax error.
To avoid this, you should design your language syntax to be structurally unambiguous from the start. If that isn't possible, you can resort to lookahead markers or design your parser rules to explicitly allow optional characters to resolve the conflict. For example, using lookahead checks like ANTLR’s semantic predicates to check what characters follow.
If you want to escape the cycle of hand-written parser grief, I highly recommend:
Published: 2022-12-11