Perfection-ism induced hiatus complete

In my previous writings (more than 2 years ago), I discussed building a simple blog generator. These posts described the basic ideas behind the blog generator I built for this (personal) blog.

I built my own software but did not use it. Why?

In my experience, root cause analysis often finds many factors. Explaining why I haven't updated the blog in a couple years has many factors.

First, my personal life became busier. I started a new job. I had a second child. I haven't been doing a lot of personal programming. These are all excuses - if I wanted to, I could have found time to make a new post.

The real reason I stopped updating the blog was that I wasn't happy with it. Writing new posts included some friction; copy & pasting DSL snippets to add new paragraphs or headings. I wanted to make some changes to reduce the friction, and didn't want to write code for more posts until I'd made these changes. In my mind, new posts hinged on complete these updates first so I just didn't.

What are priorities for a blog?

I think building the blog generator was a great experience. It may not have been a challenging pursuit, but was enjoyable (if not a yak-shave). It gave me complete control over my site.

I had wanted to do major surgery to the blog generator to take the DSL in a new direction. As I thought about the mental burden of the changes, I realized that benefits of building my own software were not outweighing the costs (anymore).

If I want the blog to be an effective communication tool or just personal record of my thoughts, the most important feature is ease of writing. My own blog was not easy (at least, not yet) to write for.

The benefits of building a blog generator had run their course; I know I can do it, but is building and maintaining this tool how I want to spend my time? The answer came back a resounding - no.

Priorities come first

One important lesson I’ve been pondering lately is that it’s priorities and judgement matters; it’s magnitudes faster to be on an effective path sooner than to move quickly on an ineffective path.

Prioritizing is discussed so often that its boring for me to mention it here but it has been revelatory for me, including several moments thinking “how could I have been so lost before?”

As a mid-career engineer, I’ve discovered that I can have an outsized impact by guiding efforts away from rabbit holes and pitfalls. For example, an early-career Adam would have spent many hours considering the dozens of blog generator tools available and the merits of each. Now, I’m able to first consider my priorities and criteria then quickly filter the available options to make a decision.

So, how does the blog work now?

Once I realized that I wanted to get out of the blog generator game, it became clear that I should just leverage an existing tool that makes it easy to write and publish content.

I considered many blogging tools before landing on quickblog. I've been enjoying Clojure for nearly a decade and quickblog is built on tools and a language that I already understand while fitting with my priorities.

Porting over the handful of previous posts and existing styles required a bit of work but I'm expecting in the long run, the investment will pay itself back. And, you're likely to read new content from me in the future.

A short segway on parenting with priorities

My daughter’s favorite reminder when we’re late is to “remember the story of the tortoise and the hare? Slow and steady wins!”. She’s quite wise for a 5 year old! Even at my age, the rushing mindset is an easy trap in which to succumb. I think it’s actually quite rare that rushing is the optimal path; the downside of being late is often less than the risk of making mistakes by mindlessly rushing. My daughter may test my patience by being so frequently distracted but does trying to rush her help? I’d venture “no”.

Published: 2023-02-01

Tagged: priorities quickblog personal blog

Parsing and grammars

Parsing is a common task for software systems. Most domain specific languages and every programming language require a parser to process their input before acting. Most bridges between two or more systems need to encode then parse the data passed between them.

I've probably written dozens of parsers over the years, of which I remember less than half. The following experience report and light introduction to the topics of parsing & grammars may lead to better decisions when building parsers.

So, we need to build a parser?

We've got some input. It's a string. The string has some structure which follows a recognizable format. We want to turn that string into data we can use. We need a parser.

There are two primary approaches (I know of) to write parsers; hand built or parser generator with a grammar.

In my experience, parsers begin hand built. The input syntax is simple or you just want to get it done quickly. You write a small regular expression. You add an iterative loop or recursion. Suddenly, you've got a hand built parser.

Hand built parser

You've got a string with a general syntax. You need code that finds the parts of every string matching the syntax and act on it. You write code that finds matches then directly calls the action code.

Hand built parsers can be fast. Being purpose built for the task, code can optimized for performance. Any abstraction would require more machine effort than a well chosen algorithm.

Time passes and after a couple updates or changes in syntax, the code gets messy. Each change brings an accumulating pain. You've got difficult-to-follow recursion or incomprehensible clauses in your switch/cond statement. You long for a better abstraction or easier debugging but you're vibing sunk cost fallacy and can't bear to toss this significant subsystem. If you muster enough courage or 20% time then you go for the full refactor but like an old back injury, the pain returns in time.

Breaking down the work

Whether explicit or not, hand built parsers perform 3 duties. First, they search the input for specific tokens. Often input languages are defined in mutually exclusive states. In the JavaScript programming language for example, some characters are invalid in identifiers but valid in strings.

Second, they parse the token stream into the rules for the domain specific language. In JavaScript, the var keyword must be followed by an identifier string.

Third, hand built parsers (often) act on the rules of the domain specific language.

Let's use this information to find a better abstraction. As Rich Hickey would say "let's decomplect it".

Lexer

A lexical analyzer (or lexer) scans the input and splits it into tokens. In a string, a token is a collection of characters (including a collection of size one). Tokens should have meaning. Meaning that a parser would need to apply the rules of the domain specific language.

Lexer definition often looks like a regular expression for recognizing a specific character or sequence of characters. The lexer produces a series of tokens pulled from the input.

A common example of a lexical analyzer generator is Lex). Interestingly, Lex was originally written in 1975 by Mike Lesk and Eric Schmidt (the future CEO of Novell & Google).

Parser

Using the rules of a language, a parser takes a stream of tokens and produces a tree. Most languages are recursive so a tree data structure makes it clear which tokens are composed within the body of others.

Yacc is a commonly used parser, often paired with Lex. This is what my University computer science courses required (15 years ago).

Grammar

Grammars are an expressive language for describing rules of a domain specific language. You write a grammar then give it to a parser generator, which generates code for interpreting the input (usually a string).

Here's an example grammar for the common CSV (comma separated values) format. This grammar is defined in ANTLR 4 which combines both lexer and parser definitions in the same grammar.

csvFile: hdr row+ ;
hdr : row ;
row : field (',' field)* ' '? ' ' ;
field
: TEXT
| STRING
|
;
TEXT : ~[, "]+ ;
STRING : '"' ('""'|~'"')* '"' ;

ANTLR combines both lexer and parser rules in the same grammar. In it's language, a lexer rule identifier begins with an upper case letter and a parser rule does not. TEXT and STRING are both lexer rules which result in tokens. The field parser rule uses the tokens (including the inline ',' in the row rule) to build the higher level abstractions. In ANTLR rules that use alternatives (|) order matters; the field rule with prefer TEXT tokens over STRING tokens.

Ambiguous and unambiguous languages

There are languages that cannot be specified in a grammar, so beware but (in my experience) they are rare. More commonly, you're going to find languages that are ambiguous.

An ambiguous language can have more than one parser rule match a set of characters. For example, let's say you have a language with the following rules.

link: [[ STRING ]]
alias: [ STRING ]( STRING )
STRING: [a-zA-Z0-9 ]+

These two rules share the same left stop character. If a grammar were to parse [[alias](target)] then the parser would be unable to determine which rule to follow. Likely, the parser would fail trying to apply the link rule but not finding the ]] right stop characters.

There are ways to work around ambiguous rules, but it would be better to design the language to remove these ambiguities if possible. The best work around I have discovered is to define each rule with optional characters to cover other ambiguous rules. From our previous example, you could add an optional [ like so. ]

link: [[ STRING ]]
alias: [? [ STRING ]( STRING )
STRING: [a-zA-Z0-9 ]+

The parser can remove the ambiguity through matching the left stop characters on both rules. Note that this is ANTLR 4 specific, but you may be able find a similar solution in other grammar definition languages.

Further reading

I am a fan of ANTLR 4. I have found it to be powerful, easy to use, performant and well supported. A Clojure wrapper exists for it's Java implementation. @aphyr even did some performance tests of it (specifically comparing it to Instaparse). If you want a deeper dive into using ANTLR then I'd recommend The Definitive ANTLR 4 Reference. There are plenty of helpful examples of ANTLR-based grammars for different languages available on github.

Published: 2022-12-11

Tagged: dsl clojure blog

Static site generator

I built a static site generator. It's used to generate all the content on this site. Let me explain the unique environment which led to building a static site generator, then I discuss some details about how it functions.

My environment

I like building things. Software things in particular. I like the malleability of software. I like that a new ideas can significantly improve expressiveness, performance, features and developer productivity. I love the feeling of a clean and cohesive code base. I spend hours on a good refactor like I'm engrossed in a pressure washing video. I love the quick feedback loops that few other disciplines can provide.

I'm really into this software stuff.

I'm also very particular about the software I use. I've built a number of web sites with some of the currently (they seem to change so often!) popular static site generators. They make simple things complex or just don't work the way I want. It's hard to build good abstractions that work for everyone.

I understand HTML and CSS. Most of static site generators are built to translate other markup languages (like Markdown) into web site assets. Other markup languages are great for some people, but I don't need that abstraction when I can speak the destination markup language.

I need a different type of tool. A tool that makes it easy to manage complexity of code re-use. A tool that gives me access to full expressiveness of the destination data format. A tool that can be composed into a larger system. A tool that is simple to understand.

Simple (even leaky) abstraction over HTML/CSS

Instead of choosing a friendlier markup language, let's talk about a structured data representation of HTML and CSS. Once we have structured data, we can simply translate it into HTML and CSS. To get started, let's focus on simply HTML with inline CSS.

The Clojure programming language is my weapon of choice. It's data manipulation primitives make building domain specific languages relatively easy. Clojure has a popular domain specific language for representing HTML and CSS. Hiccup is a simple translation of HTML elements and CSS properties into collections/arrays and maps/objects. Here's what Hiccup looks like in a Clojure REPL:

user=> (html [:span {:class "foo"} "bar"])
"<span class="foo">bar</span>"

Composition

My theory on building static sites is that we can build the most complex static web assets with simple composition). Composition should give us the option to abstract away HTML and CSS (if we want) and build re-usable components (like layouts or common heading styles). I'm confident that this theory will pan out as function composition is my primary tool for building any software in any language.

Clojure has built-in tools for reading EDN data, which employs the same syntax as Clojure data structures. Let's add the Aero library, which offers a set of tag literals to our EDN content. Aero also makes it easy to implement our own tag literals which allows us to add composition to our static site language. We could build everything discussed here simply using clojure.edn but we'd end up re-implementing some of the code that Aero includes.

Process

We're building static content for our website, so let's keep our structured data in static files. We can version static files with git and our tool can guarantee idempotency (the same data will always produce the same output).

Our tool should accept configuration for each of the site's assets, read Aero's tag literals, apply Hiccup rendering then produce the web site assets back to the filesystem.

Configuration and Content

Let's get into what structured data for an Aero/Hiccup HTML+CSS document might look like. Here's the asset configuration for the index page of this website.

{
    :type :html
    :slug "/index.html"
    :content #template ["pages/index.edn" [:content] {:path "/"}]
}

Let's talk about the #template ["pages/index.edn" [:content] {:path "/"}] section.

#template is a custom tag literal. Think of it like a function being called. It's very similar to Aero's built-in #include tag literal except that it adds additional data into it's render context. #template will be the basis of composition from which we build our static content.

["pages/index.edn" [:content] {:path "/"}] are the three arguments to our function.

First, is the path to the template's definition. We'll keep another file with structured data for :slug "index.html" at "pages/index.edn" Side note: "slug" may have been better named "output-path".

Second, a pull selection to filter the output. In a finished version of this tool, you might consider implementing the EDN Query Language but let's stick to a simple vector of keys that can be applied to Clojure's get-in. I would set a sane default (like [:content]) to provide some consistent structure to our template files.

Third, map of input variables. The input is a map so that each template can apply a similar pull syntax for extracting the data it expects. You can include any data structure as input, so it's extremely flexible.

Let's take a look at a template definition.

{
  :color #include "../styles/color.edn"
  :content
    #template ["../components/layout.edn"
               [:content]
               {:title "home"
                :body #ref [:body]}]
  :body
    [:div
      {:style
        #css {:display :flex
              :flex-direction :column
              :justify-content :flex-start
              :align-items :flex-start}}
    [:div
      {:style
        #css {:font-size "50px"
              :font-weight 700
              :color #ref [:color :yellow]}}
    "Hi. I'm Adam Tait."]]
}

This is some of the template definition for this site's (adamtait.com) home page. Hopefully, you first notice the resemblance to Hiccup or HTML. We have a body element with a flexbox column layout and a single text element.

In the site configuration, we said we would be pulling [:content] from the evaluated data of this template, so the :content section is the output. :content renders a layout template which uses :body as input. :body is the heart of our template definition.

Are we there yet?

Given what you've seen of our tool so far, you might extrapolate what a larger site may look like. You'll find ways of reducing complexity by adding sane defaults, refactoring out shared references and templates.

Rich Hickey has a talk titled Are We There Yet? where he talks about incidental complexity. Incidental complexity is hidden; it wasn't requested (or expected), it just comes along for the ride.

Seek simplicity, and distrust it (Alfred North Whitehead)

Our site and template definitions don't hide incidental complexity but they don't hide complexity either. The complexity is (mostly) laid bare. There's not much "magic" to this tool. You have full access to base languages of the system (HTML and CSS) or you can abstract the Hiccup/HTML/CSS away in templates. You have power to build your own tool. The tool you build is one that you deeply understand (you created most of it, afterall) and well adapted to your specific use case.

I built this static site generator because I wanted an honest tool. I wanted to easily understand what data was available at each point. I wanted to build up my abstractions and organize my content in the most intuitive structure for me. Most people would consider this a poor abstraction because it's too raw. What we have built is a tool to build static site generators.

What's next?

As I grow this site, this as-yet-unnamed tool is also maturing. I may eventually open source it (and the code for this site) in it's entirety. I'll post an update if it becomes available.

Published: 2020-11-29

Tagged: dsl clojure blog