.@ Tony Finch – blog


I have added syntax highlighting to my blog using tree-sitter. Here are some notes about what I learned, with some complaining.

static site generator

I moved my blog to my own web site a few years ago. It is produced using a scruffy Rust program that converts a bunch of Markdown files to HTML using pulldown-cmark, and produces complete pages from Handlebars templates.

Why did I write another static site generator?

Well, partly as an exercise when learning Rust.

Partly, since I wrote my own page templates, I’m not going to benefit from a library of existing templates. On the contrary, it’s harder to create new templates that work with a general-purpose SSG than write my own simpler site-specific SSG.

It’s miserable to write programs in template languages. My SSG can keep the logic in the templates to a minimum, and do all the fiddly stuff in Rust. (Which is not very fiddly, because my site doesn’t have complicated navigation – compared to the multilevel menus on www.dns.cam.ac.uk for instance.)

markdown ingestion

There are a few things to do to each Markdown file:

Before this work I was using regexes to do all these jobs, because that allowed me to treat pulldown-cmark as a black box: Markdown in, HTML out.

But for syntax highlighting I had to be able to find fenced code blocks. It was time to put some code into the pipeline between pulldown-cmark’s parser and renderer.

And if I’m using a proper parser I can get rid of a few regexes: after some hacking, now only the YAML frontmatter is handled with a regex.

Sub-heading linkification and ToC construction are fiddly and more complicated than they were before. But they are also less buggy: markup in headings actually works now!

Compared to the ToC, it’s fairly simple to detect code blocks and pass them through a highlighter.

You can look at my Markdown munger here. (I am not very happy with the way it uses state, but it works.)

highlighting

As well as the tree-sitter-highlight documentation I used femark as an example implementation. I encountered a few problems.

incompatible?!

I could not get the latest tree-sitter-highlight to work as described in its documentation. I thought the current tree-sitter crates were incompatible with each other! For a while I downgraded to an earlier version, but eventually I solved the problem. Where the docs say,

    let javascript_language =
        tree_sitter_javascript::language();

They should say:

    let javascript_language =
        tree_sitter::Language::new(
            tree_sitter_javascript::LANGUAGE
        );

highlight names

I was offended that tree-sitter-highlight seems to expect me to hardcode a list of highlight names, without explaining where they come from or what they mean. I was doubly offended that there’s an array of STANDARD_CAPTURE_NAMES but it isn’t exported, and doesn’t match the list in the docs. You mean I have to copy and paste it? Which one?!

There’s some discussion of highlight names in the tree-sitter manual’s “syntax highlighting” chapter, but that is aimed at people who are writing a tree-sitter grammar, not people who are using one.

Eventually I worked out that tree_sitter_javascript::HIGHLIGHT_QUERY in the tree-sitter-highlight example corresponds to the contents of a highlights.scm file. Each @name in highlights.scm is a highlight name that I might be interested in. In principle I guess different tree-sitter grammars should use similar highlight names in their highlights.scm files? (Only to a limited extent, it turns out.)

I decided the obviously correct list of highlight names is the list of every name defined in the HIGHLIGHT_QUERY. The query is just a string so I can throw a regex at it and build an array of the matches.

This should make the highlighter produce <span> wrappers for as many tokens as possible in my code, which might be more than necessary but I don’t have to style them all.

class names

The tree-sitter-highlight crate comes with a lightly-documented HtmlRenderer, which does much of the job fairly straightforwardly.

The fun part is the attribute_callback. When the HtmlRenderer is wrapping a token, it emits the start of a <span then expects the callback to append whatever HTML attributes it thinks might be appropriate.

Uh, I guess I want a class="..." here? Well, the highlight names work a little bit like class names: they have dot-separated parts which tree-sitter-highlight can match more or less specifically. (However I am telling it to match all of them.) So I decided to turn each dot-separated highlight name into a space-separated class attribute.

The nice thing about this is that my Rust code doesn’t need to know anything about a language’s tree-sitter grammar or its highlight query. The grammar’s highlight names become CSS class names automatically.

styling code

Now I can write some simple CSS to add some colours to my code. I can make type names green,

    code span.hilite.type {
        color: #aca;
    }

If I decide builtin types should be cyan like keywords I can write,

    code span.hilite.type.builtin,
    code span.hilite.keyword {
        color: #9cc;
    }

results

You can look at my tree-sitter-highlight wrapper here.

Getting it to work required a bit more creativity than I would have preferred, but it turned out OK. I can add support for a new language by adding a crate to Cargo.toml and a couple of lines to hilite.rs – and maybe some CSS if I have not yet covered its highlight names. (Like I just did to highlight the CSS above!)

future work

While writing this blog post I found myself complaining about things that I really ought to fix instead.

frontmatter

I might simplify the per-page source format knob so that I can use pulldown-cmark’s support for YAML frontmatter instead of a separate regex pass. This change will be easier if I can treat the html pages as Markdown without mangling them too much (is Markdown even supposed to be idempotent?). More tricky are a couple of special case pages whose source is Handlebars instead of Markdown.

templates

I’m not entirely happy with Handlebars. It’s a more powerful language than I need – I chose Handlebars instead of Mustache because Handlebars works neatly with serde. But it has a dynamic type system that makes the templates more error-prone than I would like.

Perhaps I can find a more static Rust template system that takes advantage of the close coupling between my templates and the data structure that describes the web site. However, I like my templates to be primarily HTML with a sprinkling of insertions, not something weird that’s neither HTML nor Rust.

feed style

There’s no CSS in my Atom feed, so code blocks there will remain unstyled. I don’t know if feed readers accept <style> tags or if it has to be inline styles. (That would make a mess of my neat setup!)

highlight quality

I’m not entirely satisfied with the level of detail and consistency provided by the tree-sitter language grammars and highlight queries. For instance, in the CSS above the class names and property names have the same colour because the CSS highlights.scm gives them the same highlight name. The C grammar is good at identifying variables, but the Rust grammar is not.

Oh well, I guess it’s good enough for now. At least it doesn’t involve Javascript.