I have added syntax highlighting to my blog using tree-sitter. Here are some notes about what I learned, with some complaining.
static site generator
I moved my blog to my own web site a few years ago. It is produced using a scruffy Rust program that converts a bunch of Markdown files to HTML using pulldown-cmark, and produces complete pages from Handlebars templates.
Why did I write another static site generator?
Well, partly as an exercise when learning Rust.
Partly, since I wrote my own page templates, I’m not going to benefit from a library of existing templates. On the contrary, it’s harder to create new templates that work with a general-purpose SSG than write my own simpler site-specific SSG.
It’s miserable to write programs in template languages. My SSG can keep the logic in the templates to a minimum, and do all the fiddly stuff in Rust. (Which is not very fiddly, because my site doesn’t have complicated navigation – compared to the multilevel menus on www.dns.cam.ac.uk for instance.)
markdown ingestion
There are a few things to do to each Markdown file:
-
split off and deserialize the YAML frontmatter
-
find the
<cut>
or<toc>
marker that indicates the end of the teaser / where the table of contents should be inserted -
augment headings with self-linking anchors (which are also used by the ToC)
Before this work I was using regexes to do all these jobs, because
that allowed me to treat pulldown-cmark
as a black box: Markdown
in, HTML out.
But for syntax highlighting I had to be able to find fenced code
blocks. It was time to put some code into the pipeline between
pulldown-cmark
’s parser and renderer.
And if I’m using a proper parser I can get rid of a few regexes: after some hacking, now only the YAML frontmatter is handled with a regex.
Sub-heading linkification and ToC construction are fiddly and more complicated than they were before. But they are also less buggy: markup in headings actually works now!
Compared to the ToC, it’s fairly simple to detect code blocks and pass them through a highlighter.
You can look at my Markdown munger here. (I am not very happy with the way it uses state, but it works.)
highlighting
As well as the tree-sitter-highlight documentation I used femark as an example implementation. I encountered a few problems.
incompatible?!
I could not get the latest tree-sitter-highlight
to work as
described in its documentation. I thought the current tree-sitter
crates were incompatible with each other! For a while I downgraded to
an earlier version, but eventually I solved the problem. Where the
docs say,
let javascript_language =
tree_sitter_javascript::language();
They should say:
let javascript_language =
tree_sitter::Language::new(
tree_sitter_javascript::LANGUAGE
);
highlight names
I was offended that tree-sitter-highlight
seems to expect me to
hardcode a list of highlight names, without explaining where they come
from or what they mean. I was doubly offended that there’s an array of
STANDARD_CAPTURE_NAMES
but it isn’t exported, and doesn’t match the
list in the docs. You mean I have to copy and paste it? Which one?!
There’s some discussion of highlight names in the tree-sitter
manual’s “syntax highlighting” chapter, but that is aimed at
people who are writing a tree-sitter
grammar, not people who are
using one.
Eventually I worked out that tree_sitter_javascript::HIGHLIGHT_QUERY
in the tree-sitter-highlight
example
corresponds to the contents of a highlights.scm
file. Each @name
in highlights.scm
is a highlight
name that I might be interested in. In principle I guess different
tree-sitter grammars should use similar highlight names in their
highlights.scm
files? (Only to a limited extent, it turns out.)
I decided the obviously correct list of highlight names is the list of
every name defined in the HIGHLIGHT_QUERY
. The query is just a
string so I can throw a regex at it and build an array of the matches.
This should make the highlighter produce <span>
wrappers for as many
tokens as possible in my code, which might be more than necessary but
I don’t have to style them all.
class names
The tree-sitter-highlight
crate comes with a lightly-documented
HtmlRenderer, which does much of the job fairly straightforwardly.
The fun part is the attribute_callback
. When the HtmlRenderer
is
wrapping a token, it emits the start of a <span
then expects the
callback to append whatever HTML attributes it thinks might be
appropriate.
Uh, I guess I want a class="..."
here? Well, the highlight names
work a little bit like class names: they have dot-separated parts
which tree-sitter-highlight
can match more or less specifically.
(However I am telling it to match all of them.) So I decided to turn
each dot-separated highlight name into a space-separated class
attribute.
The nice thing about this is that my Rust code doesn’t need to know
anything about a language’s tree-sitter
grammar or its highlight
query. The grammar’s highlight names become CSS class names
automatically.
styling code
Now I can write some simple CSS to add some colours to my code. I can make type names green,
code span.hilite.type {
color: #aca;
}
If I decide builtin types should be cyan like keywords I can write,
code span.hilite.type.builtin,
code span.hilite.keyword {
color: #9cc;
}
results
You can look at my tree-sitter-highlight
wrapper here.
Getting it to work required a bit more creativity than I would have
preferred, but it turned out OK. I can add support for a new language
by adding a crate to Cargo.toml
and a couple of lines to hilite.rs
– and maybe some CSS if I have not yet covered its highlight names.
(Like I just did to highlight the CSS above!)
future work
While writing this blog post I found myself complaining about things that I really ought to fix instead.
frontmatter
I might simplify the per-page source format knob so that I can use pulldown-cmark’s support for YAML frontmatter instead of a separate regex pass. This change will be easier if I can treat the html pages as Markdown without mangling them too much (is Markdown even supposed to be idempotent?). More tricky are a couple of special case pages whose source is Handlebars instead of Markdown.
templates
I’m not entirely happy with Handlebars. It’s a more powerful language than I need – I chose Handlebars instead of Mustache because Handlebars works neatly with serde. But it has a dynamic type system that makes the templates more error-prone than I would like.
Perhaps I can find a more static Rust template system that takes advantage of the close coupling between my templates and the data structure that describes the web site. However, I like my templates to be primarily HTML with a sprinkling of insertions, not something weird that’s neither HTML nor Rust.
feed style
There’s no CSS in my Atom feed, so code blocks there will remain
unstyled. I don’t know if feed readers accept <style>
tags or if it
has to be inline styles. (That would make a mess of my neat setup!)
highlight quality
I’m not entirely satisfied with the level of detail and consistency
provided by the tree-sitter
language grammars and highlight queries.
For instance, in the CSS above the class names and property names have
the same colour because the CSS highlights.scm
gives them the same
highlight name. The C grammar is good at identifying variables, but
the Rust grammar is not.
Oh well, I guess it’s good enough for now. At least it doesn’t involve Javascript.