2020-09-17 – Generalized string literal syntax, 10 years later

In 2010 I wrote down my idea for generalized string literals.

A lot has changed in programming languages since then: Go, Kotlin, Rust, and Swift have become important and influential; C++ and Javascript have been significantly revamped. String literals in many languages are a lot more complicated than they were 10 years ago.

My design has evolved a little bit since my old description, and recently it has been agitating me for another write-up and a comparison with non-fantasy literal syntaxes.

context

In polyglot programming we often need to quote some code in another language. Even when we are coding in “one” language, we will use a bunch of microlanguages embedded in quotes.

There are a few difficulties when we’re quoting a big language, such as SQL, HTML, or shell commands:

Regular expressions sit somewhere between microlanguages and full languages: they clash so badly with string escapes that regexes often get their own special quoting syntax; they are crying out for doing more work at compile time; and they are complicated enough to benefit from having their own syntax highlighting.

requirements

We should be able to just paste some code in a foreign language, and add quote marks without having to alter the quoted code. Therefore:

With escaping and fixed delimiters, we can’t mention the escape character or the delimter in quoted code without having to alter it.

language independence

A lexer should be able to find the start and end of the quote without knowing anything about the quoted language.

A lexer should be able to find the start and end of an interpolation without knowing anything about the quoted language or the surrounding / interpolated language.

The quoted language should be identified simply and explicitly, so that it’s easy for editors and static analysis tools to know what is quoted.

my generalized literals

There are four forms: flat or nested, short or long:

    $tag"literal"

    $tag{literal${interpolation}literal}

    $$tag"""
        literal
        """

    $$tag{{{
        literal${{{interpolation}}}literal
        }}}

mark

The $ or $$ marks introduce a short or long literal respectively. They are the only fixed part of the syntax.

tag

The tag identifies the quoted language. It is an identifier from a special namespace, so that tags can be short without clashing with other identifiers. A tag might imply some special compile-time or run-time handling of the quoted string. For example, fmt, q, re, sql, time.

delimiters

Short delimiters are one character; long delimiters can be one or more characters.

Nested delimiters can use any kind of brackets. There’s an official list of [Unicode bidi paired brackets][bidi] which specifies how they pair with each other. To get the closing delimiter, reverse the opening delimiter, and swap each character for its pair.

Flat delimiters can be any non-bracket punctuation. The closing delimiter is the same as the opening delimiter.

flat

A short flat literal ends at the first occurrence of the delimiter after the opening delimiter.

A long flat literal ends at the first occurrence of the delimiter on a line by itself. (Leading and trailing whitespace are allowed.)

Flat literals do not support interpolation.

nested

Nested literals end at the matching close delimiter. They can contain arbitrary matched nestings of the open and close delimiters.

Interpolations use the same delimiters as the surrounding literal. They are marked with a single $ for long as well as short literals.

Interpolations do not affect how delimiters are matched. In particular, if an interpolation contains a string or generalized literal, that has no effect on how the outer literal’s delimiters are matched.

whitespace

Short literals must not contain vertical whitespace. This is to make error recovery easier, and to limit the amount of syntax highlighting churn caused by incomplete literals.

Long literals can be indented. Every line of the literal must start with the same whitespace as the closing delimiter, and this indentation is removed from the resulting string. (A nested long literal is not indented if there is any non-whitespace on the line before the closing delimiter.)

In a long literal, the newline just after the opening delimiter and the newline just before the closing delimiter (if they exist) are not included in the resulting string.

escaping

There is no way to escape delimiters - choose different ones that don’t clash instead.

In general, control character escapes and suchlike are a feature of the quoted language, not the generic syntax.

C++

In C++ you can define operator""tag(arg) to define tag as a user-defined literal suffix. The argument type depends on the kind of literal - numbers, characters, strings of various types. String literals look like

    "literal"tag

    R"delimiter(literal)delimiter"tag

The first includes traditional escapes; the second is raw.

There’s no interpolation, and indentation is not removed from raw strings.

C#

String literals can be normal, verbatim, interpolated, or both.

    @"literal"

    $"literal{interpolated}literal"

    $@"literal{interpolated}literal"

Verbatim literals are not completely raw: you can include a " by doubling it "".

In an interpolation, ASCII brackets (){}[] must be balanced. You can escape { and } by doubling them, {{ and }}.

There are custom delimiters, and indentation is not removed.

D

All strings are multiline in D. Raw strings can be like:

    r"literal"

    `literal`

    q"{literal}"

    q"/literal/"

    q"HERE
    literal
    HERE"

In the third form you can use matching ASCII brackets (){}[]<>. There’s no interpolation, and indentation is not removed.

Golang

Go has traditional escaped "" strings and raw strings delimited by backticks. There are no interpolations or custom delimiters, and indentation is not removed.

JavaScript

Template literals look like

    tag`literal${interpolation}literal`

The tag is a function (with no special namespacing) that is passed a list of strings and interpolations. It can use the strings raw, or after escape sequences have been interpreted.

The interpolation has to be properly parsed as JavaScript to correctly find the end of the template literal. In particular, it can contain nested template literals.

There are no custom delimiters and indentation is not removed.

Kotlin

String literals can be short and escaped or long and raw. Both of them support interpolation.

    "literal${interpolated}literal"

    "literal $interpolated literal"

	"""
    literal${interpolated}literal
	"""

You can omit the curly brackets when the interpolated expression is just a variable name. You need to use the circumlocution ${'$'} to include a $ in a raw string.

Indentation is not removed, but it’s idiomatic to use the trimMargin() method to achieve a similar effect.

There are no custom delimiters.

OCaml

There are quoted strings with escapes and raw strings with custom delimiters:

    "literal"

	{delimiter|literal|delimiter}

There is no interpolation and indentation is not removed.

Perl

A very elaborate literal syntax.

    'literal'

    "literal${interpolated}literal"

	tag"literal${interpolated}literal"

	tag{literal${interpolated}literal}

There is a fixed set of tags: single quotes desugar to q{}, double quotes desugar to qq{}, backquotes desugar to qx{}, regexes desugar to qr{}, etc. Delimiters can be matching ASCII brackets (){}[]<> or punctuation.

Unlike interpolation in most other languages that support it, in Perl you can only interpolate an lvalue - a variable or array element or hash member, etc. (There’s a circumlocution for interpolating arbitrary expressions.)

    <<'HERE'
    literal
	HERE

    <<"HERE"
    literal${interpolated}literal
	HERE

    <<~HERE
		literal${interpolated}literal
		HERE

Here documents can be raw or interpolated, depending on how the delimiter is quoted (interpolated is the default), and in recent versions indentation is removed if you use the ~ flag.

PHP

In PHP you can interpolate a simple expression denoting a variable (including array elements and object properties) without curly brackets. Curlies after $ just delimit a variable name. The expression inside outer curlies {$var} is for more complicated variable denotations, not arbitrary expressions.

    'literal'

    "literal $interpolated literal"

    "literal${interpolated}literal"

    "literal{$interpolated}literal"

    <<<'HERE'
    literal
    HERE
    <<<HERE
    literal{$interpolated}literal
    HERE

Indentation is not removed from here documents.

Python

Strings can be long or short, and may be raw and/or formatted.

    "literal"

	"""
	literal
	"""

    r"literal"

    f"literal{interpolated=!:}literal"

Bizarrely for an indentation-oriented language, indentation is not removed from long strings.

In a formatted string, you can escape { and } by doubling them, {{ and }}. The interpolated expression can be followed by various options indicated by punctuation marks.

There are no custom delimiters.

Ruby

There’s a strong taint of Perl in Ruby, though Ruby uses a % mark where Perl tends to start tags with q.

    'literal'

    "literal#{interpolated}literal"

	%tag"literal#{interpolated}literal"

	%tag{literal#{interpolated}literal}

There is a fixed set of tags: single quotes desugar to %q{}, double quotes desugar to %Q{}, backquotes desugar to %x{}, regexes desugar to %r{}, etc. Delimiters can be matching ASCII brackets (){}[]<> or punctuation.

    <<'HERE'
    literal
	HERE

    <<"HERE"
    literal#{interpolated}literal
	HERE

    <<-HERE
		literal#{interpolated}literal
		HERE

Here documents can be raw or interpolated, depending on how the delimiter is quoted (interpolated is the default), and indentation is removed if you use the - flag.

Rust

In Rust, raw strings look like

    r####"literal"####

You can also add a suffix to a string literal, but this only has meaning in the context of a macro call.

You can choose the number of # marks in the delimiters. There is no interpolation, and indentation is not removed from raw strings.

Scala

Scala has an interpolation mechanism somewhat similar to JavaScript template literals:

    tag"literal $interpolated literal"

    tag"literal${interpolated}literal"

There are built-in tags s for simple interpolation, f for formatted interpolation (where each interpolation has a %f printf-style format suffix), and raw unescaped interpolation. Users can define their own tags, which correspond to a method invocation at run time.

There are no custom delimiters and indentation is not removed.

Swift

There are single-line (one ") and multiline literals (three """), and orthogonally, normal or extended literals.

    "literal\(interpolated\)literal"

    """
        literal\(interpolated)literal
        """

    ####"literal\####(interpolated)literal"####

    ####"""
        literal\####(interpolated)literal
        """####

Like Rust, you can choose the number of # marks in the delimiters, and backslash-escapes are only interpreted if they have the same number of # marks.

Indentation is removed from multiline strings.

There are some restrictions on backslashes and newlines inside interpolations, but you can nest strings inside interpolations.

zzz

I think when I originally started thinking about generalized literals, the syntax I had in mind was rather outlandishly complicated, but by today’s standards it seems quite reasonable.

The $$ markers are reminiscent of $@ in C# or % in Ruby or << here-document markers. C++, JavaScript, Perl, Ruby, and Scala have something more or less like a language tag. D and Perl and Ruby have punctuated or bracketed delimiters. Un-indenting is gradually becoming more common.

But Swift is the only language with literals that can avoid clashing with the quoted language’s syntax, and also support interpolation.

I would like to see more compile-time code generation from string literals. C++ is furthest ahead there. (If you don’t count Lisp!)


⇐ 2020-09-14 ⇐ da Vinci bridges ⇐ ⇒ Some more notes on endianness ⇒ 2020-10-24 ⇒