In 2010 I wrote down my idea for generalized string literals.
A lot has changed in programming languages since then: Go, Kotlin, Rust, and Swift have become important and influential; C++ and Javascript have been significantly revamped. String literals in many languages are a lot more complicated than they were 10 years ago.
My design has evolved a little bit since my old description, and recently it has been agitating me for another write-up and a comparison with non-fantasy literal syntaxes.
- context
- requirements
- my generalized literals
- C++
- C#
- D
- Golang
- JavaScript
- Kotlin
- OCaml
- Perl
- PHP
- Python
- Ruby
- Rust
- Scala
- Swift
- zzz
context
In polyglot programming we often need to quote some code in another language. Even when we are coding in “one” language, we will use a bunch of microlanguages embedded in quotes.
-
String escapes are a microlanguage. They combine awkwardly with the language we are quoting if it has a similar escape syntax, so we should be able to quote without escaping.
-
Format strings are a collection of microlanguages -
printf
,scanf
,strftime
, etc. They are effectively a hand-written bytecode that is interpreted at run time. It would be more efficient to compile them.
There are a few difficulties when we’re quoting a big language, such as SQL, HTML, or shell commands:
-
Our editor should be able to do syntax highlighting and code completion inside the quote according to the language that is quoted.
-
Interpolation should be context-aware to protect against injection vulnerabilities, e.g. in SQL interpolating string literals or identifiers or SQL fragments safely.
Regular expressions sit somewhere between microlanguages and full languages: they clash so badly with string escapes that regexes often get their own special quoting syntax; they are crying out for doing more work at compile time; and they are complicated enough to benefit from having their own syntax highlighting.
requirements
We should be able to just paste some code in a foreign language, and add quote marks without having to alter the quoted code. Therefore:
-
no escaping
-
no fixed delimiters
With escaping and fixed delimiters, we can’t mention the escape character or the delimter in quoted code without having to alter it.
language independence
A lexer should be able to find the start and end of the quote without knowing anything about the quoted language.
A lexer should be able to find the start and end of an interpolation without knowing anything about the quoted language or the surrounding / interpolated language.
The quoted language should be identified simply and explicitly, so that it’s easy for editors and static analysis tools to know what is quoted.
my generalized literals
There are four forms: flat or nested, short or long:
$tag"literal"
$tag{literal${interpolation}literal}
$$tag"""
literal
"""
$$tag{{{
literal${{{interpolation}}}literal
}}}
mark
The $
or $$
marks introduce a short or long literal respectively.
They are the only fixed part of the syntax.
tag
The tag identifies the quoted language. It is an identifier from a
special namespace, so that tags can be short without clashing with
other identifiers. A tag might imply some special compile-time or
run-time handling of the quoted string. For example, fmt
, q
, re
,
sql
, time
.
delimiters
Short delimiters are one character; long delimiters can be one or more characters.
Nested delimiters can use any kind of brackets. There’s an official list of Unicode bidi paired brackets which specifies how they pair with each other. To get the closing delimiter, reverse the opening delimiter, and swap each character for its pair.
Flat delimiters can be any non-bracket punctuation. The closing delimiter is the same as the opening delimiter.
flat
A short flat literal ends at the first occurrence of the delimiter after the opening delimiter.
A long flat literal ends at the first occurrence of the delimiter on a line by itself. (Leading and trailing whitespace are allowed.)
Flat literals do not support interpolation.
nested
Nested literals end at the matching close delimiter. They can contain arbitrary matched nestings of the open and close delimiters.
Interpolations use the same delimiters as the surrounding literal.
They are marked with a single $
for long as well as short literals.
Interpolations do not affect how delimiters are matched. In particular, if an interpolation contains a string or generalized literal, that has no effect on how the outer literal’s delimiters are matched.
whitespace
Short literals must not contain vertical whitespace. This is to make error recovery easier, and to limit the amount of syntax highlighting churn caused by incomplete literals.
Long literals can be indented. Every line of the literal must start with the same whitespace as the closing delimiter, and this indentation is removed from the resulting string. (A nested long literal is not indented if there is any non-whitespace on the line before the closing delimiter.)
In a long literal, the newline just after the opening delimiter and the newline just before the closing delimiter (if they exist) are not included in the resulting string.
escaping
There is no way to escape delimiters - choose different ones that don’t clash instead.
In general, control character escapes and suchlike are a feature of the quoted language, not the generic syntax.
C++
In C++ you can define operator""tag(arg)
to define tag
as a
user-defined literal suffix. The argument type depends on the kind of
literal - numbers, characters, strings of various types. String
literals look like
"literal"tag
R"delimiter(literal)delimiter"tag
The first includes traditional escapes; the second is raw.
There’s no interpolation, and indentation is not removed from raw strings.
C#
String literals can be normal, verbatim, interpolated, or both.
@"literal"
$"literal{interpolated}literal"
$@"literal{interpolated}literal"
Verbatim literals are not completely raw: you can include a "
by
doubling it ""
.
In an interpolation, ASCII brackets (){}[]
must be balanced. You can
escape {
and }
by doubling them, {{
and }}
.
There are custom delimiters, and indentation is not removed.
D
All strings are multiline in D. Raw strings can be like:
r"literal"
`literal`
q"{literal}"
q"/literal/"
q"HERE
literal
HERE"
In the third form you can use matching ASCII brackets (){}[]<>
.
There’s no interpolation, and indentation is not removed.
Golang
Go has traditional escaped ""
strings and raw strings delimited by
backticks. There are no interpolations or custom delimiters, and
indentation is not removed.
JavaScript
Template literals look like
tag`literal${interpolation}literal`
The tag is a function (with no special namespacing) that is passed a list of strings and interpolations. It can use the strings raw, or after escape sequences have been interpreted.
The interpolation has to be properly parsed as JavaScript to correctly find the end of the template literal. In particular, it can contain nested template literals.
There are no custom delimiters and indentation is not removed.
Kotlin
String literals can be short and escaped or long and raw. Both of them support interpolation.
"literal${interpolated}literal"
"literal $interpolated literal"
"""
literal${interpolated}literal
"""
You can omit the curly brackets when the interpolated expression is
just a variable name. You need to use the circumlocution ${'$'}
to
include a $
in a raw string.
Indentation is not removed, but it’s idiomatic to use the
trimMargin()
method to achieve a similar effect.
There are no custom delimiters.
OCaml
There are quoted strings with escapes and raw strings with custom delimiters:
"literal"
{delimiter|literal|delimiter}
There is no interpolation and indentation is not removed.
Perl
A very elaborate literal syntax.
'literal'
"literal${interpolated}literal"
tag"literal${interpolated}literal"
tag{literal${interpolated}literal}
There is a fixed set of tags: single quotes desugar to q{}
, double
quotes desugar to qq{}
, backquotes desugar to qx{}
, regexes
desugar to qr{}
, etc. Delimiters can be matching ASCII brackets
(){}[]<>
or punctuation.
Unlike interpolation in most other languages that support it, in Perl you can only interpolate an lvalue - a variable or array element or hash member, etc. (There’s a circumlocution for interpolating arbitrary expressions.)
<<'HERE'
literal
HERE
<<"HERE"
literal${interpolated}literal
HERE
<<~HERE
literal${interpolated}literal
HERE
Here documents can be raw or interpolated, depending on how the
delimiter is quoted (interpolated is the default), and in recent
versions indentation is removed if you use the ~
flag.
PHP
In PHP you can interpolate a simple expression denoting a variable
(including array elements and object properties) without curly
brackets. Curlies after $
just delimit a variable name. The
expression inside outer curlies {$var}
is for more complicated
variable denotations, not arbitrary expressions.
'literal'
"literal $interpolated literal"
"literal${interpolated}literal"
"literal{$interpolated}literal"
<<<'HERE'
literal
HERE
<<<HERE
literal{$interpolated}literal
HERE
Indentation is not removed from here documents.
Python
Strings can be long or short, and may be raw and/or formatted.
"literal"
"""
literal
"""
r"literal"
f"literal{interpolated=!:}literal"
Bizarrely for an indentation-oriented language, indentation is not removed from long strings.
In a formatted string, you can escape {
and }
by doubling them,
{{
and }}
. The interpolated expression can be followed by various
options indicated by punctuation marks.
There are no custom delimiters.
Ruby
There’s a strong taint of Perl in Ruby, though Ruby uses a %
mark
where Perl tends to start tags with q
.
'literal'
"literal#{interpolated}literal"
%tag"literal#{interpolated}literal"
%tag{literal#{interpolated}literal}
There is a fixed set of tags: single quotes desugar to %q{}
, double
quotes desugar to %Q{}
, backquotes desugar to %x{}
, regexes
desugar to %r{}
, etc. Delimiters can be matching ASCII brackets
(){}[]<>
or punctuation.
<<'HERE'
literal
HERE
<<"HERE"
literal#{interpolated}literal
HERE
<<-HERE
literal#{interpolated}literal
HERE
Here documents can be raw or interpolated, depending on how the
delimiter is quoted (interpolated is the default), and indentation is
removed if you use the -
flag.
Rust
In Rust, raw strings look like
r####"literal"####
You can also add a suffix to a string literal, but this only has meaning in the context of a macro call.
You can choose the number of #
marks in the delimiters. There is no
interpolation, and indentation is not removed from raw strings.
Scala
Scala has an interpolation mechanism somewhat similar to JavaScript template literals:
tag"literal $interpolated literal"
tag"literal${interpolated}literal"
There are built-in tags s
for simple interpolation, f
for
formatted interpolation (where each interpolation has a %f
printf-style format suffix), and raw
unescaped interpolation. Users
can define their own tags, which correspond to a method invocation at
run time.
There are no custom delimiters and indentation is not removed.
Swift
There are single-line (one "
) and multiline literals (three """
),
and orthogonally, normal or extended literals.
"literal\(interpolated\)literal"
"""
literal\(interpolated)literal
"""
####"literal\####(interpolated)literal"####
####"""
literal\####(interpolated)literal
"""####
Like Rust, you can choose the number of #
marks in the delimiters,
and backslash-escapes are only interpreted if they have the same
number of #
marks.
Indentation is removed from multiline strings.
There are some restrictions on backslashes and newlines inside interpolations, but you can nest strings inside interpolations.
zzz
I think when I originally started thinking about generalized literals, the syntax I had in mind was rather outlandishly complicated, but by today’s standards it seems quite reasonable.
The $$
markers are reminiscent of $@
in C# or %
in Ruby or <<
here-document markers. C++, JavaScript, Perl, Ruby, and Scala have
something more or less like a language tag. D and Perl and Ruby have
punctuated or bracketed delimiters. Un-indenting is gradually becoming
more common.
But Swift is the only language with literals that can avoid clashing with the quoted language’s syntax, and also support interpolation.
I would like to see more compile-time code generation from string literals. C++ is furthest ahead there. (If you don’t count Lisp!)