We need less powerful languages

by Luke Plant

Posted in:

— November 14, 2015 11:46

Translations of this post (I can't vouch for their accuracy):

Japanese

Many systems boast of being ‘powerful’, and it sounds difficult to argue that this is a bad thing. Almost everyone who uses the word assumes that it is always a good thing.

The thesis of this post is that in many cases we need less powerful languages and systems.

Before I get going, there is very little original insight in this post. The train of thought behind it was set off by reading Hofstadter’s book Gödel, Escher, Bach – an Eternal Golden Braid which helped me pull together various things in my own thinking where I’ve seen the principle in action. Philip Wadler’s post on the rule of least power was also formative, and most of all I’ve also taken a lot from the content of this video from a Scala conference about everything that is wrong with Scala, which makes the following fairly central point:

Every increase in expressiveness brings an increased burden on all who care to understand the message.

My aim is simply to illustrate this point using examples that might be more accessible to the Python community than the internals of a Scala compiler.

I also need a word about definitions. What do we mean by “more powerful” or “less powerful” languages? In this article, I mean something roughly like this: “the freedom and ability to do whatever you want to do”, seen mainly from the perspective of the human author entering data or code into the system. This roughly aligns with the concept of “expressiveness”, though not perhaps with a formal definition. (More formally, many languages have equivalent expressiveness in that they are all Turing complete, but we still recognise that some are more powerful in that they allow a certain outcome to be produced with fewer words or in multiple ways, with greater freedoms for the author).

The problem with this kind of freedom is that every bit of power you insist on having when writing in the language corresponds to power you must give up at other points of the process – when ‘consuming’ what you have written. I’ll illustrate this with various examples which range beyond what might be described as programming, but have the same principle at heart.

We’ll also need to ask “Does this matter?” It matters, of course, to the extent that you need to be able to ‘consume’ a message you have put in. Different players who might ‘consume’ the message are software maintainers, compilers and other development tools, which means you almost always care – this has implications both for performance and correctness as well as human concerns.

Databases and schema

Starting at the low end of the scale in terms of expressiveness, there is what you might call data rather than language. But both “data” and “language” can be thought of as “messages to be received by someone”, and the principle applies to both.

In my years of software development, I’ve found that clients and users often ask for “free text” fields, often for “notes”. A free text field is maximally powerfully as far as the end user is concerned – they can put whatever they like in. In this sense, this is the “most useful” field – you can use it for anything.

But precisely because of this, it is also the least useful, because it is the least structured. Even search doesn’t work reliably because of typos and alternative ways of expressing the same thing. The longer I do software development involving databases, the more I want to tightly constrain everything as much as possible. When I do so, the data I end up with is massively more useful. I can do powerful things when consuming the data only when I severely limit the power (i.e. the freedom) of the agents putting data into the system.

In terms of database technologies, the same point can be made. Databases that are “schema-less” give you great flexibility and power when putting data in, and are extremely unhelpful when getting it out. A key-value store is a more technical version of “free text”, with the same drawbacks – it is pretty unhelpful when you want to extract info or do anything with the data, since you cannot guarantee that any specific keys will be there.

HTML

The success of the web has been partly due to the fact that some of the core technologies, HTML and CSS, have been deliberately limited in power. Indeed, you probably wouldn’t call them programming languages, but markup languages. This, however, was not an accident, but a deliberate design principle on the part of Tim Berners Lee. I can’t do better than to quote that page at length:

Computer Science in the 1960s to 80s spent a lot of effort making languages which were as powerful as possible. Nowadays we have to appreciate the reasons for picking not the most powerful solution but the least powerful. The reason for this is that the less powerful the language, the more you can do with the data stored in that language. If you write it in a simple declarative form, anyone can write a program to analyze it in many ways. The Semantic Web is an attempt, largely, to map large quantities of existing data onto a common language so that the data can be analyzed in ways never dreamed of by its creators. If, for example, a web page with weather data has RDF describing that data, a user can retrieve it as a table, perhaps average it, plot it, deduce things from it in combination with other information. At the other end of the scale is the weather information portrayed by the cunning Java applet. While this might allow a very cool user interface, it cannot be analyzed at all. The search engine finding the page will have no idea of what the data is or what it is about. This the only way to find out what a Java applet means is to set it running in front of a person.

This is has become a W3C principle:

Good Practice: Use the least powerful language suitable for expressing information, constraints or programs on the World Wide Web.

Note that this is almost exactly the opposite of Paul Graham’s advice (with the caveat that ‘power’ is often too informally defined to compare):

if you have a choice of several languages, it is, all other things being equal, a mistake to program in anything but the most powerful one.

Python setup.py MANIFEST.in file

Moving up towards ‘proper’ programming language, I came across this example — the MANIFEST.in file format used by distutils/setuptools. If you have had to create a package for a Python library, you may well have used it.

The file format is essentially a very small language for defining what files should be included in your Python package (relative to the MANIFEST.in file, which we’ll call the working directory from now on). It might look something like this:

include README.rst
recursive-include foo *.py
recursive-include tests *
global-exclude *~
global-exclude *.pyc
prune .DS_Store

There are two types of directive: include type directives (include, recursive-include, global-include and graft), and exclude type directives (exclude, recursive-exclude, global-exclude and prune).

There comes a question – how are these directives to be interpreted (i.e. what are the semantics)?

You could interpret them in this way:

A file from the working directory (or sub-directories) should be included in the package if it matches at least one include type directive, and does not match any exclude type directive.

This would make it a declarative language.

Unfortunately, that is not how the language is defined. The distutils docs for MANIFEST.in are specific about this – the directives are to be understood as follows (my paraphrase):

Start with an empty list of files to include in the package (or technically, a default list of files).
Go down the directives in the MANIFEST.in in order.
For every include type directive, copy any matching files from the working directory to the list for the package.
For every exclude type directive, remove any matching files from the list for the package.

As you can see, this interpretation defines a language that is imperative in nature – each line of MANIFEST.in is a command that implies an action with side effects.

The point to note is that this makes the language more powerful than my speculative declarative version above. For example, consider the following:

recursive-include foo *
recursive-exclude foo/bar *
recursive-include foo *.png

The end result of the above commands is that .png files that are below foo/bar are included, but all other files below foo/bar are not. If I’m thinking straight, to replicate the same result using the declarative language is harder – you would have to do something like the following, which is obviously sub-optimal:

recursive-include foo *
recursive-exclude foo/bar *.txt *.rst *.gif *.jpeg *.py ...

So, because the imperative language is more powerful, there is a temptation to prefer that one. However, the imperative version comes with significant drawbacks:

It is much harder to optimise.

When it comes to interpreting the MANFIEST.in and building a list of files to include in the package, one fairly efficient solution for a typical case is to first build an immutable list of all files in the directory and its sub-directories, and then apply the rules: addition rules involve copying from the full list to an output list, and subtraction rules involve removing from the output list. This is how the Python implementation currently does it.

This works OK, unless you have many thousands of files in the full list, most of which are going to get pruned or not included, in which case you can spend a lot of time building up the full list, only to ignore most of it.

An obvious shortcut is to not recurse into directories that would be excluded by some exclude directive. However, you can only do that if the exclude directives come after all include directives.

This is not a theoretical problem – I’ve found that doing setup.py sdist and other commands can take 10 minutes to run, due to a large number of files in the working directory if you use the tool tox for instance. This means that runs of tox itself (which uses setup.py) become very slow. I am currently attempting to fix this issue, but it is looking like it will be really hard.

Adding the optimised case might not look that hard (you can shortcut the file system traversal using any exclude directives that come after all include directives), but it adds sufficiently to the complexity that a patch is unlikely to be accepted – it increases the number of code paths and the chances of mistakes, to the point of it not being worth it.

It might be that the only practical solution is to avoid MANIFEST.in altogether and optimise only the case when it is completely empty.
The power has a second cost – MANIFEST.in files are harder to understand.

First, in understanding how the language works – the docs for this are considerably longer than for the declarative version I imagined.

Second, in analysing a specific MANIFEST.in file – you have to execute the commands in your head in order to work out what the result will be, rather than being able to take each line on its own, or in any order that makes sense to you.

This actually results in packaging bugs. For instance, it would be easy to believe that a directive like:
```
global-exclude *~
```
at the top of a MANIFEST.in file would result in any file name ending in ~ (temporary files created by some editors) being excluded from the package. In reality it does nothing at all, and the files will be erroneously included if other commands include them.

Examples I’ve found of this mistake (exclude directives that don’t function as intended or are useless) include:
- hgview (exclude directives at the top do nothing)
- django-mailer (global-exclude at the top does nothing)
Another result is that you cannot groups lines in the MANIFEST.in file in any way you please, for clarity, since re-ordering changes the meaning of the file.

In addition, virtually no-one will actually use the additional power. I’m willing to bet that 99.99% MANIFEST.in files do not make use of the additional power of the imperative language (I downloaded 250 and haven’t found any that do). So we could have been served much better by a declarative language here instead of an imperative one. But backwards compatibility forces us to stick with this. That highlights another point – it is often possible to add features to a language to make it more powerful, but compatibility concerns usually don’t allow you to make it less powerful, for example by removing features or adding constraints.

URL reversing

One core piece of the Django web framework is URL routing. This is the component that parses URLs and dispatches them to the handler for that URL, possibly passing some components extracted from the URL.

In Django, this is done using regular expressions. For an app that displays information about kittens, you might have a kittens/urls.py with the following:

from django.conf.urls import url

from kittens import views

urlpatterns = [
    url(r'^kittens/$', views.list_kittens, name="kittens_list_kittens"),
    url(r'^kittens/(?P<id>\d+)/$', views.show_kitten, name="kittens_show_kitten"),
]

The corresponding views.py file looks like:

def list_kittens(request):
    # ...

def show_kitten(request, id):
    # ...

Regular expressions have a capture facility built in, which is used to capture parameters that are passed to the view functions. So, for example, if this app were running on cuteness.com, a URL like http://www.cuteness.com/kittens/23/ results in calling the Python code show_kitten(request, id="23").

Now, as well as being able to route URLs to specific functions, web apps almost always need to generate URLs. For example, the kitten list page will need to include links to the individual kitten page i.e. show_kitten. Obviously we would like to do this in a DRY way, re-using the URL routing configuration.

However, we would be using the URL routing configuration in the opposite direction. When doing URL routing, we are doing:

URL path -> (handler function, arguments)

In URL generation, we know the handler function and arguments we want the user to arrive at, and want to generate a URL that will take the user there, after going through the URL routing:

(handler function, arguments) -> URL path

In order to do this, we essentially have to predict the behaviour of the URL routing mechanism. We are asking “given a certain output, what is the input?”

In the very early days Django did not include this facility, but it was found that with most URLs, it was possible to 'reverse' the URL pattern. The regex can be parsed looking for the static elements and the capture elements.

Note, first of all, that this is only possible at all because the language being used to define URL routes is a limited one – regular expressions. We could easily have defined URL routes using a more powerful language. For example, we could have defined them using functions that:

take a URL path as input
raise NoMatch if they do not match
return a truncated URL and an optional set of captures if they do match.

Our kittens urls.py would look like something like this:

from django.conf.urls import url, NoMatch

def match_kitten(path):
    KITTEN = 'kitten/'
    if path.startswith(KITTEN):
        return path[len(KITTEN):], {}
    raise NoMatch()

def capture_id(path):
    part = path.split('/')[0]
    try:
        id = int(part)
    except ValueError:
        raise NoMatch()
    return path[len(part)+1:], {'id': id}

urlpatterns = [
    url([match_kitten], views.list_kittens, name='kittens_list_kittens'),
    url([match_kitten, capture_id], views.show_kitten, name="kittens_show_kitten"),
]

Of course, we could provide helpers that make things like match_kitten and capture_id much more concise:

from django.conf.urls import url, m, c

urlpatterns = [
    url([m('kitten/'), views.list_kittens, name='kittens_list_kittens'),
    url([m('kitten/'), c(id=int)], views.show_kitten, name="kittens_show_kitten"),
]

Now, this language for URL routing is actually a lot more powerful than our regex based one, assuming that m and c are returning functions as above. The interface for matching and capturing is not limited to the capabilities of regexes – for instance, we could do database lookups for the IDs, or many other things.

The downside, however, is that URL reversing would be entirely impossible. For general, Turing complete languages, you cannot ask “given this output, what is the input?”. We could potentially inspect the source code of the function and look for known patterns, but it quickly becomes totally impractical.

With regular expressions, however, the limited nature of the language gives us more options. In general, URL configuration based on regexes is not reversible — a regex as simple as . cannot be reversed uniquely. (Since we want to generate canonical URLs normally, a unique solution is important. As it happens, for this wild card, Django currently picks an arbitrary character, but other wild cards are not supported). But as long as wild cards of any sort are only found within capture groups (and possibly some other constraints), the regex can be reversed.

So, if we want to be able to reliably reverse the URL routes, we actually want a language less powerful than regular expressions. Regular expressions were presumably chosen because they were powerful enough, without realising that they were too powerful.

Additionally, in Python defining mini-languages for this kind of thing is quite hard, and requires a fair amount of boilerplate and verbosity both for implementation and usage – much more than when using a string based language like regexes. In languages like Haskell, relatively simple features like easy definitions of algebraic data types and pattern matching make these things much easier.

Regular expressions

The mention of regexes as used in Django’s URL routing reminds me of another problem:

Many usages of regexes are relatively simple, but whenever you invoke a regex, you get the full power whether you need it or not. One consequence is that for some regular expressions, the need to do backtracking to find all possible matches means that it is possible to construct malicious input that takes a huge amount of time to be processed by the regex implementation.

This has been the cause of a whole class of Denial Of Service vulnerabilities in many web sites and services, including one in Django due to an accidentally 'evil' regex in the URL validator — CVE-2015-5145 (and one that took down the whole of Stack Exchange - update 2016-07-22).

A less powerful string matching language wouldn’t have these problems.

Django templates vs Jinja templates

The Jinja template engine was inspired by the Django template language, but with some differences in philosophy and syntax.

One major advantage of Jinja2 over Django is that of performance. Jinja2 has an implementation strategy which is to compile to Python code, rather than run an interpreter written in Python, which is how Django works, and this results in a big performance increase – often 5 to 20 times. (YMMV etc.)

Armin Ronacher, the author of Jinja, attempted to use the same strategy to speed up Django template rendering. There were problems, however.

The first he knew about when he proposed the project – namely that the extension API in Django makes the approach taken in Jinja very difficult. Django allows custom template tags that have almost complete control over the compilation and rendering steps. This allows some powerful custom template tags like addtoblock in django-sekizai that seems impossible at first glance. However, if a slower fallback was provided for these less common situations, a fast implementation might still have been useful.

However, there is another key difference that affects a lot of templates, which is that the context object that is passed in (which holds the data needed by the template) is writable within the template rendering process in Django. Template tags are able to assign to the context, and in fact some built-in template tags like url do just that.

The result of this is a key part of the compilation to Python that happens in Jinja is impossible in Django.

Notice that in both of these, it is the power of Django’s template engine that is the problem – it allows code authors to do things that are not possible in Jinja2. However, the result is that a very large obstacle is placed in the way of attempts to compile to fast code.

This is not a theoretical consideration. At some point, performance of template rendering becomes an issue for many projects, and a number have been forced to switch to Jinja because of that. This is far from an optimal situation!

Often the issues that make optimisation difficult are only clear with the benefit of hindsight, and it isn’t true to say that simply adding restrictions to a language is necessarily going to make it easier to optimise. There are certainly languages which somehow manage to hit a “sour spot” of providing little useful power to either the authors or the consumers!

You might also say that for the Django template designers, allowing the context object to be writable was the obvious choice because Python data structures are typically mutable by default. Which brings us to Python...

Python

There are many ways that we could think about the power of the Python language, and how it makes life hard for every person and program that wants to make sense of Python code.

Compilation and performance of Python is an obvious one. The unrestricted effects that are possible at any point, including writable classes and modules etc., not only allow authors to do some very useful things, they make it extremely difficult to execute Python code quickly. PyPy has made some impressive progress, but looking at the curve from PyPy 1.3 onward, which shows diminishing returns, makes it clear that they are unlikely to make much bigger gains in the future. And the gains that have been made in terms of run time have often been at the expense of memory usage. There is simply a limit to how well you can optimise Python code.

(Please note, to all who continue reading this – I’m not a Python basher, or a Django basher for that matter. I’m a core developer of Django, and I use Python and Django in almost all my professional programming work. The point of this post is to illustrate the problems caused by powerful languages).

However, rather than focus on the performance problems of Python, I’m going to talk about refactoring and maintenance. If you do any serious work in a language, you find yourself doing a lot of maintenance, and being able to do it quickly and correctly often becomes very important.

So, for example, in Python, and with typical VCS tools (Git or Mercurial, for instance), if you re-order functions in a module e.g. move a 10 line function to a different place, you get a 20 line diff, despite the fact that nothing changed in terms of the meaning of the program. And if something did change (the function was both moved and modified), it’s going to be very difficult to spot.

This happened to me recently, and set me off thinking just how ridiculously bad our toolsets are. Why on earth are we treating our highly structured code as a bunch of lines of text? I can’t believe that we are still programming like this, it's insane!

At first, you might think that this could be solved with a more intelligent diff tool. But the problem is that in Python, the order in which functions are defined can in fact change the meaning of a program (i.e. change what happens when you execute it).

Here are a few examples:

Using a previously defined function as a default argument:

def foo():
    pass

def bar(a, callback=foo):
    pass

These functions can’t be re-ordered or you’ll get a NameError for foo in the definition of bar.

Using a decorator:

@decorateit
def foo():
    pass

@decorateit
def bar():
    pass

Due to unrestricted effects that are possible in @decorateit, you can’t safely re-order these functions and be sure the program will do the same thing afterwards. Similarly, calling some code in the function argument list:

def foo(x=Something()):
    pass

def bar(x=Something()):
    pass

Similarly, class level attributes can’t be re-ordered safely:

class Foo():
    a = Bar()
    b = Bar()

Due to unrestricted effects possible inside the Bar constructor, the definitions of a and b cannot be re-ordered safely. (This might seem theoretical, but Django, for instance, actually uses this ability inside Model and Form definitions to provide a default order for the fields, using a cunning class level counter inside the base Field constructor).

Ultimately, you have to accept that a sequence of function statements in Python is a sequence of actions in which objects (functions and default arguments) are created, possibly manipulated, etc. It is not a re-orderable set of function declarations as it might be in other languages.

This gives Python an amazing power when it comes to writing it, but imposes massive restrictions on what you can do in any automated way to manipulate Python source code.

Above I used the simple example of re-ordering two functions or class attributes. But every single type of refactoring that you might do in Python becomes virtually impossible to do safely because of the power of the language e.g. duck typing means you can’t do method renames, the possibility of reflection/dynamic attribute access (getattr and friends) means you can’t in fact do any kind of automated renames (safely).

So, if we are tempted to blame our crude VCS or refactoring tools, we actually have to blame the power of Python – despite the huge amount of structure in correct Python source code, there is very little that any software tool can do with it when it comes to manipulating it, and the line-based diffing that got me so mad is actually a reasonable approach.

Now, 99% of the time, we don’t write Python decorators which mean that the order of function definitions makes a difference, or silly things like that – we are responsible “adults”, as Guido put it, and this makes life easier for human consumers. But the fact remains that our tools are limited by what we do in the 0.01% of cases. For some consumers, we can optimise on the basis of the common case, and detect when that fails e.g. a JIT compiler using guards. But with others e.g. VCS or refactoring tools, the “runtime” information that you hit the unlucky case comes far too late – you might have released your subtly-broken code by the time you find out, so you have to be safe rather than sorry.

In an ideal world, with my dream language, when you rename a function, the entire “diff” in your VCS should simply be “Function foo renamed to bar”. (And, this should be exportable/importable, so that when you upgrade a dependency to a version in which foo is renamed to bar, it should be exactly zero work to deal with this). In a “less powerful” language, this would be possible, but the power given to the program author in Python has taken power from all the other tools in the environment.

Does this matter? It depends on how much time you spend manipulating your code, compared to using code to manipulate data.

At the beginning of a project, you may be tempted to desire the most powerful language possible, because it gives you the most help and freedom in terms of manipulating data. But later on, you spend a huge amount of time manipulating code, and often using an extremely basic tool to do so – a text editor. This treats your highly structured code as one of the least structured forms of data — a string of text – exactly the kind of manipulation you would avoid at all costs inside your code. But all the practices you would choose and rely on inside your program (manipulating all data inside appropriate containers) are no longer available to you when it comes to manipulating the program itself.

Some popular languages make automated refactoring easier, but more is needed: to actually make use of the structure of your code, you need an editor and VCS that understand your code properly. Projects like Lamdu, Unison and isomorf.io are in the right direction, but still in their infancy, and unfortunately involve re-thinking the entire software development stack :-(

Summary

When you consider the total system and all the players (whether software or human), including the need to produce efficient code, and long term maintainability, less powerful languages are actually more powerful – “slavery is freedom”. There is a balance between expressiveness and reasonability.

The more powerful a language, the greater the burden on software tools, which either need to be more complicated in order to work, or are forced to do less than they could. This includes:

compilers – with big implications for performance.
automated refactoring and VCS tools – with big implications for maintenance.

Similarly, the burden also increases for humans – for anyone attempting to understand the code or modify it.

A natural instinct is to go for the most powerful solution, or a solution that is much more powerful than is actually needed. We should try to do the opposite — find the least powerful solution that will do the job.

This won’t happen if creating new languages (which might involve parsers etc.) is hard work. We should prefer software ecosystems that make it easy to create very small and weak languages.

Addendum

Similar posts I've found since writing this:

Maximally powerful, minimally useful