Some Problems Of URLs

7 November 2017

Background

The uniform resource locator (URL) is a data structure and an associated serialization format that aims to uniquely identify any resource on the Internet (and other networks). (See also uniform resource identifier (URI).) That’s a lofty goal, but it has proven more or less tractable and practical. Which is astounding and great! A global network namespace enables powerful applications, and powerful interactions between applications.

However, URLs have some problems of usability, security, and economics. Many of us have wished for a global namespace with fewer problems. I’ll address that first, and then I’ll have some fun with the technical aspects of the problem. You can skip that stuff, if you like.

Names Are Power

URLs have very poor usability both because they are structurally complex, and because their textual representation is unnecessarily ambiguous and ugly (syntaxy). Some of the structural complexity is necessary, and some of it is not.

The poor usability of URLs is a weak spot for advocates of semi-decentralized naming schemes like URLs and the DNS. People sometimes propose that a centralized naming scheme would be less chaotic and hence more usable and more safe. They do have a point, and we should address it.

For example, my colleague Owen Campbell-Moore argues that URLs are un-fixably terrible, and advocates for search engines to provide the trusted, and hopefully trustworthy, mapping between human-meaningful names and origins or URLs.

However, that requires the search engine, or other centralized naming authority, to be trustworthy. This proves difficult:

A
screenshot of Google Play Store showing search results for “whatsapp”, with
numerous perfect spoofs. — “This is horrible.. minefield” — Cristian Vat on Twitter

Similarly, a Google search for [ download chrome ] has lots of legitimate and correct results up top, but there are still fakes on the first page of results (at least at the time of writing, and for as long as I can remember). In fact, we used to have a recurring problem that obvious spoofs were at the very top of the results. Google’s search engine could not reliably find Google’s browser. In one sense, that indicates trustworthiness — Google doesn’t seem to put its thumb on the scale. In another sense... sigh. 🤷🏻‍♂️

Perhaps ✨ machine learning ✨ could be useful in identifying spoofs, such as by comparing names and icons for similarity and raising them for human review. That would speed up the process of finding potential spoofs, potentially improving the centralized naming authority’s trustworthiness. But we’d still be trusting the authority with a lot of power.

Obviously, in most of this post, I agree with Owen about the badness of URLs. But ultimately I do not agree that a centralized authority would be better, nor that we should switch to one.

I think the problem Owen poses can be resolved by investigating this question:

“Origins are not very user-friendly”

I fully agree that URLs are not usable, but I do believe that origins (the scheme, host, port tuple) are or can be made usable — and that if we succeed at that technical problem, we can reduce some of the pressure to centralize power.

Making Origins Usable

I think we can make origins more usable by doing the following things in the Location Bar:

Show only the hostname. Not the port, not the scheme. If necessary or useful, we can also consider showing only the effective TLD + 1 label (eTLD+1). In fact I think it will prove necessary and useful.
Show a negative security indicator, like Chrome’s Not Secure chip, for non-secure schemes.
Show no indicator for secure schemes.
Continue to deprecate and remove non-secure schemes. Ideally, the ongoing project to HTTPS-ify the web could get the number of schemes people regularly see down to 1.

Note that Safari already does most of the above, although it commits what I consider an error: for sites with Extended Validation (EV) certificates, it shows the EV name instead of the eTLD+1. This opens a whole other can of goat-worms (which I have yelled about elsewhere). But you can get a glimpse of a better naming future by trying out Safari. Brave for desktop also shows only the hostname until you focus the Location Bar.

A screenshot of
Safari showing only the eTLD+1. — Glorious, isn’t it?

As a practical matter, eTLD+1 names, hostnames, and copied-and-pasted blobs are all people really use in the real world. Very few people use URLs as such — and that is perfectly fine! To improve URL usability and safety, application and platform developers need only go with the flow. Let’s.

However, some more-or-less tractable problems remain even after we do all of the above.

Ill-formed hostnames will still be confusing.
- I believe confusingness is by itself a useful indicator for people, and that well-formed hostnames like “facebook.com” and “baidu.com” make good brands. Hostnames are ubiquitous in advertising and pop culture.
- Ill-formed hostnames are inherently fishy, and people have a decent chance to distinguish “facebook.com” from “facebook.com.wumpgarble.phishing.blog”, especially if we show only eTLD+1: “phishing.blog” is clearly not “facebook.com”.
- As baidu.com and mixi.jp show, they are even usable across a language and character set barrier, although IDNs also exist and can help.
People will still need to share links, and developers will still need to read, write, and modify them.
- I think we can handle this by making the full URL visible and editable when the Location Bar takes focus.
Homoglyph attacks will continue to exist.
- Note that centralized naming authorities also have this problem.
- I believe that the mechanisms for coping with it can work either in centralized or decentralized naming schemes.

The Problem

URLs became user interface components almost immediately: people are expected to be able to type in URLs, copy and paste them, and (at least partially) parse them to extract security-relevant information, and sometimes to modify them. All this, including on tiny phone screens.

This turns out to be not-so-great, because in order to meet their goal, URLs have to be fairly complex, and object serialization and deserialization is a surprisingly hard problem even in simple cases. The end result is that most people have a very hard time actually using URLs in practice.

URLs Are Surprisingly Complex

Although the implementation Chrome uses is more complex (see also url::Parsed), we might imagine that the structure of a URL object need not be too complex. For example:

class URL {
  string scheme
  string username
  string password
  string host
  string port
  string path
  string query
  string ref    // Also called "fragment".
}

Well, that’s a bit too simple. First, TCP and UDP port numbers are unsigned 16-bit integers, not arbitrary strings. Then, the host could be an IPv4 address, an IPv6 address, or an address in another network type. Or it could be a DNS hostname, a NetBIOS hostname, or a name in some other domain. Even considering just DNS and NetBIOS names, simple strings don’t quite capture the type information we need.

The DNS is hierarchical (having up to 127 levels), and a name consists of 1 or more labels, each containing 1 to 63 octets, and the total length of the internal representation of a name can be at most 255 octets. (Wikipedia)
Wikipedia says that “NetBIOS names are 16 octets in length and vary based on the particular implementation. Frequently, the 16th octet, called the NetBIOS Suffix, designates the type of resource, and can be used to tell other applications what type of services the system offers.” That leaves a lot of questions open, but we won’t dig into them here.

So, we’ll have to complicate our representation a bit. Let’s try this:

abstract class NetworkAddress { ... }

class IPv4NetworkAddress extends NetworkAddress { ... }

class IPv6NetAddress extends NetworkAddress { ... }

abstract class HostName { ... }

class DNSName extends HostName { ... }

class NetBIOSName extends HostName { ... }

class HostIdentifier {
  enum Type {
    Address,
    Name
  }

  union {
    NetworkAddress address
    HostName name
  }
}

class URL {
  string scheme
  string username
  string password
  HostIdentifier host
  uint16_t port
  string path
  string query
  string ref  // Also called "fragment".
}

We’ve more tightly specified the port, and HostIdentifier is a sum type of 2 abstract NetworkAddress and HostName types. In turn, the abstract types are made concrete for specific addressing and naming systems; we’ve given some modern examples for each.

Although the real details are madness-inducing, let’s further assume for the moment that the string type is a sequence of Unicode characters.

The real-world analogues to each of these hypothetical classes has at least 1 serialization function and least 1 deserialization function or parsing constructor. Even the IPv4 address, a humble 4-octet data structure, has a delightfully wacky set of representations. A simple program that uses the BSD functions inet_aton (deserializer) and inet_ntoa (serializer) produces the following equivalencies:

Serialized       Deserialized  Reserialized   
222.173.190.239  0xDEADBEEF    222.173.190.239
0xDEADBEEF       0xDEADBEEF    222.173.190.239
033653337357     0xDEADBEEF    222.173.190.239
222.11386607     0xDEADBEEF    222.173.190.239
222.173.48879    0xDEADBEEF    222.173.190.239
127.0.0.1        0x7F000001    127.0.0.1      
0x7F000001       0x7F000001    127.0.0.1      
127.1            0x7F000001    127.0.0.1      
127.0.1          0x7F000001    127.0.0.1

As of this writing, Chrome will indeed take http://0x9765C143, convert it to http://151.101.193.67/, and navigate to it. Firefox navigates directly to http://0x9765C143 without first converting it to dotted decimal in the Location Bar.

IPv6 addresses have their own various representations, as Wikipedia discusses. Notably, to disambiguate colon-separated hextets of the IPv6 address from the colon-separated port number in URL string representations, IPv6 addresses must be surrounded with square braces in URLs:

https://[2001:db8:85a3:8d3:1319:8a2e:370:7348]:443/foo/bar/noodles
         +------- IPv6 address -------------+   ^
                                                |
                                               port

Syntaxyness

Whenever a language has lots of syntactic meta-characters, especially when some of the meta-characters have multiple meanings depending on their context, I say the language is “syntaxy”. If you try to write a URL parser, you’ll find that it has to keep a fair amount of state to know whether this : is part of the scheme separator ://, or a hextet separator, or the port number separator. Similarly, / has at least 2 meanings.

Unconsciously perhaps, humans need to build the same state machine in their minds to parse URLs — or fail to, and get confused. Add on top of that the fact that many URL schemes are not real words, / looks kind of like \, and so on, and and pretty soon people are just plain confused about the URL language. It’s not a language people can speak easily.

Goals For A Solution

An ideal solution to the URL usability problem would have (at least) the following properties:

unambiguous grammar
clear delineation of the security-relevant origin
not too tedious to write
not too tedious to read (low in syntaxyness)
fewer components to avoid confusion and to reduce the need for syntactic complexity

Mitigations

We can’t truly solve the problem without fundamentally re-thinking URLs. URLs are ubiquitous, and their problems are structural: there are just too many things in the data structure.

Perhaps what we can do is mitigate the badness somewhat. Arguably, it is fun to brainstorm about how.

Deprecate And Remove Fields From URLs

First, we can remove parts of the URL we don’t need or which exacerbate our problems. There is a beautiful example of syntaxyness gone wrong in Chromium issue 661005:

Steps to reproduce the problem:
1. navigate to https://www.google.com:443+q=elon@tesla.com
2. the resulting page should be https://www.tesla.com

What is the expected behavior?
Warn the user that they are about to post credentials
      - username : "www.google.com"
      - password : "443+q=elon"

A more important problem with usernames and passwords in URLs is that they obfuscate the URL’s hostname, potentially improving the effectiveness of phishing attacks. For example, people might think that the URL https://paypal@phishing.com points to PayPal, but in fact it points to phishing.com.

Internet Explorer dropped support for credentials embedded in URLs a long-ass time ago. Wisely, Edge has not resumed supporting them.

Firefox supports embedded credentials, but warns you about the ambiguity.

Firefox
warns you when you browse to a URL that contains embedded credentials. — Firefox warns you when you browse to a URL that contains embedded credentials.

Chrome does not support embedded credentials in URLs for subresources, but inevitably, that broke someone’s use case.

If we consider that the problem with embedded credentials is that they confuse people, it would seem that we could break as few use cases as possible by allowing them in subresource URLs, and (like Firefox) warning the person about the ambiguity for top-level navigations.

But if we consider that the problem is not only that embedded credentials confuse people, but that they also increase the complexity, decrease the reliability, and decrease the uniformity of our URL parsers, then that suggests the minimal-breakage approach does not solve the whole problem.

Since IE and Edge do not support embedded credentials, they are effectively dead as a reliable web platform feature, and have been for over a decade. Why should Chrome and Firefox continue to indulge this phishiness?

(Other partially-specified languages, like JSON, suffer from terrible reliability and uniformity problems. A forward-looking platform, as I believe the web should be, should seek to gradually, gently, definitely shed these ambiguous legacy interfaces. And here’s an interesting problem related to the non-uniformity of URL parsers.)

Deprecate And Remove Weird Host Address Representations

There’s no credible, user-focused reason to support hexadecimal, octal, or other strange IP address representations. They might be used in attacks to obscure things somewhat (although even a dotted-quad representation might sufficiently obscure the nature of the host). Internet Explorer once granted special privileges (‘Intranet Zone’) to URLs with no dots in the host component — including URLs using these obscure forms.

Other than for attacks, I would bet that nobody uses or wants these address forms. Probably at least some people reading this, already a technical audience, were surprised to learn that the strange representations exist at all. So let’s just get rid of these historical quirks.

Imaginary Approaches

These are mitigation approaches that perhaps might be nice to do, but which I suspct it’s too late to try. Alas. But still...

Hierarchical Names That Go In The Same Direction

2 of the several namespaces in URLs, DNS hostnames and pathnames, are hierarchical. But textually, they go in opposite directions!

In the DNS name www.example.com, com is the parent of example is the parent of www. The labels go left to right, child to parent. I’ll call this little-endian naming.

In the pathname /noodles/doodles/poodles.php, noodles is the parent of doodles is the parent of poodles.php. The components go left to right, parent to child — the opposite relationship of DNS names. I’ll call this big-endian naming.

https://www.example.com/noodles/doodles/poodles.php
        --------------- +++++++++++++++++++++++++++
        little-endian   big-endian

That’s confusing enough on its own, but it gets weirder when you consider internationalized domain names, and other Unicode URL components. What makes it extra tricky is that some languages read right to left (RTL), like Arabic or Hebrew, instead of left to right (LTR), like English. Consider further that URLs can contain both LTR and RTL components. (Indeed, all URLs with RTL hostnames still have to have at least one LTR component: the leading https or other scheme.)

typhoonfilsy provided a nice example of this:

URL with both Arabic and English
in both the hostname and path components. — URL with both Arabic and English in both the hostname and path components.

So now we have both little- and big-endian names, each containing sub-components that go LTR and RTL. Imagine trying to read that (a) at all; and (b) correctly; and (c) when trying to make a security decision about an origin!

So it sure would be helpful if the namespace hierarchies all went in the same direction, you know? That would reduce at least 1 aspect of the confusion.

https://com.example.www/noodles/doodles/poodles.php
        +++++++++++++++ +++++++++++++++++++++++++++
        big-endian      big-endian

This would be less confusing in an RTL language:

php.seldoop/seldood/seldoon/www.elpmaxe.moc://https
+++++++++++++++++++++++++++ +++++++++++++++
                 big-endian      big-endian
                        RTL             RTL     LTR

However, the proliferation of new top-level domain names (TLDs) reduces the effectiveness of the hypothetical plan to make DNS names big-endian. For example, both blog.google and google.blog are legal DNS hostnames with valid TLDs. (Only the former is currently registered and serving a live site. Another huge problem with the proliferation of TLDs is the creation of new spoofing opportunities.) Swapping the endianness of the names would create more confusion, not less, at least for these pathological cases.

Minimizing Syntaxyness

We could also imagine a new URL syntax, with fewer and less ambiguous syntactic meta-characters. Just as a thought experiment and not as a serious proposal, imagine using only the comma , to separate URL components, and using the slash / only to separate tokens in namespaces:

https,com/example/www,,noodles/doodles/poodles.php
https,com/example/www,443,noodles/doodles/poodles.php
https,com/example/www,,noodles/doodles/poodles.php,q=cute%20puppies
https,com/example/www,,noodles/doodles/poodles.php,q=cute%20puppies,table-of-contents

As always, the meta-characters must be escaped when used inside a given component. Here, the , is escaped as %2C in the query string:

https,com/example/www,,noodles/doodles/poodles.php,q=cute%2C%20puppies

If the ,, indicating the default port for the scheme bothers you, and it probably should, we could imagine something like this:

https/443,com/example/www,noodles/doodles/poodles.php
https/8443,com/example/www,noodles/doodles/poodles.php

We could also imagine tagging each component with its name, rather than relying on their order. This would remove the requirement of empty placeholders for optional or default components. The result is harder to write, but perhaps easier to read:

scheme:https,port:443,host:com/example/org
scheme:https,port:443,host:com/example/org,path:a/b/c
scheme:https,host:com/example/org,path:a/b/c
host:com/example/org,path:a/b/c,scheme:https

Now we have a third meta-character to escape (,, /, and now :) as well.

Anyway, you get the idea: other, arguably better and/or differently-bad syntaxes are possible. Or, were possible.

That’s more than enough for now. Time for beeeeeeerrrr...