Noncombatant đ About âď¸ Other Writing đľ Bandcamp đť GitHub
The uniform resource locator (URL) is a data structure and an associated serialization format that aims to uniquely identify any resource on the Internet (and other networks). (See also uniform resource identifier (URI).) Thatâs a lofty goal, but it has proven more or less tractable and practical. Which is astounding and great! A global network namespace enables powerful applications, and powerful interactions between applications.
However, URLs have some problems of usability, security, and economics. Many of us have wished for a global namespace with fewer problems. Iâll address that first, and then Iâll have some fun with the technical aspects of the problem. You can skip that stuff, if you like.
URLs have very poor usability both because they are structurally complex, and because their textual representation is unnecessarily ambiguous and ugly (syntaxy). Some of the structural complexity is necessary, and some of it is not.
The poor usability of URLs is a weak spot for advocates of semi-decentralized naming schemes like URLs and the DNS. People sometimes propose that a centralized naming scheme would be less chaotic and hence more usable and more safe. They do have a point, and we should address it.
For example, my colleague Owen Campbell-Moore argues that URLs are un-fixably terrible, and advocates for search engines to provide the trusted, and hopefully trustworthy, mapping between human-meaningful names and origins or URLs.
However, that requires the search engine, or other centralized naming authority, to be trustworthy. This proves difficult:
Similarly, a Google search for [ download chrome ] has lots of legitimate and correct results up top, but there are still fakes on the first page of results (at least at the time of writing, and for as long as I can remember). In fact, we used to have a recurring problem that obvious spoofs were at the very top of the results. Googleâs search engine could not reliably find Googleâs browser. In one sense, that indicates trustworthinessâââ Google doesnât seem to put its thumb on the scale. In another sense... sigh. đ¤ˇđťââď¸
Perhaps ⨠machine learning ⨠could be useful in identifying spoofs, such as by comparing names and icons for similarity and raising them for human review. That would speed up the process of finding potential spoofs, potentially improving the centralized naming authorityâs trustworthiness. But weâd still be trusting the authority with a lot of power.
Obviously, in most of this post, I agree with Owen about the badness of URLs. But ultimately I do not agree that a centralized authority would be better, nor that we should switch to one.
I think the problem Owen poses can be resolved by investigating this question:
âOrigins are not very user-friendlyâ
I fully agree that URLs are not usable, but I do believe that origins (the scheme, host, port tuple) are or can be made usableâââ and that if we succeed at that technical problem, we can reduce some of the pressure to centralize power.
I think we can make origins more usable by doing the following things in the Location Bar:
Note that Safari already does most of the above, although it commits what I consider an error: for sites with Extended Validation (EV) certificates, it shows the EV name instead of the eTLD+1. This opens a whole other can of goat-worms (which I have yelled about elsewhere). But you can get a glimpse of a better naming future by trying out Safari. Brave for desktop also shows only the hostname until you focus the Location Bar.
As a practical matter, eTLD+1 names, hostnames, and copied-and-pasted blobs are all people really use in the real world. Very few people use URLs as suchâââ and that is perfectly fine! To improve URL usability and safety, application and platform developers need only go with the flow. Letâs.
However, some more-or-less tractable problems remain even after we do all of the above.
URLs became user interface components almost immediately: people are expected to be able to type in URLs, copy and paste them, and (at least partially) parse them to extract security-relevant information, and sometimes to modify them. All this, including on tiny phone screens.
This turns out to be not-so-great, because in order to meet their goal, URLs have to be fairly complex, and object serialization and deserialization is a surprisingly hard problem even in simple cases. The end result is that most people have a very hard time actually using URLs in practice.
Although the
implementation Chrome uses is more complex (see also url::Parsed
),
we might imagine that the structure of a URL object need not be too complex. For
example:
class URL { string scheme string username string password string host string port string path string query string ref // Also called "fragment". }
Well, thatâs a bit too simple. First, TCP and UDP port numbers are unsigned 16-bit integers, not arbitrary strings. Then, the host could be an IPv4 address, an IPv6 address, or an address in another network type. Or it could be a DNS hostname, a NetBIOS hostname, or a name in some other domain. Even considering just DNS and NetBIOS names, simple strings donât quite capture the type information we need.
So, weâll have to complicate our representation a bit. Letâs try this:
abstract class NetworkAddress { ... } class IPv4NetworkAddress extends NetworkAddress { ... } class IPv6NetAddress extends NetworkAddress { ... } abstract class HostName { ... } class DNSName extends HostName { ... } class NetBIOSName extends HostName { ... } class HostIdentifier { enum Type { Address, Name } union { NetworkAddress address HostName name } } class URL { string scheme string username string password HostIdentifier host uint16_t port string path string query string ref // Also called "fragment". }
Weâve more tightly specified the port, and HostIdentifier
is a
sum type of 2 abstract NetworkAddress
and HostName
types. In turn, the abstract types are made concrete for specific addressing and
naming systems; weâve given some modern examples for each.
Although the real details are madness-inducing, letâs further assume for the
moment that the string
type is a sequence of Unicode
characters.
The real-world analogues to each of these hypothetical classes has at least 1
serialization function and least 1 deserialization function or parsing
constructor. Even the IPv4 address, a humble 4-octet data structure, has a
delightfully wacky set of representations. A simple
program that uses the BSD functions inet_aton
(deserializer)
and inet_ntoa
(serializer) produces the following
equivalencies:
Serialized Deserialized Reserialized 222.173.190.239 0xDEADBEEF 222.173.190.239 0xDEADBEEF 0xDEADBEEF 222.173.190.239 033653337357 0xDEADBEEF 222.173.190.239 222.11386607 0xDEADBEEF 222.173.190.239 222.173.48879 0xDEADBEEF 222.173.190.239 127.0.0.1 0x7F000001 127.0.0.1 0x7F000001 0x7F000001 127.0.0.1 127.1 0x7F000001 127.0.0.1 127.0.1 0x7F000001 127.0.0.1
As of this writing, Chrome will indeed take http://0x9765C143
,
convert it to http://151.101.193.67/
, and navigate to it. Firefox
navigates directly to http://0x9765C143
without first converting it
to dotted decimal in the Location Bar.
IPv6 addresses have their own various representations, as Wikipedia discusses. Notably, to disambiguate colon-separated hextets of the IPv6 address from the colon-separated port number in URL string representations, IPv6 addresses must be surrounded with square braces in URLs:
https://[2001:db8:85a3:8d3:1319:8a2e:370:7348]:443/foo/bar/noodles +------- IPv6 address -------------+ ^ | port
Whenever a language has lots of syntactic meta-characters, especially when
some of the meta-characters have multiple meanings depending on their context, I
say the language is âsyntaxyâ. If you try to write a URL parser, youâll find
that it has to keep a fair amount of state to know whether this :
is part of the scheme separator ://
, or a hextet separator, or the
port number separator. Similarly, /
has at least 2 meanings.
Unconsciously perhaps, humans need to build the same state machine in their
minds to parse URLsâââ or fail to, and get confused. Add on top of that the fact
that many URL schemes are not real words, /
looks kind of like
\
, and so on, and and pretty soon people are just plain confused
about the URL language. Itâs not a language people can speak easily.
An ideal solution to the URL usability problem would have (at least) the following properties:
We canât truly solve the problem without fundamentally re-thinking URLs. URLs
are ubiquitous, and their problems are struct
ural: there are just
too many things in the data structure.
Perhaps what we can do is mitigate the badness somewhat. Arguably, it is fun to brainstorm about how.
First, we can remove parts of the URL we donât need or which exacerbate our problems. There is a beautiful example of syntaxyness gone wrong in Chromium issue 661005:
Steps to reproduce the problem: 1. navigate to https://www.google.com:443+q=elon@tesla.com 2. the resulting page should be https://www.tesla.com What is the expected behavior? Warn the user that they are about to post credentials - username : "www.google.com" - password : "443+q=elon"
A more important problem with usernames and passwords in URLs is that they
obfuscate the URLâs hostname, potentially improving the effectiveness of
phishing attacks. For example, people might think that the URL
https://paypal@phishing.com
points to PayPal, but in fact it points
to phishing.com.
Internet Explorer dropped support for credentials embedded in URLs a long-ass time ago. Wisely, Edge has not resumed supporting them.
Firefox supports embedded credentials, but warns you about the ambiguity.
Chrome does not support embedded credentials in URLs for subresources, but inevitably, that broke someoneâs use case.
If we consider that the problem with embedded credentials is that they confuse people, it would seem that we could break as few use cases as possible by allowing them in subresource URLs, and (like Firefox) warning the person about the ambiguity for top-level navigations.
But if we consider that the problem is not only that embedded credentials confuse people, but that they also increase the complexity, decrease the reliability, and decrease the uniformity of our URL parsers, then that suggests the minimal-breakage approach does not solve the whole problem.
Since IE and Edge do not support embedded credentials, they are effectively dead as a reliable web platform feature, and have been for over a decade. Why should Chrome and Firefox continue to indulge this phishiness?
(Other partially-specified languages, like JSON, suffer from terrible reliability and uniformity problems. A forward-looking platform, as I believe the web should be, should seek to gradually, gently, definitely shed these ambiguous legacy interfaces. And hereâs an interesting problem related to the non-uniformity of URL parsers.)
Thereâs no credible, user-focused reason to support hexadecimal, octal, or other strange IP address representations. They might be used in attacks to obscure things somewhat (although even a dotted-quad representation might sufficiently obscure the nature of the host). Internet Explorer once granted special privileges (âIntranet Zoneâ) to URLs with no dots in the host componentâââ including URLs using these obscure forms.
Other than for attacks, I would bet that nobody uses or wants these address forms. Probably at least some people reading this, already a technical audience, were surprised to learn that the strange representations exist at all. So letâs just get rid of these historical quirks.
These are mitigation approaches that perhaps might be nice to do, but which I suspct itâs too late to try. Alas. But still...
2 of the several namespaces in URLs, DNS hostnames and pathnames, are hierarchical. But textually, they go in opposite directions!
In the DNS name www.example.com, com is the parent of example is the parent of www. The labels go left to right, child to parent. Iâll call this little-endian naming.
In the pathname /noodles/doodles/poodles.php, noodles is the parent of doodles is the parent of poodles.php. The components go left to right, parent to childâââ the opposite relationship of DNS names. Iâll call this big-endian naming.
https://www.example.com/noodles/doodles/poodles.php --------------- +++++++++++++++++++++++++++ little-endian big-endian
Thatâs confusing enough on its own, but it gets weirder when you consider internationalized
domain names, and other Unicode URL components. What makes it extra tricky
is that some languages read right to left (RTL), like Arabic or Hebrew, instead
of left to right (LTR), like English. Consider further that URLs can contain
both LTR and RTL components. (Indeed, all URLs with RTL hostnames still have to
have at least one LTR component: the leading https
or other
scheme.)
typhoonfilsy provided a nice example of this:
So now we have both little- and big-endian names, each containing sub-components that go LTR and RTL. Imagine trying to read that (a) at all; and (b) correctly; and (c) when trying to make a security decision about an origin!
So it sure would be helpful if the namespace hierarchies all went in the same direction, you know? That would reduce at least 1 aspect of the confusion.
https://com.example.www/noodles/doodles/poodles.php +++++++++++++++ +++++++++++++++++++++++++++ big-endian big-endian
This would be less confusing in an RTL language:
php.seldoop/seldood/seldoon/www.elpmaxe.moc://https +++++++++++++++++++++++++++ +++++++++++++++ big-endian big-endian RTL RTL LTR
However, the proliferation of new top-level domain names (TLDs) reduces the effectiveness of the hypothetical plan to make DNS names big-endian. For example, both blog.google and google.blog are legal DNS hostnames with valid TLDs. (Only the former is currently registered and serving a live site. Another huge problem with the proliferation of TLDs is the creation of new spoofing opportunities.) Swapping the endianness of the names would create more confusion, not less, at least for these pathological cases.
We could also imagine a new URL syntax, with fewer and less ambiguous
syntactic meta-characters. Just as a thought experiment and not as a serious
proposal, imagine using only the comma ,
to separate URL
components, and using the slash /
only to separate tokens in
namespaces:
https,com/example/www,,noodles/doodles/poodles.php https,com/example/www,443,noodles/doodles/poodles.php https,com/example/www,,noodles/doodles/poodles.php,q=cute%20puppies https,com/example/www,,noodles/doodles/poodles.php,q=cute%20puppies,table-of-contents
As always, the meta-characters must be escaped when used inside a given
component. Here, the ,
is escaped as %2C
in the query
string:
https,com/example/www,,noodles/doodles/poodles.php,q=cute%2C%20puppies
If the ,,
indicating the default port for the scheme bothers
you, and it probably should, we could imagine something like this:
https/443,com/example/www,noodles/doodles/poodles.php https/8443,com/example/www,noodles/doodles/poodles.php
We could also imagine tagging each component with its name, rather than relying on their order. This would remove the requirement of empty placeholders for optional or default components. The result is harder to write, but perhaps easier to read:
scheme:https,port:443,host:com/example/org scheme:https,port:443,host:com/example/org,path:a/b/c scheme:https,host:com/example/org,path:a/b/c host:com/example/org,path:a/b/c,scheme:https
Now we have a third meta-character to escape (,
, /
,
and now :
) as well.
Anyway, you get the idea: other, arguably better and/or differently-bad syntaxes are possible. Or, were possible.
Thatâs more than enough for now. Time for beeeeeeerrrr...