In Pursuit of Laziness

Manish Goregaokar's blog

Picking Apart the Crashing iOS String

Posted by Manish Goregaokar on February 15, 2018 in programming, unicode

So there’s yet another iOS text crash, where just looking at a particular string crashes iOS. Basically, if you put this string in any system text box (and other places), it crashes that process. I’ve been testing it by copy-pasting characters into Spotlight so I don’t end up crashing my browser.

The original sequence is U+0C1C U+0C4D U+0C1E U+200C U+0C3E, which is a sequence of Telugu characters: the consonant ja (జ), a virama ( ్ ), the consonant nya (ఞ), a zero-width non-joiner, and the vowel aa ( ా).

I was pretty interested in what made this sequence “special”, and started investigating.

So first when looking into this, I thought that the <ja, virama, nya> sequence was the culprit. That sequence forms a special ligature in many Indic scripts (ज्ञ in Devanagari) which is often considered a letter of its own. However, the ligature for Telugu doesn’t seem very “special”.

Also, from some experimentation, this bug seemed to occur for any pair of Telugu consonants with a vowel, as long as the vowel is not   ై (ai). Huh.

The ZWNJ must be doing something weird, then. <consonant, virama, consonant, vowel> is a pretty common sequence in any Indic script; but ZWNJ before a vowel isn’t very useful for most scripts (except for Bengali and Oriya, but I’ll get to that).

And then I saw that there was a sequence in Bengali that also crashed.

The sequence is U+09B8 U+09CD U+09B0 U+200C U+09C1, which is the consonant “so” (স), a virama ( ্ ), the consonant “ro” (র), a ZWNJ, and vowel u (  ু).

Before we get too into this, let’s first take a little detour to learn how Indic scripts work:

Indic scripts and consonant clusters

Indic scripts are abugidas; which means that their “letters” are consonants, which you can attach diacritics to to change the vowel. By default, consonants have a base vowel. So, for example, क is “kuh” (kə, often transcribed as “ka”), but I can change the vowel to make it के (the “ka” in “okay”) का (“kaa”, like “car”).

Usually, the default vowel is the ə sound, though not always (in Bengali it’s more of an o sound).

Because of the “default” vowel, you need a way to combine consonants. For example, if you wished to write the word “ski”, you can’t write it as स + की (sa + ki = “saki”), you must write it as स्की. What’s happened here is that the स got its vowel “killed”, and got tacked on to the की to form a consonant cluster ligature.

You can also write this as स्‌की . That little tail you see on the स is known as a “virama”; it basically means “remove this vowel”. Explicit viramas are sometimes used when there’s no easy way to form a ligature, e.g. in ङ्‌ठ because there is no simple way to ligatureify ङ into ठ. Some scripts also prefer explicit viramas, e.g. “ski” in Malayalam is written as സ്കീ, where the little crescent is the explicit virama.

In unicode, the virama character is always used to form a consonant cluster. So स्की was written as <स,  ्, क,  ी>, or <sa, virama, ka, i>. If the font supports the cluster, it will show up as a ligature, otherwise it will use an explicit virama.

For Devanagari and Bengali, usually, in a consonant cluster the first consonant is munged a bit and the second consonant stays intact. There are exceptions – sometimes they’ll form an entirely new glyph (क + ष = क्ष), and sometimes both glyphs will change (ड + ड = ड्ड, द + म = द्म, द + ब = द्ब). Those last ones should look like this in conjunct form:

Investigating the Bengali case

Now, interestingly, unlike the Telugu crash, the Bengali crash seemed to only occur when the second consonant is র (“ro”). However, I can trigger it for any choice of the first consonant or vowel, except when the vowel is  ো (o) or  ৌ (au).

Now, র is an interesting consonant in some Indic scripts, including Devanagari. In Devanagari, it looks like र (“ra”). However, it does all kinds of things when forming a cluster. If you’re having it precede another consonant in a cluster, it forms a little feather-like stroke, like in र्क (rka). In Marathi, that stroke can also look like a tusk, as in र्‍क. As a suffix consonant, it can provide a little “extra leg”, as in क्र (kra). For letters without a vertical stroke, like ठ (tha), it does this caret-like thing, ठ्र (thra).

Basically, while most consonants retain some of their form when put inside a cluster, र does not. And a more special thing about र is that this happens even when र is the second consonant in a cluster – as I mentioned before, for most consonant clusters the second consonant stays intact. While there are exceptions, they are usually specific to the cluster; it is only र for which this happens for all clusters.

It’s similar in Bengali, র as the second consonant adds a tentacle-like thing on the existing consonant. For example, প + র (po + ro) gives প্র (pro).

But it’s not just র that does this in Bengali, the consonant “jo” does as well. প + য (po + jo) forms প্য (pjo), and the য is transformed into a wavy line called a “jophola”.

So I tried it with য — , and it turns out that the Bengali crash occurs for য as well! So the general Bengali case is <consonant, virama, র OR য, ZWNJ, vowel>, where the vowel is not  ো or  ৌ.

Suffix-joining consonants

So we’re getting close, here. At least for Bengali, it occurs when the second consonant is such that it often combines with the first consonant without modifying its form much.

In fact, this is the case for Telugu as well! Consonant clusters in Telugu are usually formed by preserving the original consonant, and tacking the second consonant on below!

For example, the original crashy string contains the cluster జ + ఞ, which looks like జ్ఞ. The first letter isn’t really modified, but the second is.

From this, we can guess that it will also occur for Devanagari with र. Indeed it does! U+0915 U+094D U+0930 U+200C U+093E, that is, <क,  ्, र, zwnj,  ा> (< ka, virama, ra, zwnj, aa >) is one such crashing sequence.

But this isn’t really the whole story, is it? For example, the crash does occur for “kro” + zwnj + vowel in Bengali, and in “kro” (ক্র = ক + র = ko + ro) the resultant cluster involves the munging of both the prefix and suffix. But the crash doesn’t occur for द्ब or ड्ड. It seems to be specific to the letter, not the nature of the cluster.

Digging deeper, the reason is that for many fonts (presumably the ones in use), these consonants form “suffix joining consonants”1 (a term I made up) when preceded by a virama. This seems to correspond to the pstf OpenType feature, as well as vatu.

For example, the sequence virama + क gives   ्क, i.e. it renders a virama with a placeholder followed by a क.

But, for र, virama + र renders  ्र, which for me looks like this:

In fact, this is the case for the other consonants as well. For me,  ्र  ্র  ্য  ్ఞ  ్క (Devanagari virama-ra, Bengali virama-ro, Bengali virama-jo, Telugu virama-nya, Telugu virama-ka) all render as “suffix joining consonants”:

(This is true for all Telugu consonants, not just the ones listed).

An interesting bit is that the crash does not occur for <र, virama, र, zwnj, vowel>, because र-virama-र uses the prefix-joining form of the first र (र्र). The same occurs for র with itself or ৰ or য. Because the virama is “stickier” to the left in these cases, it doesn’t cause a crash. (h/t hackbunny for discovering this using a script to enumerate all cases).

Kannada also has “suffix joining consonants”, but for some reason I cannot trigger the crash with it. Ya in Gurmukhi is also suffix-joining.

The ZWNJ

The ZWNJ is curious. The crash doesn’t happen without it, but as I mentioned before a ZWNJ before a vowel doesn’t really do anything for most Indic scripts. In Indic scripts, a ZWNJ can be used to explicitly force a virama if used after the virama (I used it to write स्‌की in this post), however that’s not how it’s being used here.

In Bengali and Oriya specifically, a ZWNJ can be used to force a different vowel form when used before a vowel (e.g. রু vs র‌ু), however this bug seems to apply to vowels for which there is only one form, and this bug also applies to other scripts where this isn’t the case anyway.

The exception vowels are interesting. They’re basically all vowels that are made up of two glyph components. Philippe Verdy points out:

And why this bug does not occur with some vowels is because these are vowels in two parts, that are first decomposed into two separate glyphs reordered in the buffer of glyphs, while other vowels do not need this prior mapping and keep their initial direct mapping from their codepoints in fonts, which means that this has to do to the way the ZWNJ looks for the glyphs of the vowels in the glyphs buffer and not in the initial codepoints buffer: there’s some desynchronization, and more probably an uninitialized data field (for the lookup made in handling ZWNJ) if no vowel decomposition was done (the same data field is correctly initialized when it is the first consonnant which takes an alternate form before a virama, like in most Indic consonnant clusters, because the a glyph buffer is created.

Generalizing

So, ultimately, the full set of cases that cause the crash are:

Any sequence <consonant1, virama, consonant2, ZWNJ, vowel> in Devanagari, Bengali, and Telugu, where:

  • consonant2 is suffix-joining (pstf/vatu) – i.e. र, র, য, ৰ, and all Telugu consonants
  • consonant1 is not a reph-forming letter like र/র (or a variant, like ৰ)
  • vowel does not have two glyph components, i.e. it is not   ై,   ো, or   ৌ

This leaves one question open:

Why doesn’t it apply to Kannada? Or, for that matter, Khmer, which has a similar virama-like thing called a “coeng”?

Are these valid strings?

A recurring question I’m getting is if these strings are valid in the language, or unicode gibberish like Zalgo text. Breaking it down:

  • All of the rendered glyphs are valid. The original Telugu one is the root of the word for “knowledge” (and I’ve taken to calling this bug “forbidden knowledge” for that reason).
  • In Telugu and Devanagari, there is no functional use of the ZWNJ as used before a vowel. It should not be there, and one would not expect it in typical text.
  • In Bengali (also Oriya), putting a ZWNJ before some vowels prevents them from ligatureifying, and this is mentioned in the Unicode spec. However, it seems rare for native speakers to use this.
  • In all of these scripts, putting a ZWNJ after viramas can be used to force an explicit virama over a ligature. That is not the position ZWNJ is used here, but it gives a hint that this might have been a mistype. Doing this is also rare at least for Devanagari (and I believe for the other two scripts as well)
  • Android has an explicit key for ZWNJ on its keyboards for these languages2, right next to the spacebar. iOS has this as well on the long-press of the virama key. Very easy to mistype, at least for Android.

So while the crashing strings are usually invalid, and when not, very rare, they are easy enough to mistype.

An example by @FakeUnicode was the string “For/k” (or “Foŕk”, if accents were easier to type). A slash isn’t something you’d normally type there, and the produced string is gibberish, but it’s easy enough to type by accident.

Except of course that the mistake in “For/k”/”Foŕk” is visually obvious and would be fixed; this isn’t the case for most of the crashing strings.

Conclusion

I don’t really have one guess as to what’s going on here – I’d love to see what people think – but my current guess is that the “affinity” of the virama to the left instead of the right confuses the algorithm that handles ZWNJs after viramas into thinking the ZWNJ applies to the virama (it doesn’t, there’s a consonant in between), and this leads to some numbers not matching up and causing a buffer overflow or something. Philippe’s diagnosis of the vowel situation matches up with this.

An interesting thing is that I can cause this crash to happen more reliably in browsers by clicking on the string.

Additionally, sometimes it actually renders in spotlight for a split second before crashing; which means that either the crash isn’t deterministic, or it occurs in some process after rendering. I’m not sure what to think of either. Looking at the backtraces, the crash seems to occur in different places, so it’s likely that it’s memory corruption that gets uncovered later.

I’d love to hear if folks have further insight into this.

Update: Philippe on the Unicode mailing list has an interesting theory

Yes, I could attach a debugger to the crashing process and investigate that instead, but that’s no fun 😂

  1. Philippe Verdy points out that these may be called “phala forms” at least for Bengali 

  2. I don’t think the Android keyboard needs this key; the keyboard seems very much a dump of “what does this unicode block let us do”, and includes things like Sindhi-specific or Kashmiri-specific characters for the Marathi keyboard as well as extremely archaic characters, whilst neglecting more common things like the eyelash reph (which doesn’t have its own code point but is a special unicode sequence; native speakers should not be expected to be aware of this sequence).