Skip to content

Instantly share code, notes, and snippets.

@mranney
Created January 30, 2012 23:05
Show Gist options
  • Save mranney/1707371 to your computer and use it in GitHub Desktop.
Save mranney/1707371 to your computer and use it in GitHub Desktop.
Why we can't process Emoji anymore
From: Chris DeSalvo <chris.desalvo@voxer.com>
Subject: Why we can't process Emoji anymore
Date: Thu, 12 Jan 2012 18:49:20 -0800
Message-Id: <AE459007-DF2E-4E41-B7A4-FA5C2A83025F@voxer.com>
--Apple-Mail=_6DEAA046-886A-4A03-8508-6FD077D18F8B
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=utf-8
If you are not interested in the technical details of why Emoji current =
do not work in our iOS client, you can stop reading now.
Many many years ago a Japanese cell phone carrier called SoftBank came =
up with the idea for emoji and built it into the cell phones that it =
sold for their network. The problem they had was in deciding how to =
represent the characters in electronic form. They decided to use =
Unicode code points in the private use areas. This is a perfectly valid =
thing to do as long as your data stays completely within your product. =
However, with text messages the data has to interoperate with other =
carriers' phones.
Unfortunately SoftBank decided to copyright their entire set of images, =
their encoding, etc etc etc and refused to license them to anyone. So, =
when NTT and KDDI (two other Japanese carriers) decided that they wanted =
emoji they had to do their own implementations. To make things even =
more sad they decided not to work with each other and gang up on =
SoftBank. So, in Japan, there were three competing emoji standards that =
did not interoperate.
In 2010 Apple released iOS 2.2 and added support for the SoftBank =
implementation of emoji. Since SoftBank would not license their emoji =
out for use on networks other than their own Apple agreed to only make =
the emoji keyboard visible on iPhones that were on the SoftBank network. =
That's why you used to have to run an ad-ware app to make that keyboard =
visible.
Later in 2010 the Unicode consortium released version 6.0 of the Unicode =
standard. (In case any cares, Unicode originated in 1987 as a joint =
research project between Xerox and Apple.) The smart Unicode folks =
added all of emoji (about 740 glyphs) plus the new Indian Rupee sign, =
more symbols needed for several African languages, and hundreds more CJK =
symbols for, well, Chinese/Japanese/Korean (CJK also covers Vietnamese, =
but now, like then, nobody gives Vietnam any credit).
With iOS 5.0 Apple (wisely) decided to adopt Unicode 6.0. The emoji =
keyboard was made available to all users and generates code points from =
their new Unicode 6.0 locations. Apple also added this support to OS X =
Lion.
You may be asking, "So this all sounds great. Why can't I type a smiley =
in Voxer and have the damn thing show up?" Glad you asked. Consider =
the following glyph:
=F0=9F=98=84
SMILING FACE WITH OPEN MOUTH AND SMILING EYES
Unicode: U+1F604 (U+D83D U+DE04), UTF-8: F0 9F 98 84
You can get this info for any character that OS X can render by bringing =
up the Character Viewer panel and right-clicking on a glyph and =
selecting "Copy Character Info". So, what this shows us is that for =
this smiley face the Unicode code point is 0x1F604. For those of you =
who are not hex-savvy that is the decimal number 128,516. That's a =
pretty big number.
The code point that SoftBank had used was 0xFB55 (or 64,341 decimal). =
That's a pretty tiny number. You can represent 64,341 with just 16 =
bits. Dealing with 16 bits is something computers do really well. To =
represent 0x1F604 you need 17 bits. Since bits come in 8-packs you end =
up using 24 total. Computers hate odd numbers and dealing with a group =
of 3 bytes is a real pain.
I have to make a side-trip now and explain Unicode character encodings. =
Different kinds of computer systems, and the networks that connect them, =
think of data in different ways. Inside of the computer the processor =
thinks of data in terms defined by its physical properties. An old =
Commodore 64 operated on one byte, 8 bits, at a time. Later computers =
had 16-bit hardware, then 32, and now most of the computers you will =
encounter on your desk prefer to operate on data 64-bits (8 bytes) at a =
time. Networks still like to think of data as a string of individual =
bytes and try to ignore any such logical groupings. To represent the =
entire Unicode code space you need 21 bits. That is a frustrating size. =
Also, if you tend to work in Latin script (English, French, Italian, =
etc) where all of the codes you'll ever use fit neatly in 8 bits (the =
ISO Latin-1 set) then it is wasteful to have to use 24 bits (21 rounded =
up to the next byte boundary) because those top 17 bits will always be =
unused. So what do you do? You make alternate encodings.
There are many encodings, the most common being UTF-8 and UTF-16. There =
is also a UTF-32, but it isn't very popular since it's not =
space-friendly. UTF-8 has the nice property that all of the original =
ASCII characters preserve their encoding. So far in this email every =
single character I've typed (other than the smiley) has been an ASCII =
character and fits neatly in 7 bits. One byte per character is really =
friendly to work with, fits nicely in memory, and doesn't take much =
space on disk. If you sometimes need to represent a big character, like =
that smiley up there, then you do that with a multi-byte sequence. As =
we can see in the info above the UTF-8 for that smiley is the 4-byte =
sequence [F0 9F 98 84]. Make a file with those four byes in it and open =
it in any editor that is UTF-8 aware and you'll get that smiley.
Some Unicode-aware programming languages such as Java, Objective-C, and =
(most) JavaScript systems use the UTF-16 encoding internally. UTF-16 =
has some really good properties of its own that I won't digress into =
here. The thing to note is that it uses 16 bits for most characters. =
So, whereas a small letter 'a' would be the single byte 0x61in ASCII or =
UTF-8, in UTF-16 it is the 2-byte 0x0061. Note that the SoftBank 0xFB55 =
fits nicely in that 16-bit space. Hmm, but our smiley has a Unicode =
value of U+1F604 (we use U+ when throwing Unicode values around in =
hexadecimal) and that will NOT fit in 16 bits. Remember, we need 17. =
So what do we do? Well, the Unicode guys are really smart (UTF-8 is =
fucking brilliant, no, really!) and they invented a thing called a =
"surrogate pair". With a surrogate pair you can use two 16-bit values =
to encode that code point that is too big to fit into a single 16-bit =
field. Surrogate pairs have a specific bit pattern in their top bits =
that lets UTF-16 compliant systems know that they are a surrogate pair =
that represent a single code point and not two separate UTF-16 code =
points. In the example smiley above we find that the UTF-16 surrogate =
pair that encodes U+1F604 is [U+D83D U+DE04]. Put those four bytes into =
a file and open it in any program that understands UTF-16 and you'll see =
that smiley. He really is quite cheery.
So, I've already said that Objective-C and Java and (most) JavaScript =
systems use UTF-16 internally so we should be all cool, right? Well, =
see, it was that "(most)" that is the problem.
Before there was UTF-16 there was another encoding used by Java and =
JavaScript called UCS-2. UCS-2 is a strict 16-bit encoding. You get 16 =
bits per character and no more. So how do you represent U+1F604 which =
needs 17 bits? You don't. Period. UCS-2 has no notion of surrogate =
pairs. Through most of time this was ok because the Unicode consortium =
hadn't defined many code points beyond the 16 bit range so there was =
nothing out there to encode. But in 1996 it was clear that to encode =
all the CJK languages (and Vietnamese!) that we'd start needing those =
17+ bit code points. SUN updated Java to stop using UCS-2 as its =
default encoding and switched to UTF-16. NeXT did the same thing with =
NeXTSTEP (the precursor to iOS). Many JavaScript systems updated as =
well.
Now, here's what you've all been waiting for: the V8 runtime for =
JavaScript, which is what our node.js severs are built on, use UCS-2 =
internally as their encoding and are not capable of handing any code =
point outside the base 16 bit range (we call that the BMP, or Basic =
Multilingual Plane). V8 fundamentally has no ability to represent the =
U+1F604 that we need to make that smiley.
Danny confirmed this with the node guys today. Matt Ranney is going to =
talk to the V8 guys about it and see what they want to do about it.
Wow, you read though all of that? You rock. I'm humbled that you gave =
me so much of your attention. I feel that we've accomplished something =
together. Together we are now part of the tiny community of people who =
actually know anything about Unicode. You may have guessed by now that =
I am a text geek. I have had to implement java.lang.String for three =
separate projects. I love this stuff. If you have any questions about =
anything I've written here, or want more info so that you don't have to =
read the 670 page Unicode 6.0 core specification (there are many, many =
addenda as well) then please don't hesitate to hit me up.
Love,
Chris
p.s. Remember that this narrative is almost all ASCII characters, and =
ASCII is a subset of UTF-8. That smiley is the only non-ASCII =
character. In UTF-8 this email (everything up to, but not including my =
signature) is 8,553 bytes. In UTF-16 it is 17,102 bytes. In UTF-32 it =
would be 34,204 bytes. These space considerations are one of the many =
reasons we have multiple encodings.=
@iskanbil
Copy link

Hi Chris!

I have find this post finding and finding information on how to render efficiently Emojis on Java. I am developing a chat application both on iOS and Android and in Android i am having problems on how to identify emoji unicode characters on text messages (it does but is takes so much time) I use the pattern "[\p{So}\p{Cn}]" to get the possible positions of emojis in text but still very inefficient (I have a hasMap with more than 800 emoji unicodes from Unicode 6.0)....

Any ideas or any other pattern?

@e70838
Copy link

e70838 commented Nov 26, 2012

UTF16, UCS2 and BMP are insane crap produced by crazy committees. I always uses a UTF8 editor for all my programs or web pages. It supports full unicode. It is easier to handle UTF8 than any of the other encodings (except ascii or iso8859-*). The "wide" chars in C or java were a stupid mistake that makes the life of developers a lot more difficult.

@broofa
Copy link

broofa commented Nov 26, 2012

Not-very-related-aside: The surrogate pair nonsense is one of the reasons why working with binary data in JS is such a hassle. Trying to pack [efficiently] a binary array into a JS string may result in invalid surrogate pairs which end up throwing errors if/when you try to encode/decode them using encode/decodeURIComponent().

@raggi
Copy link

raggi commented Nov 26, 2012

@broofa and that actually isn't part of the URI standard, it was rejected. ECMA still specifies a unicode uri escape extension, but it's totally non-standard. In short, that shouldn't fail.

@pete
Copy link

pete commented Nov 26, 2012

Well, the Unicode guys are really smart (UTF-8 is =
fucking brilliant, no, really!) and they invented a thing called a =
"surrogate pair".
I realize that I'm being a bit pedantic since "the history of Unicode and UTF-8" is a little off-topic. Apologies in advance for the digression, but it was actually Ken Thompson and Rob Pike who designed UTF-8, rather than the above-mentioned "Unicode guys": http://doc.cat-v.org/bell_labs/utf-8_history

@josephg
Copy link

josephg commented Nov 26, 2012

@e70838: UCS2 made sense when we thought every unicode character could be represented in 16 bytes. When they couldn't, it was much easier to retrofit languages to use UTF16 than swap to UTF8. New languages like Go support UTF8 out of the box. (Though it helps that Rob Pike, the mastermind behind Go was also one of the guys who invented UTF8).

@mranney I hear you. I'm having the same problem with a collaborative text library I'm writing. I need to be able to say "Insert 'Hi' at position 10 in a string". But 'position 10' changes depending on whether we're talking about UCS2 2-byte offsets or counting actual characters. I want cross-language support - so I should use UTF8 offsets. But that means I need to convert character offsets back and forth in Javascript.

I've ranted about this before, too: http://josephg.com/post/31707645955/string-length-lies
For myself, I ended up writing a string library in C which lets me efficiently (O(log n)) insert and delete characters using both UTF8 string offsets or wchar offsets: https://github.com/josephg/librope .

@dgl
Copy link

dgl commented Nov 26, 2012

Not sure why this has been linked to again, it looks like this was addressed in March: https://code.google.com/p/v8/issues/detail?id=761#c33

@xdamman
Copy link

xdamman commented Nov 26, 2012

Thanks for this great post. We are also running in the same issue here at Storify. Especially when people are putting in their stories instagram pictures that tend to have a lot of iOS emoji in their caption. Our app runs on Node 0.8.14. We would definitely love if the V8 guys could come up with a fix for that.

@emostar
Copy link

emostar commented Nov 27, 2012

I believe the original creator was the company called J-Phone. They got bought out and became SoftBank later on.

@dylang
Copy link

dylang commented Nov 27, 2012

Not an ideal solution but could could the client (browser/mobile app) translate Emoji strings to some other encoding (like <U+1F604>) before sending to server and then reverse that when it sees those values come from the server?

@dylang
Copy link

dylang commented Nov 27, 2012

Oops, my example encoding in my question above was decoded out by github. Here's a different example: $$U+1F604$$.

@fabianperez
Copy link

Thanks so much for writing this up! Really interesting.

@jcoglan
Copy link

jcoglan commented Nov 27, 2012

I'm slightly puzzled by this since some time during the 0.7 series of Node.js a bunch of encoding problems got fixed. Specifically, a lot of test failures in faye-websocket related to 4-byte characters went away. These are tests that make sure that a Buffer that's converted to text using UTF-8 and then converted back to a Buffer for retransmission don't get altered in weird ways.

You can see the change here: note failing tests under 0.6 that work under 0.7.
http://faye.jcoglan.com/autobahn/clients/

Could you clarify how these issues are related and why the 0.7 change didn't fix the emoji problem?

@DTrejo
Copy link

DTrejo commented Nov 27, 2012

I just read this. Happy to hear mranney is on it!

@DTrejo
Copy link

DTrejo commented Nov 27, 2012

  • was on it & fixed it

@tvernum
Copy link

tvernum commented Nov 27, 2012

@jcoglan

I think at least part of the answer to your question is that this email is from 2012-Jan-12, and the current version of Node at that time was 0.6.7.
I am led to believe this issue was fixed in the 0.7 series (but it may not have been)

@isaacs
Copy link

isaacs commented Nov 27, 2012

@xdamman If you can provide an example of node handling emoji improperly, I'd be happy to look into it. Emoji utf8 characters should be handled properly in node now. They appear in the JS string as 2 characters, but otherwise they should behave properly.

@al45tair
Copy link

UTF16, UCS2 and BMP are insane crap produced by crazy committees. I always uses a UTF8 editor for all my programs or web pages. It supports full unicode. It is easier to handle UTF8 than any of the other encodings (except ascii or iso8859-*). The "wide" chars in C or java were a stupid mistake that makes the life of developers a lot more difficult.

Sorry, but that’s complete rubbish. UTF-16 is the preferred representation for Unicode text for a variety of very good reasons, which is why it’s used by the canonical reference implementation, the ICU project, as well as the Java and Objective-C runtimes.

As for C, the wide characters (and the associated wide and multibyte string routines) in C were never intended for use with Unicode; they were intended for use in East Asian countries with pre-existing standards. They were designed with the intent that a single wide character represented something that the end user would regard as a character (i.e. something that could be processed as an individual unit); this is not true even with UCS-4, and so using the wide character routines and wchar_t for Unicode (whether your wchar_t is 16 or 32-bit) is and always has been a mistake.

@dustin
Copy link

dustin commented Nov 28, 2012

Sorry, but that’s complete rubbish. UTF-16 is the preferred representation for Unicode text for a variety of very good reasons, which is why it’s used by the canonical reference implementation, the ICU project, as well as the Java and Objective-C runtimes.

Can you name any of them? UCS-2 may have been justifiable at some point, but I can't think of any good reason for UTF-16 to exist anywhere.

@xdamman
Copy link

xdamman commented Nov 29, 2012

@isaacs you are right, this issue has actually been solved with the migration from node 0.6.x to 0.8.x.

@apk
Copy link

apk commented Nov 29, 2012

The only reason people are using UTF-16 (esp. as programmer-visible internal representation) is that it was originally UCS-2 in the same language, and we are stuck with the strange java codepoint/index APIs that most people forget to use properly because that causes bugs that only appear in a few fringe languages (from a western-centric viewpoint), as opposed to utf-8 that has effects practically everywhere. Ironic that the emojis bring that problem back to the western world. :-)

@jonathanwcrane
Copy link

So psyched to learn about the history of Unicode and character encodings, especially the historical anomaly of the battle between competing Japanese wireless providers!

@MarcusJohnson91
Copy link

Hey Chris, I just stumbled upon this page while trying to understand the UTF-8 encoding.

I'm trying to write a basic UTF-8 string handling library in C (my idea it to basically define utf8_t as an unsigned char pointer)

Anyway, I've skimmed through the unicode pdf a few tiems, and tried googling it about a dozen times, and I can't find any good information on how the more complex features of this encoding are represented.

So, here goes.

How are emoji (like, flag emoji especially) represented? are they 2 code points? how do you know there's a following code point? I know that usually the leading byte of a codepoint will set the top 4 bits depending on how many bytes are in the code point, does this work for emoji flags too?

ALSO why is there sometimes a leading flag byte? how do you know when the flag byte will be separate, or part of the first coding byte?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment