Proposed Draft Unicode Technical Standard #40

BOCU-1:
MIME-Compatible Unicode Compression

Version	1.0
Authors	Markus W. Scherer, Mark Davis
Date	2006-02-03
This Version	http://www.unicode.org/reports/tr40/tr40-1.html
Previous Version	None
Latest Version	http://www.unicode.org/reports/tr40/
Revision	1

Summary

This document describes a Unicode compression scheme that is MIME-compatible, directly usable for emails, and preserves binary order (for databases and sorted lists). BOCU-1 is a MIME-compatible application of the Binary Ordered Compression for Unicode [BOCU] base algorithm.

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1 Introduction
- 1.1 Definitions
- 1.2 Generic BOCU
- 1.3 BOCU-1 Profile
2 Features of BOCU-1
- 2.1 Compression
- 2.2 Binary Order
- 2.3 MIME Compatibility
- 2.4 Details
- 2.5 Signature Byte Sequence
3 Conformance
4 BOCU-1 Algorithm
- 4.1 Encoding
- 4.2 Difference Encoding
- 4.3 Decoding
5 Sample C Code
6 BOCU-1 Performance
- 6.1 SCSU vs. BOCU-1
- 6.2 Test Setup
Intellectual Property
References
Acknowledgments
Modifications

1 Introduction

The most popular encoding for Unicode text data exchange is UTF-8. It is relatively simple and widely applicable: MIME/email/HTML/XML, in-process use in some systems, etc. However, UTF-8 uses more bytes per code point for non-Latin scripts than language-specific codepages.

There are many environments which are bandwidth-limited, for example, wireless networks for handheld systems. In such environments it can be desirable to reduce the average ratio of bytes per code point compared to UTF-8, while still retaining essential features of a character encoding.

The Standard Compression Scheme for Unicode [SCSU] was created as a Unicode compression scheme with a byte-per-code point ratio similar to language-specific codepages. It has not been widely adopted although it fulfills the criteria for an IANA charset and is registered as one. Unlike [SCSU], BOCU-1 is suitable for MIME “text” media types, i.e., it can be used directly in emails and similar protocols. (See Section 2.3, MIME Compatibility.) [SCSU] requires a relatively complicated encoder design for good performance (see there for details about small vs. full encoders).

BOCU-1 combines the wide applicability of UTF-8 with the compactness of [SCSU]. It is useful for short strings and maintains code point order.

1.1 Definitions

D1 128-block: A 128-block is a range of 128 code points starting on an even 128 boundary. This is not to be confused with the term block in the generic sense as used by the Unicode Standard.

1.2 Generic BOCU

Binary Ordered Compression for Unicode [BOCU] employs a kind of difference encoding tailored to Unicode text. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF₁₆ to 10FFFF₁₆. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.

Compression is achieved because normal Unicode text contains sequences of code points from the same script block. Small alphabets are allocated in blocks of 128 code points, so that the difference between two code points from such a block can be encoded with a single byte. Large script blocks like CJK Unihan are covered with 2-byte-encoded differences.

In order to minimize the absolute values of differences and thus improve the compression, the previous code point is adjusted to the middle of a 128-block. This covers a 128-block with differences of -64..+63 around the center code point instead of requiring differences of -127..+127 without the adjustment. A special range check for the larger CJK Unihan block ensures that it is covered with differences of about ±10500 which can be encoded with 2 bytes each.

BOCU preserves binary code point order by encoding differences into byte sequences in such a way that the lexical order of the byte sequences corresponds to the numeric order of the difference values.

Small differences are encoded with single bytes. Larger differences are encoded with multi-byte sequences consisting of a so-called “lead byte” and one or more “trail bytes”. Byte values for single and lead bytes are necessarily disjoint, but may overlap with byte values for trail bytes.

For BOCU, the center of the lead byte range is used for a zero difference, and lead bytes for positive and negative differences are arranged symmetrically around it, with the lead byte values farther from the middle for larger absolute differences.

1.3 BOCU-1 Profile

BOCU-1 enhances the generic algorithm mainly by using a different set of byte value ranges to achieve MIME compatibility. It encodes U+0000..U+0020 (that is, C0 control codes U+0000..U+001F and U+0020 SPACE) directly with byte values 00..20. The ASCII byte values for common C0 control codes (especially 0D for CR and 0A for LF) are not used in the encoding of other characters at all. In addition, the space character U+0020 does not affect the state. This is to avoid large difference values at the beginning and end of each word.

As a result, only the code points above U+0020 are encoded with multi-byte sequences for their differences, and the lead byte range for these sequences starts at 0x21. Some byte values below 0x20 corresponding to rarely-used C0 control codes are also used as trail bytes in multi-byte sequences. Each code point is encoded with at most four bytes.

BOCU-1 uses FF as a special reset byte; see Section 2.4, Details.

2 Features of BOCU-1

2.1 Compression

For runs of code points from the same 128-block, exactly one byte is output per input code point. The same is true for the Hiragana range, even though it does not start at an even 128 boundary. For runs of CJK Unihan or Hangul code points, two bytes are output per input code point. These byte-per-code point ratios are the same as for [SCSU].

For runs of characters up to U+2DD4B, which includes almost all assigned Unicode characters, at most three bytes are output per input code point. The maximum number of bytes per code point is four.

The startup overhead is very low (similar to [SCSU]), which makes BOCU-1 useful for very short strings like individual names.

With BOCU-1, texts in languages with Latin-based scripts take about the same amount of space as with UTF-8, while texts in all other languages take about 25%-60% less space. Compared to UTF-16, texts in all languages with small character repertoires take approximately half as much space in BOCU-1. For large character sets, i.e., Chinese/Japanese/Korean, texts are about the same size for UTF-16 and BOCU-1.

2.2 Binary Order

The lexical order of BOCU-1 bytes is the same as the code point order of the original text — like UTF-8 but unlike [SCSU] — which allows the compression of large, sorted lists of strings. This makes BOCU-1 suitable for use in databases to reduce storage requirements as described above.

2.3 MIME Compatibility

For compatibility with MIME “text” media types it is a requirement to encode the C0 control codes NUL, CR and LF with the same byte values as in US-ASCII, and to not use those byte values in the encoding of any other character.

BOCU-1 fulfills this requirement, which makes it suitable for MIME “text” media types and directly usable in emails.

In addition, BOCU-1 also encodes nine other C0 control codes with the same byte values as in US-ASCII and does not use these byte values in any other way. In particular, the SUB control with its byte value of 1A is included in this set to avoid problems in DOS/Windows/OS/2 systems where 1A is interpreted as an end-of-file marker in text mode. As a result, BOCU-1 is generally “friendly” for some ASCII-based tools.

2.4 Details

BOCU-1 encodes an input sequence of zero or more Unicode code points into an output sequence of zero or more bytes. It is stateful, that is, the encoding of a character depends on the preceding characters. Its inter-character state consists of a single integer. BOCU-1 is also deterministic, that is, the same input sequence always results in the same output sequence (unlike with [SCSU]).

Like [SCSU], BOCU-1 can be classified as a Character Encoding Scheme (CES) or as a Transfer Encoding Syntax (TES). It is a “charset” according to IANA, and it is suitable for MIME “text” media types.

The state is reset at each C0 control (U+0000..U+001F, includes CR, LF, TAB). CRLF-separated lines do not affect each other’s encoding. Together with BOCU-1 being deterministic, this allows line-based file comparisons (diff) and makes BOCU-1 usable with RCS, CVS and other source-code control systems (unlike [SCSU]). This also allows some limited random access.

Byte values for single-byte codes and lead bytes overlap with trail bytes. This makes large ranges of lead and trail bytes available and allows the encoding of large integers with few bytes. However, unlike UTF-8, character boundaries cannot be determined in random access, except by backing up to a reset point. As a result, BOCU-1 is unsuitable for text processing (as opposed to storage and data transfer). An exception is that BOCU-1 encoded text can be efficiently compared in Unicode code point order because that order is preserved in the lexical order of BOCU-1 byte streams (as long as the FF reset byte is not used).

Byte values 7F..9F (DEL and C1 control codes) are used as lead and trail bytes. US-ASCII characters (code points U+0021..U+007E) are not encoded with the same bytes as in US-ASCII. Therefore, the charset must be specified with a signature byte sequence or in a higher-level protocol.

As a single/lead byte, byte value FF is used as a special “reset-state-only” command. It does not encode any code point. FF is also a normal trail byte. Having a “reset only” command allows simple concatenation of BOCU-1 byte streams. (All other BOCU-1 byte sequences would append some code point.) Using FF to reset the state breaks the ordering and the deterministic encoding! The use of FF resets is discouraged. Byte stream concatenation without resetting with FF requires to scan back to a C0 control whose byte value is not used for trail bytes (last known reset to initial state); then decode to the end of the first stream and encode the first non-U+0020 code point of the second stream according to the current state; then append the rest of the second stream. The same procedure could be used to remove an FF reset command.

2.5 Signature Byte Sequence

An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.

3 Conformance

C1	An implementation of a BOCU-1 encoder (a converter from Unicode text to BOCU-1) that claims conformance to BOCU-1 shall produce the same results as the algorithm published in Section 4.1, Encoding. Encoders must be able to encode any sequence of Unicode code points. A conformant encoder must not emit illegal or reserved combinations of bytes.
C2	An implementation of a BOCU-1 decoder (a converter from BOCU-1 to Unicode text) that claims conformance to BOCU-1 shall produce the same results as the algorithm published in Section 4.3, Decoding. The action of a conformant decoder on illegal or reserved input is undefined.
C3	When a charset label of “BOCU-1” is specified for a sequence of bytes, then the corresponding output values from a BOCU-1 decoder shall be interpreted as code points of the Unicode Standard, version 2.0.0 or later.

4 BOCU-1 Algorithm

4.1 Encoding

BOCU-1 encodes exactly the Unicode code points U+0000..U+10FFFF. Supplementary characters must be encoded by their code points, not as pairs of surrogates.

Encode a sequence of Unicode code points as follows:

R1 Initialization: Define an inter-code point state variable prev (an integer capable of holding code point values 0..0x10FFFF) and initialize it to 0x40 (in the middle of the ASCII block).

This variable essentially holds the previous code point, adjusted according to the algorithm below.

For each code point c implement the following rules:

R2 C0 Control Codes: If c is less than U+0020, then output one byte with the same value as c and set prev=0x40. No further rules apply to these code points.

Output C0 control codes in their ASCII encoding and reset the state to the ASCII range.

R3 Space: If c is equal to U+0020, then output one byte 0x20 but do not modify prev. No further rules apply to this code point.

Output a space in its ASCII encoding and do not change the state.

R4 Other Characters: If c is greater than U+0020, then subtract prev from c and encode the resulting difference value in 1..4 bytes according to Section 4.2, Difference Encoding.

R5 Adjustment: After applying rule R4, adjust prev as follows.

If c is a code point in the Hiragana range U+3040..U+309F inclusive, then set prev to 0x3070.
If c is a code point in the Unihan range U+4E00..U+9FA5, then set prev to 0x7711.
If c is a code point in the Hangul syllables range U+AC00..U+D7A3, then set prev to 0xc1d1.
Otherwise, set prev to the middle of a 128-block: prev=(c&~0x7f)+0x40

This adjustment will usually set prev to the middle of a 128-block of code points which works well with the assignment pattern of scripts in Unicode. There are custom adjustments for Hiragana, which is not 128-aligned, and for CJK Unihan and Hangul syllables, which have much larger blocks.
The Unihan adjustment makes U+4E00 accessible exactly with the largest negative two-byte-encoded difference.
Like R4, R5 only applies to code points above U+0020.

4.2 Difference Encoding

Any code point above U+0020 (space) is encoded with a multi-byte sequence representing the difference between the code point value and the prev state variable (see rule R4 above).

R4.1 Single-Byte Differences: Differences of -0x40..+0x3F (-64..+63) are encoded with single bytes 0x50..0xCF by adding 0x90.

Differences outside of this range are encoded with multi-byte sequences in a way similar to decimal or hexadecimal numbers, but with a number base of 243 (=256-13) instead of 10 or 16. Lead bytes are chosen such that byte sequences for lower difference values sort lexically before those for higher difference values.

Encode such differences according to the following rules:

R4.2 Values Lookup: Determine values for the encoding of the difference value diff according to table 1:

The number of trail bytes.
The base lead byte value.
A difference offset: For a positive difference value, the lowest difference value that can be encoded with the same number of trail bytes; for a negative difference value, the lowest difference value that can be encoded with one fewer trail byte.

Table 1. Lookup values for the difference encoding
Diff. value	Number of trail bytes	Base lead byte value	Diff. offset
2DD0C..10FFFF	3	FE	2DD0C
2911..2DD0B	2	FB	2911
40..2910	1	D0	40
-40..3F	0	90	0
-2911..-41	1	50	-40
-2DD0C..-2912	2	25	-2911
-10FFFF..-2DD0D	3	22	-2DD0C

R4.3 Offset: Subtract the difference offset (bullet c. of R4.2) from the current difference value.

This ensures that each difference value has only one possible encoding (avoiding a shortest-form problem) and extends the ranges of difference values encodable with a given number of bytes.

R4.4 Trail Bytes: For each trail byte from last to first:

Calculate the quotient and remainder (t) of dividing the difference value (diff) by 243 (the number of possible trail byte values).
- quotient=diff/243
- t=diff%243 (modulo, division remainder)
- Make sure that the modulo/division operations always yield non-negative modulo values and that the modulo and quotient values together compute back to the original difference value according to diff=(quotient*243)+t; adjust if necessary by adding 243 to t and subtracting 1 from the quotient.
Encode the division remainder t with a trail byte according to table 2.
Continue with the quotient (which will always be negative if the original difference was negative): diff=quotient

Table 2: Trail byte values
Division remainder `t`	Add	Trail byte value
14..F2	0D	21..FF
10..13	0C	1C..1F
06..0F	0A	10..19
00..05	01	01..06

R4.5 Lead Byte: Encode the lead byte by adding the base lead byte value (see bullet b. of R4.2) to the final diff value (the last quotient) from R4.4.

R4.6 Output: Output the lead and trail bytes.

4.3 Decoding

A BOCU-1 byte stream is decoded (converted back to Unicode text) by reversing the encoding algorithm.

RD1 Initialization: Define an inter-code point state variable prev (an integer capable of holding code point values 0..0x10FFFF) and initialize it to 0x40.

Repeat the following rules for each byte sequence in the input. Define a variable lb for the lead byte of a byte sequence.

RD2 C0 Control Codes: If lb is less than 0x20, then output one code point with the same value as lb and set prev=0x40. Continue with the next byte sequence.

RD3 Space: If lb is equal to 0x20, then output one code point U+0020 but do not modify prev. Continue with the next byte sequence.

RD4 Single Byte: If lb is in the range 0x50..0xCF inclusive, then compute the output code point as prev+(lb-0x90) and adjust prev according to encoding rule R5. Continue with the next byte sequence.

RD5 Lead Byte: If lb is in the range 0x21..0x4F inclusive, or in the range 0xD0..0xFE inclusive, then it is a lead byte for a multi-byte sequence. Reverse the difference encoding and compute the output code point by adding the resulting difference value to prev:

Determine decoding values for the lead byte from table 3. Let n be the number of trail bytes for lb.
Subtract the base lead byte value from lb and assign the result to a code point variable c.
Multiply c with 243ⁿ.
Add to c the difference offset corresponding to lb.
For each of the n trail bytes:
1. Use table 4 to find the value t corresponding to the trail byte.
2. Multiply t with the multiplier according to table 3.
3. Add this value to c.
Add prev to c and output the resulting Unicode code point c.
Adjust prev according to encoding rule R5.
Continue with the next byte sequence.

RD6 Reset Byte: If lb is equal to 0x20, then do not output any code point but set prev=0x40. Continue with the next byte sequence.

FF is a “reset-state-only” byte.

Table 3. Lookup values for decoding
Lead byte	Number of trail bytes	Base lead byte value	Diff. offset	Multiplier for trail byte 1	2	3
FE	3	FE	2DD0C	243²	243	1
FB..FD	2	FB	2911	243	1	-
D0..FA	1	D0	40	1	-	-
50..CF	0	90	0	-	-	-
25..4F	1	50	-40	1	-	-
22..24	2	25	-2911	243	1	-
21	3	22	-2DD0C	243²	243	1

Table 4: Trail byte decoding values
Trail byte value	Subtract	`t`
21..FF	0D	14..F2
1C..1F	0C	10..13
10..19	0A	06..0F
01..06	01	00..05

While there is no illegal lead or single byte, the bytes 00, 07..0F, 1A, 1B and 20 are illegal as trail bytes. In addition, multi-byte sequences that result in code points outside of the Unicode range U+0000..U+10FFFF are illegal. Recovery from illegal input is not specified.

5 Sample C Code

Included with this document is sample C code containing a complete but unoptimized implementation of BOCU-1 together with a consistency test and simple file conversion code.

This code uses International Components for Unicode [ICU] standard headers and the one implementation file icu/source/common/utf_impl.c. (It is not necessary to link the entire [ICU] common library.) This is for convenience in the surrounding test functions and not necessary for the core BOCU-1 functions. These headers and implementation file provide the following:

Include inttypes.h or define its types.
Define UChar for UTF-16 as an unsigned 16-bit type (wchar_t or uint16_t).
Define UTF* macros to handle reading and writing of in-process UTF-8/16 strings.

This code is under the X license (ICU version) [ICULicense].

Files:

bocu1.h (constants and macros)
bocu1.c (encoder and decoder functions)
bocu1tst.c (test code with main() function, see below)

A complete, compiled sample executable for Windows® from this source code is available for download. Aside from basic implementation and consistency tests, this also provides file conversion between UTF-8 and BOCU-1. Use a command-line argument of “?” or “-h” for usage.

6 BOCU-1 Performance

This is a performance comparison between BOCU-1, UTF-8, and [SCSU] when implemented as [ICU] converters, which convert to and from UTF-16. The time is for roundtrip conversions from the internal UTF-16 form to the encoding and back. Values are relative to UTF-8.

	BOCU-1		SCSU
Languages	Size of text	Time to convert to/from UTF-16	Size of text	Time to convert to/from UTF-16
English/French	100%	160..170%	100%	125%
Greek/Russian/Arabic/Hebrew	60%	65..70%	55%	70%
Hindi	40%	45%	40%	45%
Thai (see below)	45%	60%	40%	55%
Japanese	60%	150%	55%	110%
Korean	75%	155%	85%	70%
Chinese	75%	165%	70%	65%

(Smaller percentages are better. Percentages are rounded to the nearest 5%.)

The compression ratio is smaller for web pages (lots of ASCII in HTML). The performance difference tends to be smaller for smaller buffers. When the text is transmitted between machines (emails, web pages), then the transmission time may swamp the conversion time. Smaller text will then transmit faster.

6.1 SCSU vs. BOCU-1

[SCSU] and BOCU-1 have similar compression characteristics.
[SCSU] tends to be faster than BOCU-1.
BOCU-1 is applicable in many more environments (it is MIME-friendly, CVS-friendly, and has the same binary order as UTF-8).
The optimized [ICU] converter for BOCU-1 is somewhat less complex than the one for [SCSU], and the output is deterministic, unlike with [SCSU].

6.2 Test Setup

The texts are the “What is Unicode” pages from www.unicode.org, except for Thai. Note that english.html contains non-ASCII characters in the index sidebar. The Thai text, th18057.txt, has a different structure: It is a Thai word list from [ICU]’s test data, with one Thai word on each line. Compared to the other texts, it contains only a few characters between CRLF.

This comparison uses full-fledged [ICU] converters for UTF-8, [SCSU] and BOCU-1. “Full-fledged [ICU] converter” means that this is with the [ICU] conversion API, designed for external encodings, as used e.g. by an XML parser or web browser.

The [ICU] converter code for [SCSU] and BOCU-1 that is tested here is part of [ICU] 2.2. The [SCSU] converter was optimized slightly (conversion function variants without offset handling). The BOCU-1 converter is optimized compared to the reference code in the design document. The test machine is a Pentium 2 laptop.

Intellectual Property

Letter regarding licensing of IBM Technology. Hardcopy is on file with the Chair of the UTC.

For information, see the Unicode Consortium Patent Policy.

January 23, 2006

Ms. Lisa Moore, UTC Chair
The Unicode Consortium
P.O. Box 391476
Mountain View, CA 94039-1476

VIA: US Mail and e-mail

SUBJECT: Binary Ordered Compression for Unicode (BOCU)

IBM would like to bring to your attention, US Patent 6737994 "Binary-Ordered Compression For Unicode", which may contain claims necessary to, or which may facilitate the implementation of, BOCU-1. Because we believe that access to this patent may be important to the successful implementation of BOCU-1, and although not required to do so by the Unicode Consortium’s IP policies, IBM would like to offer a royalty free license to this patent upon request to implementers of a fully compliant version of BOCU-1.

A party wishing to request a license should contact:

IBM Director of Licensing
North Castle Drive
Armonk, NY 10504

FAX: 914 765 4420

Sincerely,

Marcia Courtemanche
Program Director
Intellectual Property and Standards

References

[BOCU]	Binary Ordered Compression for Unicode http://icu.sourceforge.net/docs/papers/binary_ordered_compression_for_unicode.html Describes the base algorithm, of which BOCU-1 is a profile. It serves as background information and is not necessary for the specification of BOCU-1.
[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues.
[Feedback]	Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[ICU]	International Components for Unicode http://ibm.com/software/globalization/icu
[ICULicense]	The X license, modified for ICU http://dev.icu-project.org/cgi-bin/viewcvs.cgi/checkout/icu/license.html
[Notes]	Unicode Technical Reports http://www.unicode.org/notes/ For information on the status for technical notes, and for a list of technical notes.
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[SCSU]	Unicode Technical Standard #6: A Standard Compression Scheme for Unicode http://www.unicode.org/reports/tr6/
[Unicode]	The Unicode Standard For the latest version see: http://www.unicode.org/versions/latest/. For the current version see: http://www.unicode.org/versions/Unicode4.1.0/. For the last major version see: The Unicode Consortium. The Unicode Standard, Version 4.0. (Boston, MA, Addison-Wesley, 2003. 0-321-18578-1).
[Versions]	Versions of the Unicode Standard http://www.unicode.org/standard/versions/ For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.

Acknowledgments

Thanks to Doug Ewell and Asmus Freytag for valuable feedback on this document.

Modifications

The following summarizes modifications from the previous version of this document.

Revision 1:

Initial version as a Unicode Technical Standard. An earlier version of this specification was published as a Unicode Technical Note [Notes]. Both specifications are exactly equivalent.
Changes from Unicode Technical Note #6:
Added a formal description and updated links.
Added a Conformance section as required for Unicode Technical Standards.
Minor edits to conform to the format for Unicode Technical Standards.
The sample C code has minor fixes for 16-bit compilers.

Copyright © 2002-2006 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical note. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.