Technical Reports |
Version | 1.0 |
Authors | Markus W. Scherer, Mark Davis |
Date | 2006-02-03 |
This Version | http://www.unicode.org/reports/tr40/tr40-1.html |
Previous Version | None |
Latest Version | http://www.unicode.org/reports/tr40/ |
Revision | 1 |
This document describes a Unicode compression scheme that is MIME-compatible, directly usable for emails, and preserves binary order (for databases and sorted lists). BOCU-1 is a MIME-compatible application of the Binary Ordered Compression for Unicode [BOCU] base algorithm.
This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
The most popular encoding for Unicode text data exchange is UTF-8. It is relatively simple and widely applicable: MIME/email/HTML/XML, in-process use in some systems, etc. However, UTF-8 uses more bytes per code point for non-Latin scripts than language-specific codepages.
There are many environments which are bandwidth-limited, for example, wireless networks for handheld systems. In such environments it can be desirable to reduce the average ratio of bytes per code point compared to UTF-8, while still retaining essential features of a character encoding.
The Standard Compression Scheme for Unicode [SCSU] was created as a Unicode compression scheme with a byte-per-code point ratio similar to language-specific codepages. It has not been widely adopted although it fulfills the criteria for an IANA charset and is registered as one. Unlike [SCSU], BOCU-1 is suitable for MIME “text” media types, i.e., it can be used directly in emails and similar protocols. (See Section 2.3, MIME Compatibility.) [SCSU] requires a relatively complicated encoder design for good performance (see there for details about small vs. full encoders).
BOCU-1 combines the wide applicability of UTF-8 with the compactness of [SCSU]. It is useful for short strings and maintains code point order.
Binary Ordered Compression for Unicode [BOCU] employs a kind of difference encoding tailored to Unicode text. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF16 to 10FFFF16. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.
Compression is achieved because normal Unicode text contains sequences of code points from the same script block. Small alphabets are allocated in blocks of 128 code points, so that the difference between two code points from such a block can be encoded with a single byte. Large script blocks like CJK Unihan are covered with 2-byte-encoded differences.
In order to minimize the absolute values of differences and thus improve the compression, the previous code point is adjusted to the middle of a 128-block. This covers a 128-block with differences of -64..+63 around the center code point instead of requiring differences of -127..+127 without the adjustment. A special range check for the larger CJK Unihan block ensures that it is covered with differences of about ±10500 which can be encoded with 2 bytes each.
BOCU preserves binary code point order by encoding differences into byte sequences in such a way that the lexical order of the byte sequences corresponds to the numeric order of the difference values.
Small differences are encoded with single bytes. Larger differences are encoded with multi-byte sequences consisting of a so-called “lead byte” and one or more “trail bytes”. Byte values for single and lead bytes are necessarily disjoint, but may overlap with byte values for trail bytes.
For BOCU, the center of the lead byte range is used for a zero difference, and lead bytes for positive and negative differences are arranged symmetrically around it, with the lead byte values farther from the middle for larger absolute differences.
BOCU-1 enhances the generic algorithm mainly by using a different set of byte value ranges to achieve MIME compatibility. It encodes U+0000..U+0020 (that is, C0 control codes U+0000..U+001F and U+0020 SPACE) directly with byte values 00..20. The ASCII byte values for common C0 control codes (especially 0D for CR and 0A for LF) are not used in the encoding of other characters at all. In addition, the space character U+0020 does not affect the state. This is to avoid large difference values at the beginning and end of each word.
As a result, only the code points above U+0020 are encoded with multi-byte sequences for their differences, and the lead byte range for these sequences starts at 0x21. Some byte values below 0x20 corresponding to rarely-used C0 control codes are also used as trail bytes in multi-byte sequences. Each code point is encoded with at most four bytes.
BOCU-1 uses FF as a special reset byte; see Section 2.4, Details.
For runs of code points from the same 128-block, exactly one byte is output per input code point. The same is true for the Hiragana range, even though it does not start at an even 128 boundary. For runs of CJK Unihan or Hangul code points, two bytes are output per input code point. These byte-per-code point ratios are the same as for [SCSU].
For runs of characters up to U+2DD4B, which includes almost all assigned Unicode characters, at most three bytes are output per input code point. The maximum number of bytes per code point is four.
The startup overhead is very low (similar to [SCSU]), which makes BOCU-1 useful for very short strings like individual names.
With BOCU-1, texts in languages with Latin-based scripts take about the same amount of space as with UTF-8, while texts in all other languages take about 25%-60% less space. Compared to UTF-16, texts in all languages with small character repertoires take approximately half as much space in BOCU-1. For large character sets, i.e., Chinese/Japanese/Korean, texts are about the same size for UTF-16 and BOCU-1.
The lexical order of BOCU-1 bytes is the same as the code point order of the original text — like UTF-8 but unlike [SCSU] — which allows the compression of large, sorted lists of strings. This makes BOCU-1 suitable for use in databases to reduce storage requirements as described above.
For compatibility with MIME “text” media types it is a requirement to encode the C0 control codes NUL, CR and LF with the same byte values as in US-ASCII, and to not use those byte values in the encoding of any other character.
BOCU-1 fulfills this requirement, which makes it suitable for MIME “text” media types and directly usable in emails.
In addition, BOCU-1 also encodes nine other C0 control codes with the same byte values as in US-ASCII and does not use these byte values in any other way. In particular, the SUB control with its byte value of 1A is included in this set to avoid problems in DOS/Windows/OS/2 systems where 1A is interpreted as an end-of-file marker in text mode. As a result, BOCU-1 is generally “friendly” for some ASCII-based tools.
BOCU-1 encodes an input sequence of zero or more Unicode code points into an output sequence of zero or more bytes. It is stateful, that is, the encoding of a character depends on the preceding characters. Its inter-character state consists of a single integer. BOCU-1 is also deterministic, that is, the same input sequence always results in the same output sequence (unlike with [SCSU]).
Like [SCSU], BOCU-1 can be classified as a Character Encoding Scheme (CES) or as a Transfer Encoding Syntax (TES). It is a “charset” according to IANA, and it is suitable for MIME “text” media types.
The state is reset at each C0 control (U+0000..U+001F, includes CR, LF, TAB). CRLF-separated lines do not affect each other’s encoding. Together with BOCU-1 being deterministic, this allows line-based file comparisons (diff) and makes BOCU-1 usable with RCS, CVS and other source-code control systems (unlike [SCSU]). This also allows some limited random access.
Byte values for single-byte codes and lead bytes overlap with trail bytes. This makes large ranges of lead and trail bytes available and allows the encoding of large integers with few bytes. However, unlike UTF-8, character boundaries cannot be determined in random access, except by backing up to a reset point. As a result, BOCU-1 is unsuitable for text processing (as opposed to storage and data transfer). An exception is that BOCU-1 encoded text can be efficiently compared in Unicode code point order because that order is preserved in the lexical order of BOCU-1 byte streams (as long as the FF reset byte is not used).
Byte values 7F..9F (DEL and C1 control codes) are used as lead and trail bytes. US-ASCII characters (code points U+0021..U+007E) are not encoded with the same bytes as in US-ASCII. Therefore, the charset must be specified with a signature byte sequence or in a higher-level protocol.
As a single/lead byte, byte value FF is used as a special “reset-state-only” command. It does not encode any code point. FF is also a normal trail byte. Having a “reset only” command allows simple concatenation of BOCU-1 byte streams. (All other BOCU-1 byte sequences would append some code point.) Using FF to reset the state breaks the ordering and the deterministic encoding! The use of FF resets is discouraged. Byte stream concatenation without resetting with FF requires to scan back to a C0 control whose byte value is not used for trail bytes (last known reset to initial state); then decode to the end of the first stream and encode the first non-U+0020 code point of the second stream according to the current state; then append the rest of the second stream. The same procedure could be used to remove an FF reset command.
An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.
C1 | An implementation of a BOCU-1 encoder (a converter from Unicode text to BOCU-1) that claims conformance to BOCU-1 shall produce the same results as the algorithm published in Section 4.1, Encoding. Encoders must be able to encode any sequence of Unicode code points. A conformant encoder must not emit illegal or reserved combinations of bytes. |
C2 | An implementation of a BOCU-1 decoder (a converter from BOCU-1 to Unicode text) that claims conformance to BOCU-1 shall produce the same results as the algorithm published in Section 4.3, Decoding. The action of a conformant decoder on illegal or reserved input is undefined. |
C3 | When a charset label of “BOCU-1” is specified for a sequence of bytes, then the corresponding output values from a BOCU-1 decoder shall be interpreted as code points of the Unicode Standard, version 2.0.0 or later. |
BOCU-1 encodes exactly the Unicode code points U+0000..U+10FFFF. Supplementary characters must be encoded by their code points, not as pairs of surrogates.
Encode a sequence of Unicode code points as follows:
R1 Initialization: Define an inter-code point state variable prev
(an integer capable of holding code point values 0..0x10FFFF
) and
initialize it to 0x40
(in the middle of the ASCII block).
This variable essentially holds the previous code point, adjusted according to the algorithm below.
For each code point c
implement the following rules:
R2 C0 Control Codes: If c
is less than U+0020, then output one byte with the same value as c
and set prev=0x40
.
No further rules apply to these code points.
R3 Space: If c
is equal to U+0020, then
output one byte 0x20
but do not modify prev
. No
further rules apply to this code point.
Output a space in its ASCII encoding and do not change the state.
R4 Other Characters: If c
is greater than U+0020, then
subtract prev
from c
and encode the resulting
difference value in 1..4 bytes according to Section 4.2, Difference Encoding.
R5 Adjustment: After applying rule R4, adjust prev
as follows.
If c
is a code point in the Hiragana
range U+3040..U+309F inclusive, then set prev
to 0x3070
.
If c
is a code point in the Unihan
range U+4E00..U+9FA5, then set prev
to 0x7711
.
If c
is a code point in the Hangul
syllables range U+AC00..U+D7A3, then set prev
to 0xc1d1
.
Otherwise, set prev
to the middle of a
128-block: prev=(c&~0x7f)+0x40
This adjustment will usually set prev
to the middle of a
128-block of code points which works well with the assignment pattern of
scripts in Unicode. There are custom adjustments for Hiragana, which is
not 128-aligned, and for CJK Unihan and Hangul syllables, which have much
larger blocks.
The Unihan adjustment makes U+4E00 accessible exactly with the largest negative two-byte-encoded difference.
Like R4, R5 only applies to code points above U+0020.
Any code point above U+0020 (space) is encoded with a multi-byte sequence
representing the difference between the code point value and the prev
state variable (see rule R4 above).
R4.1 Single-Byte Differences: Differences of -0x40..+0x3F (-64..+63) are encoded with single bytes 0x50..0xCF by adding 0x90.
Differences outside of this range are encoded with multi-byte sequences in a way similar to decimal or hexadecimal numbers, but with a number base of 243 (=256-13) instead of 10 or 16. Lead bytes are chosen such that byte sequences for lower difference values sort lexically before those for higher difference values.
Encode such differences according to the following rules:
R4.2 Values Lookup: Determine values for the encoding of the difference value
diff
according to
table 1:
Diff. value | Number of trail bytes |
Base lead byte value |
Diff. offset |
---|---|---|---|
2DD0C..10FFFF | 3 | FE | 2DD0C |
2911..2DD0B | 2 | FB | 2911 |
40..2910 | 1 | D0 | 40 |
-40..3F | 0 | 90 | 0 |
-2911..-41 | 1 | 50 | -40 |
-2DD0C..-2912 | 2 | 25 | -2911 |
-10FFFF..-2DD0D | 3 | 22 | -2DD0C |
R4.3 Offset: Subtract the difference offset (bullet c. of R4.2) from the current difference value.
R4.4 Trail Bytes: For each trail byte from last to first:
t
) of dividing
the difference value (diff
) by 243 (the number of
possible trail byte values).
quotient=diff/243
t=diff%243
(modulo,
division remainder)diff=(quotient*243)+t
;
adjust if necessary by adding 243 to t
and subtracting 1 from the
quotient
.t
with a trail byte according to table 2.diff=quotient
Division remainder t |
Add | Trail byte value |
---|---|---|
14..F2 | 0D | 21..FF |
10..13 | 0C | 1C..1F |
06..0F | 0A | 10..19 |
00..05 | 01 | 01..06 |
R4.5 Lead Byte: Encode the lead byte by adding the base lead byte value
(see bullet b. of R4.2) to
the final diff
value (the last quotient) from R4.4.
R4.6 Output: Output the lead and trail bytes.
A BOCU-1 byte stream is decoded (converted back to Unicode text) by reversing the encoding algorithm.
RD1 Initialization: Define an inter-code point state variable prev
(an integer capable of holding code point values 0..0x10FFFF
) and
initialize it to 0x40
.
Repeat the following rules for each byte sequence in the input. Define a
variable lb
for the lead byte of a byte sequence.
RD2 C0 Control Codes: If lb
is less than 0x20
,
then output one code point with the same value as lb
and set prev=0x40
.
Continue with the next byte sequence.
lb
is equal to 0x20
, then
output one code point U+0020 but do not modify prev
. Continue
with the next byte sequence.RD4 Single Byte: If lb
is
in the range 0x50..0xCF
inclusive, then compute the output code point as
prev+(lb-0x90)
and adjust prev
according to encoding rule R5.
Continue with the next byte sequence.
RD5 Lead Byte: If lb
is in the range 0x21..0x4F
inclusive, or in the range 0xD0..0xFE
inclusive, then it is a lead
byte for a multi-byte sequence. Reverse the
difference encoding and compute the output code point by adding the
resulting difference value to prev
:
n
be the number of trail bytes for lb
.lb
and assign
the result to a code point variable c
.c
with 243n
.c
the difference offset corresponding to lb
.n
trail bytes:
t
corresponding to
the trail byte.t
with the multiplier according to table
3.c
.prev
to c
and output the resulting Unicode code
point c
.prev
according to encoding rule R5.RD6 Reset Byte: If lb
is equal to 0x20
, then
do not output any code point but set prev=0x40
. Continue with the
next byte sequence.
Lead byte |
Number of trail bytes |
Base lead byte value |
Diff. offset | Multiplier for trail byte 1 | 2 | 3 |
---|---|---|---|---|---|---|
FE | 3 | FE | 2DD0C | 2432 | 243 | 1 |
FB..FD | 2 | FB | 2911 | 243 | 1 | - |
D0..FA | 1 | D0 | 40 | 1 | - | - |
50..CF | 0 | 90 | 0 | - | - | - |
25..4F | 1 | 50 | -40 | 1 | - | - |
22..24 | 2 | 25 | -2911 | 243 | 1 | - |
21 | 3 | 22 | -2DD0C | 2432 | 243 | 1 |
Trail byte value | Subtract | t |
---|---|---|
21..FF | 0D | 14..F2 |
1C..1F | 0C | 10..13 |
10..19 | 0A | 06..0F |
01..06 | 01 | 00..05 |
While there is no illegal lead or single byte, the bytes 00, 07..0F, 1A, 1B and 20 are illegal as trail bytes. In addition, multi-byte sequences that result in code points outside of the Unicode range U+0000..U+10FFFF are illegal. Recovery from illegal input is not specified.
Included with this document is sample C code containing a complete but unoptimized implementation of BOCU-1 together with a consistency test and simple file conversion code.
This code uses International Components for Unicode
[ICU] standard
headers and the one implementation file icu/source/common/utf_impl.c
.
(It is not necessary to link the entire [ICU] common library.) This is for
convenience in the surrounding test functions and not necessary for the core
BOCU-1 functions. These headers and implementation file provide the following:
This code is under the X license (ICU version) [ICULicense].
Files:
main()
function, see below)A complete, compiled sample executable for Windows® from this source code is available for download. Aside from basic implementation and consistency tests, this also provides file conversion between UTF-8 and BOCU-1. Use a command-line argument of “?” or “-h” for usage.
This is a performance comparison between BOCU-1, UTF-8, and [SCSU] when implemented as [ICU] converters, which convert to and from UTF-16. The time is for roundtrip conversions from the internal UTF-16 form to the encoding and back. Values are relative to UTF-8.
BOCU-1 | SCSU | |||
---|---|---|---|---|
Languages | Size of text | Time to convert to/from UTF-16 |
Size of text | Time to convert to/from UTF-16 |
English/French | 100% | 160..170% | 100% | 125% |
Greek/Russian/Arabic/Hebrew | 60% | 65..70% | 55% | 70% |
Hindi | 40% | 45% | 40% | 45% |
Thai (see below) | 45% | 60% | 40% | 55% |
Japanese | 60% | 150% | 55% | 110% |
Korean | 75% | 155% | 85% | 70% |
Chinese | 75% | 165% | 70% | 65% |
(Smaller percentages are better. Percentages are rounded to the nearest 5%.)
The compression ratio is smaller for web pages (lots of ASCII in HTML). The performance difference tends to be smaller for smaller buffers. When the text is transmitted between machines (emails, web pages), then the transmission time may swamp the conversion time. Smaller text will then transmit faster.
The texts are the “What is Unicode” pages from www.unicode.org, except for Thai. Note that english.html contains non-ASCII characters in the index sidebar. The Thai text, th18057.txt, has a different structure: It is a Thai word list from [ICU]’s test data, with one Thai word on each line. Compared to the other texts, it contains only a few characters between CRLF.
This comparison uses full-fledged [ICU] converters for UTF-8, [SCSU] and BOCU-1. “Full-fledged [ICU] converter” means that this is with the [ICU] conversion API, designed for external encodings, as used e.g. by an XML parser or web browser.
The [ICU] converter code for [SCSU] and BOCU-1 that is tested here is part of [ICU] 2.2. The [SCSU] converter was optimized slightly (conversion function variants without offset handling). The BOCU-1 converter is optimized compared to the reference code in the design document. The test machine is a Pentium 2 laptop.
Letter regarding licensing of IBM Technology. Hardcopy is on file with the Chair of the UTC.
For information, see the Unicode Consortium Patent Policy.
January 23, 2006
Ms. Lisa Moore, UTC Chair
The Unicode Consortium
P.O. Box 391476
Mountain View, CA 94039-1476
VIA: US Mail and e-mail
SUBJECT: Binary Ordered Compression for Unicode (BOCU)
IBM would like to bring to your attention, US Patent 6737994 "Binary-Ordered Compression For Unicode", which may contain claims necessary to, or which may facilitate the implementation of, BOCU-1. Because we believe that access to this patent may be important to the successful implementation of BOCU-1, and although not required to do so by the Unicode Consortium’s IP policies, IBM would like to offer a royalty free license to this patent upon request to implementers of a fully compliant version of BOCU-1.
A party wishing to request a license should contact:
IBM Director of Licensing
North Castle Drive
Armonk, NY 10504
FAX: 914 765 4420
Sincerely,
Marcia Courtemanche
Program Director
Intellectual Property and Standards
[BOCU] | Binary Ordered Compression for Unicode http://icu.sourceforge.net/docs/papers/binary_ordered_compression_for_unicode.html Describes the base algorithm, of which BOCU-1 is a profile. It serves as background information and is not necessary for the specification of BOCU-1. |
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues. |
[Feedback] | Reporting Errors and Requesting
Information Online http://www.unicode.org/reporting.html |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[ICU] |
International Components for Unicode http://ibm.com/software/globalization/icu |
[ICULicense] | The X license, modified for ICU http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html |
[Notes] | Unicode Technical Reports http://www.unicode.org/notes/ For information on the status for technical notes, and for a list of technical notes. |
[Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[SCSU] | Unicode Technical Standard #6: A Standard Compression Scheme for Unicode http://www.unicode.org/reports/tr6/ |
[Unicode] | The Unicode Standard For the latest version see: http://www.unicode.org/versions/latest/. For the current version see: http://www.unicode.org/versions/Unicode4.1.0/. For the last major version see: The Unicode Consortium. The Unicode Standard, Version 4.0. (Boston, MA, Addison-Wesley, 2003. 0-321-18578-1). |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/standard/versions/ For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports. |
Thanks to Doug Ewell and Asmus Freytag for valuable feedback on this document.
The following summarizes modifications from the previous version of this document.
Revision 1:
Copyright © 2002-2006 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical note. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.