Unicode Utilities: Description and Index

Boundaries

Breaks Demonstrates different boundaries within text.
- Enter the sample text.
- Pick the kind of boundaries, or hit Test.
Regex Shows transformation of (Java) Regex pattern to support Unicode.
- Enter the regex pattern
- Change the sample text if desired.
- Click Show Modified Regex Pattern
You'll then see the modified pattern. It will often be much larger, but any reasonable Regex engine will compile character classes reasonably. Below that, you'll see a sample of how the expression works, using it to find substrings of the sample text and underline them.

Properties

Unicode Property Demo window
- Enter a character code in the right side, and hit Show. You'll see the properties for that character (where they have non-default values).
- If you click on any property (like Age), you'll see a list of all the properties and their values in the Unicode Property List window
- If you click on any property value in either of these two windows, like 4.0.0.0 for Age, you'll see the characters with that property in the UnicodeSets Demo window
UnicodeSet Demo window
- You can put in arbitrary UnicodeSets, allowing boolean combinations of any of the property+value combinations in the Unicode Property List window
- If you click on Compare at the top, you can compare any two UnicodeSets.

Transforms

Transform Demonstrates different boundaries within text.
- Enter the Transform Rules
- Enter Sample Text
- Hit Show Transform
- Examples:
The rules can either be IDs (simple or compound) or general rules. To see a list of all the IDs, see ID List.

The sample can either be a piece of text or a UnicodeSet. In the latter case, only characters that are affected by the transform are shown. They are listed alphabetically by the result of the transform, with multiple entries shown in a UnicodeSet.

UnicodeSets use regular-expression syntax to allow for arbitrary set operations (Union, Intersection, Difference) on sets of Unicode characters. The base sets can be specified explicitly, such as [a-m w-z], or using Unicode Properties like [[:script=arabic:]&[:decompositiontype=canonical:]]. The latter set gets the Arabic script characters that have a canonical decomposition. The properties can be specified either with Perl-style notation (\p{script=arabic}) or with POSIX-style notation ([:script=arabic:]). For more information, see ICU UnicodeSet Documentation.

In the online demo, the implementation of UnicodeSet is customized in the following ways.

Query Use. The UnicodeSet can be typed in, or used as a URL query parameter, such as the following. Note that in that case, "&" needs to be replaced by "%26".
- list-unicodeset.jsp?a=[:whitespace:]
Regular Expressions. For the name property, regular expressions can be used for the value, enclosed in /.../. For example in the following expression, the first term will select all those Unicode characters whose names contain "CJK". The rest of the expression will then subtract the ideographic characters, showing that these can be used in arbitrary combinations.
- [[:name=/CJK/:]-[:ideographic:]] - the set of all characters with names that contain CJK that are not Ideographic
- [:name=/\bDOT$/:] - the set of all characters with names that end with the word DOT
- [:block=/(?i)arab/:] - the set of all characters in blocks that contain the sequence of letters "arab" (case-insensitive)
- [:toNFKC=/\./:] - the set of all characters with toNFKC values that contain a literal period
Some particularly useful regex features are:
- \b means a word break, ^ means front of the string, and $ means end. So /^DOT\b/ means the word DOT at the start.
- (?i) means case-insensitive matching.
Caveats:
1. The regex uses the standard Java Pattern. In particular, it does not have the extended functions in UnicodeSet, nor is it up-to-date with the latest Unicode. So be aware that you shouldn't depend on properties inside of the /.../ pattern.
2. The Unassigned, Surrogate, and Private Use code points are skipped in the Regex comparison, so [:Block=/Aegean_Numbers/:] returns a different number of characters than [:Block=Aegean_Numbers:], because it skips Unassigned code points.
3. None of the normal "loose matching" is enabled. So [:Block=aegeannumbers:] works, but [:Block=/aegeannumbers/:] fails -- you have to use [:Block=/Aegean_Numbers/:] or [:Block=/(?i)aegean_numbers/:].
Casing Properties. Unicode defines a number of string casing functions in Section 3.13 Default Case Algorithms. These string functions can also be applied to single characters. Warning: the first three sets may be somewhat misleading: isLowercase means that the character is the same as its lowercase version, which includes all uncased characters. To get those characters that are cased characters and lowercase, use [[:isLowercase:]&[:isCased:]]
1. The binary testing operations take no argument:
2. The string functions are also provided, and require an argument. For example:
  - [:toLowercase=a:]- the set of all characters X such that toLowercase(X) = a
  - [:toCaseFold=a:]
  - [:toUppercase=A:]
  - [:toTitlecase=A:]
  Note: The Unassigned, Surrogate, and Private Use code points are skipped in generation of the sets.
Normalization Properties. Unicode defines a number of string normalization functions UAX #15. These string functions can also be applied to single characters.
1. The binary testing operations have somewhat odd constructions:
  - [:^NFCquickcheck=N:] (use for [:isNFC:], and so on).
  - [:^NFKCquickcheck=N:]
  - [:^NFDquickcheck=N:]
  - [:^NFKDquickcheck=N:]
2. The string functions are also provided, and require an argument. For example:
  - [:toNFC=a:] - the set of all characters X such that toNFC(X) = a
  - [:toNFD=A\u0300:]
  - [:toNFKC=A:]
  - [:toNFKD=A\u0300:]
  Note: The Unassigned, Surrogate, and Private Use code points are skipped in the generation of the sets.
IDNA Properties. The status of characters with respect to IDNA (internationalized domain names) can also be determined. The available properties are listed below.
1. [:idna=output:] The set of all characters allowed in the output of IDNA. An example is
  - U+00E0 ( à ) LATIN SMALL LETTER A WITH GRAVE
2. [:idna=ignored:] The set of all characters ignored by IDNA on input. That is, these characters are mapped to nothing -- removed -- by NamePrep. An example is:
  - U+00AD ( ) SOFT HYPHEN.
3. [:idna=remapped:] The set of characters remapped to other characters by IDNA (NamePrep). Examples are:
  - U+00C0 ( À ) LATIN CAPITAL LETTER A WITH GRAVE (remapped to the lowercase version).
  - U+FF21 ( Ａ ) FULLWIDTH LATIN CAPITAL LETTER A
4. [:idna=disallowed:] These are characters disallowed (on the registry side) by IDNA. An example is:
  - U+002C ( , ) COMMA
  Note: The client side adds characters unassigned in Unicode 3.2, for compatibility. To see just the characters disallowed in Unicode 3.2, you can use [[:idna=disallowed:]&[:age=3.2:]]. To also remove private-use, unassigned, surrogates, and controls, use [[:idna=disallowed:]&[:age=3.2:]-[:c:]].

Fonts and Display. If you don't have a good set of Unicode fonts (and modern browser), you may not be able to read some of the characters. Some suggested fonts that you can add for coverage are: Noto Fonts site, Unicode Fonts for Ancient Scripts, Large, multi-script Unicode fonts. See also: Unicode Display Problems.

Version 3.9; ICU version: 74.1; Unicode/Emoji version: 15.1.0;