https://begriffs.com/posts/2019-05-23-unicode-icu.html

unicode-logo.png

Most programming languages evolved awkwardly during the transition from ASCII to 16-bit UCS-2 to full Unicode. They contain internationalization features that often aren’t portable or don’t suffice.

Unicode is more than a numbering scheme for the characters of every language – although that in itself is a useful accomplishment. Unicode also includes characters’ case, directionality, and alphabetic properties. The Unicode standard and specifications describe the proper way to divide words and break lines, sort text, format numbers, display text in different directions, split/combine/reorder vowels South Asian languages, and determine when characters may look visually confusable.

Human languages are highly varied and internally inconsistent, and any application which treats strings as more than an opaque byte stream must embrace the complexity. Realistically this means using a mature third-party library.

This article illustrates text processing ideas with example programs. We’ll use the International Components for Unicode (ICU) library, which is mature, portable, and powers the international text processing behind many products and operating systems.

IBM (the maintainers of ICU) officially support a C, C++ and Java API. We’ll use the C API here for a better view into the internals. Many languages have bindings to the library, so these concepts should be applicable to your language of choice.

Table of Contents:

Concepts

Before getting into the example code, it’s important to learn the terminology. Let’s start at the most basic question.

What is a “character?”

“Character” is an overloaded term. What a native speaker of a language identifies as a letter or symbol is often stored as multiple values in the internal Unicode representation. The representation is further obscured by an additional encoding in memory, on disk, or during network transmission.

Let’s start at the abstraction closest to the user: the grapheme cluster. A “grapheme” is a graphical unit that a reader recognizes as a single element of the writing system. It’s the character as a user would understand it. For example, 山, ä and క్క are graphemes. Pieces of a single grapheme always stay together in print; breaking them apart is either nonsense or changes the meaning of the symbol. They are rendered as “glyphs,” i.e. markings on paper or screen which vary by font, style, or position in a word.

You might imagine that Unicode assigns each grapheme a unique number, but that is not true. It would be wasteful because there is a combinatorial explosion between letters and diacritical marks. For instance (o, ô, ọ, ộ) and (a, â, ạ, ậ) follow a pattern. Rather than assigning a distinct number to each, it’s more efficient to assign a number to o and a, and then to each of the combining marks. The graphemes can be built from letters and combining marks e.g. ậ = a + ◌̂ + ◌̣.

In reality Unicode takes both approaches. It assigns numbers to basic letters and combining marks, but also to some of their more common combinations. Many graphemes can thus be created in more than one way. For instance ộ can be specified in five ways: