Skip to content
jalf edited this page Mar 30, 2013 · 2 revisions

What you need to know about Unicode

utf.hpp isn't a Unicode library, but as the UTF encodings are defined in terms of Unicode, it does use some Unicode concepts and terminology.

Feel free to skip to the bottom of the page for the extremely short version.

Code Points, Characters and Glyphs

At the core of Unicode are three concepts, (abstract) characters, code points, and glyphs, best illustrated by an example:

Take the letter a. This is a purely abstract entity, a unit carrying some semantic information, and nothing else. In Unicode, these are called characters, or abstract characters.

Because computers deal with numbers, this abstract character is mapped to the integer 97 or 0x61. And because Unicode deals with a lot of different characters, these integers can be up to 20 bits wide. These 20-bit integers are known as code points.

Finally, the letter a also has a visual representation: what it actually looks like when written down. This is called a glyph.

And that is basically what Unicode is all about: the mappings from abstract characters to code points, and from code points to glyphs.

Now, in a nice simple world, you might expect these to be one-to-one mappings, but Unicode is complex, and, well, pretty much all of these are potentially many-to-many mappings.

But utf.hpp doesn't care about that. All that matters is that any Unicode string defines a sequence of code points (which map to a sequence of abstract characters which in turn define the semantic meaning of the string).

UTF encodings and Code Units

Given that it is inconvenient to encode strings as a sequence of 20-bit integers, several different Unicode Transformation Formats are defined. These are UTF-8, UTF-16 and UTF-32.

These are basically serialization formats, encoding code points as different byte sequences.

Each of these formats use a different size of integer as its basic building block, known as a code unit, and encodes a code point as one or more of these.

UTF-32 is the simplest format, as it uses 32-bit wide code units. In other words, each 20-bit code point is mapped to a the 32-bit code unit (by padding the value with leading zeros). This makes UTF-32 a fixed-width encoding, as any code point, regardless of its value, is encoded as a single 32-bit code unit.

UTF-8 uses 8-bit wide code units. That is, each code point is mapped to a variable number (1 to 4) of 8-bit integers. A nice property of UTF-8 is that plain ASCII text is also valid as UTF-8 as code units with values in the range 0-127 represent the same characters as th do in plain ASCII text. However, all other code points are encoded as sequences of multiple code points.

UTF-16 falls in-between the two others: as its name implies, it defines a code unit to be 16 bits wide, meaning that the most commonly used (in western languages, at least), but not all code points, can be encoded in a single code unit. However, it is still a variable-length encoding, as all code points with values greater than 0xffff must be encoded as a pair of code units (known as a surrogate pair).

tl;dr

Any Unicode string defines a sequence of code points.

A UTF-16 string encodes this as a sequence of 16-bit code units. Depending on the value of each code point, it may be encoded as a single code unit, or it might be encoded as a surrogate pair of two code units.

If the same string is encoded as UTF-8, it is instead encoded as a sequence of 8-bit code units. Depending on the value of each code point, it might be encoded as 1, 2, 3 or 4 code units.

So the length of the string in UTF-16 might be different from the length of the string in UTF-8, and both of these might have lengths that differ from the number of Unicode code points represented by the string.

And when converting between UTF-8, UTF-16 and UTF-32, the sequence of code points is never modified. The string is made up of the same sequence of code points, only serialized to different byte sequences.

Clone this wiki locally