Skip to content
jalf edited this page May 4, 2013 · 2 revisions

A Bare Minimum of Unicode

In Unicode terminology, a code point is the basic unit of digital text, and often, but not always, corresponds to what we informally mean when we say "character". More precisely, a Unicode character is the abstract entity (such as the letter e), but the code point is the integral value 0x65 (which, in both ASCII and Unicode represents the letter e). Where it gets tricky is that Unicode has a code point for the letter e, a code point for the combining acute accent ´, and a code point for the letter e with an acute accent (é). In other words, the character é can be represented in multiple ways, as a single code point, or as a sequence of a base character followed by a combining character. Nothing is simple when it comes to Unicode.

However, broadly speaking, a code point is the 20-bit integer value associated with a character. And often, when people say "Unicode", what they mean is the table mapping characters to such integers, kind of like an ASCII chart, but bigger.

Note, that the code points are independent of encoding. Unicode dictates that the code point 0xe9 (usually written as U+00E9) is the code point for the character LATIN SMALL LETTER E WITH ACUTE, always and forever.

If a code point is the basic unit of digital text, the integer that makes up each character, a code unit is the basic unit of your chosen encoding. In UTF-8, a code unit is an 8-bit integer, in UTF-32 it is a 32-bit integer. A code unit is simply a measure of the "unit" size of a particular encoding.

And each of the UTF encodings map each individual code point to a sequence of code units. For example, the letter e, with code point 0x65, when encoded as UTF-8, becomes the singe code unit 0x65, but the letter é, with code point 0xe9, is encoded as the sequence 0xc3 0xa9 in UTF-8, while it only needs a single code unit (0xe9) in both UTF-16 and UTF-32.

UTF-32, with 32-bit code units, allow every code point to be encoded as a single code unit, which is nice and simple, and often makes UTF-32 preferable when doing text manipulation. (even though, as mentioned above, sometimes a character might be composed of multiple code points being combined, making everything a mess anyway)

UTF-16, with 16-bit code units, encodes every code point as either a single 16-bit code unit, or a surrogate pair, a pair of two 16-bit code units.

UTF-8, with 8-bit code units, encodes code points as a sequence of 1 to 4 code units.

##tl;dr

  • code points are the integers that represent each character, like in an ASCII chart
  • code units are the building blocks of a specific encoding
  • encodings map each code point to a sequence of one or more code units
  • Unicode is complicated. Really, seriously complicated. Don't assume that you can just compute the position of the 4th character. With UTF-8 and UTF-16, it's basically impossible. With UTF-32, it is tricky and subtle and depends on what precisely you mean by "character".
Clone this wiki locally