Why do some glyphs have a four-digit hex code while other have five digits? #1681

stamminator · 2024-07-31T17:15:08Z

stamminator
Jul 31, 2024

I'm troubleshooting an issue where the VS Code integrated terminal does unnecessary line breaks when certain glyphs are present in the output. I've noticed that glyphs with four-digit hex codes don't have the issue, while glyphs with five-digit hex codes do.

I plan to file an issue with vscode, but first, I want to understand how glyphs work better. Why do the hex codes vary in length?

EDIT: I've also noticed that when I enter the hex code literal into an Oh My Posh segment template string rather than pasting the icon itself into the template string, the 4-digit one is correct while the 5-digit one isn't.

I know this is neither the VS Code nor the Oh My Posh repo, but these issues seem conceptually similar.

Answered by Finii

Aug 7, 2024

"In the beginning" there was ASCII, where each character can be expressed as 7 bit number - later expanded to 8 bit.
These 8 bits can be expressed as 2 digit hexadecimal number.
But 256 characters is a bit low, considering all these strange letters with appendices like ä and Ó.
Then we got codepages. Etc.

That all was too limited and finally unicode came up. Version 2 had 40'000 characters, nowadays we have 150'000.
The 'address' of a certain character or glyph is its codepoint. And you see for Version 2 with 40'000 (decimal) codepoints each can be expressed as 4 digit hexadecimal number (40000 decimal = 9C40 hex). But this is no longer possible (150'000 decimal = 249F0 hex)

We are not fr…

View full answer

Finii · 2024-08-07T17:53:41Z

Finii
Aug 7, 2024
Collaborator

"In the beginning" there was ASCII, where each character can be expressed as 7 bit number - later expanded to 8 bit.
These 8 bits can be expressed as 2 digit hexadecimal number.
But 256 characters is a bit low, considering all these strange letters with appendices like ä and Ó.
Then we got codepages. Etc.

That all was too limited and finally unicode came up. Version 2 had 40'000 characters, nowadays we have 150'000.
The 'address' of a certain character or glyph is its codepoint. And you see for Version 2 with 40'000 (decimal) codepoints each can be expressed as 4 digit hexadecimal number (40000 decimal = 9C40 hex). But this is no longer possible (150'000 decimal = 249F0 hex)

We are not free to use any codepoint we like, there are certain ranges that need to be used. And when the useful-for-Nerdfonts 4 digit range filled up we needed to allocate new glyphs a 5 digit number. Well, in principle.
Codepoints are organized in blocks, see for example https://en.wikipedia.org/wiki/Unicode_block
We can only use codepoints in the Private Use Areas or PUA, and you see in the block overview that there is only a very small block in the 4 digit range and two larger blocks in the 5 and 6 digit range.

That much for history. A long time 4 digit code were all what was needed and this lead to some pitfalls.
So for example in many contexts the \u is a prefix for a 4 digit unicode. You see that in your example \uf1442 is the same as \uf144 and 2 and this is what you see. Some systems have \U for 5/6 digit codes, for some systems you must encode the codepoint in UTF16 that you can put as two \u strings. The cheat cheat helps you with these because the calculation is not easy:

Here you see LOL, with codepoint \Uf1214 or \udb84\ude14 (the second form is the UTF16 variant that you get by clicking on the UTF16 box).

I have no clue what VS Code uses or expects.

For the unnecessary line breaks, I guess the line breaks because the terminal calculates the length wrong? But the lines seem to be not full. On the other hand the break is not after the special glyph but later on. You need to know that unicode has the concept of different width or characters; they can be single, double, or ambiguous width. Maybe here is a problem.
Is the line break independent of concrete full line length? Why is it at the closing parens? Etc.

I hope this explains at least some things and helps your research. There are for more peculiarities, especially regarding Windows and how it handles stuff. I am not sure if the same holds for VS Code on other platforms and you do not specify your platform. Windows specifically was late in joining the other OSes that all supported 5 digit unicode characters and still struggles (I believe, I can be wrong here).

stamminator Aug 14, 2024
Author

Yeah, I won't bother raising an issue about conhost, but I will submit an issue to the VS Code repo this week describing the PUA-A unicode rendering issue.

the upper case U takes as many digits as there are, so \UbeefF is not beef and F but some U+BEEFF character and you have to express such kind of strings in roundabout way

Which language is it that uses this syntax for hex literals? YAML solves this by making U always 8 digits (\U0000beefF = 뻯F), which seems to me to have no downsides if you don't count the extra zeros. Anything to avoid having to do UTF-16 surrogate pair math is a win in my book.

Finii Aug 14, 2024
Collaborator

Hmm, maybe I'm mistaken. I had the printf command in mind, but that is exactly as you say. Maybe it was echo?

https://www.man7.org/linux/man-pages/man1/printf.1.html

Finii Aug 14, 2024
Collaborator

Ah, with printf you have the problem with \x at least .... "1 to 2 digits" 😬
And you always have and had this with the octal sequences, which drove me crazy in like 1980 ;-)

Finii Aug 14, 2024
Collaborator

Ah I guess it was \x, here as C/C++ literal ... what a bad design idea. \U was not available back than; \u became standardized only in 99 it seems.

stamminator Aug 14, 2024
Author

Thanks for the confirmation! I knew next to nothing about Unicode two weeks ago, so this has been a great crash course.

which drove me crazy in like 1980 ;-)

My dad was in middle school in 1980 😳

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do some glyphs have a four-digit hex code while other have five digits? #1681

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why do some glyphs have a four-digit hex code while other have five digits? #1681

stamminator Jul 31, 2024

Replies: 1 comment · 7 replies

Finii Aug 7, 2024 Collaborator

stamminator Aug 14, 2024 Author

Finii Aug 14, 2024 Collaborator

Finii Aug 14, 2024 Collaborator

Finii Aug 14, 2024 Collaborator

stamminator Aug 14, 2024 Author

stamminator
Jul 31, 2024

Replies: 1 comment 7 replies

Finii
Aug 7, 2024
Collaborator

stamminator Aug 14, 2024
Author

Finii Aug 14, 2024
Collaborator

Finii Aug 14, 2024
Collaborator

Finii Aug 14, 2024
Collaborator

stamminator Aug 14, 2024
Author