Investigate input of overlong UTF-8 sequences #932

jridderbusch · 2024-04-03T12:49:23Z

Enhancement

Description

Investigate behavior when input contains overlong UTF-8 sequences (check if string validation can be bypassed; should be fine since Java converts all UTF-8 to UTF-16 before exposing it as strings, but not sure if JSON parser reads UTF-8 stream directly)

Stakeholders

@sybereal

Solution Proposal and Work Breakdown

illfixit · 2024-04-05T09:03:38Z

We have standard Angular validators for the form fields. They seem to be well tested and handle such symbols correctly.

sybereal · 2024-04-05T09:23:14Z

I believe there may have been a misunderstanding here.

UTF-8's design theoretically allows code points to be represented in different ways. Overlong UTF-8 sequences use more bytes than strictly required, while still decoding to the same code point. For example, the ASCII space character (U+0020) is normally encoded as a single byte 0x20. However, following normal UTF-8 decoding rules, if you decode 0xc0 0xa0, you will also get U+0020 back.¹²

The concern is that, if software operates directly on UTF-8-encoded strings, such encodings could potentially be used to bypass validation checks. In the above case of the space character, a validation that checks if a certain input does not contain whitespace may naively look only for the byte 0x20, which can cause it to miss certain occurrences if input is not normalized beforehand.

Since this concerns input validation, I believe it is a backend issue, rather than (just) a frontend issue.

illfixit · 2024-04-05T09:39:20Z

I believe there may have been a misunderstanding here.

UTF-8's design theoretically allows code points to be represented in different ways. Overlong UTF-8 sequences use more bytes than strictly required, while still decoding to the same code point. For example, the ASCII space character (U+0020) is normally encoded as a single byte 0x20. However, following normal UTF-8 decoding rules, if you decode 0xc0 0xa0, you will also get U+0020 back.1 2

The concern is that, if software operates directly on UTF-8-encoded strings, such encodings could potentially be used to bypass validation checks. In the above case of the space character, a validation that checks if a certain input does not contain whitespace may naively look only for the byte 0x20, which can cause it to miss certain occurrences if input is not normalized beforehand.

Since this concerns input validation, I believe it is a backend issue, rather than (just) a frontend issue.

Footnotes

https://stackoverflow.com/a/7113150 ↩

https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings ↩

thank you for the information!

SebastianOpriel · 2024-10-06T08:29:35Z

Is this really an issue of our repo or shall it be addressed in Core EDC? //Cc @efiege

sybereal · 2024-10-18T12:16:02Z

Both, since we would have to investigate the behavior of both upstream and our custom code.

jridderbusch added the kind/enhancement New feature or request label Apr 3, 2024

sybereal changed the title ~~Investigate input of long UTF-8 sequences~~ Investigate input of overlong UTF-8 sequences Apr 8, 2024

jridderbusch transferred this issue from sovity/authority-portal Apr 8, 2024

AbdullahMuk assigned sybereal May 1, 2024

AbdullahMuk added the clean-backlog requires backlog cleaning label May 2, 2024

sybereal assigned ununhexium and unassigned sybereal May 2, 2024

ununhexium removed the clean-backlog requires backlog cleaning label May 29, 2024

ununhexium transferred this issue from sovity/edc-broker-server-extension May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate input of overlong UTF-8 sequences #932

Investigate input of overlong UTF-8 sequences #932

jridderbusch commented Apr 3, 2024

illfixit commented Apr 5, 2024 •

edited

Loading

sybereal commented Apr 5, 2024

illfixit commented Apr 5, 2024

Footnotes

SebastianOpriel commented Oct 6, 2024

sybereal commented Oct 18, 2024

Investigate input of overlong UTF-8 sequences #932

Investigate input of overlong UTF-8 sequences #932

Comments

jridderbusch commented Apr 3, 2024

Enhancement

Description

Stakeholders

Solution Proposal and Work Breakdown

illfixit commented Apr 5, 2024 • edited Loading

sybereal commented Apr 5, 2024

Footnotes

illfixit commented Apr 5, 2024

Footnotes

SebastianOpriel commented Oct 6, 2024

sybereal commented Oct 18, 2024

illfixit commented Apr 5, 2024 •

edited

Loading