You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using science-parse for getting sections of PDF papers.
I assumed that the resulting Json files are in UTF-8 format,
but have encountered several cases of surrogate pairs encoded as "\uxxxx\uxxxx",
which is characteristic of UTF-16.
JSON exchange in an open ecosystem must be encoded in UTF-8.[18] The encoding supports the full Unicode character set, including those characters outside the Basic Multilingual Plane (U+10000 to U+10FFFF). However, if escaped, those characters must be written using UTF-16 surrogate pairs, a detail missed by some JSON parsers. For example, to include the Emoji character U+1F602 😂 FACE WITH TEARS OF JOY in JSON:
{ "face": "😂" }
// or
{ "face": "\uD83D\uDE02" }
So, how does science-parse encode these out-of-BMP Unicode code points?
Is this configurable?
Thanks.
The text was updated successfully, but these errors were encountered:
I am using science-parse for getting sections of PDF papers.
I assumed that the resulting Json files are in UTF-8 format,
but have encountered several cases of surrogate pairs encoded as "\uxxxx\uxxxx",
which is characteristic of UTF-16.
In https://en.wikipedia.org/wiki/JSON I found the following paragraph:
{ "face": "😂" }
// or
{ "face": "\uD83D\uDE02" }
So, how does science-parse encode these out-of-BMP Unicode code points?
Is this configurable?
Thanks.
The text was updated successfully, but these errors were encountered: