Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which encoding are science-parse output files written in? #131

Open
fyuval opened this issue Feb 12, 2019 · 0 comments
Open

Which encoding are science-parse output files written in? #131

fyuval opened this issue Feb 12, 2019 · 0 comments

Comments

@fyuval
Copy link

fyuval commented Feb 12, 2019

I am using science-parse for getting sections of PDF papers.
I assumed that the resulting Json files are in UTF-8 format,
but have encountered several cases of surrogate pairs encoded as "\uxxxx\uxxxx",
which is characteristic of UTF-16.

In https://en.wikipedia.org/wiki/JSON I found the following paragraph:

JSON exchange in an open ecosystem must be encoded in UTF-8.[18] The encoding supports the full Unicode character set, including those characters outside the Basic Multilingual Plane (U+10000 to U+10FFFF). However, if escaped, those characters must be written using UTF-16 surrogate pairs, a detail missed by some JSON parsers. For example, to include the Emoji character U+1F602 😂 FACE WITH TEARS OF JOY in JSON:

{ "face": "😂" }
// or
{ "face": "\uD83D\uDE02" }

So, how does science-parse encode these out-of-BMP Unicode code points?
Is this configurable?
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant