Which encoding are science-parse output files written in? #131

fyuval · 2019-02-12T21:50:51Z

I am using science-parse for getting sections of PDF papers.
I assumed that the resulting Json files are in UTF-8 format,
but have encountered several cases of surrogate pairs encoded as "\uxxxx\uxxxx",
which is characteristic of UTF-16.

In https://en.wikipedia.org/wiki/JSON I found the following paragraph:

JSON exchange in an open ecosystem must be encoded in UTF-8.[18] The encoding supports the full Unicode character set, including those characters outside the Basic Multilingual Plane (U+10000 to U+10FFFF). However, if escaped, those characters must be written using UTF-16 surrogate pairs, a detail missed by some JSON parsers. For example, to include the Emoji character U+1F602 😂 FACE WITH TEARS OF JOY in JSON:

{ "face": "😂" }
// or
{ "face": "\uD83D\uDE02" }

So, how does science-parse encode these out-of-BMP Unicode code points?
Is this configurable?
Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which encoding are science-parse output files written in? #131

Which encoding are science-parse output files written in? #131

fyuval commented Feb 12, 2019

Which encoding are science-parse output files written in? #131

Which encoding are science-parse output files written in? #131

Comments

fyuval commented Feb 12, 2019