Make lexer use state rather than re-scanning previous text after break in input #202

nickd4 · 2018-01-14T04:41:06Z

Firstly, thanks for a really great JSON Library. I've tried several others and this one really speaks to me, because I was looking for something clean and minimal and well-thought out, unlike others I tried.

I was concerned about a possible quadratic behaviour, if I try to parse a file containing really gigantic strings (e.g. 1 Gigabyte), and I pass it the input in consistent sized blocks (e.g. 16 Megabytes), then every time I pass a new block, the lexer is going to re-scan all the previous blocks received for the string.

The rest of the system should be able to handle this as far as I can see. (The use case is something like a browser cache where it would keep a list of keys being filenames and strings being file contents).

So I decided to do an experimental change where the lexer uses state variables to pick up where it left off, instead of re-scanning the previous input. This worked out quite well. I've used a variant of "Duff's device" to achieve this without huge modifications to the existing code. In fact the logic flow is pretty much identical, except that I merged the handling of "true", "false" and "null", just because I could. I could clean up the UTF-8 string validation stuff slightly (see comments) but that would be extra change.

I think this would be an enhancement to the current lexer, what do others think about the idea? In fact, for my case I am happy to use a private fork, but I thought it good to contribute it upstream if possible.

…k in input

… strings

…re principled way by reversing the cases test suite character-by-character except inside tokens (this means keys and values are reversed, allow this in yajl_rev_parser.c)

…rmal ones

…ot token, add resume-after-cancel option, make parser more consistent with reverse parser

…for "{}"

…count that would be returned by yajl_gen_get_buf(), fix bug in reverse parser with last item not being flushed out during yajl_rev_complete_parse() if maybe supplementary

…rser

…readable yajl_gen_get_start_offset()/yajl_gen_get_end_offset() and make it so that the end offset is not necessarily the end of the buffer if a final newline was inserted, add similar yajl_get_start_offset()/yajl_get_end_offset() for parser callbacks

…osition

nickd4 · 2018-11-06T00:35:17Z

Note that I wasn't experienced with pull requests when I filed this, and I seem to have linked to my experimental repository with many unrelated changes. To clarify, the proposed change is just to one or two functions which implement the relevant lexer functions as discussed in the original post.

I'll isolate them out and link the pull request later. It does not seem to matter at the moment since lloyd is inactive as many people have observed.

I am planning to take over maintainership of this project, by informally making a version available that contains (in general) the pull requests filed in this repo. However, it is a large project since there are many pull requests and I'm not sure that I can validate all of them, e.g. those relating to MSVC or embedded applications of the parser. So I plan to start the yajl maintainership project in a few months when I have a bit more time. If anyone is interested, please mail me: nick "AT" ndcode "DOT" org.

nickd4 added 12 commits January 14, 2018 15:29

Make lexer use state rather than re-scanning previous text after brea…

0a57ebf

…k in input

Remove a variable that is no longer used since the previous commit

da77780

Add reverse parser and test cases, passes more than half, need to fix…

50e1688

… strings

Implement reverse string parsing, revise rev_cases test suite in a mo…

9772943

…re principled way by reversing the cases test suite character-by-character except inside tokens (this means keys and values are reversed, allow this in yajl_rev_parser.c)

Add YAJL_SUPPLEMENTARY to parse or generate extra data items after no…

1f495ca

…rmal ones

Add YAJL_SUPPLEMENTARY support in the reverse parsers

c03da2d

Make reverse parser lookahead after supplementary item by character n…

b765929

…ot token, add resume-after-cancel option, make parser more consistent with reverse parser

Modify generator to generate "[]" rather than "[\n\n]" and similarly …

61bd5a0

…for "{}"

Add convenience routine yajl_gen_get_offset() which just returns the …

dfe0a19

…count that would be returned by yajl_gen_get_buf(), fix bug in reverse parser with last item not being flushed out during yajl_rev_complete_parse() if maybe supplementary

Add yajl_reset() similar to yajl_gen_reset() allowing to reset the pa…

2068f46

…rser

Fix bug with closing ] or } losing its whitespace when in map value p…

fec49bb

…osition

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make lexer use state rather than re-scanning previous text after break in input #202

Make lexer use state rather than re-scanning previous text after break in input #202

nickd4 commented Jan 14, 2018

nickd4 commented Nov 6, 2018

Make lexer use state rather than re-scanning previous text after break in input #202

Are you sure you want to change the base?

Make lexer use state rather than re-scanning previous text after break in input #202

Conversation

nickd4 commented Jan 14, 2018

nickd4 commented Nov 6, 2018