Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make lexer use state rather than re-scanning previous text after break in input #202

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

nickd4
Copy link

@nickd4 nickd4 commented Jan 14, 2018

Firstly, thanks for a really great JSON Library. I've tried several others and this one really speaks to me, because I was looking for something clean and minimal and well-thought out, unlike others I tried.

I was concerned about a possible quadratic behaviour, if I try to parse a file containing really gigantic strings (e.g. 1 Gigabyte), and I pass it the input in consistent sized blocks (e.g. 16 Megabytes), then every time I pass a new block, the lexer is going to re-scan all the previous blocks received for the string.

The rest of the system should be able to handle this as far as I can see. (The use case is something like a browser cache where it would keep a list of keys being filenames and strings being file contents).

So I decided to do an experimental change where the lexer uses state variables to pick up where it left off, instead of re-scanning the previous input. This worked out quite well. I've used a variant of "Duff's device" to achieve this without huge modifications to the existing code. In fact the logic flow is pretty much identical, except that I merged the handling of "true", "false" and "null", just because I could. I could clean up the UTF-8 string validation stuff slightly (see comments) but that would be extra change.

I think this would be an enhancement to the current lexer, what do others think about the idea? In fact, for my case I am happy to use a private fork, but I thought it good to contribute it upstream if possible.

…re principled way by reversing the cases test suite character-by-character except inside tokens (this means keys and values are reversed, allow this in yajl_rev_parser.c)
…ot token, add resume-after-cancel option, make parser more consistent with reverse parser
…count that would be returned by yajl_gen_get_buf(), fix bug in reverse parser with last item not being flushed out during yajl_rev_complete_parse() if maybe supplementary
…readable yajl_gen_get_start_offset()/yajl_gen_get_end_offset() and make it so that the end offset is not necessarily the end of the buffer if a final newline was inserted, add similar yajl_get_start_offset()/yajl_get_end_offset() for parser callbacks
@nickd4
Copy link
Author

nickd4 commented Nov 6, 2018

Note that I wasn't experienced with pull requests when I filed this, and I seem to have linked to my experimental repository with many unrelated changes. To clarify, the proposed change is just to one or two functions which implement the relevant lexer functions as discussed in the original post.

I'll isolate them out and link the pull request later. It does not seem to matter at the moment since lloyd is inactive as many people have observed.

I am planning to take over maintainership of this project, by informally making a version available that contains (in general) the pull requests filed in this repo. However, it is a large project since there are many pull requests and I'm not sure that I can validate all of them, e.g. those relating to MSVC or embedded applications of the parser. So I plan to start the yajl maintainership project in a few months when I have a bit more time. If anyone is interested, please mail me: nick "AT" ndcode "DOT" org.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant