Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non-blocking ("async") JSON parsing #57

Closed
cowtowncoder opened this issue Feb 5, 2013 · 39 comments
Closed

Add support for non-blocking ("async") JSON parsing #57

cowtowncoder opened this issue Feb 5, 2013 · 39 comments
Milestone

Comments

@cowtowncoder
Copy link
Member

(migrated from http://jira.codehaus.org/browse/JACKSON-39 -- note, high vote count)


(suggested by Dimitri M on user list)

There are use cases where it'd be good to be able to feed input to parser, instead of trying to provide an input stream for parser to read from. This would cover use cases where input comes in chunks; for example, part of logical document in one chunk, then after a delay (perhaps in a separate request) another one and so forth. In these cases it may be difficult to implement InputStream (Reader etc) abstraction; instead, it would be better if application could feed (push) data to parser.
But if so, parser must be able to indicate cases where no data is YET available (but may become available).

This is similar to how Aalto Xml processor (http://www.cowtowncoder.com/hatchery/aalto/index.html) operatesd in its async mode. However, since Json is much simple format than xml, implementation might be simpler.

Based on my experiences with Aalto, implementation is a non-trivial thing however. One problem is that even UTF-8 decoding needs to be somewhat aware of chunk boundaries, so in the end a separate parser may be required: this because current parser uses blocking to handle these split cases. A smaller problem is that of indicating "not-yet-available" case – this can probably be handled by introducing a new member in JsonToken enumeration.

@chhetri
Copy link

chhetri commented Jun 7, 2013

+1. Please implement it if you can.

@cowtowncoder
Copy link
Member Author

Heh. Unless I have personal need, or someone pays me to do it, I doubt I'll ever work on this.

But perhaps someone else has the itch. Always willing to help.

@dpratt
Copy link

dpratt commented Jun 21, 2013

+1

I'm using Aalto right now to parse XML - I've bolted it onto an Akka IO/Scala Delimited Continuations framework that I wrote that allows you to access the API as if it were regular blocking StAX code. I'd love to be able to migrate our JSON handling to the same framework.

@cowtowncoder
Copy link
Member Author

For what it is worth, I did start writing async parser for Smile backend, as that would be marginally easier.

@dpratt I'd be very interested in learn more about your use of Aalto, including feedback, challenges, things you learnt. Maybe you could send email to my gmail address (tsaloranta)? This could even eventually help with JSON equivalent, if things worked out well. What I really need more than anything else is someone to collaborate, to make it less likely I write something that does not get used.

@akaigoro
Copy link

akaigoro commented Aug 5, 2013

I wrote a simple JSON non-blocking parser (as a part of a parser for Javon - Json's extention): https://github.com/rfqu/Javon . I can help to embed it into Jackson codebase.

@cowtowncoder
Copy link
Member Author

We are always happy to accept contributions, if you want to tackle this problem. In case of Jackson, interface would need to be via JsonParser, to work together with other components (although there is a significant problem with databinding anyway...).

@akaigoro
Copy link

akaigoro commented Aug 6, 2013

If you mean com.fasterxml.jackson.core.JsonParser, then it is technically impossible. It contains methods like nextToken(), which would block until input data are available, destroying the idea of non-blocking parser. The interface has to be turned inside out: parser should have method putChar(char nextChar), and transfer parsed data to a next stage which has interface similar to JsonGenerator.

@dpratt
Copy link

dpratt commented Aug 6, 2013

Why not have nextToken return the (already defined and standard Jackson
token type) NOT_AVAILABLE when more input is required?

On Tue, Aug 6, 2013 at 3:06 PM, Alexei Kaigorodov
[email protected]:

If you mean com.fasterxml.jackson.core.JsonParser, then it is technically
impossible. It contains methods like nextToken(), which would block until
input data are available, destroying the idea of non-blocking parser. The
interface has to be turned inside out: parser should have method
putChar(char nextChar), and transfer parsed data to a next stage which has
interface similar to JsonGenerator.


Reply to this email directly or view it on GitHubhttps://github.com//issues/57#issuecomment-22207073
.

@akaigoro
Copy link

akaigoro commented Aug 6, 2013

It is defined, but not used. And I cannot see easy way to use it: when the consumer should awake and try to read next token again? It should be notified some way, but then it's natural to pass new token with the notification. Then, how do you think the consumer should save and restore its state, if it calls to nextToken () from recursive methods?

@dpratt
Copy link

dpratt commented Aug 6, 2013

I've used the same pattern quite successfully using Aalto - basically, you
have to build up an Iteratee-like structure that parses a non-blocking
stream of events. Then, you have another bit that feeds the parser on
demand (when bytes are available), translates those to tokens, and feeds
the iteratee with the tokens. At each feed step, the iteratee either
signals that it's done or returns an object with the capability to consume
another token. Using these primitives, you can build up some very
sophisticated parsers - I'm about to publish a library that does this for
XML (using Aalto). The equivalent for JSON would actually be quite a bit
more simple to implement.

On Tue, Aug 6, 2013 at 4:40 PM, Alexei Kaigorodov
[email protected]:

It is defined, but not used. And I cannot see easy way to use it: when the
consumer should awake and try to read next token again? It should be
notified some way, but then it's natural to pass new token with the
notification. Then, how do you think the consumer should save and restore
its state, if it calls to nextToken () from recursive methods?


Reply to this email directly or view it on GitHubhttps://github.com//issues/57#issuecomment-22213710
.

@cowtowncoder
Copy link
Member Author

@rfqu Please have a look at Aalto, like @dpratt suggested, if you have time -- yes, minor modification is needed, but JsonToken already has NOT_AVAILABLE (I did anticipate need for it); and the other part for feeding input is not a big addition. I understand that you are thinking of push style interface (SAX); but while this can be useful for some use cases, there is not much value for adding such an interface in Jackson in my opinion. For a new stand-alone library it is not a bad way to go of course, and doing it that way is slightly easier for implementors. But not so much for users.

@dpratt Yes, exactly! There are challenges at higher levels, so data-binding would need to change a lot, from pull- to push model most likely. But at core level it is doable.

@cowtowncoder
Copy link
Member Author

@rfqu On how caller knows more is available: typically (at least with Aalto), caller both feeds new data and iterates; so it will feed a chunk of bytes, then iterate over tokens until there are no more available. Then get new data (possibly waiting, via NIO callbacks).

I wrote a bit about using Aalto in non-blocking mode 2.5 years ago, see:

http://www.cowtowncoder.com/blog/archives/2011/03/entry_451.html

@dpratt
Copy link

dpratt commented Aug 6, 2013

I can say that having at the very least a JsonParser implementation that
only implemented nextToken() (and perhaps just returns defaults for all the
other methods) would be incredibly useful for me. I have several cases in
which I need to parse very large JSON documents (on the order of 8-15 megs
in memory as a raw byte stream) and the ability to do it incrementally
would be massively helpful.

We could even implement a JsonParser2 interface that extends JsonParser and
adds the push methods to consume new byte arrays.

On Tue, Aug 6, 2013 at 5:39 PM, Tatu Saloranta [email protected]:

@rfqu https://github.com/rfqu Please have a look at Aalto, like @dpratthttps://github.com/dprattsuggested, if you have time -- yes, minor modification is needed, but
JsonToken already has NOT_AVAILABLE (I did anticipate need for it); and
the other part for feeding input is not a big addition. I understand that
you are thinking of push style interface (SAX); but while this can be
useful for some use cases, there is not much value for adding such an
interface in Jackson in my opinion. For a new stand-alone library it is not
a bad way to go of course, and doing it that way is slightly easier for
implementors. But not so much for users.

@dpratt https://github.com/dpratt Yes, exactly! There are challenges at
higher levels, so data-binding would need to change a lot, from pull- to
push model most likely. But at core level it is doable.


Reply to this email directly or view it on GitHubhttps://github.com//issues/57#issuecomment-22217359
.

@akaigoro
Copy link

akaigoro commented Aug 8, 2013

I looked Aalto and found that to save its state, parser defines several dozens of constants. It is no fun to program that way.
Then, your JsonParser delivers a stream of tokens and so is actually a scanner. Scanners work closely with parsers, and their interaction should not bother theirs user. You, however, offer me to implement a scanner alone with given interface while corresponding parser also is not implemented, and I am not sure the proposed interface is a right choice.
In short, I was not going to make a contribution to your project, and you did not inspire me. I still am ready to help if you decide to use my code, but I don't want to reconstruct it myself to follow your style.

@cowtowncoder
Copy link
Member Author

@rfqu You are free not to contribute, I was just outlining what kind of contribution would make sense, based on YOUR offer to help. I don't know where you get that pissy attitude however; that is unnecessary.
My job is not to "inspire" you; this is voluntary collaboration (or not) of peers. While I appreciate your offer for the code, it is very important that any code fits within general design of library. This does not seem to be the case here.

I wish you best of luck with your projects however; and given that it can do non-blocking parsing the way you like it, I hope interested users find it.

As to Aalto: it is just a (conceptually!) simple state machine; and if there was a state machine generator to use, it'd be much simpler to write. I don't greatly care if something is simple for implementors to write; pull-style is much easier for users to use and that what counts most to me. But as I said, you are free to explore other options that are more to your liking.

@dpratt
Copy link

dpratt commented Aug 9, 2013

@cowtowncoder to pop the stack on this conversation, I just wanted to insert that I've been really impressed with Aalto so far. It's super fast, and works exactly the way I want it to. I've actually given up on my original idea of using continuations with an XMLStreamReader shim - Scala's implementation of continuations imposes too much of a burden on clients, and it was really hard to make something that I'd be willing to use as a day-to-day parser. I've gone a different track of using a variant of Iteratees. It works really well and is fairly speedy, but it still suffers from a few performance issues, and I don't know if any of them are really addressable.

  • One of the nice things about XMLStreamReader is that it has a minimal amount of object allocations - the raw byte streams aren't translated to strings until you actually ask for the values of either text elements or attributes. Unfortunately, when I'm generating XML events for an Iteratee, I sort of have to implement a lightweight version of what XMLEventReader does, since I don't know if clients will ever ultimately need the information at each token or not. The actual parser may or may not consume the token until after the underlying stream reader has moved on. This means that there's a bunch of extra likely unneeded object allocations on every parser event. Ideally, what I'd love to do is just have the ability to grab the raw byte[] or char[] subsequence for the entire token and propogate that up, and only convert it to objects when asked. Do you have any plans to make an XMLEventReader implementation for aalto?
  • On the JSON side, I'm considering writing a simple scanner for JSON initially that uses a Decoder to translate from incremental byte[] to char[], and then scanning the char[] streams as they come in. Later on, if needed I can optimize the UTF-8 byte[] -> char[] converstion in a domain specific way. I'll keep you posted on my progress. I heavily suspect that anything I do won't be up to the legendary performance levels of the actual Jackson tokenizer, but I'd like to at least get something working before I go down the optimization rabbit hole.

Since you've done this before, is there anything in specific I should be concerned about w.r.t. processing byte[] streams?

@akaigoro
Copy link

akaigoro commented Aug 9, 2013

I splitted Javon project in Javon itself and independent pure Json project, now it is at https://github.com/rfqu/df-codec
@dpratt I believe https://github.com/rfqu/df-codec/tree/master/json/src/com/github/rfqu/codec/json/pushparser is what you looking for. I am ready to add features which you may found to be worth to add.

@cowtowncoder
Copy link
Member Author

@dpratt I think we should continue discussion (very good feedback btw) at Aalto users list at http://tech.groups.yahoo.com/group/aalto-xml-interest/
Also please do file an RFE for event reader implementation; it might be easy enough to do, and if so there might be performance improvements. And if not, at least more convenience.

@rfqu Thank you for the link & good luck -- I honestly think it is good to have different impls, approaches and wish you good luck with your work on Javon.

@testn
Copy link

testn commented Aug 17, 2015

Is there a way to get this supported? I saw something implemented in jackson-smile.

@cowtowncoder
Copy link
Member Author

@testn Yes, by someone with lots of time to spend on implementing and testing it. Smile unfortunately only has skeleton (or, scaffolding), not full implementation. I know how it can be done (see aalto-xml for earlier implementation I did for xml), but at this point do not have time or personal need, nor a customer willing to pay for development costs. But anyone else is free to have a stab with it, it is definitely doable.

@testn
Copy link

testn commented Aug 17, 2015

Can you describe a bit how it should be done? Maybe I will take a stab at it.

@cowtowncoder
Copy link
Member Author

The way I did it with Aalto:

https://github.com/FasterXML/aalto-xml

was to basically do two things:

  1. Provide non-blocking content feeder interface, to use instead of blocking InputStream or Reader
  2. Implement state-machine driven parser/decoder to keep byte-accurate state of things; and return NOT_AVAILABLE token if there isn't enough content to FULLY decode the event (unlike with blocking, there is no option of lazily decoding things)

Of these, (1) is trivially simple, although I had to rewrite it a bit to support alternatives like feeding byte[] vs feeding ByteBuffer. Still, it just needs to be accessible by caller (code that instantiates JsonParser), and allow doing three things: checking whether decoder has more data to work on, and if not, either feeding more data, or indicating that no more data is available (end of input).

Second part is the complicated part: there needs to be state associated with all possible different state throughout decoding, at byte accurate level. Not just within tokens, but since this has to assume multi-byte UTF-8 decoding as well, within UTF-8 characters. An alternative could be to separate UTF-8 decoding, and this might simplify things a bit, but with some performance overhead. If so, parser would work with chars, which is slightly easier (but not a whole lot, due to surrogate characters).

Regardless, amount of state to track with JSON is less than with XML, so that's a slight simplification.
In addition whereas in XML you can return partial text segments, in JSON String values are atomic (at least via Jackson tokens).
For state itself, it is not enough to just have a state (or, some sort of state/sub-state, main state being current token, sub-state within that context), but also there's need to track some of accumulated content, such as partially decoded UTF-8 character (if any). And possibly decoded value; although here existing components like TextBuffer may work with no modifications.

So... yeah, it is bit involved. I'm sure there are other approaches too, but in general state machine approach should work well. Others would probably use a state machine library or compiler; it could simplify the task. I am not sure whether it would, but since you'd be starting from the scratch it might make sense.

@wtracy
Copy link

wtracy commented May 15, 2016

Hello,

I'm interested in asynchronous I/O too, but from a different angle. I still want the traditional synchronous Jackson API, but internally I want Jackson to do something conceptually like this:

Token getNextToken() {
  startReadingDataIntoBuffer();
  blockUntilWeCanReadAToken();
  //data continues streaming into buffer in the background
  //until buffer is full or EOF is reached
  return parseAToken();
}

I guess my big question is: Could this pattern make Jackson faster by better saturating whatever I/O connection I'm using? I'm trying to stream some fairly large JSON files, and my gut feeling is that I'm not saturating my data connection because of the time the CPU spends on parsing between read() calls.

Is there anyone around more knowledgeable than me who can comment on both whether my hunch is correct, and, if so, whether any possible performance gains are likely to be worth the engineering effort? If anyone has any ideas for resources I could look at (I already spent half an hour with Google without luck) or benchmarks I could write to answer my own question, I'm all ears.

William Tracy

@akaigoro
Copy link

@wtracy as far I understand, you want to continue to load data while parsing. That is, to load data and parse them in parallel. If so, the simplest way is to start a separate thread for data loading, and to use a circular buffer to connect loading and parsing threads. Probably, loading thread should do also token recognition and fill the buffer with tokens, not characters.

@cowtowncoder
Copy link
Member Author

@wtracy isn't that just like existing regular blocking parser? This is how InputStream works: they block until some content is available, and always return at least one byte. JsonParser, in turn, reads enough content to decode one token (although possibly buffer more) and decode that.
So I am not sure what actual change you would be suggesting.

Now, I am guessing like @rfqu that you may be wishing to continue loading in the background. If so, I think that multi-threaded background reading should reside outside Jackson core, and abstracted out behind InputStream. Synchronization could be handled by usual blocking behavior. I would not count on this actually helping a lot, unless you have significant parallelism (thousands of threads); my experience (and also what I have read over time) has been that blocking I/O is pretty efficient at what it does for low to medium concurrency, at least to low hundreds of threads. Non-blocking or pipelined I/O starts to make sense when there are thousands of connections with long(er) lifetime but with bursty/low-average throughput at any given point.

@wtracy
Copy link

wtracy commented May 16, 2016

@cowtowncoder you've convinced me that I probably based this whole idea on
an incorrect concept of how BufferedInputStream (or whatever Jackson uses
internally) works. I was concerned that it was loading a burst of data on
the first read() call, then letting the hardware sit idle until another
read() call requested more data than was left in the buffer.

I think I should at least spend some more time understanding the internals
of InputStream and friends before I try to pursue this any farther. Thanks
for taking the time to talk to me, everybody.

William
On May 16, 2016 8:05 AM, "Tatu Saloranta" [email protected] wrote:

@wtracy https://github.com/wtracy isn't that just like existing regular
blocking parser? This is how InputStream works: they block until some
content is available, and always return at least one byte. JsonParser, in
turn, reads enough content to decode one token (although possibly buffer
more) and decode that.
So I am not sure what actual change you would be suggesting.

Now, I am guessing like @rfqu https://github.com/rfqu that you may be
wishing to continue loading in the background. If so, I think that
multi-threaded background reading should reside outside Jackson core, and
abstracted out behind InputStream. Synchronization could be handled by
usual blocking behavior. I would not count on this actually helping a lot,
unless you have significant parallelism (thousands of threads); my
experience (and also what I have read over time) has been that blocking I/O
is pretty efficient at what it does for low to medium concurrency, at least
to low hundreds of threads. Non-blocking or pipelined I/O starts to make
sense when there are thousands of connections with long(er) lifetime but
with bursty/low-average throughput at any given point.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#57 (comment)

@cowtowncoder
Copy link
Member Author

Quick note: I am working on non-blocking parsing for Smile format, and hope to follow it up here for JSON, to be included in 2.9.0.

@aduchate
Copy link

Actson: https://github.com/michel-kraemer/actson is an async implementation of a JSON parser

@cowtowncoder
Copy link
Member Author

Thank you for sharing this. API looks similar, probably due to common roots via Aalto xml parser.
Makes sense to me.

@cowtowncoder
Copy link
Member Author

Fwtw, Smile codec has fully functioning non-blocking implementation; JsonParser now exposes necessary methods (for byte array -backed input).

@sdeleuze
Copy link

sdeleuze commented May 29, 2017 via email

@cowtowncoder
Copy link
Member Author

@sdeleuze I am working on JSON async right now, and one thing I have to decide is whether to release one more pr (2.9.0.pr4), or take my chances with 2.9.0 final. If I get to push pr4 within a week or so (hopefully next weekend; probably no sooner), with json non-blocking, would that allow you to test it? Ideally I would have last pr as close as possible to eventual release, with only smaller fixes and new features. But there is some to releasing too.

@sdeleuze
Copy link

sdeleuze commented May 31, 2017 via email

@cowtowncoder
Copy link
Member Author

And with latest commit, non-blocking parser now works well enough to pass jackson-benchmarks test, which suggests speed is actually within 5% of blocking-parser's speed!

I still need to work a bit on non-standard features (comments, various quoting alternatives), as well as just porting more tests. But things are looking good.

@cowtowncoder cowtowncoder changed the title Add support for non-blocking ("async") parsing Add support for non-blocking ("async") JSON parsing Jun 3, 2017
@cowtowncoder cowtowncoder added this to the 2.9.0.pr4 milestone Jun 3, 2017
cowtowncoder added a commit that referenced this issue Jun 3, 2017
…complete wrt non-standard features, but functional
@cowtowncoder
Copy link
Member Author

@sdeleuze Finally got pr4 out (should be at Maven Central now)

@sdeleuze
Copy link

sdeleuze commented Jun 17, 2017 via email

@cowtowncoder
Copy link
Member Author

@sdeleuze Sounds good -- I tried to add reasonable testing, but that'd only help verify it works the way I want to, not that it is good or fit :)

@sdeleuze
Copy link

@cowtowncoder We are going to try to leverage this during the week and send you feedbacks cc @poutsma.

@mmimica
Copy link

mmimica commented Aug 1, 2017

Using actson to provide async parsing for jackson: https://github.com/mmimica/async-jackson

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants