Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy Parsing #33

Open
zekronium opened this issue Dec 16, 2023 · 9 comments
Open

Lazy Parsing #33

zekronium opened this issue Dec 16, 2023 · 9 comments

Comments

@zekronium
Copy link

Hi,

My understanding that the key benefit of SIMD is that we can "progressively" parse a stream of JSON hence the tape reading like implementation.

Although in benchmarks with varying stop points, as in parsing only 20/50/80% of JSON and exiting early, the throughput seems to be almost the same and directly correlating to the size of the json.

If I pre-pad the array like what it does, then I get a more realistic result with varying throughput depending how deep the parsing goes but the general throughput still stays roughly the same. Is there alot of pre-parsing going on?
Screenshot 2023-12-15 at 16 31 07

The bars are different size of json

@piotrrzysko
Copy link
Member

Hi, currently, lazy parsing is not supported. However, I've been thinking about it, and it will likely be the next feature that I add to the library.

There are at least two approaches to lazy/limited parsing:

  1. Schema-based, where we define a POJO class with fields that are important to us, and we skip parsing those that are not defined in the class:
List<Tweet> tweets = parser.parse(bytes);
  1. On-demand, where we parse fields only when they are accessed:
Iterator<JsonValue> tweets = jsonValue.get("statuses").arrayIterator();
while (tweets.hasNext()) {
   JsonValue tweet = tweets.next();
   JsonValue user = tweet.get("user");
   if (user.get("default_profile").asBoolean()) { // parse the default_profile field here
       System.out.println(user.get("screen_name").asString());
   }
}

I'd like to support both approaches, but I don't know yet which will be implemented first.

@zekronium
Copy link
Author

There are already a lot of libraries that parse to POJO well and handle missing fields and tune and optimize the parsing in many different ways, of course, does not mean they beat SIMD, but for me SIMD stands for the ability to really parse a large file/json in a streaming manner, apart from jsoniter, not many libraries can do that and especially this fast as.

I think a key note on stream parsing is theres a must to support InputStream since in many frameworks, especially high performance ones, accessing a byte[] might be expensive or even impossible (Netty UnsafeBuffer ex), so that must be taken into account

@piotrrzysko
Copy link
Member

Okay, so you're referring to the feature described here: #19, rather than on-demand fields parsing. Nevertheless, both of these features are important, and I have them on the roadmap. However, I don't know yet when they will be delivered.

@piotrrzysko
Copy link
Member

Hi @zekronium. Would you mind sharing the code of your benchmark?

@zekronium
Copy link
Author

I can invite you to the repository

@piotrrzysko
Copy link
Member

OK, so please add me to the repository.

@zekronium
Copy link
Author

Invite sent

@zekronium
Copy link
Author

Did it help? Have you also noticed, that in my benchmark where I serialize to map manually and build the full structure, how much slower it actually is compared to the same task in fastjson or even jackson?

@piotrrzysko
Copy link
Member

It did, although I haven't started working on this yet. Regarding serializing to a map, I'll look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants