Lazy Parsing #33

zekronium · 2023-12-16T00:31:40Z

Hi,

My understanding that the key benefit of SIMD is that we can "progressively" parse a stream of JSON hence the tape reading like implementation.

Although in benchmarks with varying stop points, as in parsing only 20/50/80% of JSON and exiting early, the throughput seems to be almost the same and directly correlating to the size of the json.

If I pre-pad the array like what it does, then I get a more realistic result with varying throughput depending how deep the parsing goes but the general throughput still stays roughly the same. Is there alot of pre-parsing going on?

The bars are different size of json

piotrrzysko · 2023-12-16T06:21:50Z

Hi, currently, lazy parsing is not supported. However, I've been thinking about it, and it will likely be the next feature that I add to the library.

There are at least two approaches to lazy/limited parsing:

Schema-based, where we define a POJO class with fields that are important to us, and we skip parsing those that are not defined in the class:

List<Tweet> tweets = parser.parse(bytes);

On-demand, where we parse fields only when they are accessed:

Iterator<JsonValue> tweets = jsonValue.get("statuses").arrayIterator();
while (tweets.hasNext()) {
   JsonValue tweet = tweets.next();
   JsonValue user = tweet.get("user");
   if (user.get("default_profile").asBoolean()) { // parse the default_profile field here
       System.out.println(user.get("screen_name").asString());
   }
}

I'd like to support both approaches, but I don't know yet which will be implemented first.

zekronium · 2023-12-16T22:27:40Z

There are already a lot of libraries that parse to POJO well and handle missing fields and tune and optimize the parsing in many different ways, of course, does not mean they beat SIMD, but for me SIMD stands for the ability to really parse a large file/json in a streaming manner, apart from jsoniter, not many libraries can do that and especially this fast as.

I think a key note on stream parsing is theres a must to support InputStream since in many frameworks, especially high performance ones, accessing a byte[] might be expensive or even impossible (Netty UnsafeBuffer ex), so that must be taken into account

piotrrzysko · 2023-12-18T08:58:55Z

Okay, so you're referring to the feature described here: #19, rather than on-demand fields parsing. Nevertheless, both of these features are important, and I have them on the roadmap. However, I don't know yet when they will be delivered.

piotrrzysko · 2024-03-22T10:55:42Z

Hi @zekronium. Would you mind sharing the code of your benchmark?

zekronium · 2024-03-22T20:09:41Z

I can invite you to the repository

piotrrzysko · 2024-03-24T06:49:36Z

OK, so please add me to the repository.

zekronium · 2024-03-25T16:33:29Z

Invite sent

zekronium · 2024-04-29T10:08:13Z

Did it help? Have you also noticed, that in my benchmark where I serialize to map manually and build the full structure, how much slower it actually is compared to the same task in fastjson or even jackson?

piotrrzysko · 2024-04-30T04:18:42Z

It did, although I haven't started working on this yet. Regarding serializing to a map, I'll look into it.

piotrrzysko mentioned this issue Apr 29, 2024

Stream reader? #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy Parsing #33

Lazy Parsing #33

zekronium commented Dec 16, 2023

piotrrzysko commented Dec 16, 2023

zekronium commented Dec 16, 2023

piotrrzysko commented Dec 18, 2023

piotrrzysko commented Mar 22, 2024

zekronium commented Mar 22, 2024

piotrrzysko commented Mar 24, 2024

zekronium commented Mar 25, 2024

zekronium commented Apr 29, 2024

piotrrzysko commented Apr 30, 2024

Lazy Parsing #33

Lazy Parsing #33

Comments

zekronium commented Dec 16, 2023

piotrrzysko commented Dec 16, 2023

zekronium commented Dec 16, 2023

piotrrzysko commented Dec 18, 2023

piotrrzysko commented Mar 22, 2024

zekronium commented Mar 22, 2024

piotrrzysko commented Mar 24, 2024

zekronium commented Mar 25, 2024

zekronium commented Apr 29, 2024

piotrrzysko commented Apr 30, 2024