JSON-to-table adapter #2975

rbasralian · 2022-10-07T14:51:08Z

High-throughput JSON adapter from DHE. Supports parallel processing of JSON records, handling of nested values, and expansion of array nodes. Also supports writing data from nested array nodes to separate tables.

Major differences from DHE:

Support subtables (parsing an array field of one node into a a separate table)
Ability to process messages synchronously (helpful in tests; required for subtables)

It could use another API layer to allow the DynamicTableWriter and the JSONToTableWriter adapter to be defined at the same time (this is not a problem in DHE because the table writer is basically defined by the XML schema).

We could also create something that generates an adapter based on a JSON schema.

…to support writing to multiple tables from a single JSON record. also an `instantToTime()` function.

…in consumer threads.

devinrsmith

This is a very large changeset - I have not gone over it all; I would like to get some perspective before digging down into the nitty gritty details.

The ability to take json and turn it into a table is a great feature. (I would very much like it to not be a feature gated behind Kafka as it currently exists outside of this PR.) Raffi - are you the driving force behind wanting to get this feature into DHC, or is there a DHC user request?

My first concern is that we have bifurcated paths for parsing json - Kafka Json vs extensions-json. In an ideal world, Kafka Json would be able to explicitly use/share this code path.

One of the parts that makes the above challenging is that this PR is exhausting to a DynamicTableWriter. Instead, I think exhausting to a stream table makes a lot of sense. That makes it possible for the end user to consume a stream, a ring, or an append table. (I can't speak to the inner-workings of the Kafka Json parsing, but I do have some experience w/ stream table creation.)

There is also some language and structural interface around this PR that makes it more "enterprise" focused. "Database", "ingester", "write to disk". I'd like to make sure that we are exposing an appropriate interface for the DHC side.

I'm also concerned about the builders for creating the translation from JsonNode -> Table columns. Again, I haven't dug into the nitty-gritty, and I know that we want to have to handle complex nested cases, but I think it's a very important user facing part I'll want to take a closer look at.

I'm wondering if there may be a two-tiered approach we might take to simplify this process. A layer that is explicitly centered around JsonNode, with mapping from JsonPointer to type / WritableChunk. And then a user-friendly layer on top of it.

engine/time/src/main/java/io/deephaven/time/DateTime.java

extensions/json/build.gradle

devinrsmith · 2022-10-10T16:58:18Z

extensions/json/build.gradle

+    // Using io.confluent dependencies requires code in the toplevel build.gradle to add their maven repository.
+    // Note: the -ccs flavor is provided by confluent as their community edition. It is equivalent to the maven central
+    // version, but has a different version to make it easier to keep confluent dependencies aligned.
+    api 'org.apache.kafka:kafka-clients:7.1.1-ccs'
+    api 'io.confluent:kafka-avro-serializer:7.1.1'


I'm confused why json needs dependency on kafka. Maybe it will become clear as I continue on in PR?

Think only because of io.deephaven.kafka.ingest.JsonNodeUtil — meant to move that here and make Kafka depend on this module, but didn't get around to it yet

ok, moved that class & flipped the dependencies

devinrsmith · 2022-10-10T17:18:15Z

extensions/json/build.gradle

+            project(path: ':configs')
+
+
+    Classpaths.inheritSlf4j(project, 'slf4j-simple', 'runtimeOnly')


We should not be depending on slf4j-simple (or any other concrete logging impls) that persists to runtime from a java-library. It's ok to do it for testRuntimeOnly.

devinrsmith · 2022-10-10T19:35:49Z

extensions/kafka/src/main/java/io/deephaven/kafka/ingest/JsonNodeUtil.java

+        // 'node==null || node.isMissingNode()' is OK here, because missing keys
+        // are allowed (which implicitly means null values for those keys are allowed).
+        // only explicit null values are disallowed.
+        if (!allowNullValues && node != null && node.isNull()) {


This comment doesn't line up w/ the code?

It's theoretically possible that an user/app would want to allow a missing field but disallow explicitly null.

the problem before was that !allowMissingKeys && allowNullValues didn't work — if you have a missing key, the node variable will just be null already (because we are calling this with the result of node.get(key), which generally returns null if the key is missing.

But if the key is there, with an explicit null value, then you will have a non-null node reference (so node != null) but it will be a com.fasterxml.jackson.databind.node.NullNode (so node.isNull())

devinrsmith · 2022-10-10T19:59:19Z

extensions/json/src/main/java/io/deephaven/jsoningester/BaseMessageMetadata.java

+     * @param ingestTime The time when this message was finished processing by its ingester and was ready to be written
+     *        to disk.


We aren't writing messages to disk, are we?

nope that's just from the original files — fixed this and a few others

devinrsmith · 2022-10-10T20:05:38Z

extensions/json/src/main/java/io/deephaven/jsoningester/JSONToTableWriterAdapterBuilder.java

+/**
+ * The builder configures a factory for StringToTableWriterAdapters that accept JSON strings and writes a table.
+ */


Do the interfaces actually require creating strings? (I thought that the kafka one operated on bytes or input stream...)

I think forcing json to wash through strings (if that indeed is the case) is a design limitation. I'd love for this json parsing logic to be used by kafka json.

currently they do — the original code in DHE received messages from another library that only provided string message content. but I don't think it will be too tough to change this to takes bytes/streams/pre-parsed JsonNodes.

devinrsmith · 2022-10-10T20:09:01Z

extensions/json/src/test/java/io/deephaven/jsoningester/JsonAdapterTest.java

+        try {
+            // this will unfortunately take MAX_WAIT_MILLIS to finish
+            adapter.waitForProcessing(MAX_WAIT_MILLIS);
+            Assert.fail("Expected timeout exception did not occur");
+        } catch (final TimeoutException ex) {
+            // expected;
+        }


IMO, we need a better way to test this.

agreed...

I renamed this to testWaitForProcessingTimeout() since that's all it's really testing.

devinrsmith · 2022-10-10T20:13:19Z

extensions/json/src/main/java/io/deephaven/jsoningester/DataToTableWriterAdapter.java

+import java.io.IOException;
+import java.util.concurrent.TimeoutException;
+
+public interface DataToTableWriterAdapter {


This interface seems more generic than TableWriter - could probably be renamed?

renamed to AsynchronousDataIngester

devinrsmith · 2022-10-10T20:13:52Z

extensions/json/src/main/java/io/deephaven/jsoningester/DataToTableWriterAdapter.java

+    /**
+     * Shut down the adapter. This <b>must not run {@link #cleanup cleanup}</b>; that is handled by the StreamContext.
+     */


What's StreamContext?

StreamContext was in the original code. The point of this was "don't call cleanup() from just anywhere if you're writing to a DHE DIS" — updated it to be more generic/hopefully explain the problem

…ema at the same time

…essages.

…ful subclasses

…mes/types

…a TableWriter

…he JSONToStreamPublisherAdapter)

# Conflicts: # R/rdeephaven/DESCRIPTION # engine/table/src/main/java/io/deephaven/engine/table/impl/InitialFilterExecution.java # engine/table/src/main/java/io/deephaven/engine/table/impl/QueryTable.java # engine/table/src/main/java/io/deephaven/engine/table/impl/WhereListener.java # engine/table/src/test/java/io/deephaven/engine/table/impl/QueryTableWhereTest.java # gradle.properties # py/server/deephaven/_udf.py # web/client-api/types/package.json

…s when constructing adapter instead of in MutableInts when running field processors (which only worked with multithreading because the field processors happened to always run in the same order).

# Conflicts: # engine/table/src/main/java/io/deephaven/stream/StreamPublisherBase.java # extensions/json/src/main/java/io/deephaven/jsoningester/JsonNodeUtil.java

rbasralian added 5 commits October 1, 2022 04:30

initial port of JSONToTableWriterAdapter from DHE, plus an extension …

3df6c7f

…to support writing to multiple tables from a single JSON record. also an `instantToTime()` function.

port tests from DHE. change JsonNodeUtil to actually allow missing keys.

541c72a

add tests for subtables. use threadlocals for subtable-related state …

002930d

…in consumer threads.

fix weird inheritance. run spotless.

154e697

restore message metadata handling

8ef8f3a

rbasralian added the DocumentationNeeded label Oct 7, 2022

rbasralian added 2 commits October 7, 2022 14:54

scale performance test's arbitrary expectations based on the system

299fbdb

test with no 'owner' message adapter (so no message metadata).

bcdde0a

rcaudy requested a review from devinrsmith October 10, 2022 16:43

rcaudy assigned rbasralian Oct 10, 2022

rcaudy added the json label Oct 10, 2022

rcaudy added this to the Oct 2022 milestone Oct 10, 2022

devinrsmith reviewed Oct 10, 2022

View reviewed changes

rbasralian added 17 commits October 10, 2022 17:03

fix some outdated comments

4a71b45

replace DHE-specific comment with more generic note

0d5e769

Make Kafka depend on JSON instead of having it the other way around.

bfcdea2

update a comment

8ac4920

rename test

3c4b993

add support for defining the JSON adapter and the in-memory table sch…

2bcc161

…ema at the same time

add warmup to test to hopefully improve reliability. randomize test m…

62f3c63

…essages.

move a utility

fdfc6d0

note on replacing tablewriter

c6dae34

remove test-only 'TextMessage' class. misc cleanup.

28aac3b

move cleanup thread creation

96b1b06

remove unused method

2913057

comments on using StreamPublisher

efbb735

spotless

d7d07de

builder fix

5a8278b

relax node type check

cc08696

allow naming result table

3641c70

devinrsmith and others added 27 commits February 12, 2024 14:47

Bump to 0.32.1

895ab9a

Merge remote-tracking branch 'origin/release/v0.32.1' into raffi_json

604c74c

remove outdated dependency

c56ce7b

Move StreamPublisherBase from extensions-kafka to engine-table

f39a851

rename interface (not specific to TableWriter)

2e6875c

deprecate tablewriter-specific classes. extract and generify some use…

611e6c6

…ful subclasses

Move message-specific classes to a separate package

5beeabb

rename non-tablewriter-specific interface

30396c8

rename tablewriter-specific class interface

3a928e4

Promote StreamPublisherBase's tableDefinition from private to protected

57c3950

Deprecate tablewriter-specific class

3fec5f6

fix incorrect change in tablewriter-specific class

b6f6a3d

convenience method to build a TableDefinition from Lists of column na…

7ca58bc

…mes/types

fix some raw use of parametrized types warnings

4722188

Support for ingesting JSON data through a StreamPublisher instead of …

25cb67f

…a TableWriter

spotless

b56ba30

add builder to create the StreamPublisher and blink table (not just t…

4587252

…he JSONToStreamPublisherAdapter)

spotless

f27a6be

deprecate more tablewriter-related classes

813954f

remove dead code

418c983

remove dead code

a2deaa8

add bigger ingestion example

9029618

delete JSON-to-TableWriter code

62bc85c

Remove boxing in InMemoryRowHolder. Store row holder positions in int…

1fbbee1

…s when constructing adapter instead of in MutableInts when running field processors (which only worked with multithreading because the field processors happened to always run in the same order).

remove broken class

9312cda

Merge remote-tracking branch 'origin/main' into raffi_json

56ca959

# Conflicts: # engine/table/src/main/java/io/deephaven/stream/StreamPublisherBase.java # extensions/json/src/main/java/io/deephaven/jsoningester/JsonNodeUtil.java

pete-petey added the 2023_triagedNoMilestone label Aug 26, 2024

pete-petey modified the milestones: Year 2023 tickets that need to be 'milestone-ed' and assigned, 5. Backlog Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON-to-table adapter #2975

JSON-to-table adapter #2975

rbasralian commented Oct 7, 2022 •

edited

Loading

devinrsmith left a comment

devinrsmith Oct 10, 2022

rbasralian Oct 10, 2022

rbasralian Oct 10, 2022

devinrsmith Oct 10, 2022

rbasralian Jan 9, 2023

devinrsmith Oct 10, 2022

rbasralian Oct 10, 2022

devinrsmith Oct 10, 2022

rbasralian Oct 10, 2022

devinrsmith Oct 10, 2022

rbasralian Oct 10, 2022

devinrsmith Oct 10, 2022

rbasralian Oct 10, 2022

devinrsmith Oct 10, 2022

rbasralian Jan 9, 2023

devinrsmith Oct 10, 2022

rbasralian Oct 10, 2022

		project(path: ':configs')


		Classpaths.inheritSlf4j(project, 'slf4j-simple', 'runtimeOnly')

		* @param ingestTime The time when this message was finished processing by its ingester and was ready to be written
		* to disk.

JSON-to-table adapter #2975

Are you sure you want to change the base?

JSON-to-table adapter #2975

Conversation

rbasralian commented Oct 7, 2022 • edited Loading

devinrsmith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbasralian commented Oct 7, 2022 •

edited

Loading