Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON configuration and Jackson Streaming Object Processor #5225

Merged
merged 56 commits into from
Jun 18, 2024

Conversation

devinrsmith
Copy link
Member

@devinrsmith devinrsmith commented Mar 6, 2024

This PR adds a declarative JSON configuration object that allows users to specify the schema of a JSON message. It is meant to have good out-of-the-box defaults, while still allowing power users to modify some of the finer parsing details (should this int field be parseable from a string? should null values be allowed? what if a field is missing? etc). The JSON configuration layer is not tied to any specific implementation; it is introspectible, and could have alternative implementations with other parsing backends. (I could imagine a DHE use-case where they do code-generation based on the JSON configuration, somewhat like the DHE avro ObjectProcessor code generator.)

Out of the box, there's an ObjectProcessor implementation based on the Jackson streaming APIs; that is, the data flows from byte[]s (or InputStream, relevant for very-large-files) to the output WritableChunks without the need for the intermediating Jackson databind layer (TreeNode). This saves a large layer of allocation that our current kafka json_spec layer relies upon. The ObjectProcessor layer means that this can be used in other places that expose ObjectProcessor layers and want 1-to-1 record-to-row (currently, Kafka).

Part of #5222

@devinrsmith
Copy link
Member Author

This PR is currently lacking a lot of documentation. I know!

In addition to reviews, I'm hoping to have some field testing of the APIs.

Here's an example script that shows a lot of the primitive functionality:

from datetime import datetime

from deephaven.table import Table
from deephaven.json import (
    string_,
    bool_,
    char_,
    byte_,
    short_,
    int_,
    long_,
    float_,
    double_,
    instant_,
    big_integer_,
    big_decimal_,
    array_,
    tuple_,
    object_,
    any_,
)
from deephaven.stream import blink_to_append_only
from deephaven.json import JsonValueType
from deephaven.json.table import json_table, source


def table(options: JsonValueType, content: str, multi_value_support: bool = False) -> Table:
    return blink_to_append_only(
        json_table(options, [source(content=content)], multi_value_support=multi_value_support)
    )


def weather() -> Table:
    return blink_to_append_only(
        json_table(
            {
                "features": [
                    {"properties": {"timestamp": datetime, "textDescription": str}}
                ]
            },
            [source(url="https://api.weather.gov/stations/KNYC/observations")],
        )
    )


for i, example in enumerate(
    [
        table(bool, "true"),
        table(int, "42"),
        table(float, "42.42"),
        table(str, '"foo bar"'),
        table(datetime, '"2009-02-13T23:31:30.123456789Z"'),
        table((bool, int), "[true, 42]"),
        table({"foo": int}, '{"foo": 42}'),
        table([int], "[1, 2, 4, 8, null]"),
        table(
            [{"name": str, "xy": (float, float)}],
            '[{"name": "foo", "xy": [1.1, 2.2]}, {"name": "bar", "xy": [-3, 42]}]',
        ),
        table(bool_(), "false"),
        table(char_(), '"X"'),
        table(byte_(), "42"),
        table(short_(), "42"),
        table(int_(), "42"),
        table(long_(allow_decimal=True), "42.42"),
        table(long_(allow_string=True), '"42"'),
        table(float_(), "42.42"),
        table(double_(), "42.42"),
        table(instant_(), '"2009-02-13T23:31:30.123456789Z"'),
        table(instant_(number_format="ns"), "1234567890123456789"),
        table(big_integer_(), "1234567890123456789999999999999999999999"),
        table(
            big_decimal_(),
            "1234567890123456789999999999999999999999.9876543210000000001234",
        ),
        table(tuple_((bool_(), int_())), "[true, 42]"),
        table(object_({"foo": int_(), "bar": long_()}), '{"foo": 42, "bar": 43}'),
        table(
            array_(object_({"foo": int_(), "bar": long_()})),
            '[{"foo": 42, "bar": 43}, {"foo": 44, "bar": 45}]',
        ),
        table(any_(), '[[[[], [[], [], [{"foo": 42}]]]]]'),
        table(int_(), "1 2 3 4", multi_value_support=True),
        weather(),
    ]
):
    globals()[f"example_{i}"] = example

@devinrsmith
Copy link
Member Author

Here's a fun one:

from deephaven.stream import blink_to_append_only
from deephaven.json import JsonValueType
from deephaven.json.table import json_table, source


deephaven_core_issues = blink_to_append_only(json_table(
    [
        {
            "number": int,
            "title": str,
            "user": {"login": str},
            "labels": [{"name": str}],
            "state": str,
            "assignee": {"login": str},
            "milestone": {"title": str},
            "comments": int,
            "body": str,
        }
    ],
    [
        source(url="https://api.github.com/repos/deephaven/deephaven-core/issues?per_page=100&page=1"),
        source(url="https://api.github.com/repos/deephaven/deephaven-core/issues?per_page=100&page=2"),
        source(url="https://api.github.com/repos/deephaven/deephaven-core/issues?per_page=100&page=3"),
        source(url="https://api.github.com/repos/deephaven/deephaven-core/issues?per_page=100&page=4"),
        source(url="https://api.github.com/repos/deephaven/deephaven-core/issues?per_page=100&page=5"),
    ],
)).sort_descending(["number"])

This PR adds a declarative JSON configuration object that allows users to specify the schema of a JSON message. It is meant to have good out-of-the-box defaults, while still allowing power users to modify some of the finer parsing details (should this int field be parseable from a string? should null values be allowed? what if a field is missing? etc). The JSON configuration layer is not tied to any specific implementation; it is introspectible, and could have alternative implementations with other parsing backends. (I could imagine a DHE use-case where they do code-generation based on the JSON configuration, somewhat like the DHE avro ObjectProcessor code generator.)

Out of the box, there's an ObjectProcessor implementation based on the Jackson streaming APIs; that is, the data flows from byte[]s (or InputStream, relevant for very-large-files) to the output WritableChunks without the need for the intermediating Jackson databind layer (TreeNode). This saves a large layer of allocation that our current kafka json_spec layer relies upon. The ObjectProcessor layer means that this can be used in other places that expose ObjectProcessor layers and want 1-to-1 record-to-row (currently, Kafka).

Part of deephaven#5222
Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review.

py/server/deephaven/stream/kafka/consumer.py Outdated Show resolved Hide resolved
py/server/deephaven/json/jackson.py Show resolved Hide resolved
Comment on lines 21 to 23
# todo: should be on the classpath by default, but doesn't have to be
# todo: would be nice if this code could live in the JSON jar
_JProvider = jpy.get_type("io.deephaven.json.jackson.JacksonProvider")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODOs.

@jmao-denver should take a look. he has a pattern for dealing with possibly missing libs that may be appropriate here.

Comment on lines 41 to 43
# todo: not on the classpath by default
# todo: would be nice if this code could live in the BSON jar
_JProvider = jpy.get_type("io.deephaven.bson.jackson.JacksonBsonProvider")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODOs.

@jmao-denver should take a look. he has a pattern for dealing with possibly missing libs that may be appropriate here.

* @return the date-time formatter
*/
@Default
public DateTimeFormatter dateTimeFormatter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to want to be able to parse "the way DateTimeUtils does".

*/
@Immutable
@BuildableStyle
public abstract class LocalDateOptions extends ValueOptionsSingleValueBase<LocalDate> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have LocalTime? ZonedDateTime?

*/
@Immutable
@BuildableStyle
public abstract class ObjectKvOptions extends ValueOptionsRestrictedUniverseBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider whether we can produce a guide like:
https://immutables.github.io/immutable.html

@jmao-denver
Copy link
Contributor

A more general question? Should we support JSON Schema?

JSON Schema is a widely-adopted standard for defining the structure and constraints of JSON data. It facilitates validation, documentation, and interoperability, making it an essential tool for developers working with JSON. By using JSON Schema, you can ensure that your JSON data conforms to the expected format and constraints, thus improving the robustness and reliability of your applications.

@devinrsmith
Copy link
Member Author

A more general question? Should we support JSON Schema?

JSON Schema is a widely-adopted standard for defining the structure and constraints of JSON data. It facilitates validation, documentation, and interoperability, making it an essential tool for developers working with JSON. By using JSON Schema, you can ensure that your JSON data conforms to the expected format and constraints, thus improving the robustness and reliability of your applications.

It should be relatively "straightforward" for us to implement a "JSON Schema to json Value" adapter in the future if we find value in it. This was considered for the java side of things, but calling it a "standard" means there are a bunch of completing implementations and different versions, and none of the java versions seemed to fit the bill for what we want to use it for.

@devinrsmith devinrsmith changed the title JSON configuration and Object Processor JSON configuration and Jackson Streaming Object Processor Jun 13, 2024
jmao-denver
jmao-denver previously approved these changes Jun 13, 2024
Copy link
Contributor

@jmao-denver jmao-denver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python changes LGTM.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

rcaudy
rcaudy previously approved these changes Jun 14, 2024
jmao-denver
jmao-denver previously approved these changes Jun 14, 2024
Copy link
Contributor

@jmao-denver jmao-denver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python changes LGTM.

<root level="warn">
<root level="info">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the consequence of this change? Where do these logs go? Who sees them? What gets lost?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; I didn't mean to commit this. I do think this should be done. There was discussion earlier about missing logs from the embedded server. It doesn't go to the console, only the web logs. I'll re-open this as a separate PR.

py/server/deephaven/json/jackson.py Outdated Show resolved Hide resolved
py/server/deephaven/json/jackson.py Outdated Show resolved Hide resolved
py/server/deephaven/json/jackson.py Outdated Show resolved Hide resolved
py/server/deephaven/json/jackson.py Outdated Show resolved Hide resolved
py/server/deephaven/json/__init__.py Outdated Show resolved Hide resolved
py/server/deephaven/json/__init__.py Show resolved Hide resolved
py/server/deephaven/json/__init__.py Show resolved Hide resolved
py/server/deephaven/json/__init__.py Outdated Show resolved Hide resolved
py/server/deephaven/json/__init__.py Show resolved Hide resolved
@devinrsmith devinrsmith dismissed stale reviews from jmao-denver and rcaudy via 7e81e41 June 14, 2024 23:28
Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@devinrsmith devinrsmith merged commit 7150848 into deephaven:main Jun 18, 2024
15 checks passed
@devinrsmith devinrsmith deleted the json-config branch June 18, 2024 15:22
@github-actions github-actions bot locked and limited conversation to collaborators Jun 18, 2024
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#236

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants