Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documents with anonymous top level array are not properly decoded #111

Open
mikereinhold opened this issue Oct 4, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@mikereinhold
Copy link

According to JSON standards (RFC 4627, ECMA-404, and RFC 8259), an array is a legal top-level JSON text.

According to the Amazon Ion Hive SerDe documentation:

Because Amazon Ion is a superset of JSON, you can use the Amazon Ion Hive SerDe to query non-Amazon Ion JSON datasets.

Based on this, it is expected that JSON files with top level (anonymous) arrays should be properly understood and decoded by the Amazon Ion Hive SerDe.

For example:
[{"a": "b", "b": 123, "c": true}, {"a": "z", "b": 456, "c": false}]

However the Ion Hive SerDe does not properly interpret these files:

Table definition:

CREATE EXTERNAL TABLE `top_level_array_test`(
  `array` array<struct<a:string,b:int,c:boolean>>
)
ROW FORMAT SERDE 
  'com.amazon.ionhiveserde.IonHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'ion.encoding'='TEXT', 
  'ion.fail_on_overflow'='false',
  'ion.ignore_malformed'='false'
) 
STORED AS INPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonInputFormat' 
OUTPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonOutputFormat'
LOCATION
  '...'

However this results in no query results and no input bytes to the execution engine by the SerDe:
image
image

In my testing, the OpenX JSON SerDe correctly handles similar data files.

@mikereinhold mikereinhold added the bug Something isn't working label Oct 4, 2024
@rmarrowstone
Copy link

Hi! It is true that Ion is a superset of JSON, but it doesn't follow that JSON Arrays should necessarily be treated as Rows/Structs by the Ion SerDe. I understand why it seems implied, but it's not a given.

We don't have any plans for active development on the Hive SerDe but other ecosystem integrations (namely Trino) are in-flight. In what engine/deployment are you using the Hive SerDe? Trino? AWS Athena? Spark? Something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants