Documents with anonymous top level array are not properly decoded #111

mikereinhold · 2024-10-04T16:57:37Z

According to JSON standards (RFC 4627, ECMA-404, and RFC 8259), an array is a legal top-level JSON text.

According to the Amazon Ion Hive SerDe documentation:

Because Amazon Ion is a superset of JSON, you can use the Amazon Ion Hive SerDe to query non-Amazon Ion JSON datasets.

Based on this, it is expected that JSON files with top level (anonymous) arrays should be properly understood and decoded by the Amazon Ion Hive SerDe.

For example:
[{"a": "b", "b": 123, "c": true}, {"a": "z", "b": 456, "c": false}]

However the Ion Hive SerDe does not properly interpret these files:

Table definition:

CREATE EXTERNAL TABLE `top_level_array_test`(
  `array` array<struct<a:string,b:int,c:boolean>>
)
ROW FORMAT SERDE 
  'com.amazon.ionhiveserde.IonHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'ion.encoding'='TEXT', 
  'ion.fail_on_overflow'='false',
  'ion.ignore_malformed'='false'
) 
STORED AS INPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonInputFormat' 
OUTPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonOutputFormat'
LOCATION
  '...'

However this results in no query results and no input bytes to the execution engine by the SerDe:

In my testing, the OpenX JSON SerDe correctly handles similar data files.

The text was updated successfully, but these errors were encountered:

rmarrowstone · 2024-10-10T17:35:02Z

Hi! It is true that Ion is a superset of JSON, but it doesn't follow that JSON Arrays should necessarily be treated as Rows/Structs by the Ion SerDe. I understand why it seems implied, but it's not a given.

We don't have any plans for active development on the Hive SerDe but other ecosystem integrations (namely Trino) are in-flight. In what engine/deployment are you using the Hive SerDe? Trino? AWS Athena? Spark? Something else?

mikereinhold added the bug Something isn't working label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documents with anonymous top level array are not properly decoded #111

Documents with anonymous top level array are not properly decoded #111

mikereinhold commented Oct 4, 2024

rmarrowstone commented Oct 10, 2024

Documents with anonymous top level array are not properly decoded #111

Documents with anonymous top level array are not properly decoded #111

Comments

mikereinhold commented Oct 4, 2024

rmarrowstone commented Oct 10, 2024