Investigate performance of FixedSizedList/List types in Parquet files #40

sjperkins · 2023-07-13T11:36:39Z

We use FixedSizeListArrays and ListArrays to represent tensor and variably shaped data, respectively. In the Apache Arrow columnar format, these structures simply establish a view over a flat buffer of values, with additional offset arrays for each dimension in the ListArray case.

Arrow doesn't map 1-1 to Parquet and this means that reading (and writing?) these nested structures can be inefficient, compared to I/O on primitive types. Relevant issue and comments:

Reading FixedSizeList from parquet is slower than reading values into more rows apache/arrow#34510
Reading FixedSizeList from parquet is slower than reading values into more rows apache/arrow#34510 (comment)
Reading FixedSizeList from parquet is slower than reading values into more rows apache/arrow#34510 (comment)
Reading FixedSizeList from parquet is slower than reading values into more rows apache/arrow#34510 (comment)

To be honest parquet's tag line could be "It's good enough". You can almost certainly do 2-3x better than parquet for any given workload, but you really need orders of magnitude improvements to overcome ecosystem inertia. I suspect most workloads will also mix in byte arrays and/or object storage or block compression, at which point those will easily be the tall pole in decode performance.

Reading FixedSizeList from parquet is slower than reading values into more rows apache/arrow#34510 (comment)

Arrow based fixed size lists of primitive values (eg. tensors) shouldn't be converted to nested parquet data, but instead they are better as BYTE_ARRAY in parquet (while I think it'd be important sadly there is no fixed size BYTE_ARRAY in the parquet spec so it'll be still slightly slower than possible). Also some fast paths for never null data - which was not marked as non-nullable when the data was saved - can be useful too, but that's all.

So if optimal performance was desired for performing parquet i/o for nested, tensor type data, it sounds as if casting between the List types and Fixed Size Binary types (pyarrow.binary/Fixed Size Primitives) might be an easy fix to solve this, if performance proves to be a problem.

sjperkins · 2023-07-13T11:44:56Z

/cc @bennahugo @orliac for informational purposes

bennahugo · 2023-07-13T12:39:24Z

I think in first order, from my standpoint, we should compare a parquet-backed dataset and queries to what can be achieved with casacore for a reasonably sizable simulated MK+ database - 8k channels, 4s dumprate with dask accumulation and filtering operators. My idea is that we start working on incorporating this into something like ratt-ru/shadems and a set of notebooks for plotting MK+ data sets so that we can more easily commission the telescope.

sjperkins added the question Further information is requested label Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance of FixedSizedList/List types in Parquet files #40

Investigate performance of FixedSizedList/List types in Parquet files #40

sjperkins commented Jul 13, 2023 •

edited

Loading

sjperkins commented Jul 13, 2023

bennahugo commented Jul 13, 2023

Investigate performance of FixedSizedList/List types in Parquet files #40

Investigate performance of FixedSizedList/List types in Parquet files #40

Comments

sjperkins commented Jul 13, 2023 • edited Loading

sjperkins commented Jul 13, 2023

bennahugo commented Jul 13, 2023

sjperkins commented Jul 13, 2023 •

edited

Loading