Support RLE_DICTIONARY encoding with sink_parquet() #8442

talawahtech · 2023-04-23T02:22:11Z

Problem description

The default parquet library used by Polars (parquet2) does not support run length encoding for data pages (RLE_DICTIONARY). As a result, parquet files created from sorted, low-cardinality data are much larger than they need to be e.g. 4,028,562 KiB vs 17 KiB (see below)

Pyarrow supports RLE_DICTIONARY, so files created with write_parquet(use_pyarrow=True) benefit from this feature, however use_pyarrow is not available with sink_parquet().

Generate Files:

import polars as pl

num_entries=10_000_000
df = pl.select(pl.concat([pl.repeat(x, n=num_entries) for x in range(250, 350)]).alias("latency"))

rg_size = 64 * 1024 * 1024
df.write_parquet("file.parq", row_group_size=rg_size, compression="uncompressed")
df.write_parquet("file.zstd.parq", row_group_size=rg_size, compression="zstd")
df.write_parquet("file.pya.parq", row_group_size=rg_size, compression="uncompressed", use_pyarrow=True)

Print Metadata and File Size

import os
import pyarrow.parquet as pq

metadata = pq.read_metadata("file.parq",)
file_size = int(os.path.getsize("file.parq",) / 1024)

print(f"File: size: {file_size:,d} KiB")
print(metadata)
print(metadata.row_group(0).column(0))

Default Uncompressed

File: size: 4,028,562 KiB

<pyarrow._parquet.FileMetaData object at 0x7fa28d7cbbf0>
  created_by: Arrow2 - Native Rust implementation of Arrow
  num_columns: 1
  num_rows: 1000000000
  num_row_groups: 14
  format_version: 2.6
  serialized_size: 1488

<pyarrow._parquet.ColumnChunkMetaData object at 0x7fa2636790b0>
  file_offset: 294655007
  file_path: 
  physical_type: INT32
  num_values: 71428571
  path_in_schema: latency
  is_stats_set: False
  statistics:
    None
  compression: UNCOMPRESSED
  encodings: ('PLAIN', 'RLE')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 294655003
  total_uncompressed_size: 294655003

Default Compressed

File: size: 122,741 KiB

<pyarrow._parquet.FileMetaData object at 0x7fa25ac5f0b0>
  created_by: Arrow2 - Native Rust implementation of Arrow
  num_columns: 1
  num_rows: 1000000000
  num_row_groups: 14
  format_version: 2.6
  serialized_size: 1406

<pyarrow._parquet.ColumnChunkMetaData object at 0x7fa2636794d0>
  file_offset: 8972431
  file_path: 
  physical_type: INT32
  num_values: 71428571
  path_in_schema: latency
  is_stats_set: False
  statistics:
    None
  compression: ZSTD
  encodings: ('PLAIN', 'RLE')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 8972427
  total_uncompressed_size: 294655003

PyArrow Uncompressed

File: size: 17 KiB

<pyarrow._parquet.FileMetaData object at 0x7fa28d7cbbf0>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 1
  num_rows: 1000000000
  num_row_groups: 15
  format_version: 2.6
  serialized_size: 1465

<pyarrow._parquet.ColumnChunkMetaData object at 0x7fa28d7cbd70>
  file_offset: 1030
  file_path: 
  physical_type: INT32
  num_values: 67108864
  path_in_schema: latency
  is_stats_set: False
  statistics:
    None
  compression: UNCOMPRESSED
  encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 46
  total_compressed_size: 1026
  total_uncompressed_size: 1026

The text was updated successfully, but these errors were encountered:

talawahtech · 2024-05-20T05:18:22Z

Confirming that the native write_parquet/sink_parquet functions now support RLE_DICTIONARY. Uncompressed test file size has dropped from 4,028,562 KiB to 15 KiB (23 KiB with statistics). Validated with the following code:

import polars as pl

rg_size=10_000_000
df = pl.select(pl.concat([pl.repeat(x, n=rg_size) for x in range(250, 350)]).alias("latency"))
df.write_parquet("file.parq", row_group_size=rg_size, compression="uncompressed")

pl.Config.set_streaming_chunk_size(rg_size)
lf = pl.scan_parquet("file.parq").rename({"latency": "s-latency"})
lf.sink_parquet("file-sink.parq", row_group_size=rg_size, compression="uncompressed")

I assume #16125 is the source of the fix. Thanks
@thalassemia! Closing.

talawahtech added the enhancement New feature or an improvement of an existing feature label Apr 23, 2023

talawahtech closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support RLE_DICTIONARY encoding with sink_parquet() #8442

Support RLE_DICTIONARY encoding with sink_parquet() #8442

talawahtech commented Apr 23, 2023 •

edited

Loading

talawahtech commented May 20, 2024 •

edited

Loading

Support RLE_DICTIONARY encoding with sink_parquet() #8442

Support RLE_DICTIONARY encoding with sink_parquet() #8442

Comments

talawahtech commented Apr 23, 2023 • edited Loading

Problem description

talawahtech commented May 20, 2024 • edited Loading

talawahtech commented Apr 23, 2023 •

edited

Loading

talawahtech commented May 20, 2024 •

edited

Loading