Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RLE_DICTIONARY encoding with sink_parquet() #8442

Closed
talawahtech opened this issue Apr 23, 2023 · 1 comment
Closed

Support RLE_DICTIONARY encoding with sink_parquet() #8442

talawahtech opened this issue Apr 23, 2023 · 1 comment
Labels
enhancement New feature or an improvement of an existing feature

Comments

@talawahtech
Copy link

talawahtech commented Apr 23, 2023

Problem description

The default parquet library used by Polars (parquet2) does not support run length encoding for data pages (RLE_DICTIONARY). As a result, parquet files created from sorted, low-cardinality data are much larger than they need to be e.g. 4,028,562 KiB vs 17 KiB (see below)

Pyarrow supports RLE_DICTIONARY, so files created with write_parquet(use_pyarrow=True) benefit from this feature, however use_pyarrow is not available with sink_parquet().

Generate Files:

import polars as pl

num_entries=10_000_000
df = pl.select(pl.concat([pl.repeat(x, n=num_entries) for x in range(250, 350)]).alias("latency"))

rg_size = 64 * 1024 * 1024
df.write_parquet("file.parq", row_group_size=rg_size, compression="uncompressed")
df.write_parquet("file.zstd.parq", row_group_size=rg_size, compression="zstd")
df.write_parquet("file.pya.parq", row_group_size=rg_size, compression="uncompressed", use_pyarrow=True)

Print Metadata and File Size

import os
import pyarrow.parquet as pq

metadata = pq.read_metadata("file.parq",)
file_size = int(os.path.getsize("file.parq",) / 1024)

print(f"File: size: {file_size:,d} KiB")
print(metadata)
print(metadata.row_group(0).column(0))

Default Uncompressed

File: size: 4,028,562 KiB

<pyarrow._parquet.FileMetaData object at 0x7fa28d7cbbf0>
  created_by: Arrow2 - Native Rust implementation of Arrow
  num_columns: 1
  num_rows: 1000000000
  num_row_groups: 14
  format_version: 2.6
  serialized_size: 1488

<pyarrow._parquet.ColumnChunkMetaData object at 0x7fa2636790b0>
  file_offset: 294655007
  file_path: 
  physical_type: INT32
  num_values: 71428571
  path_in_schema: latency
  is_stats_set: False
  statistics:
    None
  compression: UNCOMPRESSED
  encodings: ('PLAIN', 'RLE')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 294655003
  total_uncompressed_size: 294655003

Default Compressed

File: size: 122,741 KiB

<pyarrow._parquet.FileMetaData object at 0x7fa25ac5f0b0>
  created_by: Arrow2 - Native Rust implementation of Arrow
  num_columns: 1
  num_rows: 1000000000
  num_row_groups: 14
  format_version: 2.6
  serialized_size: 1406

<pyarrow._parquet.ColumnChunkMetaData object at 0x7fa2636794d0>
  file_offset: 8972431
  file_path: 
  physical_type: INT32
  num_values: 71428571
  path_in_schema: latency
  is_stats_set: False
  statistics:
    None
  compression: ZSTD
  encodings: ('PLAIN', 'RLE')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 8972427
  total_uncompressed_size: 294655003

PyArrow Uncompressed

File: size: 17 KiB

<pyarrow._parquet.FileMetaData object at 0x7fa28d7cbbf0>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 1
  num_rows: 1000000000
  num_row_groups: 15
  format_version: 2.6
  serialized_size: 1465

<pyarrow._parquet.ColumnChunkMetaData object at 0x7fa28d7cbd70>
  file_offset: 1030
  file_path: 
  physical_type: INT32
  num_values: 67108864
  path_in_schema: latency
  is_stats_set: False
  statistics:
    None
  compression: UNCOMPRESSED
  encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 46
  total_compressed_size: 1026
  total_uncompressed_size: 1026
@talawahtech talawahtech added the enhancement New feature or an improvement of an existing feature label Apr 23, 2023
@talawahtech
Copy link
Author

talawahtech commented May 20, 2024

Confirming that the native write_parquet/sink_parquet functions now support RLE_DICTIONARY. Uncompressed test file size has dropped from 4,028,562 KiB to 15 KiB (23 KiB with statistics). Validated with the following code:

import polars as pl

rg_size=10_000_000
df = pl.select(pl.concat([pl.repeat(x, n=rg_size) for x in range(250, 350)]).alias("latency"))
df.write_parquet("file.parq", row_group_size=rg_size, compression="uncompressed")

pl.Config.set_streaming_chunk_size(rg_size)
lf = pl.scan_parquet("file.parq").rename({"latency": "s-latency"})
lf.sink_parquet("file-sink.parq", row_group_size=rg_size, compression="uncompressed")

I assume #16125 is the source of the fix. Thanks
@thalassemia! Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant