Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet metadata written can be corrupt due to Buffer comparison for DECIMAL type #137

Open
MatthewJin-at opened this issue Aug 15, 2024 · 0 comments

Comments

@MatthewJin-at
Copy link

Steps to reproduce

This program reproduces it on 1.7.0:

import {ParquetSchema, ParquetWriter} from '@dsnp/parquetjs';

export async function writeCorruptParquetFile() {
    const parquetWriter = await ParquetWriter.openFile(
        new ParquetSchema({
            num: {type: 'DECIMAL' as const, precision: 20},
        }),
        'corrupt.parquet',
    );

    // These are the numbers for 10^15 + 127 through 10^15 + 129
    const buffer127 = Buffer.from([0x00, 0x00, 0x03, 0x8d, 0x7e, 0xa4, 0xc6, 0x80, 0x7f]);
    const buffer128 = Buffer.from([0x00, 0x00, 0x03, 0x8d, 0x7e, 0xa4, 0xc6, 0x80, 0x80]);
    const buffer129 = Buffer.from([0x00, 0x00, 0x03, 0x8d, 0x7e, 0xa4, 0xc6, 0x80, 0x81]);

    await parquetWriter.appendRow({
        num: buffer127,
    });
    await parquetWriter.appendRow({
        num: buffer128,
    });
    await parquetWriter.appendRow({
        num: buffer129,
    });

    await parquetWriter.close();
}

Expected behaviour

The file should have 3 numbers with accurate rowgroup metadata.

Actual behaviour

This writes a file with 3 numbers but corrupt metadata - the max_value of the row group is the 128 value instead of the 129 value.

➜  parquet meta corrupt.parquet                                                   [12:39:23]

File path:  corrupt.parquet
Created by: @dsnp/parquetjs
Properties: (none)
Schema:
message root {
  required binary num (DECIMAL(20,0));
}


Row group 0:  count: 3  37.00 B records  start: 4  total(compressed): 111 B total(uncompressed):186 B
--------------------------------------------------------------------------------
     type      encodings count     avg size   nulls   min / max
num  BINARY    _ BB_     3         37.00 B    0       "1000000000000127" / "1000000000000128"

➜  parquet check-stats corrupt.parquet                                            [12:39:30]
corrupt.parquet has corrupt stats: Max should be >= all values.

Any logs, error output, etc?

n/a

Any other comments?

I think this is because the library uses < and > instead of Buffer.compare() to determine the statistics min_value/max_values.

@wilwade wilwade added this to the Q2/Q3 2024 Improvements milestone Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants