Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41321: [C++][Parquet] More strict Parquet level checking #41346

Merged
merged 14 commits into from
May 21, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Apr 23, 2024

Rationale for this change

In #41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict.

Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches

What changes are included in this PR?

More strict parquet checkings on Level

Are these changes tested?

Already exists test, maybe we can introduce parquet file as test file

Are there any user-facing changes?

More strict checkings

@mapleFU
Copy link
Member Author

mapleFU commented Apr 23, 2024

cc @emkornfield @pitrou Would you mind take a look? This patch checks the counting is right in RecordReader

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Apr 29, 2024
@@ -1028,12 +1030,17 @@ class TypedColumnReaderImpl : public TypedColumnReader<DType>,
// and number of values to read. This function is called before reading values.
void ReadLevels(int64_t batch_size, int16_t* def_levels, int16_t* rep_levels,
int64_t* num_def_levels, int64_t* values_to_read) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to have a int64_t* num_def_levels if it's always going to be equal to batch_size? Can we remove this argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_size = std::min(batch_size, this->available_values_current_page());

The batch_size would be adjusted by the code here.

*values_read = this->ReadValues(values_to_read, values);
ARROW_DCHECK_GE(values_to_read, *values_read);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this occur on a corrupted file? If so, it should be an exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right, this can occur on a corrupt file!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh-no, we cannot...If values_read is too less, it might be a corrupt file, but when values_to_read < *values_read, ReadValues returns too many...

@rouault
Copy link
Contributor

rouault commented May 7, 2024

@mapleFU Gentle ping if this PR could be moved forward. This would help me making progress on fuzzing my own code, which is now stopped by crashes in libarrow/libparquet.

@@ -1045,6 +1052,7 @@ class TypedColumnReaderImpl : public TypedColumnReader<DType>,

// Not present for non-repeated fields
if (this->max_rep_level_ > 0 && rep_levels != nullptr) {
DCHECK_GT(this->max_rep_level_, 0);
int64_t num_rep_levels = this->ReadRepetitionLevels(batch_size, rep_levels);
if (def_levels != nullptr && *num_def_levels != num_rep_levels) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be num_rep_levels != batch_size instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it can 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emmm batch_size is being limit here, not check that

@mapleFU
Copy link
Member Author

mapleFU commented May 7, 2024

Sorrying for delay I just back from holidays, will address the comments tomorrow.

@mapleFU
Copy link
Member Author

mapleFU commented May 8, 2024

Sigh, adding checks is easy, but maybe reasoning them would be a little tricky, and 77fc23f adds some ad-hoc checkings. Not ad-hoc checkings are written everywhere in our system. Maybe I can summarize them as:

  1. Levels Count / num_values not matches the num_values() in Page Header. LevelDecoder currently doesn't checks this 😭. But we will check the decoded def-level should equal to rep-level, and we can check available_values_current_page()
  2. storing value not matches the level, like in ReadBatch, we parsed that we have 10 non-null values, but only reads 5. This should be regard as "value not matches level"

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 8, 2024
@mapleFU mapleFU force-pushed the more-checks-for-read-levels branch from 0f49c8d to 2cf9591 Compare May 8, 2024 08:31
@mapleFU mapleFU changed the title GH-41321: [C++][Parquet] More strict Parquet value number checking GH-41321: [C++][Parquet] More strict Parquet level checking May 8, 2024
}
} else if (this->max_def_level_ > 0) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is useless...

@mapleFU mapleFU requested review from pitrou and emkornfield May 8, 2024 09:23
@mapleFU mapleFU force-pushed the more-checks-for-read-levels branch from fbafc33 to db35d8b Compare May 8, 2024 09:24
@mapleFU mapleFU force-pushed the more-checks-for-read-levels branch 3 times, most recently from 70f1b66 to b08803a Compare May 16, 2024 16:46
@mapleFU mapleFU force-pushed the more-checks-for-read-levels branch from b08803a to f50a393 Compare May 16, 2024 16:52
@mapleFU mapleFU requested a review from pitrou May 16, 2024 16:52
@mapleFU mapleFU requested a review from pitrou May 16, 2024 18:00
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 17, 2024
@mapleFU
Copy link
Member Author

mapleFU commented May 17, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: 3d58afe

Submitted crossbow builds: ursacomputing/crossbow @ actions-c74132eedf

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 17, 2024
@mapleFU
Copy link
Member Author

mapleFU commented May 20, 2024

Will merge in one day if no negative comments, since we collect 1 +1 vote

@mapleFU mapleFU merged commit 1f07404 into apache:main May 21, 2024
32 of 33 checks passed
@mapleFU mapleFU removed the awaiting change review Awaiting change review label May 21, 2024
@mapleFU mapleFU deleted the more-checks-for-read-levels branch May 21, 2024 10:38
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1f07404.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 30 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…ache#41346)

### Rationale for this change

In apache#41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict.

**Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches**

### What changes are included in this PR?

More strict parquet checkings on Level

### Are these changes tested?

Already exists test, maybe we can introduce parquet file as test file

### Are there any user-facing changes?

More strict checkings

* GitHub Issue: apache#41321

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Signed-off-by: mwish <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants