-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41321: [C++][Parquet] More strict Parquet level checking #41346
Conversation
cc @emkornfield @pitrou Would you mind take a look? This patch checks the counting is right in RecordReader |
cpp/src/parquet/column_reader.cc
Outdated
@@ -1028,12 +1030,17 @@ class TypedColumnReaderImpl : public TypedColumnReader<DType>, | |||
// and number of values to read. This function is called before reading values. | |||
void ReadLevels(int64_t batch_size, int16_t* def_levels, int16_t* rep_levels, | |||
int64_t* num_def_levels, int64_t* values_to_read) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to have a int64_t* num_def_levels
if it's always going to be equal to batch_size
? Can we remove this argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_size = std::min(batch_size, this->available_values_current_page());
The batch_size
would be adjusted by the code here.
cpp/src/parquet/column_reader.cc
Outdated
*values_read = this->ReadValues(values_to_read, values); | ||
ARROW_DCHECK_GE(values_to_read, *values_read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this occur on a corrupted file? If so, it should be an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right, this can occur on a corrupt file!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh-no, we cannot...If values_read
is too less, it might be a corrupt file, but when values_to_read < *values_read
, ReadValues
returns too many...
@mapleFU Gentle ping if this PR could be moved forward. This would help me making progress on fuzzing my own code, which is now stopped by crashes in libarrow/libparquet. |
cpp/src/parquet/column_reader.cc
Outdated
@@ -1045,6 +1052,7 @@ class TypedColumnReaderImpl : public TypedColumnReader<DType>, | |||
|
|||
// Not present for non-repeated fields | |||
if (this->max_rep_level_ > 0 && rep_levels != nullptr) { | |||
DCHECK_GT(this->max_rep_level_, 0); | |||
int64_t num_rep_levels = this->ReadRepetitionLevels(batch_size, rep_levels); | |||
if (def_levels != nullptr && *num_def_levels != num_rep_levels) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be num_rep_levels != batch_size
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it can 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emmm batch_size is being limit here, not check that
Sorrying for delay I just back from holidays, will address the comments tomorrow. |
Sigh, adding checks is easy, but maybe reasoning them would be a little tricky, and 77fc23f adds some ad-hoc checkings. Not ad-hoc checkings are written everywhere in our system. Maybe I can summarize them as:
|
0f49c8d
to
2cf9591
Compare
} | ||
} else if (this->max_def_level_ > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch is useless...
fbafc33
to
db35d8b
Compare
70f1b66
to
b08803a
Compare
b08803a
to
f50a393
Compare
@github-actions crossbow submit -g cpp |
Revision: 3d58afe Submitted crossbow builds: ursacomputing/crossbow @ actions-c74132eedf |
Will merge in one day if no negative comments, since we collect 1 +1 vote |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1f07404. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 30 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…ache#41346) ### Rationale for this change In apache#41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict. **Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches** ### What changes are included in this PR? More strict parquet checkings on Level ### Are these changes tested? Already exists test, maybe we can introduce parquet file as test file ### Are there any user-facing changes? More strict checkings * GitHub Issue: apache#41321 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Signed-off-by: mwish <[email protected]>
Rationale for this change
In #41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict.
Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches
What changes are included in this PR?
More strict parquet checkings on Level
Are these changes tested?
Already exists test, maybe we can introduce parquet file as test file
Are there any user-facing changes?
More strict checkings