Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197
PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197
Changes from 33 commits
e311a8a
c60568b
d341446
e49b908
f9fba05
97d22a5
c6e244b
24fafcc
0556bfb
6fe15bb
c5ca0e1
766e62d
a0d0d43
5edbf34
0dfc307
f00f611
cb68370
b7ebf1c
407443f
ab3ef49
a9ec32f
d0b051c
8591f23
98d1881
0181c5e
9fa9f9c
4f8dcf0
ce7904d
1752f2d
5ee0864
b4a703d
7081735
aee8c0e
d7856b1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do we need to add an extra histogram for
pair<def_level, rep_level>
if both exist?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. But I might not be following what you are suggesting (I gave two examples from Arrow below on usage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we store def_levels and ref_levels separately, how can we derive number of nulls in each level precisely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking of supporting pushing down filters like
IS_NULL
orIS_NOT_NULL
to nested fields. So I want to make sure if this can satisfy the use case. Maybe we don't need precise null_count of each level but it would be great to answer yes or no to the filters above.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might pay to illustrate exact queries, but if this is just answering a question is there any null element at a particular nesting level I think definition level histogram by itself gives that information.
Take a nested lists where both lists and elements can be nullable at each level. IIRC, the definition levels would represent as follows:
0 - Null top level list.
1 - empty top level list
2 - null nested list
3 - empty nested list
4 - null leaf element
5 - present leaf element
So if the query is for top level list
is null
, one could prune whendef_level[0] == 0
. Foris not null
one could prune ifdef_level[0] == num_values from page (i.e. all values are null)
.I believe similar logic holds for
def_level[2]
but could get more complicated depending on the semantics of whether a top level null element should imply a the nested list is also null or if an empty list implies the nested list should be considered null (but should still be derivable by using histogram indices 0,1 and 2).One thing the joint histogram (pairs of rep/def level counts) could give you is the number first list elements that are null, but I'm not sure how useful that is. I would need to think about other queries the joint histogram would enable (or if you have more examples of supported queries we can figure out if one is needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not related to this line, but to
ColumnMetaData
in general)For completeness-reasons, we might also want to add
unencoded_byte_array_data_bytes
andnum_entries
for the dictionary page (if existent) into the ColumnMetadata, i.e.,dictionary_unencoded_byte_array_data_bytes
andnum_dictionary_entries
.This way, readers could plan how much memory the dictionary of a column chunk will take. This can help in decisions whether, e.g., to load the dictionary up-front to perform pre-filtering on the dictionary. It also helps to right-size the buffer that will hold the dictionary.
I'm not suggesting that this is a must-have for this commit or at all, so feel free to drop this issue. I just wanted to voice that if we already want to provide tools for size estimation, the dictionary is currently not really accounted for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this could be useful, my preference is to leave this out for now as it hasn't come up in discussion before we can always add this as follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields are already presented in the page header but it requires an in-efficient hop to read it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a stab (and likely be wrong 😉). I believe the discussion has assumed the histograms form a matrix with the row index being page number, the column index being the level. Assuming that, what you have defined would be the row major ordering, where elements of the same row are contiguous in memory, as in a C matrix. What @JFinis and @pitrou seem to prefer is the opposite, where elements of the same column are contiguous in memory, as in a Fortran matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think i got lost in whether column-major meant "page-major" or "histogram-major". I'll let @JFinis and @pitrou chime in if they want this reversed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess I'm confused by the terminology as well. Perhaps we can drop "row-major" and "column-major" entirely when discussing this? :-)
Personally, what I mean is that the levels of a given page should be contiguous in memory. I suppose that can be called "page-major" but I might be mistaken.
(if it was a 2D array or list, you would index first by page and then by level)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on moving away from the terminology "major" terminology and just assess if the comment matches what you want (which I think it does).
The other "major" option would be something like
Element 0 is the first element of the histogram for the first page. Element 1 is the first element of the histogram from the second page. The element at index ``num_pages`` is the second element of the histogram for the first page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does.
That said, I don't have a strong opinion on this anymore. I got confused by the "column-major" terminology and mistakenly assumed that Parquet columns were involved. Actually they're not :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm listing an option, I agree there is tension between size estimation use-case and filtering use-case, it really depends what we want to favor (the plus side of favoring size estimation is it a more obvious representation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. And I'm just voicing a preference 😉 I don't need to squeeze every last microsecond out of size estimation, so am happy to yield to those with more pressing performance constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the per level representation to be slightly superior, as it is more useful for filtering. Filtering is a process that might lead to most pages being skipped, so the overall query time might be super short in this case. The most extreme case would be a point look-up where only a single row in a single page survives the filters. In this case, the performance of actually performing the filtering on the page index might have a measurable impact.
In contrast, for the size-estimation case, we're estimating the size because we're planning to read the page. This reading will take orders of magnitude longer, so it is not too important to avoid every possible cache-miss in this case.
That being said, we're talking about micro optimizations here. Even though my gut feeling is that the other ordering would be superior, I don't mind this order. We're not creating hundreds of lists anymore, that's the most important point for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will leave as is, unless we get strong data on the trade-offs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of data, as will surprise no one, the current set of changes has a huge impact on the column index size (at least for the cases with thousands of pages with short histograms). I can post an update to the index size tables if there's interest.