-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44010: [C++] Add arrow::RecordBatch::MakeStatisticsArray()
#44252
base: main
Are you sure you want to change the base?
Conversation
|
903e3f4
to
92afc83
Compare
92afc83
to
b194430
Compare
@pitrou @ianmcook What do you think about this? Statistics schema https://github.com/apache/arrow/pull/43553/files#diff-f3758fb6986ea8d24bb2e13c2feb625b68bbd6b93b3fbafd3e2a03dcdc7ba263R86-R95 is compact but it may be complex to build. Because it uses many nested types. |
5a00c48
to
12b1a97
Compare
const std::shared_ptr<DataType>& operator()(const int64_t&) { return int64(); } | ||
const std::shared_ptr<DataType>& operator()(const uint64_t&) { return uint64(); } | ||
const std::shared_ptr<DataType>& operator()(const double&) { return float64(); } | ||
const std::shared_ptr<DataType>& operator()(const std::string&) { return utf8(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may forgot a bit but we don't distinct "bytes" and "utf8" in stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we didn't discuss it...
Let's discuss it in #44579.
We can assume "utf8" here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a // TODO(GH-44579)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I should have added it...
I've added it.
@@ -80,6 +80,24 @@ struct ArrowArray { | |||
void* private_data; | |||
}; | |||
|
|||
# define ARROW_STATISTICS_KEY_AVERAGE_BYTE_WIDTH_EXACT "ARROW:average_byte_width:exact" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know constexpr std::string_view is better or this is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use constexpr
because this header may be used by C programs.
statistics.nth_statistics = 0; | ||
statistics.start_new_column = true; | ||
statistics.nth_column = std::nullopt; | ||
statistics.key = ARROW_STATISTICS_KEY_ROW_COUNT_EXACT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So RowCount is also handled as a stats 🤔?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Statistics array will be passed to consumer before consumer receives a record batch.
So this may be useful for consumer.
But DuckDB doesn't have row count in its BaseStatistics
...: https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146
This may not be useful...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this for now to demonstrate table/record batch level statistics.
It's a convenient function that converts `arrow::ArrayStatistics` in a `arrow::RecordBatch` to `arrow::Array` for the Arrow C data interface.
8e4d618
to
9c529d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a careful pass tonight
const std::shared_ptr<DataType>& operator()(const int64_t&) { return int64(); } | ||
const std::shared_ptr<DataType>& operator()(const uint64_t&) { return uint64(); } | ||
const std::shared_ptr<DataType>& operator()(const double&) { return float64(); } | ||
const std::shared_ptr<DataType>& operator()(const std::string&) { return utf8(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a // TODO(GH-44579)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM but I'm not an expert on C ABI and data layer
|
||
// Statistics schema doesn't define static dense union type for | ||
// values. Each statistics schema have a dense union type that has | ||
// needled value types. The following block collects these types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually this is logically a "set" prepared for items
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
If there are the same types, the first type is only used.
}; | ||
using OnStatistics = | ||
std::function<Status(const EnumeratedStatistics& enumerated_statistics)>; | ||
Status EnumerateStatistics(const RecordBatch& record_batch, OnStatistics on_statistics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually this is for a two-phase building, one pass for types, and one-pass for data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
I think that it's one of complexities.
So I sent https://lists.apache.org/thread/0c9jftkspvj7yw1lpo73s3vtp6vfjqv8 to the mailing list. But nobody agreed it. So this complexity will be acceptable...
Rationale for this change
Statistics schema for Arrow C data interface (GH-43553) is complex because it uses nested types (struct, map and union). So reusable implementation to make statistics array is useful.
What changes are included in this PR?
arrow::RecordBatch::MakeStatisticsArray()
is a convenient function that convertsarrow::ArrayStatistics
in aarrow::RecordBatch
toarrow::Array
for the Arrow C data interface.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
arrow::ArrayStatistics
toarrow::Array
for the Arrow C data interface #44010