-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41301: [C++] Extract the kernel loops used for PrimitiveTakeExec and generalize to any fixed-width type #41373
Conversation
29b582d
to
d47fb55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm skeptical that you want to reuse this for Filter, unless you add Gather methods for batch selection. For Filter performance, it is essential to write out ranges of selected values at a time, not one value at a time. I don't know if that's what you have in mind.
Other than that, some assorted comments.
cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc
Outdated
Show resolved
Hide resolved
I want to expand the set of |
3dd6c56
to
6d769d3
Compare
Nice! Can you also post numbers obtained with the |
Looking at both bloaty -d symbols -C full -n 0 \
HEAD-vector_selection_take_internal.cc.o HEAD-vector_selection_internal.cc.o -- \
MAIN-vector_selection_take_internal.cc.o MAIN-vector_selection_internal.cc.o FILE SIZE VM SIZE
-------------- --------------
[NEW] +32.8Ki [NEW] +32.7Ki arrow::compute::internal::FixedWidthTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*)
[NEW] +204 [NEW] +168 GCC_except_table272
[NEW] +189 [NEW] +104 arrow::Status arrow::Status::NotImplemented<char const (&) [38], arrow::DataType const&>(char const (&&&) [38], arrow::DataType const&&&) Details
+183% +168 +300% +168 GCC_except_table170
+43% +164 +47% +164 GCC_except_table20
[NEW] +156 [NEW] +120 GCC_except_table302
[NEW] +146 [NEW] +76 GCC_except_table99
+40% +109 +71% +144 GCC_except_table24
+72% +143 +66% gg +108 GCC_except_table13
[NEW] +126 [NEW] +56 GCC_except_table60
[NEW] +108 [NEW] +72 GCC_except_table138
+169% +108 +257% +72 GCC_except_table172
[NEW] +108 [NEW] +72 GCC_except_table225
[NEW] +107 [NEW] +72 GCC_except_table64
+195% +107 +360% +72 GCC_except_table87
[NEW] +96 [NEW] +60 GCC_except_table111
+160% +96 +250% +60 GCC_except_table151
+25% +95 +18% +60 GCC_except_table17
[NEW] +95 [NEW] +60 GCC_except_table72
[NEW] +95 [NEW] +60 GCC_except_table81
+116% +88 +220% +88 GCC_except_table192
[NEW] +84 [NEW] +48 GCC_except_table188
[NEW] +76 [NEW] +40 GCC_except_table107
+96% +68 +189% +68 GCC_except_table10
[NEW] +68 [NEW] +32 GCC_except_table102
[NEW] +68 [NEW] +32 GCC_except_table114
[NEW] +68 [NEW] +32 GCC_except_table144
[NEW] +68 [NEW] +32 GCC_except_table197
[NEW] +68 [NEW] +32 GCC_except_table223
[NEW] +67 [NEW] +32 GCC_except_table43
[NEW] +67 [NEW] +32 GCC_except_table53
[NEW] +67 [NEW] +32 GCC_except_table90
[NEW] +64 [NEW] +28 GCC_except_table164
[NEW] +64 [NEW] +28 GCC_except_table199
[NEW] +64 [NEW] +28 GCC_except_table207
[NEW] +64 [NEW] +28 GCC_except_table213
[NEW] +63 [NEW] +28 GCC_except_table28
[NEW] +60 [NEW] +24 GCC_except_table128
[NEW] +60 [NEW] +24 GCC_except_table155
[NEW] +60 [NEW] +24 GCC_except_table298
[NEW] +59 [NEW] +24 GCC_except_table36
[NEW] +56 [NEW] +20 GCC_except_table182
[NEW] +56 [NEW] +20 GCC_except_table190
[NEW] +56 [NEW] +20 GCC_except_table209
[NEW] +52 [NEW] +16 GCC_except_table133
[NEW] +52 [NEW] +16 GCC_except_table134
[NEW] +52 [NEW] +16 GCC_except_table162
[NEW] +52 [NEW] +16 GCC_except_table168
[NEW] +52 [NEW] +16 GCC_except_table179
[NEW] +52 [NEW] +16 GCC_except_table194
+100% +52 +100% +16 GCC_except_table202
[NEW] +52 [NEW] +16 GCC_except_table218
[NEW] +52 [NEW] +16 GCC_except_table221
[NEW] +52 [NEW] +16 GCC_except_table277
[NEW] +52 [NEW] +16 GCC_except_table279
[NEW] +51 [NEW] +16 GCC_except_table38
[NEW] +50 [NEW] +16 GCC_except_table4
+92% +48 +300% +48 GCC_except_table132
[NEW] +47 [NEW] +12 GCC_except_table54
[NEW] +42 [NEW] +16 lCPI223_1
[NEW] +40 [NEW] +16 lJTI2_8
[NEW] +40 [NEW] +16 lJTI2_9
+25% +32 +53% +32 GCC_except_table84
+0.4% +32 +0.4% +32 ltmp9
+30% +31 -5.9% -4 GCC_except_table85
[NEW] +31 [NEW] +5 lJTI170_0
[NEW] +30 [NEW] +4 lJTI170_1
+44% +28 +100% +28 GCC_except_table171
-6.4% -7 +70% +28 GCC_except_table35
+32% +24 +60% +24 GCC_except_table177
+46% +24 +150% +24 GCC_except_table178
+38% +20 +125% +20 GCC_except_table217
+26% +15 +100% +16 GCC_except_table100
+19% +12 +43% +12 GCC_except_table206
+19% +12 +43% +12 GCC_except_table212
+14% +8 +40% +8 GCC_except_table181
+5.6% +4 +11% +4 GCC_except_table216
+2.9% +2 +7.4% +2 typeinfo name for arrow::StructArray
+1.2% +2 +2.6% +2 typeinfo name for std::__1::__shared_ptr_emplace<arrow::ChunkedArray, std::__1::allocator<arrow::ChunkedArray> >
-6.8% -4 -16.7% -4 GCC_except_table83
-13.3% -8 -33.3% -8 GCC_except_table149
-17.9% -12 -37.5% -12 GCC_except_table41
-20.7% -12 -50.0% -12 GCC_except_table8
-23.9% -16 -50.0% -16 GCC_except_table52
-26.3% -20 -50.0% -20 GCC_except_table215
[DEL] -30 [DEL] -4 lJTI169_1
-27.3% -30 -46.9% -30 lJTI2_0
[DEL] -31 [DEL] -5 lJTI169_0
+2.9% +3 -47.1% -32 GCC_except_table34
[DEL] -41 [DEL] -16 lJTI25_0
[DEL] -42 [DEL] -16 lCPI222_1
-40.7% -44 -61.1% -44 GCC_except_table173
[DEL] -46 [DEL] -12 GCC_except_table7
[DEL] -47 [DEL] -12 GCC_except_table55
-0.6% -48 -0.6% -48 arrow::compute::internal::FSLTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*)
[DEL] -51 [DEL] -16 GCC_except_table37
[DEL] -51 [DEL] -16 GCC_except_table51
[DEL] -51 [DEL] -16 GCC_except_table61
[DEL] -52 [DEL] -16 GCC_except_table135
[DEL] -52 [DEL] -16 GCC_except_table148
[DEL] -52 [DEL] -16 GCC_except_table161
[DEL] -52 [DEL] -16 GCC_except_table203
[DEL] -52 [DEL] -16 GCC_except_table285
[DEL] -52 [DEL] -16 GCC_except_table287
[DEL] -52 [DEL] -16 GCC_except_table288
[DEL] -55 [DEL] -20 GCC_except_table33
[DEL] -55 [DEL] -20 GCC_except_table40
[DEL] -56 [DEL] -20 GCC_except_table208
[DEL] -56 [DEL] -20 GCC_except_table214
-48.3% -58 -46.2% -24 GCC_except_table2
[DEL] -60 [DEL] -24 GCC_except_table127
[DEL] -60 [DEL] -24 GCC_except_table306
[DEL] -63 [DEL] -28 GCC_except_table29
[DEL] -64 [DEL] -28 GCC_except_table163
[DEL] -64 [DEL] -28 GCC_except_table180
[DEL] -68 [DEL] -32 GCC_except_table113
[DEL] -68 [DEL] -32 GCC_except_table145
-22.1% -72 -28.1% -72 GCC_except_table11
[DEL] -75 [DEL] -40 GCC_except_table59
[DEL] -76 [DEL] -40 GCC_except_table108
[DEL] -76 [DEL] -40 GCC_except_table205
[DEL] -76 [DEL] -40 GCC_except_table211
-61.3% -92 -79.3% -92 GCC_except_table9
-20.2% -95 -15.0% -60 GCC_except_table16
[DEL] -95 [DEL] -60 GCC_except_table71
[DEL] -95 [DEL] -60 GCC_except_table82
[DEL] -95 [DEL] -60 GCC_except_table98
[DEL] -96 [DEL] -60 GCC_except_table110
-61.5% -96 -71.4% -60 GCC_except_table152
-64.3% -99 -76.2% -64 GCC_except_table86
[DEL] -100 [DEL] -64 GCC_except_table131
[DEL] -100 [DEL] -64 GCC_except_table174
[DEL] -103 [DEL] -24 typeinfo for arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl
[DEL] -107 [DEL] -72 GCC_except_table65
[DEL] -108 [DEL] -72 GCC_except_table139
[DEL] -108 [DEL] -72 GCC_except_table226
[DEL] -113 [DEL] -28 arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl::Finish()
[DEL] -120 [DEL] -48 GCC_except_table222
[DEL] -127 [DEL] -48 vtable for arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl
[DEL] -132 [DEL] -60 GCC_except_table198
-24.1% -132 -25.8% -132 GCC_except_table21
[DEL] -133 [DEL] -8 arrow::compute::internal::(anonymous namespace)::Selection<arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl, arrow::FixedSizeBinaryType>::Init()
[DEL] -134 [DEL] -64 GCC_except_table89
[DEL] -136 [DEL] -64 GCC_except_table101
-26.2% -136 -28.1% -136 GCC_except_table18
[DEL] -137 [DEL] -16 typeinfo for arrow::compute::internal::(anonymous namespace)::Selection<arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl, arrow::FixedSizeBinaryType>
[DEL] -137 [DEL] -58 typeinfo name for arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl
[DEL] -140 [DEL] -68 GCC_except_table189
-64.8% -140 -72.2% -104 GCC_except_table193
[DEL] -149 [DEL] -96 arrow::FixedSizeBinaryArray::~FixedSizeBinaryArray()
-74.5% -152 -90.5% -152 GCC_except_table280
[DEL] -156 [DEL] -120 GCC_except_table310
[DEL] -169 [DEL] -48 vtable for arrow::compute::internal::(anonymous namespace)::Selection<arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl, arrow::FixedSizeBinaryType>
-50.1% -208 -54.7% -208 GCC_except_table22
[DEL] -221 [DEL] -100 typeinfo name for arrow::compute::internal::(anonymous namespace)::Selection<arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl, arrow::FixedSizeBinaryType>
[DEL] -252 [DEL] -8 arrow::compute::internal::(anonymous namespace)::Selection<arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl, arrow::FixedSizeBinaryType>::~Selection()
-2.1% -256 -2.1% -256 ltmp5
[DEL] -312 [DEL] -240 GCC_except_table169
-82.5% -316 -90.8% -316 GCC_except_table25 -8.9% -352 -9.2% -352 arrow::compute::internal::PopulateTakeKernels(std::__1::vector<arrow::compute::internal::SelectionKernelData, std::__1::allocator<arrow::compute::internal::SelectionKernelData> >*)
[DEL] -464 [DEL] -304 arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl::~FSBSelectionImpl()
-0.8% -677 [ = ] 0 [Unmapped]
[DEL] -5.83Ki [DEL] -5.72Ki arrow::compute::internal::FSBTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*)
[DEL] -23.9Ki [DEL] -23.8Ki arrow::compute::internal::PrimitiveTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*)
-0.1% -504 +0.5% +1.45Ki TOTAL UPDATE: Last 3 commits lead to:
|
Impact on origin/main: Before this commit -0.1% -504 +0.5% +1.45Ki TOTAL After this commit -0.1% -392 +0.5% +1.27Ki TOTAL
In a follow-up PR I'm adding Status return type out of necessity, so this change is not only for binary size reasons. -0.1% -528 +0.5% +1.21Ki TOTAL
cpp/src/arrow/util/macros.h
Outdated
@@ -117,6 +125,7 @@ | |||
#define ARROW_PREFETCH(addr) | |||
#define ARROW_RESTRICT | |||
#define ARROW_COMPILER_ASSUME(expr) | |||
#define ARROW_COMPILER_UNREACHABLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it use https://en.cppreference.com/w/cpp/utility/unreachable
with an adequate feature test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will undo this for now and merge. I added this for very minor binary size gains. I will tackle this in the next PR.
I've run the Take micro-benchmarks locally with this (AMD Zen 2, gcc 12.3.0). The changes a bit all over the place and show that compilers are generally capricious and difficult to steer towards "optimal" code :-)
|
Do you want me to copy and paste the code so that EDIT: The things that the PR improves are much faster. The |
No, it's not important. I was just posting the results for information. |
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 4f89097. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 20 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
I want to instantiate this primitive operation in other scenarios (e.g. the optimized version of
Take
that handles chunked arrays) and extend the sub-classes ofGatherCRTP
with different member functions that re-use theWriteValue
function generically (any fixed-width type and even bit-wide booleans).When taking these improvements to
Filter
I will also re-use the "gather" concept and parameterize it by bitmaps/boolean-arrays instead of selection vectors (indices
) liketake
does. So gather is not a "renaming of take" but rather a generalization oftake
andfilter
do in Arrow with different representations of what should be gathered from thevalues
array.What changes are included in this PR?
Take
implementation for values/indices without nullsTake
into a single oneAre these changes tested?
Take
guarantees null values are zeroed out