Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41301: [C++] Extract the kernel loops used for PrimitiveTakeExec and generalize to any fixed-width type #41373
GH-41301: [C++] Extract the kernel loops used for PrimitiveTakeExec and generalize to any fixed-width type #41373
Changes from 21 commits
27d5efb
0817d59
35c409d
bc75bfa
9ee75fb
b0614a0
1fdc401
b122382
97be842
d77c758
47ef815
5f635aa
54ca9a4
9f5d834
8a089c8
8629277
efaeb9c
8f884c6
a64a582
c31e5be
075508c
ca6937c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
ARROW_FORCE_INLINE
desirable here? Generally, the compiler is able to make more informed heuristics than us (but not always admittedly). This function is rather complex so I'm not sure force-inlining it would actually be beneficial (and could blow up code generation sizes).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I checked code and benchmarks (a month or more ago).
Here's the overall approach with this header.
GatherBaseCRTP
is a template that generates these templates:The usage should be as follows. On the same function, the specific
Gather
instance should be instantiated and the method should be called on it:What this achieves?
Gather<kValueWidthInBits, ...> gather{...}
isn't really the initialization of a portion of the memory in the stack, but of registers. Because the lifetime ofgather
is confined to a single function and all the method calls on it are inlined, the member ofgather
can live in registers from initialization to the end of the kernel loop. I did this to achieve the same performance we had when these loops were hand-written (more than once!) invector_selection_take_internal.cc
.This idea of keeping variables in registers is mentioned in database implementation reports (Efficiently Compiling Efficient Query Plans for Modern Hardware) and it's not just compiler-brain perfectionism (https://llvm.org/docs/Passes.html#sroa-scalar-replacement-of-aggregates).
By force-inlining the
Execute
methods I'm actually reducing code size (together with the fact that I instantiateGather
and these specificExecute
calls from a single place). Without inliningExecute
there would be more binary code to deal with construction/destruction, loads to/from memory.To make sure we only ever instantiate these templates once, I could wrap this code in templates outside the inline namespace. The templates would contain the initialization of
Gather<...> gather
and the inlined calls to the two inlinedExecute
functions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to build that functional façade to these classes before we concretely have more than two instantiations of these templates? I could create an issue to track this concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know using registers is critical for performance. I'm just surprised that the compiler wouldn't cache member variables in registers in a (non-inlined) method's body. Adding a couple memory lookups at the start and end of such a heavy function should not yield a noticeable slowdown.
Just because a compiler feels better with some particular organization of the source code doesn't mean we've unlocked a robust performance improvement. We may just have hit a sweeter spot in this optimizer's behaviour.
In any case, we can proceed with this. This is helper code that should only be used in a couple kernels, so the potential code size increase is hopefully (!) minor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the inline namespace annotation as well as it seems to be a red-herring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without inlining
Execute
, there has to be athis
pointer pointing to the struct laid out in memory and call toWrite*
have to load members through thethis
pointer. Not an impossible optmization, but wasn't happening.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Compilers are annoying :-)