-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41687: [C++] bit_util: Trying to remove pre-compute table #41690
base: main
Are you sure you want to change the base?
GH-41687: [C++] bit_util: Trying to remove pre-compute table #41690
Conversation
|
@ursabot please benchmark lang=C++ |
Benchmark runs are scheduled for commit e2f513e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit e2f513e. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
// static constexpr uint8_t kBitmask[] = {1, 2, 4, 8, 16, 32, 64, 128}; | ||
static constexpr uint8_t GetBitMask(uint8_t index) { | ||
// DCHECK(index >= 0 && index <= 7); | ||
return static_cast<uint8_t>(1) << index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler will generate code that performs the <<
on a 32-bit integer. A more honest implementation (in the sense that it gives more freedom to the compiler [1]):
return static_cast<uint8_t>(0x1 << index)
// and
return static_cast<uint8_t>(~(0x1 << i))
And since indices in arrow are rarely uint8_t
, I would keep the index type unconstrained like this:
template <typename T>
static constexpr uint8_t GetBitmask2(T index) {
return static_cast<uint8_t>(0x1 << index);
}
template <typename T>
static constexpr uint8_t GetFlippedBitmask2(T index) {
return static_cast<uint8_t>(~(0x1 << index));
}
[1] might matter more for rustc
than clang
because C/C++ compilers already have a lot of freedom even when your code contains many type constraints
cpp/src/arrow/util/bit_util.h
Outdated
} | ||
|
||
static inline void SetBit(uint8_t* bits, int64_t i) { bits[i / 8] |= kBitmask[i % 8]; } | ||
static inline void SetBit(uint8_t* bits, int64_t i) { bits[i / 8] |= GetBitMask(i % 8); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One advantage that the kBitmask
implementation had over this one is that memory access with negative i
is UB, so the compiler was assuming here that i >= 0
to come up with the bitmask.
We can bring that UB back (yes, UB is a Good Thing™️) by using ARROW_COMPILER_ASSUME(i >= 0)
. The generated assembly for SetBit
with the non-negative assumption is much shorter and doesn't need a conditional mov (or csel
in ARM) instruction.
SetBit2(unsigned char*, long): # @SetBit2(unsigned char*, long)
mov rcx, rsi
lea rax, [rsi + 7]
test rsi, rsi
cmovns rax, rsi
mov edx, eax
and edx, 248
sub ecx, edx
mov edx, 1
shl edx, cl
sar rax, 3
or byte ptr [rdi + rax], dl
ret
SetBit2NNeg(unsigned char*, long): # @SetBit2NNeg(unsigned char*, long)
mov ecx, esi
and cl, 7
mov al, 1
shl al, cl
shr rsi, 3
or byte ptr [rdi + rsi], al
ret
All the experiments on Godbolt -> https://godbolt.org/z/Ez974vE3d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK because whatever is defined as the expected behavior for negative i
in the C++ standard, is bogus in the context of Arrow and SetBit
since a negative i
is already poison.
On my M1 MacOS with -O3: Current:
Before:
|
@ursabot please benchmark lang=C++ |
Benchmark runs are scheduled for commit 2eb38e4. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 2eb38e4. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
No idea why compiling failed in conbench |
Thanks @felipecrv , I've applied the suggestions but conbench failed. Maybe wait it work later |
cpp/src/arrow/util/bit_util.h
Outdated
// DCHECK(index >= 0 && index <= 7); | ||
ARROW_COMPILER_ASSUME(index >= 0 && index <= 7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting the assumption here doesn't have the same effect. Because signed_index % 8
happens before the call to GetBitMask
and GetFlippedBitMask
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully understand this, signed_index % 8
happens before the call to GetBitmask doesn't means the code of GetBitMask
cannot benifit from this? Or you means the system might analyze % 8
and do the assume itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or from the godbolt link, do you mean change ClearBit
and SetBit
to ub if i < 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, tried in godbolt, I'll add this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARROW_COMPILER_ASSUME(index >= 0 && index <= 7)
is superfluous because shifts are already UB in C when the size is bigger than the bit-width of the integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I use ARROW_COMPILER_ASSUME(index >= 0 && index <= 7)
like DCHECK
here, removed it
@ursabot please benchmark |
Benchmark runs are scheduled for commit 270e1a2. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 270e1a2. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
@ursabot please benchmark |
Benchmark runs are scheduled for commit 4a9bae8. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit 4a9bae8. There were 65 benchmark results indicating a performance regression:
The full Conbench report has more details. |
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
kBitmask
table ? #41687