Replies: 1 comment
-
The effect I was seeing where likely_if_innermost was substantially worse in some cases was almost entirely a quirk of how NoAsserts works, which I'll fix shortly. With that fixed, likely vs likely_if_innermost barely matters. Using likely_if_innermost instead of likely increases runtime by 1.3% and decreases code size by 0.8%. So that answers question 1. We still have the is-this-the-right-heuristic question though. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This was brought up in several dev meetings. I finally did a large experiment to try to settle the question. The surprising answer is that switching it to likely_if_innermost increased code size and runtime on average (5% and 3%).
To understand why, it's helpful to consider a small example with typical GuardWithIf usage
Before loop partitioning it looks like this:
Note the 'likely' in the outer if has no else case. This means that if you partition the loop such that all cases where it's true are in steady-state, and all cases where it's false are in an epilogue, and the partition went perfectly, then the epilogue is a no-op, because the if has no else case, so you get this:
Loop partitioning as directed by that likely has simply reduced the extent of the yi loop, not increasing code size. It's basically the same code as if you had said .never_partition(y).
So with this kind of scheduling idiom, using likely doesn't hurt, and there are cases in the codebase I'm looking at where it helps (for reasons that aren't entirely clear to me).
But now consider a case where the split factor is a constant:
Before loop partitioning we have:
It's much the same. After loop partitioning we get:
Uh oh.
Digging into this more, it's happening because we try to partition loops from the outside in. In the more complicated case, first we try to partition the loop over yo. This fails because the IR is complicated, so we try to partition the loop over yi. This succeeds, and is what we want in this instance. In the second case we try to partition the loop over yo, and succeed. So we get a steady state group of 8 scanlines, and a tail group of 8 scanlines with the if still in it, in separate pieces of code.
So our heuristic for deciding which loops to split is not so good in this case. You can manually control it with .partition, and indeed if you say .never_partition(yo) you get good code.
We partition loops from the outside in because there are other cases where it's much better. If you're doing a blur on an input with a boundary condition, you probably want to partition the loop over output vectors, not the loop over the kernel taps. If you're doing something on a GPU, you probably want to partition the GPU block loop, so you get the same control flow within each warp, as opposed to partitioning the thread loop, which would lead to warp divergence.
The next questions are
Beta Was this translation helpful? Give feedback.
All reactions