-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OIIO::bitcast adjustments #4101
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the generated assembly, it looks like gcc, clang, and msvc all generate the optimal code when doing something productive, so I'm all for this code simplification.
More info: This report says that msvc will be worse after this PR, but that seems to be only the case when the result of the cast is not used. Once it's used to do something, msvc produced optimal code (verified only for float->int cast): https://godbolt.org/z/rP4Kvrs6o
Followup question: If the memcpy already is optimal (let's assume it is unless we have proof), why still have two different code paths? If we don't have examples of __builtin_bit_cast being better, should we get rid of that and only do the memcpy? I get worried that supporting alternative code paths could one day result in one of the paths developing a regression that we don't notice as easily.
I'm assuming that eventually, this will all be replaced by std::bit_cast when C++20 is our floor. I did think about what you mention -- looking at godbolt, even the memcpy appears to completely optimize away (for the simple examples I tested), so maybe that's the one solution we need and it's not worth the extra complexity? I guess I was figuring that __builtin_bitcast would be the best the gcc/clang based compilers could do, possibly employing tricks that would generate better code in more complex situations than I could easily test with godbolt. I think the philosophy I was following was "if the compiler provides a built-in for this specific purpose, use it, and if not, fall back on the memcpy trick." There are two code paths, I guess, but not for the same compiler? |
My light understanding of the builtins is that clang/gcc will replace the memcpy with the builtin memcpy if it knows the size at compile time (yes in our case here) and the size isn't too large (also yes for us). That explains the inlining and why we don't see a function call to memcpy. I haven't looked at the code, but it wouldn't surprise me if the builtin bitcast is implemented with a memcpy under the hood just so there's less code paths to maintain, as suggested by this discussion. I've been bitten enough times to be wary of optimizing code if there's no measured benefit: I suspect the two code paths at best will offer no benefit and if they do in fact diverge in generated assembly, I would expect the newer bit_cast version to be more likely to contain bugs, at least in the near-term, than the older and more used memcpy. Switching to C++20's bit_cast is cleaner, but unless it gives some measured improvement (maybe with constexpr?), it's once again probably not worth doing until we raise our minimum C++ version to 20 and can then outright replace OIIO::bitcast with std::bit_cast. |
To be clear, this isn't much of a "two code paths" case -- it always uses builtin_bitcast for the compilers where it exists, and falls back on a hokey memcpy where it doesn't (where we've checked, this doesn't seem to be a perf problem). Using std::bit_cast (when it's available) could also vary in its exact implementation platform to platform, though I bet clang/gcc will simply define it to call builtin_bitcast. I'm applying the following heuristic, roughly speaking: Prefer std where it exists and is performant; where not, prefer an idiom that directly expresses what we want (in this case, __builtin_bitcast) when available rather than clumsily making a multi-statement work-alike; if direct idioms only exist on some platforms, I think your concern is that last part, that it's just not worth the I can remove the __builtin_bitcast if you think it just adds unnecessary complexity here, and we can just stand on that until the minimum is C++20 and we can rely on directly calling std::bit_cast. |
Yes, I do worry this is just introducing unneeded complexity with no measured benefit. I would go with the further simplification of just doing memcpy until we replace this all with bit_cast. But the PR is already technically fine, it's simpler than before, and I already approved it, so I'm fine with it being merged as is if you want. I'm merely trying to point out some things to consider but in the end I don't think it's a big deal either way. |
Please hold on a second. I wanted to revisit why these cast conditional compilations where added. I believe they came from SIMD optimizations done in Open Shading Language. Please look at the codegen difference between a simd loop using the intrinsics based bitcast vs the memcpy based one here: 5 instructions (bitcast based) vs 226 instructions (memcpy based)! Perhaps in scalar mode compilers might transform that memcpy to something else. But if that transformation doesn't happen, then it the memcpy takes addresses, which disqualifies it in the SROA (Scalar Replacement of Aggregates) optimization pass which means those must exist on the stack. Existing on the stack disqualifies them from being kept in packed SIMD registers and lots of code to pull data out of SIMD registers to stack exercise the memcpy in scalar than put result back into SIMD registers. |
ICX 2024.0 does ok with it though have 5 instructions for instrinsic, pun_bitcast, or memcpy. Could we just get the ICC conditional compilation added back into this PR as relying on the memcpy version does bugger the vectorization. |
Interesting, so the worry is that a consumer of OIIO (maybe OSL?) might be using OIIO::bitcast in their own code, compiled with icc and making use of openmp-simd autovectorization? If so, that would indeed be a bad performance regression. Do we still want to support ICC? When you compile with it (see the compiler message in the godbolt link Alex posted) it'll say:
If we do want to still support this use case, I would suggest wrapping that code in a Otherwise, it's nice to hear that all the other compilers are still doing the right thing with just memcpy. |
Yeah, kicking myself for not remembering how many of these situations we've been through before, where we must use a bit of a baroque construct because the straightforward one that works fine for scalar code turns out to cause loops to fail to auto-vectorize. IIRC, the only current important OSL user of batch + icc is Pixar, and Stephen told me not too long ago that they were phasing out icc. We can ask him for confirmation, but I bet might even already be safe to not care about icc code generation, at least for future versions (master/main) of OIIO & OSL. |
@lgritz, not everyone moves to latest compiler versions quickly, would be good to give it a year or 2 of overlap for entire ecosystem to support icx properly. And some features like vectorizer optimization report are bit more mature in icc, so even if not release a build with icc, still of use during development. |
I see, ok, let me revise this PR and see what you think. |
Use gcc __builtin_bit_cast when available. Get rid of specializations -- they are not needed, as verified by godbolt. Signed-off-by: Larry Gritz <[email protected]>
Revision, after realizing that the memcpy trick + icc will fail to
|
@lgritz , thanks for adding ICC support that back in. Note on terms "auto-vectorization" is when a compiler decides to vectorize a loop on its own. My example used "explicit" simd vectorization, where programmer tells the compiler "its safe to perform operations in the loops with SIMD parallelism" with a Also the PR description "Use gcc __builtin_bit_cast when available." But now doesn't actually do that? I guess net result is that previously more compilers took the intrinsic based code path, and now all but ICC will take the memcpy approach and rely on compiler transformations to occur before vectorization. |
Aside: Stephen says his place has made the transition to icx. But I'm inclined to cater to Alex's desire to preserve icc compatibility for his own use, if he finds that it provides him with better tools to analyze how to vectorize code than does icx, even if icx is the preferred deployment compiler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
@AlexMWells I fixed the auto-vectorize language, thanks for the correction. I will fix up the details of the commit message when I do the merge. |
vectorize loops containing it: * Restore the intrinsics, just for icc, just for the float<->int/uint varietes (still axe the ones involving doubles). * Remove the __builtin_bitcast clauses, it seems to provide no benefit for the compilers that support it (icc doesn't, and that's the one that seems to need the exra hints). * Add comments explaining why this is all the case and also reminding us NOT to switch to C++20 std::bit_cast in the future without also testing whether using it prevents auto-vectorization. Signed-off-by: Larry Gritz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I'll let @AlexMWells chime in on whether he's ok with dropping the doubles.
OSL doesn't use bitcast with double anywhere. |
Looks good, I would of left the doubles in for completeness, but as Larry mentioned unused by OSL (and hopefully enough breadcrumbs here if someone does stumble into it.) |
Verified that the memcpy trick works fine (even when used in vectorized loops) for gcc, clang, MSVS, icx. So change the specialized versions using intrinsics so they are only used for icc, which fails to vectorize loops when the memcpy trick is used. Also get rid of unused double specializations. --------- Signed-off-by: Larry Gritz <[email protected]> Signed-off-by: Peter Kovář <[email protected]>
Use gcc __builtin_bit_cast when available.
Get rid of specializations -- they are not needed, as verified by godbolt.