OIIO::bitcast adjustments #4101

lgritz · 2023-12-31T03:55:13Z

Use gcc __builtin_bit_cast when available.

Get rid of specializations -- they are not needed, as verified by godbolt.

ThiagoIze

Looking at the generated assembly, it looks like gcc, clang, and msvc all generate the optimal code when doing something productive, so I'm all for this code simplification.

More info: This report says that msvc will be worse after this PR, but that seems to be only the case when the result of the cast is not used. Once it's used to do something, msvc produced optimal code (verified only for float->int cast): https://godbolt.org/z/rP4Kvrs6o

Followup question: If the memcpy already is optimal (let's assume it is unless we have proof), why still have two different code paths? If we don't have examples of __builtin_bit_cast being better, should we get rid of that and only do the memcpy? I get worried that supporting alternative code paths could one day result in one of the paths developing a regression that we don't notice as easily.

lgritz · 2024-01-01T05:18:26Z

I'm assuming that eventually, this will all be replaced by std::bit_cast when C++20 is our floor.

I did think about what you mention -- looking at godbolt, even the memcpy appears to completely optimize away (for the simple examples I tested), so maybe that's the one solution we need and it's not worth the extra complexity? I guess I was figuring that __builtin_bitcast would be the best the gcc/clang based compilers could do, possibly employing tricks that would generate better code in more complex situations than I could easily test with godbolt. I think the philosophy I was following was "if the compiler provides a built-in for this specific purpose, use it, and if not, fall back on the memcpy trick." There are two code paths, I guess, but not for the same compiler?

ThiagoIze · 2024-01-02T00:03:23Z

My light understanding of the builtins is that clang/gcc will replace the memcpy with the builtin memcpy if it knows the size at compile time (yes in our case here) and the size isn't too large (also yes for us). That explains the inlining and why we don't see a function call to memcpy. I haven't looked at the code, but it wouldn't surprise me if the builtin bitcast is implemented with a memcpy under the hood just so there's less code paths to maintain, as suggested by this discussion.

I've been bitten enough times to be wary of optimizing code if there's no measured benefit: I suspect the two code paths at best will offer no benefit and if they do in fact diverge in generated assembly, I would expect the newer bit_cast version to be more likely to contain bugs, at least in the near-term, than the older and more used memcpy.

Switching to C++20's bit_cast is cleaner, but unless it gives some measured improvement (maybe with constexpr?), it's once again probably not worth doing until we raise our minimum C++ version to 20 and can then outright replace OIIO::bitcast with std::bit_cast.

lgritz · 2024-01-02T01:10:30Z

To be clear, this isn't much of a "two code paths" case -- it always uses builtin_bitcast for the compilers where it exists, and falls back on a hokey memcpy where it doesn't (where we've checked, this doesn't seem to be a perf problem). Using std::bit_cast (when it's available) could also vary in its exact implementation platform to platform, though I bet clang/gcc will simply define it to call builtin_bitcast.

I'm applying the following heuristic, roughly speaking: Prefer std where it exists and is performant; where not, prefer an idiom that directly expresses what we want (in this case, __builtin_bitcast) when available rather than clumsily making a multi-statement work-alike; if direct idioms only exist on some platforms, #if them when available and fall back on the clunkier one when not, unless doing so introduces per-platform variability of results or is a readability/maintenance nightmare.

I think your concern is that last part, that it's just not worth the #if complexity since we've verified that memcpy seems not to be a performance penalty. Whereas I'm saying that this is taking a first step toward eventually just using std::bit_cast and thought maybe it's worth the bet that builtin_bitcast has a higher chance of always being optimized than the memcpy trick under some future circumstance.

I can remove the __builtin_bitcast if you think it just adds unnecessary complexity here, and we can just stand on that until the minimum is C++20 and we can rely on directly calling std::bit_cast.

ThiagoIze · 2024-01-03T06:09:56Z

Yes, I do worry this is just introducing unneeded complexity with no measured benefit. I would go with the further simplification of just doing memcpy until we replace this all with bit_cast. But the PR is already technically fine, it's simpler than before, and I already approved it, so I'm fine with it being merged as is if you want. I'm merely trying to point out some things to consider but in the end I don't think it's a big deal either way.

AlexMWells · 2024-01-03T19:38:41Z

Please hold on a second. I wanted to revisit why these cast conditional compilations where added. I believe they came from SIMD optimizations done in Open Shading Language.

Please look at the codegen difference between a simd loop using the intrinsics based bitcast vs the memcpy based one here:
https://godbolt.org/z/EaaPhrqEq

5 instructions (bitcast based) vs 226 instructions (memcpy based)!

Perhaps in scalar mode compilers might transform that memcpy to something else. But if that transformation doesn't happen, then it the memcpy takes addresses, which disqualifies it in the SROA (Scalar Replacement of Aggregates) optimization pass which means those must exist on the stack. Existing on the stack disqualifies them from being kept in packed SIMD registers and lots of code to pull data out of SIMD registers to stack exercise the memcpy in scalar than put result back into SIMD registers.

AlexMWells · 2024-01-03T19:54:30Z

ICX 2024.0 does ok with it though have 5 instructions for instrinsic, pun_bitcast, or memcpy.
https://godbolt.org/z/8nozPb38d
And trying gcc 11, clang 10+, they look ok as well.

Could we just get the ICC conditional compilation added back into this PR as relying on the memcpy version does bugger the vectorization.

ThiagoIze · 2024-01-03T20:30:54Z

Interesting, so the worry is that a consumer of OIIO (maybe OSL?) might be using OIIO::bitcast in their own code, compiled with icc and making use of openmp-simd autovectorization? If so, that would indeed be a bad performance regression.

Do we still want to support ICC? When you compile with it (see the compiler message in the godbolt link Alex posted) it'll say:

icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.

If we do want to still support this use case, I would suggest wrapping that code in a #if icc (whatever the define for that is) along with a comment saying that this is required for that compiler when used with openmp simd generation, and possibly a link to this PR.

Otherwise, it's nice to hear that all the other compilers are still doing the right thing with just memcpy.

lgritz · 2024-01-03T20:55:18Z

Yeah, kicking myself for not remembering how many of these situations we've been through before, where we must use a bit of a baroque construct because the straightforward one that works fine for scalar code turns out to cause loops to fail to auto-vectorize.

IIRC, the only current important OSL user of batch + icc is Pixar, and Stephen told me not too long ago that they were phasing out icc. We can ask him for confirmation, but I bet might even already be safe to not care about icc code generation, at least for future versions (master/main) of OIIO & OSL.

AlexMWells · 2024-01-03T21:09:51Z

@lgritz, not everyone moves to latest compiler versions quickly, would be good to give it a year or 2 of overlap for entire ecosystem to support icx properly. And some features like vectorizer optimization report are bit more mature in icc, so even if not release a build with icc, still of use during development.

lgritz · 2024-01-03T21:35:14Z

I see, ok, let me revise this PR and see what you think.

Use gcc __builtin_bit_cast when available. Get rid of specializations -- they are not needed, as verified by godbolt. Signed-off-by: Larry Gritz <[email protected]>

lgritz · 2024-01-03T22:18:01Z

Revision, after realizing that the memcpy trick + icc will fail to
auto-vectorize loops containing it:

Restore the intrinsics, just for icc, just for the float<->int/uint
varietes (still axe the ones involving doubles).
Remove the __builtin_bitcast clauses, it seems to provide no benefit
for the compilers that support it (icc doesn't, and that's the one
that seems to need the exra hints).
Add comments explaining why this is all the case and also reminding
us NOT to switch to C++20 std::bit_cast in the future without also
testing whether using it prevents auto-vectorization.

src/include/OpenImageIO/fmath.h

AlexMWells · 2024-01-03T23:29:40Z

@lgritz , thanks for adding ICC support that back in.

Note on terms "auto-vectorization" is when a compiler decides to vectorize a loop on its own. My example used "explicit" simd vectorization, where programmer tells the compiler "its safe to perform operations in the loops with SIMD parallelism" with a #pragma omp simd. And ICC could still vectorize with memcpy, but produced really sub-optimal results.

Also the PR description "Use gcc __builtin_bit_cast when available." But now doesn't actually do that? I guess net result is that previously more compilers took the intrinsic based code path, and now all but ICC will take the memcpy approach and rely on compiler transformations to occur before vectorization.

lgritz · 2024-01-03T23:30:13Z

Aside: Stephen says his place has made the transition to icx.

But I'm inclined to cater to Alex's desire to preserve icc compatibility for his own use, if he finds that it provides him with better tools to analyze how to vectorize code than does icx, even if icx is the preferred deployment compiler.

ThiagoIze

This looks good to me.

lgritz · 2024-01-05T18:44:01Z

@AlexMWells I fixed the auto-vectorize language, thanks for the correction. I will fix up the details of the commit message when I do the merge.

vectorize loops containing it: * Restore the intrinsics, just for icc, just for the float<->int/uint varietes (still axe the ones involving doubles). * Remove the __builtin_bitcast clauses, it seems to provide no benefit for the compilers that support it (icc doesn't, and that's the one that seems to need the exra hints). * Add comments explaining why this is all the case and also reminding us NOT to switch to C++20 std::bit_cast in the future without also testing whether using it prevents auto-vectorization. Signed-off-by: Larry Gritz <[email protected]>

ThiagoIze

Looks good to me. I'll let @AlexMWells chime in on whether he's ok with dropping the doubles.

lgritz · 2024-01-05T22:05:23Z

OSL doesn't use bitcast with double anywhere.

AlexMWells · 2024-01-05T22:12:21Z

Looks good, I would of left the doubles in for completeness, but as Larry mentioned unused by OSL (and hopefully enough breadcrumbs here if someone does stumble into it.)

Verified that the memcpy trick works fine (even when used in vectorized loops) for gcc, clang, MSVS, icx. So change the specialized versions using intrinsics so they are only used for icc, which fails to vectorize loops when the memcpy trick is used. Also get rid of unused double specializations. --------- Signed-off-by: Larry Gritz <[email protected]> Signed-off-by: Peter Kovář <[email protected]>

ThiagoIze approved these changes Jan 1, 2024

View reviewed changes

OIIO::bitcast adjustments

c8590f1

Use gcc __builtin_bit_cast when available. Get rid of specializations -- they are not needed, as verified by godbolt. Signed-off-by: Larry Gritz <[email protected]>

lgritz force-pushed the lg-bitcast branch from 2b754c6 to 24773d6 Compare January 3, 2024 22:16

lgritz force-pushed the lg-bitcast branch from 24773d6 to bb01d4c Compare January 3, 2024 22:18

ThiagoIze reviewed Jan 3, 2024

View reviewed changes

src/include/OpenImageIO/fmath.h Show resolved Hide resolved

ThiagoIze approved these changes Jan 4, 2024

View reviewed changes

lgritz force-pushed the lg-bitcast branch from bb01d4c to a3c5f56 Compare January 5, 2024 18:43

lgritz force-pushed the lg-bitcast branch from a3c5f56 to e5fa2e1 Compare January 5, 2024 19:33

ThiagoIze approved these changes Jan 5, 2024

View reviewed changes

lgritz merged commit d3db2f5 into AcademySoftwareFoundation:master Jan 5, 2024
25 checks passed

lgritz deleted the lg-bitcast branch January 5, 2024 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OIIO::bitcast adjustments #4101

OIIO::bitcast adjustments #4101

lgritz commented Dec 31, 2023

ThiagoIze left a comment

lgritz commented Jan 1, 2024

ThiagoIze commented Jan 2, 2024

lgritz commented Jan 2, 2024

ThiagoIze commented Jan 3, 2024

AlexMWells commented Jan 3, 2024

AlexMWells commented Jan 3, 2024

ThiagoIze commented Jan 3, 2024

lgritz commented Jan 3, 2024

AlexMWells commented Jan 3, 2024

lgritz commented Jan 3, 2024

lgritz commented Jan 3, 2024

AlexMWells commented Jan 3, 2024

lgritz commented Jan 3, 2024

ThiagoIze left a comment

lgritz commented Jan 5, 2024

ThiagoIze left a comment

lgritz commented Jan 5, 2024

AlexMWells commented Jan 5, 2024

OIIO::bitcast adjustments #4101

OIIO::bitcast adjustments #4101

Conversation

lgritz commented Dec 31, 2023

ThiagoIze left a comment

Choose a reason for hiding this comment

lgritz commented Jan 1, 2024

ThiagoIze commented Jan 2, 2024

lgritz commented Jan 2, 2024

ThiagoIze commented Jan 3, 2024

AlexMWells commented Jan 3, 2024

AlexMWells commented Jan 3, 2024

ThiagoIze commented Jan 3, 2024

lgritz commented Jan 3, 2024

AlexMWells commented Jan 3, 2024

lgritz commented Jan 3, 2024

lgritz commented Jan 3, 2024

AlexMWells commented Jan 3, 2024

lgritz commented Jan 3, 2024

ThiagoIze left a comment

Choose a reason for hiding this comment

lgritz commented Jan 5, 2024

ThiagoIze left a comment

Choose a reason for hiding this comment

lgritz commented Jan 5, 2024

AlexMWells commented Jan 5, 2024