Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OIIO::bitcast adjustments #4101

Merged
merged 2 commits into from
Jan 5, 2024
Merged

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Dec 31, 2023

Use gcc __builtin_bit_cast when available.

Get rid of specializations -- they are not needed, as verified by godbolt.

Copy link
Collaborator

@ThiagoIze ThiagoIze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the generated assembly, it looks like gcc, clang, and msvc all generate the optimal code when doing something productive, so I'm all for this code simplification.

More info: This report says that msvc will be worse after this PR, but that seems to be only the case when the result of the cast is not used. Once it's used to do something, msvc produced optimal code (verified only for float->int cast): https://godbolt.org/z/rP4Kvrs6o

Followup question: If the memcpy already is optimal (let's assume it is unless we have proof), why still have two different code paths? If we don't have examples of __builtin_bit_cast being better, should we get rid of that and only do the memcpy? I get worried that supporting alternative code paths could one day result in one of the paths developing a regression that we don't notice as easily.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 1, 2024

I'm assuming that eventually, this will all be replaced by std::bit_cast when C++20 is our floor.

I did think about what you mention -- looking at godbolt, even the memcpy appears to completely optimize away (for the simple examples I tested), so maybe that's the one solution we need and it's not worth the extra complexity? I guess I was figuring that __builtin_bitcast would be the best the gcc/clang based compilers could do, possibly employing tricks that would generate better code in more complex situations than I could easily test with godbolt. I think the philosophy I was following was "if the compiler provides a built-in for this specific purpose, use it, and if not, fall back on the memcpy trick." There are two code paths, I guess, but not for the same compiler?

@ThiagoIze
Copy link
Collaborator

My light understanding of the builtins is that clang/gcc will replace the memcpy with the builtin memcpy if it knows the size at compile time (yes in our case here) and the size isn't too large (also yes for us). That explains the inlining and why we don't see a function call to memcpy. I haven't looked at the code, but it wouldn't surprise me if the builtin bitcast is implemented with a memcpy under the hood just so there's less code paths to maintain, as suggested by this discussion.

I've been bitten enough times to be wary of optimizing code if there's no measured benefit: I suspect the two code paths at best will offer no benefit and if they do in fact diverge in generated assembly, I would expect the newer bit_cast version to be more likely to contain bugs, at least in the near-term, than the older and more used memcpy.

Switching to C++20's bit_cast is cleaner, but unless it gives some measured improvement (maybe with constexpr?), it's once again probably not worth doing until we raise our minimum C++ version to 20 and can then outright replace OIIO::bitcast with std::bit_cast.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 2, 2024

To be clear, this isn't much of a "two code paths" case -- it always uses builtin_bitcast for the compilers where it exists, and falls back on a hokey memcpy where it doesn't (where we've checked, this doesn't seem to be a perf problem). Using std::bit_cast (when it's available) could also vary in its exact implementation platform to platform, though I bet clang/gcc will simply define it to call builtin_bitcast.

I'm applying the following heuristic, roughly speaking: Prefer std where it exists and is performant; where not, prefer an idiom that directly expresses what we want (in this case, __builtin_bitcast) when available rather than clumsily making a multi-statement work-alike; if direct idioms only exist on some platforms, #if them when available and fall back on the clunkier one when not, unless doing so introduces per-platform variability of results or is a readability/maintenance nightmare.

I think your concern is that last part, that it's just not worth the #if complexity since we've verified that memcpy seems not to be a performance penalty. Whereas I'm saying that this is taking a first step toward eventually just using std::bit_cast and thought maybe it's worth the bet that builtin_bitcast has a higher chance of always being optimized than the memcpy trick under some future circumstance.

I can remove the __builtin_bitcast if you think it just adds unnecessary complexity here, and we can just stand on that until the minimum is C++20 and we can rely on directly calling std::bit_cast.

@ThiagoIze
Copy link
Collaborator

Yes, I do worry this is just introducing unneeded complexity with no measured benefit. I would go with the further simplification of just doing memcpy until we replace this all with bit_cast. But the PR is already technically fine, it's simpler than before, and I already approved it, so I'm fine with it being merged as is if you want. I'm merely trying to point out some things to consider but in the end I don't think it's a big deal either way.

@AlexMWells
Copy link
Collaborator

Please hold on a second. I wanted to revisit why these cast conditional compilations where added. I believe they came from SIMD optimizations done in Open Shading Language.

Please look at the codegen difference between a simd loop using the intrinsics based bitcast vs the memcpy based one here:
https://godbolt.org/z/EaaPhrqEq

5 instructions (bitcast based) vs 226 instructions (memcpy based)!

Perhaps in scalar mode compilers might transform that memcpy to something else. But if that transformation doesn't happen, then it the memcpy takes addresses, which disqualifies it in the SROA (Scalar Replacement of Aggregates) optimization pass which means those must exist on the stack. Existing on the stack disqualifies them from being kept in packed SIMD registers and lots of code to pull data out of SIMD registers to stack exercise the memcpy in scalar than put result back into SIMD registers.

@AlexMWells
Copy link
Collaborator

ICX 2024.0 does ok with it though have 5 instructions for instrinsic, pun_bitcast, or memcpy.
https://godbolt.org/z/8nozPb38d
And trying gcc 11, clang 10+, they look ok as well.

Could we just get the ICC conditional compilation added back into this PR as relying on the memcpy version does bugger the vectorization.

@ThiagoIze
Copy link
Collaborator

Interesting, so the worry is that a consumer of OIIO (maybe OSL?) might be using OIIO::bitcast in their own code, compiled with icc and making use of openmp-simd autovectorization? If so, that would indeed be a bad performance regression.

Do we still want to support ICC? When you compile with it (see the compiler message in the godbolt link Alex posted) it'll say:

icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.

If we do want to still support this use case, I would suggest wrapping that code in a #if icc (whatever the define for that is) along with a comment saying that this is required for that compiler when used with openmp simd generation, and possibly a link to this PR.

Otherwise, it's nice to hear that all the other compilers are still doing the right thing with just memcpy.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 3, 2024

Yeah, kicking myself for not remembering how many of these situations we've been through before, where we must use a bit of a baroque construct because the straightforward one that works fine for scalar code turns out to cause loops to fail to auto-vectorize.

IIRC, the only current important OSL user of batch + icc is Pixar, and Stephen told me not too long ago that they were phasing out icc. We can ask him for confirmation, but I bet might even already be safe to not care about icc code generation, at least for future versions (master/main) of OIIO & OSL.

@AlexMWells
Copy link
Collaborator

@lgritz, not everyone moves to latest compiler versions quickly, would be good to give it a year or 2 of overlap for entire ecosystem to support icx properly. And some features like vectorizer optimization report are bit more mature in icc, so even if not release a build with icc, still of use during development.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 3, 2024

I see, ok, let me revise this PR and see what you think.

Use gcc __builtin_bit_cast when available.

Get rid of specializations -- they are not needed, as verified by
godbolt.

Signed-off-by: Larry Gritz <[email protected]>
@lgritz
Copy link
Collaborator Author

lgritz commented Jan 3, 2024

Revision, after realizing that the memcpy trick + icc will fail to
auto-vectorize loops containing it:

  • Restore the intrinsics, just for icc, just for the float<->int/uint
    varietes (still axe the ones involving doubles).

  • Remove the __builtin_bitcast clauses, it seems to provide no benefit
    for the compilers that support it (icc doesn't, and that's the one
    that seems to need the exra hints).

  • Add comments explaining why this is all the case and also reminding
    us NOT to switch to C++20 std::bit_cast in the future without also
    testing whether using it prevents auto-vectorization.

@AlexMWells
Copy link
Collaborator

@lgritz , thanks for adding ICC support that back in.

Note on terms "auto-vectorization" is when a compiler decides to vectorize a loop on its own. My example used "explicit" simd vectorization, where programmer tells the compiler "its safe to perform operations in the loops with SIMD parallelism" with a #pragma omp simd. And ICC could still vectorize with memcpy, but produced really sub-optimal results.

Also the PR description "Use gcc __builtin_bit_cast when available." But now doesn't actually do that? I guess net result is that previously more compilers took the intrinsic based code path, and now all but ICC will take the memcpy approach and rely on compiler transformations to occur before vectorization.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 3, 2024

Aside: Stephen says his place has made the transition to icx.

But I'm inclined to cater to Alex's desire to preserve icc compatibility for his own use, if he finds that it provides him with better tools to analyze how to vectorize code than does icx, even if icx is the preferred deployment compiler.

Copy link
Collaborator

@ThiagoIze ThiagoIze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 5, 2024

@AlexMWells I fixed the auto-vectorize language, thanks for the correction. I will fix up the details of the commit message when I do the merge.

vectorize loops containing it:

* Restore the intrinsics, just for icc, just for the float<->int/uint
  varietes (still axe the ones involving doubles).

* Remove the __builtin_bitcast clauses, it seems to provide no benefit
  for the compilers that support it (icc doesn't, and that's the one
  that seems to need the exra hints).

* Add comments explaining why this is all the case and also reminding
  us NOT to switch to C++20 std::bit_cast in the future without also
  testing whether using it prevents auto-vectorization.

Signed-off-by: Larry Gritz <[email protected]>
Copy link
Collaborator

@ThiagoIze ThiagoIze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll let @AlexMWells chime in on whether he's ok with dropping the doubles.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 5, 2024

OSL doesn't use bitcast with double anywhere.

@AlexMWells
Copy link
Collaborator

Looks good, I would of left the doubles in for completeness, but as Larry mentioned unused by OSL (and hopefully enough breadcrumbs here if someone does stumble into it.)

@lgritz lgritz merged commit d3db2f5 into AcademySoftwareFoundation:master Jan 5, 2024
25 checks passed
@lgritz lgritz deleted the lg-bitcast branch January 5, 2024 23:42
1div0 pushed a commit to 1div0/OpenImageIO that referenced this pull request Feb 24, 2024
Verified that the memcpy trick works fine (even when used in vectorized
loops) for gcc, clang, MSVS, icx. So change the specialized versions using
intrinsics so they are only used for icc, which fails to vectorize loops when
the memcpy trick is used.

Also get rid of unused double specializations.

---------

Signed-off-by: Larry Gritz <[email protected]>
Signed-off-by: Peter Kovář <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants