-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clang produces individual memclr calls when setting large data structures #62813
Comments
@llvm/issue-subscribers-clang-codegen |
Updating with Richard Smith’s response:
Can confirm using a constexpr variable allows clang to clear the whole struct then set individual non-zero values to members. |
It's probably still worth tracking this as a general optimization, even if you've adopted a workaround for your specific usage. |
Ok did some digging and I think I have a little more idea of what's going on. Here I'm just looking at one member:
So when using an initializer via the macro, the IR clang emits is an
With the default opt pipeline, SROA will transform it into something like this:
which eventually gets transformed to
Then this will get lowered to what we see in the original bug description:
It looks like SROA is just doing its job of splitting the larger This is contrary to the IR using
Here, clang emits the same
which eventually gets transformed to
Not an SROA expert, but it looks to me that the pass isn't doing anything wrong here since it should be perfectly legal to split the alloca up. I see these potential ways to address this:
|
Not sure where the memset(0) over the whole struct is coming from... I'd have to look a bit more. But I'm not sure if clang generates a memset like that in all relevant cases. There's a pass ordering issue here; if you run memcpyopt, the extra alloca that's still hanging around after -O2. Maybe this would be better if SROA had some handling for memset. Probably clang could be a bit smarter in a few ways. We have some code to try to emit a copy from a global constant in AggExprEmitter::EmitArrayInit, but I don't think we do something equivalent for structs. And we could maybe try to optimize structs that contain a lot of zeros to avoid storing the zeros one-by-one. And maybe if we wanted to be really clever, we could avoid the temporary alloca altogether, but it's a little tricky given C++ semantics require the temporary in the general case. |
When compiling the following
with
Clang produces a
GlobalStateManager::ResetChangedState
that's over 800 bytes in size. The function is composed of inidividual calls to__aeabi_memclr
in the form:for every member in the structs that are zero. GCC however produces something significantly smaller:
Since very few members get initialized to non-zero values, GCC instead memsets both members then sets individual members to non-zero values. At
-Oz
, it would be nice if Clang was able to produce something like this since just replacing the Clang definition with the GCC one saves ~750 bytes of text just for this one function.The text was updated successfully, but these errors were encountered: