Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLD] [COFF] Zero-intialization & proper constructor invocation in COFF's Symbol #98447

Merged

Conversation

vg0204
Copy link
Contributor

@vg0204 vg0204 commented Jul 11, 2024

It happened due to lld's COFF linker multiple regression tests failure. It got reliably reproduced after the needed intialization of isUsedinRegularObject bit in the Symbol's ctor, but not handled at replaceSymbol API properly while creating a specific symbol to insert in symbol table.

So, now while creating the specific symbol using replaceSymbol, by explicitly setting the value of isUsedinRegularObject to newly created symbol around the ctor call of symbol would solve the regression failure

@llvmbot
Copy link
Collaborator

llvmbot commented Jul 11, 2024

@llvm/pr-subscribers-platform-windows

@llvm/pr-subscribers-lld-coff

Author: Vikash Gupta (vg0204)

Changes

It happened due to lld's COFF linker multiple regression tests failure. The bug found can be identified as a Heisenbug, which appears only in Release mode, not in Debug mode. It also vanishes when BUG is being tried to be outputed out or isolated. This behaviour cannot be seen at all in OS below Ubuntu24.

Eventually, the bug was about not handling exported symbols properly, when dealing with defining undefined symbols for COFF linker. Its specifically about improper forwarding of isUsedinRegObject boolean while invoking replaceSymbol() API, so hence explicitly doing it resolved issue.


Full diff: https://github.com/llvm/llvm-project/pull/98447.diff

1 Files Affected:

  • (modified) lld/COFF/Symbols.h (+2)
diff --git a/lld/COFF/Symbols.h b/lld/COFF/Symbols.h
index 5ef46f5af6a6c..d33989841c501 100644
--- a/lld/COFF/Symbols.h
+++ b/lld/COFF/Symbols.h
@@ -499,8 +499,10 @@ void replaceSymbol(Symbol *s, ArgT &&... arg) {
   assert(static_cast<Symbol *>(static_cast<T *>(nullptr)) == nullptr &&
          "Not a Symbol");
   bool canInline = s->canInline;
+  bool isUsedInRegularObj = s->isUsedInRegularObj;
   new (s) T(std::forward<ArgT>(arg)...);
   s->canInline = canInline;
+  s->isUsedInRegularObj = isUsedInRegularObj;
 }
 } // namespace coff
 

@tschuett tschuett requested a review from MaskRay July 11, 2024 09:04
@arsenm
Copy link
Contributor

arsenm commented Jul 11, 2024

Needs test?

@vg0204
Copy link
Contributor Author

vg0204 commented Jul 11, 2024

I don't know exactly, as this patch solved some failing test cases(bug) when ran in Ubuntu 24 (for the first time in psdb pipeline), but it was not needed up until ubuntu 22.

@mstorsjo
Copy link
Member

FWIW, I can't reproduce such an issue with Ubuntu 24.04. I tested setting up Ubuntu 24.04, installed the build-essential packages (giving me GCC 13), built the latest git main of llvm-project (-DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=lld -DLLVM_ENABLE_TARGETS="X86;ARM;AArch64") and ran check-lld and it all passed just fine.

@MaskRay
Copy link
Member

MaskRay commented Jul 12, 2024

Request changes unless there is a reproduce and evidence that isUsedInRegularObj does resolve the issue.

I am more familiar with the ELF port where there used to be some issues related to isUsedInRegularObj, but the issues could all be reliably caught. I am not sure COFF port has Heisenbug since we don't perform parallel symbol resolution.

If you believe there is an issue isUsedInRegularObj, perhaps asan can expose the issue.

Copy link
Member

@MaskRay MaskRay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@vg0204
Copy link
Contributor Author

vg0204 commented Jul 13, 2024

Request changes unless there is a reproduce and evidence that isUsedInRegularObj does resolve the issue.

I am more familiar with the ELF port where there used to be some issues related to isUsedInRegularObj, but the issues could all be reliably caught. I am not sure COFF port has Heisenbug since we don't perform parallel symbol resolution.

If you believe there is an issue isUsedInRegularObj, perhaps asan can expose the issue.

Point noted!

@vg0204
Copy link
Contributor Author

vg0204 commented Jul 13, 2024

FWIW, I can't reproduce such an issue with Ubuntu 24.04. I tested setting up Ubuntu 24.04, installed the build-essential packages (giving me GCC 13), built the latest git main of llvm-project (-DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=lld -DLLVM_ENABLE_TARGETS="X86;ARM;AArch64") and ran check-lld and it all passed just fine.

As I looked into it again, what are you saying is true, but I am getting this issue while trying to build in a private repo, which gets updated with main upstream on regular basis. Seeing how things procced there, it would better to decide if this patch is really needed or not.

@arsenm
Copy link
Contributor

arsenm commented Jul 15, 2024

As I looked into it again, what are you saying is true, but I am getting this issue while trying to build in a private repo, which gets updated with main upstream on regular basis. Seeing how things procced there, it would better to decide if this patch is really needed or not.

It would be easier to evaluate with a test. Any behavior change needs a test

@vg0204
Copy link
Contributor Author

vg0204 commented Jul 15, 2024

Also. I have a query regarding the replaceSymbol() function used for other linker like wasm or MachO. All of them used the reference forwarding, then why they need explicit re-assignment of boolean variables. If so, shouldn't it be uniformly done for all such variables in every version of replaceSymbol(). If anyone could comment on this.

@rnk
Copy link
Collaborator

rnk commented Jul 15, 2024

I think there's a real bug here.

It seems extremely sketchy that we don't initialize isUsedInRegularObj in the Symbol constructor, but that must be intentional. I'm pretty sure the compiler is entitled to zero initialize the whole object during the constructor, and that may be what @vg0204 is observing with some new GCC version on Ubuntu 24. And, with that toolchain, the bug is reliably observable in the regression test suite.

If we zero initialize the member in all builds, does that cause tests to fail, and then does this patch fix it? I think that's the next thing to try here.

@rnk
Copy link
Collaborator

rnk commented Jul 15, 2024

@vg0204
Copy link
Contributor Author

vg0204 commented Jul 16, 2024

I think there's a real bug here.

It seems extremely sketchy that we don't initialize isUsedInRegularObj in the Symbol constructor, but that must be intentional. I'm pretty sure the compiler is entitled to zero initialize the whole object during the constructor, and that may be what @vg0204 is observing with some new GCC version on Ubuntu 24. And, with that toolchain, the bug is reliably observable in the regression test suite.

If we zero initialize the member in all builds, does that cause tests to fail, and then does this patch fix it? I think that's the next thing to try here.

Will do that right away!

rnk added a commit to rnk/llvm-project that referenced this pull request Jul 16, 2024
Before this change, SymbolTable::insert set this boolean to a meaningful
value before calling the appropriate Symbol constructor via
replaceSymbol. I believe this is UB, it is probably valid for a compiler
to zero out a bitfield in the constructor, rather than carefully
preserving the single bit that existed prevoiusly.

After this change, the constructor does the obvious thing, which is to
zero initialize this field, and we explicitly copy the bitfield around
the constructor call in replaceSymbol. This should be an alternative
solution to the problems described in llvm#98447 .
@vg0204
Copy link
Contributor Author

vg0204 commented Jul 18, 2024

As mentioned in your commit, it really solves the problem. The regression failure can be reliably observed when isUsedDefinedInRegObj bit is zero-intialized in all builds (Ubuntu24 & below) and also identical in nature. Now, if this patch is applied on top of the zero intialization, the issue gets resolved for all builds. Exactly both of this is being done in your mentioned commit, thus exposing & solving the problem uniformly across different Ubuntu builds.

@vg0204 vg0204 changed the title [LLD] [Build Issue] Ubuntu24 build failure due to LLD's COFF regression [LLD] [COFF] Ubuntu24 build failure due to LLD's COFF regression Jul 18, 2024
@vg0204
Copy link
Contributor Author

vg0204 commented Jul 19, 2024

PING!

@mstorsjo
Copy link
Member

I also reproduced, that if we as zeroing of this variable in the constructor, the issue appears, and this change fixes it.

But I would prefer if we’d add such zeroing in the constructor, to get consistent and reliable behavior across all toolchains. I’m not sure that it’s deliberately omitted, it seems like an oversight to me.

@rnk
Copy link
Collaborator

rnk commented Jul 19, 2024

Yeah, I put together that patch locally, but I wanted to give @vg0204 the opportunity to add the zero initialization so they get credit for the fix. :)

So, yes, that is the next step, add isUsedByRegularObj(false) to the ctor, and then we can merge it.

@vg0204
Copy link
Contributor Author

vg0204 commented Jul 22, 2024

Appreciate that @rnk, already made the necessary changes locally as well while testing out your suggestion on my toolchain of ubuntu24 where this issues had been intially observed. So, would push it up the changes soon, just after testing all regression suite one!

Copy link
Member

@mstorsjo mstorsjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

The PR description/subject should probably be updated a bit before landing this though, now that the issue is better understood. (Although whoever merges it probably also can touch up the commit message.)

@vg0204
Copy link
Contributor Author

vg0204 commented Jul 22, 2024

Will do the needful, before the merge!

@vg0204 vg0204 changed the title [LLD] [COFF] Ubuntu24 build failure due to LLD's COFF regression [LLD] [COFF] Zero-intialization & proper constructor invocation in COFF's Symbol Jul 22, 2024
…ymbol class

It happened due to lld's COFF linker multiple regression tests failure.
It got reliably reproduced after the needed intialization of
isUsedinRegularObject bit in the Symbol's ctor, but not handled at
replaceSymbol API properly while creating a specific symbol to insert in
symbol table.

Eventually, the issue was about not handling exported symbols properly,
when dealing with defining undefined symbols for COFF linker. It got
resolved by explicit setting up the isUsedinRegularObject bit around the
ctor invocation of symbol class in replaceSymbol API.
@vg0204 vg0204 force-pushed the vg0204/build-failure-ubuntu24-COFF-linker-regression branch from 0a3e465 to 205dbe7 Compare July 22, 2024 09:57
@vg0204 vg0204 merged commit 45c0dec into llvm:main Jul 22, 2024
7 checks passed
sgundapa pushed a commit to sgundapa/upstream_effort that referenced this pull request Jul 23, 2024
…FF's Symbol (llvm#98447)

It happened due to lld's COFF linker multiple regression tests failure.
It got reliably reproduced after the needed intialization of
isUsedinRegularObject bit in the Symbol's ctor, but not handled at
replaceSymbol API properly while creating a specific symbol to insert in
symbol table.

So, now while creating the specific symbol using replaceSymbol, by
explicitly setting the value of isUsedinRegularObject to newly created
symbol around the ctor call of symbol would solve the regression failure
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
…FF's Symbol (#98447)

It happened due to lld's COFF linker multiple regression tests failure.
It got reliably reproduced after the needed intialization of
isUsedinRegularObject bit in the Symbol's ctor, but not handled at
replaceSymbol API properly while creating a specific symbol to insert in
symbol table.

So, now while creating the specific symbol using replaceSymbol, by
explicitly setting the value of isUsedinRegularObject to newly created
symbol around the ctor call of symbol would solve the regression failure
@vg0204 vg0204 deleted the vg0204/build-failure-ubuntu24-COFF-linker-regression branch August 6, 2024 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants