-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tapchannel - [bug]: potential deadlock when signing second level HTLCs #1114
Comments
Perhaps this is actually the issue: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/channel.go#L4701-L4706 It's gated on lack of aux signer there, instead of blob and aux signer. |
Another hunch here is that we may be missing proper cancel/quit channels in this context. Which is then preventing a proper shutdown of a channel. Another relevant part I saw in the goroutine stack dump when trimming it a bit:
Full goroutine trace is here: https://0bin.net/paste/H8EBWoac#1FnM7HsS3kPWLeRlPisbgVPeU1fux+-HeTCA9eytFgs |
Re cancel chan, we pass one in here: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L74 but then don't select on it for the send in |
the receive should have a sigjob? Is the use it uniform (quit time out flow) |
@jharveyb will provide a context update |
I think if the Blob is taproot-assets/tapchannel/aux_leaf_signer.go Line 269 in 881ecaf
I agree that this seems like the most likely root cause. I've been comparing the behavior for aux sigs against what happens for htlc sigs. AFAICT, the htlcsig jobs are made in the same order as the aux sigs, and then submitted for processing here: Those jobs are processed here: Where we do skip a job if the cancel channel was closed, and skip all jobs if Jobs submission also exits early on cancel or quit: It seems like jobs would be cancelled by the submitter, in case of other errors: And that cancel channel is shared amongst all jobs. For the aux sigs, the flow is different in a few ways. Firstly, the job submission func returns immediately: Within the job processing goroutine, we only check for taproot-assets/tapchannel/aux_leaf_signer.go Line 257 in 881ecaf
And we never check the job's cancel channel. So updating the tapd side to catch those signals seems like a good first fix. I think the proper handling of Another bit I'm confused about is the sorting of sig jobs. AFAICT they are generated in order of (incoming HTLCs, outbound HTLCs), submitted to other goroutines, sorted in-place by output index, and then waited on. I think the sorting after submission could cause extra problems with signal handling. Example:
I think adding the cancel and quit handling on both sides would address this, but I'm confused as to why we don't sort the jobs right after generation so that the wait order matches the processing order. |
This is great debugging! |
Ok, following on from review in #1118 :
My current change is a working fix, but I think there are some better options:
To pursue that second option, we'll need a follow-up PR in I'll add these changes to lightningnetwork/lnd#9074 and then #1118 . |
@jharveyb looking at this more closely, I don't think we need either of those solutions, as the validator is never expected to close that channel. See lightningnetwork/lnd#9074 (comment). |
Writing back offline discussion: Sync'd with Oli, and I agree that we can solve this by:
Then for
Will update the existing PRs in-place. |
Given the above, do we have an answer to this question: #1118 (review)? |
Also re context cancel, looks to be thread safe: https://cs.opensource.google/go/go/+/refs/tags/go1.23.1:src/context/context.go;l=665-677 |
From the stack trace, is it certain that |
Background
A user has noticed that occasionally the daemon may freeze up when signing second level HTLCs.
We we're able to obtain a trace to confirm that the daemon was indeed deadlocked, with
lnd
waiting for a channel response fromtapd
, whiletapd
is attempting to send a response tolnd
:Here's a diagram from Clade based on the trace above:
From the above, we can see that we have circular waiting dependancy.
As is, the channel created for the transfer of sigs is always buffered: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L58-L76
We're then blocking here in
tapd
:taproot-assets/tapchannel/aux_leaf_signer.go
Lines 265 to 273 in 0551a3f
One thing to note is that in
tapd
, this is the send when there's no aux blob for a channel.Expected behavior
Buffered channel send never blocks.
Actual behavior
Buffered channel send blocks. Potentially there's some underlying mutation here.
The text was updated successfully, but these errors were encountered: