Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tapchannel - [bug]: potential deadlock when signing second level HTLCs #1114

Closed
Roasbeef opened this issue Aug 29, 2024 · 13 comments · Fixed by #1118
Closed

tapchannel - [bug]: potential deadlock when signing second level HTLCs #1114

Roasbeef opened this issue Aug 29, 2024 · 13 comments · Fixed by #1118
Assignees
Labels
Milestone

Comments

@Roasbeef
Copy link
Member

Background

A user has noticed that occasionally the daemon may freeze up when signing second level HTLCs.

We we're able to obtain a trace to confirm that the daemon was indeed deadlocked, with lnd waiting for a channel response from tapd, while tapd is attempting to send a response to lnd:

goroutine 2042 [chan receive, 44 minutes]:
github.com/lightningnetwork/lnd/lnwallet.(*LightningChannel).SignNextCommitment(0xc006d40360)
	github.com/lightningnetwork/[email protected]/lnwallet/channel.go:4706 +0x124c
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTx(0xc00280ec08)
	github.com/lightningnetwork/[email protected]/htlcswitch/link.go:2534 +0x6a
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTxOrFail(0xc00280ec08)
	github.com/lightningnetwork/[email protected]/htlcswitch/link.go:2485 +0x1f
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).handleUpstreamMsg(0xc00280ec08, {0x389c7b0, 0xc00faed740})
	github.com/lightningnetwork/[email protected]/htlcswitch/link.go:2367 +0xa96
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).htlcManager(0xc00280ec08)
	github.com/lightningnetwork/[email protected]/htlcswitch/link.go:1415 +0x1596
created by github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).Start in goroutine 1680
	github.com/lightningnetwork/[email protected]/htlcswitch/link.go:541 +0x47d
goroutine 1324695 [chan send, 44 minutes]:
github.com/lightninglabs/taproot-assets/tapchannel.(*AuxLeafSigner).processAuxSigBatch(0xc005532930, 0xc0049c1008, 0xc00583e800, {0xc0178b0008, 0x2, 0x2?})
	github.com/lightninglabs/[email protected]/tapchannel/aux_leaf_signer.go:269 +0x1eb
created by github.com/lightninglabs/taproot-assets/tapchannel.(*AuxLeafSigner).SubmitSecondLevelSigBatch in goroutine 2042
	github.com/lightninglabs/[email protected]/tapchannel/aux_leaf_signer.go:104 +0xcf

Here's a diagram from Clade based on the trace above:

graph TD
    subgraph "goroutine 2042 (lnd)"
        E[htlcManager]
        D[handleUpstreamMsg]
        C[updateCommitTxOrFail]
        B[updateCommitTx]
        A[SignNextCommitment]
        
        E --> D
        D --> C
        C --> B
        B -->|"chan receive (44 min)"| A
    end

    subgraph "goroutine 1324695 (taproot-assets)"
        G[SubmitSecondLevelSigBatch]
        F[processAuxSigBatch]
        
        G -->|"chan send (44 min)"| F
    end

    G -->|Creates| F
    G -->|Waiting for| A
    E -->|Waiting for| G

    classDef waiting fill:#FF9999,stroke:#333,stroke-width:2px,color:#000;
    classDef normal fill:#D9EAD3,stroke:#333,stroke-width:1px,color:#000;
    class A,F waiting;
    class B,C,D,E,G normal;
Loading

From the above, we can see that we have circular waiting dependancy.

As is, the channel created for the transfer of sigs is always buffered: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L58-L76

We're then blocking here in tapd:

// If there is no commit blob, this isn't a custom channel. We
// still need to signal the job as done though, even if we don't
// have a signature to return.
if sigJob.CommitBlob.IsNone() {
sigJob.Resp <- lnwallet.AuxSigJobResp{
HtlcIndex: sigJob.HTLC.HtlcIndex,
}
continue
}

One thing to note is that in tapd, this is the send when there's no aux blob for a channel.

Expected behavior

Buffered channel send never blocks.

Actual behavior

Buffered channel send blocks. Potentially there's some underlying mutation here.

@Roasbeef
Copy link
Member Author

Perhaps this is actually the issue: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/channel.go#L4701-L4706

It's gated on lack of aux signer there, instead of blob and aux signer.

@Roasbeef
Copy link
Member Author

Another hunch here is that we may be missing proper cancel/quit channels in this context. Which is then preventing a proper shutdown of a channel.

Another relevant part I saw in the goroutine stack dump when trimming it a bit:

233  goroutine 1555 [semacquire, 1055 minutes]:
   1 sync.runtime_Semacquire(0xc0053ad098?)
   2         runtime/sema.go:62 +0x25
   3 sync.(*WaitGroup).Wait(0xc00161c3c0?)
   4         sync/waitgroup.go:116 +0x48
   5 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Stop(0x20324a0?)
   6         github.com/lightningnetwork/[email protected]/contractcourt/channel_arbitrator.go:837 +0x195
   7 github.com/lightningnetwork/lnd/contractcourt.(*ChainArbitrator).ResolveContract(0xc000370d88, {{0xea, 0x94, 0x46, 0x3f, 0x1d, 0xd2, 0x74, 0xbb, 0x99, ...}, ...})
   8         github.com/lightningnetwork/[email protected]/contractcourt/chain_arbitrator.go:527 +0x21c
   9 github.com/lightningnetwork/lnd/contractcourt.newActiveChannelArbitrator.func5()
  10         github.com/lightningnetwork/[email protected]/contractcourt/chain_arbitrator.go:463 +0x78
  11 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).stateStep(0xc002a97888, 0xd1ac8, 0x4, 0x0?)
  12         github.com/lightningnetwork/[email protected]/contractcourt/channel_arbitrator.go:1290 +0x1129
  13 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).advanceState(0xc002a97888, 0xd1ac8, 0x4, 0x0)
  14         github.com/lightningnetwork/[email protected]/contractcourt/channel_arbitrator.go:1615 +0x165
  15 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).channelAttendant(0xc002a97888, 0xd1a96)
  16         github.com/lightningnetwork/[email protected]/contractcourt/channel_arbitrator.go:2855 +0x5ac
  17 created by github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Start in goroutine 507
  18         github.com/lightningnetwork/[email protected]/contractcourt/channel_arbitrator.go:570 +0x66e
  19

Full goroutine trace is here: https://0bin.net/paste/H8EBWoac#1FnM7HsS3kPWLeRlPisbgVPeU1fux+-HeTCA9eytFgs

@Roasbeef
Copy link
Member Author

Re cancel chan, we pass one in here: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L74

but then don't select on it for the send in tapd, nor the recv in lnd.

@dstadulis dstadulis added this to the v0.4.2 milestone Aug 29, 2024
@Roasbeef Roasbeef removed their assignment Sep 3, 2024
@dstadulis
Copy link
Collaborator

the receive should have a sigjob?

Is the use it uniform (quit time out flow)

@dstadulis
Copy link
Collaborator

@jharveyb will provide a context update

@jharveyb
Copy link
Collaborator

jharveyb commented Sep 5, 2024

Perhaps this is actually the issue: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/channel.go#L4701-L4706

It's gated on lack of aux signer there, instead of blob and aux signer.

I think if the Blob is None, the tapd behavior is safe; the sigjob resp would have a None SigBlob, so later on when the sigs are sent back to tapd to be packed into a tlv Blob, the signature would be skipped. Though that is also the point where we are blocking:

sigJob.Resp <- lnwallet.AuxSigJobResp{

Re cancel chan, we pass one in here: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L74

but then don't select on it for the send in tapd, nor the recv in lnd

I agree that this seems like the most likely root cause. I've been comparing the behavior for aux sigs against what happens for htlc sigs. AFAICT, the htlcsig jobs are made in the same order as the aux sigs, and then submitted for processing here:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/channel.go#L4625

Those jobs are processed here:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/sigpool.go#L191

Where we do skip a job if the cancel channel was closed, and skip all jobs if quit is received by that worker (not sure where that signal would come from).

Jobs submission also exits early on cancel or quit:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/sigpool.go#L278

It seems like jobs would be cancelled by the submitter, in case of other errors:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/channel.go#L4651

And that cancel channel is shared amongst all jobs.

For the aux sigs, the flow is different in a few ways. Firstly, the job submission func returns immediately:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/channel.go#L4628

Within the job processing goroutine, we only check for Quit once instead of at every point where we try to send on a channel:

And we never check the job's cancel channel. So updating the tapd side to catch those signals seems like a good first fix.

I think the proper handling of Quit should be unit testable from the tapd repo.

Another bit I'm confused about is the sorting of sig jobs. AFAICT they are generated in order of (incoming HTLCs, outbound HTLCs), submitted to other goroutines, sorted in-place by output index, and then waited on.

I think the sorting after submission could cause extra problems with signal handling. Example:

  • There are 100 jobs, and the first job in auxSigBatch will be the last job in that slice after BIP 69 sorting.
  • tapd receives a Quit signal while processing that first job; a cancel signal is sent, and a sigjob resp with an error.
  • lnd is now waiting on a response for a job that will never get processed and hangs. The error sent by tapd will also never be received.

I think adding the cancel and quit handling on both sides would address this, but I'm confused as to why we don't sort the jobs right after generation so that the wait order matches the processing order.

@dstadulis
Copy link
Collaborator

This is great debugging!
Should we resolve the job sorting question next?

@jharveyb
Copy link
Collaborator

Ok, following on from review in #1118 :

  • The cancel signal can now be sent from both tapd and lnd. This requires changes in both systems to make this safe.
  • Using a raw channel to send the cancel signal is a bit unwieldy, as we need to avoid a double close that would cause a panic.

My current change is a working fix, but I think there are some better options:

  • Use a ctx.Context with cancelFunc. Under the hood, this is a channel guarded with a mutex. It checks that cancelFunc only has a side effect exactly once. A bit heavy handed here, as we don't need any context inheritance.

  • Use a sync.Once to close the channel. This will save use from a possible double close, without adding overhead from atomic Load() that's used in the Context package.

To pursue that second option, we'll need a follow-up PR in lnd to change the types lnwallet.AuxSigJob and lnwallet.SignJob. That can be followed by a PR in tapd to use the new job type.

I'll add these changes to lightningnetwork/lnd#9074 and then #1118 .

@guggero
Copy link
Member

guggero commented Sep 11, 2024

@jharveyb looking at this more closely, I don't think we need either of those solutions, as the validator is never expected to close that channel. See lightningnetwork/lnd#9074 (comment).

@jharveyb
Copy link
Collaborator

Writing back offline discussion:

Sync'd with Oli, and I agree that we can solve this by:

  • Having tapd skip all pending jobs if any job errs.
  • Never close the cancel channel.

Then for lnd:

  • Skip all pending jobs if any job errs.
  • Leave in the job sorting, as the above behavior relies on a shared job ordering.

Will update the existing PRs in-place.

@Roasbeef
Copy link
Member Author

Given the above, do we have an answer to this question: #1118 (review)?

@Roasbeef
Copy link
Member Author

Also re context cancel, looks to be thread safe: https://cs.opensource.google/go/go/+/refs/tags/go1.23.1:src/context/context.go;l=665-677

@jharveyb
Copy link
Collaborator

Given the above, do we have an answer to this question: #1118 (review)?

From the stack trace, is it certain that tapd was hanging on that send, or just that that's where that goroutine was during shudown?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

4 participants