tapchannel - [bug]: potential deadlock when signing second level HTLCs #1114

Roasbeef · 2024-08-29T22:18:56Z

Background

A user has noticed that occasionally the daemon may freeze up when signing second level HTLCs.

We we're able to obtain a trace to confirm that the daemon was indeed deadlocked, with lnd waiting for a channel response from tapd, while tapd is attempting to send a response to lnd:

goroutine 2042 [chan receive, 44 minutes]:
github.com/lightningnetwork/lnd/lnwallet.(*LightningChannel).SignNextCommitment(0xc006d40360)
	github.com/lightningnetwork/[email protected]/lnwallet/channel.go:4706 +0x124c
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTx(0xc00280ec08)
	github.com/lightningnetwork/[email protected]/htlcswitch/link.go:2534 +0x6a
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTxOrFail(0xc00280ec08)
	github.com/lightningnetwork/[email protected]/htlcswitch/link.go:2485 +0x1f
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).handleUpstreamMsg(0xc00280ec08, {0x389c7b0, 0xc00faed740})
	github.com/lightningnetwork/[email protected]/htlcswitch/link.go:2367 +0xa96
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).htlcManager(0xc00280ec08)
	github.com/lightningnetwork/[email protected]/htlcswitch/link.go:1415 +0x1596
created by github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).Start in goroutine 1680
	github.com/lightningnetwork/[email protected]/htlcswitch/link.go:541 +0x47d

goroutine 1324695 [chan send, 44 minutes]:
github.com/lightninglabs/taproot-assets/tapchannel.(*AuxLeafSigner).processAuxSigBatch(0xc005532930, 0xc0049c1008, 0xc00583e800, {0xc0178b0008, 0x2, 0x2?})
	github.com/lightninglabs/[email protected]/tapchannel/aux_leaf_signer.go:269 +0x1eb
created by github.com/lightninglabs/taproot-assets/tapchannel.(*AuxLeafSigner).SubmitSecondLevelSigBatch in goroutine 2042
	github.com/lightninglabs/[email protected]/tapchannel/aux_leaf_signer.go:104 +0xcf

Here's a diagram from Clade based on the trace above:

graph TD
    subgraph "goroutine 2042 (lnd)"
        E[htlcManager]
        D[handleUpstreamMsg]
        C[updateCommitTxOrFail]
        B[updateCommitTx]
        A[SignNextCommitment]
        
        E --> D
        D --> C
        C --> B
        B -->|"chan receive (44 min)"| A
    end

    subgraph "goroutine 1324695 (taproot-assets)"
        G[SubmitSecondLevelSigBatch]
        F[processAuxSigBatch]
        
        G -->|"chan send (44 min)"| F
    end

    G -->|Creates| F
    G -->|Waiting for| A
    E -->|Waiting for| G

    classDef waiting fill:#FF9999,stroke:#333,stroke-width:2px,color:#000;
    classDef normal fill:#D9EAD3,stroke:#333,stroke-width:1px,color:#000;
    class A,F waiting;
    class B,C,D,E,G normal;

From the above, we can see that we have circular waiting dependancy.

As is, the channel created for the transfer of sigs is always buffered: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L58-L76

We're then blocking here in tapd:

taproot-assets/tapchannel/aux_leaf_signer.go

Lines 265 to 273 in 0551a3f

    
           // If there is no commit blob, this isn't a custom channel. We 
        
           // still need to signal the job as done though, even if we don't 
        
           // have a signature to return. 
        
           if sigJob.CommitBlob.IsNone() { 
        
           	sigJob.Resp <- lnwallet.AuxSigJobResp{ 
        
           		HtlcIndex: sigJob.HTLC.HtlcIndex, 
        
           	} 
        
           	continue 
        
           }

One thing to note is that in tapd, this is the send when there's no aux blob for a channel.

Expected behavior

Buffered channel send never blocks.

Actual behavior

Buffered channel send blocks. Potentially there's some underlying mutation here.

The text was updated successfully, but these errors were encountered:

Roasbeef · 2024-08-29T22:21:13Z

Perhaps this is actually the issue: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/channel.go#L4701-L4706

It's gated on lack of aux signer there, instead of blob and aux signer.

Roasbeef · 2024-08-29T22:36:02Z

Another hunch here is that we may be missing proper cancel/quit channels in this context. Which is then preventing a proper shutdown of a channel.

Another relevant part I saw in the goroutine stack dump when trimming it a bit:

233  goroutine 1555 [semacquire, 1055 minutes]:
   1 sync.runtime_Semacquire(0xc0053ad098?)
   2         runtime/sema.go:62 +0x25
   3 sync.(*WaitGroup).Wait(0xc00161c3c0?)
   4         sync/waitgroup.go:116 +0x48
   5 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Stop(0x20324a0?)
   6         github.com/lightningnetwork/[email protected]/contractcourt/channel_arbitrator.go:837 +0x195
   7 github.com/lightningnetwork/lnd/contractcourt.(*ChainArbitrator).ResolveContract(0xc000370d88, {{0xea, 0x94, 0x46, 0x3f, 0x1d, 0xd2, 0x74, 0xbb, 0x99, ...}, ...})
   8         github.com/lightningnetwork/[email protected]/contractcourt/chain_arbitrator.go:527 +0x21c
   9 github.com/lightningnetwork/lnd/contractcourt.newActiveChannelArbitrator.func5()
  10         github.com/lightningnetwork/[email protected]/contractcourt/chain_arbitrator.go:463 +0x78
  11 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).stateStep(0xc002a97888, 0xd1ac8, 0x4, 0x0?)
  12         github.com/lightningnetwork/[email protected]/contractcourt/channel_arbitrator.go:1290 +0x1129
  13 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).advanceState(0xc002a97888, 0xd1ac8, 0x4, 0x0)
  14         github.com/lightningnetwork/[email protected]/contractcourt/channel_arbitrator.go:1615 +0x165
  15 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).channelAttendant(0xc002a97888, 0xd1a96)
  16         github.com/lightningnetwork/[email protected]/contractcourt/channel_arbitrator.go:2855 +0x5ac
  17 created by github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Start in goroutine 507
  18         github.com/lightningnetwork/[email protected]/contractcourt/channel_arbitrator.go:570 +0x66e
  19

Full goroutine trace is here: https://0bin.net/paste/H8EBWoac#1FnM7HsS3kPWLeRlPisbgVPeU1fux+-HeTCA9eytFgs

Roasbeef · 2024-08-29T22:43:19Z

Re cancel chan, we pass one in here: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L74

but then don't select on it for the send in tapd, nor the recv in lnd.

dstadulis · 2024-09-03T19:00:56Z

the receive should have a sigjob?

Is the use it uniform (quit time out flow)

dstadulis · 2024-09-05T15:47:02Z

@jharveyb will provide a context update

jharveyb · 2024-09-05T20:22:33Z

Perhaps this is actually the issue: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/channel.go#L4701-L4706

It's gated on lack of aux signer there, instead of blob and aux signer.

I think if the Blob is None, the tapd behavior is safe; the sigjob resp would have a None SigBlob, so later on when the sigs are sent back to tapd to be packed into a tlv Blob, the signature would be skipped. Though that is also the point where we are blocking:

taproot-assets/tapchannel/aux_leaf_signer.go

Line 269 in 881ecaf

sigJob.Resp <- lnwallet.AuxSigJobResp{

Re cancel chan, we pass one in here: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L74

but then don't select on it for the send in tapd, nor the recv in lnd

I agree that this seems like the most likely root cause. I've been comparing the behavior for aux sigs against what happens for htlc sigs. AFAICT, the htlcsig jobs are made in the same order as the aux sigs, and then submitted for processing here:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/channel.go#L4625

Those jobs are processed here:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/sigpool.go#L191

Where we do skip a job if the cancel channel was closed, and skip all jobs if quit is received by that worker (not sure where that signal would come from).

Jobs submission also exits early on cancel or quit:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/sigpool.go#L278

It seems like jobs would be cancelled by the submitter, in case of other errors:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/channel.go#L4651

And that cancel channel is shared amongst all jobs.

For the aux sigs, the flow is different in a few ways. Firstly, the job submission func returns immediately:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/channel.go#L4628

Within the job processing goroutine, we only check for Quit once instead of at every point where we try to send on a channel:

taproot-assets/tapchannel/aux_leaf_signer.go

Line 257 in 881ecaf

select {

And we never check the job's cancel channel. So updating the tapd side to catch those signals seems like a good first fix.

I think the proper handling of Quit should be unit testable from the tapd repo.

Another bit I'm confused about is the sorting of sig jobs. AFAICT they are generated in order of (incoming HTLCs, outbound HTLCs), submitted to other goroutines, sorted in-place by output index, and then waited on.

I think the sorting after submission could cause extra problems with signal handling. Example:

There are 100 jobs, and the first job in auxSigBatch will be the last job in that slice after BIP 69 sorting.
tapd receives a Quit signal while processing that first job; a cancel signal is sent, and a sigjob resp with an error.
lnd is now waiting on a response for a job that will never get processed and hangs. The error sent by tapd will also never be received.

I think adding the cancel and quit handling on both sides would address this, but I'm confused as to why we don't sort the jobs right after generation so that the wait order matches the processing order.

dstadulis · 2024-09-07T14:23:04Z

This is great debugging!
Should we resolve the job sorting question next?

jharveyb · 2024-09-11T10:39:55Z

Ok, following on from review in #1118 :

The cancel signal can now be sent from both tapd and lnd. This requires changes in both systems to make this safe.
Using a raw channel to send the cancel signal is a bit unwieldy, as we need to avoid a double close that would cause a panic.

My current change is a working fix, but I think there are some better options:

Use a ctx.Context with cancelFunc. Under the hood, this is a channel guarded with a mutex. It checks that cancelFunc only has a side effect exactly once. A bit heavy handed here, as we don't need any context inheritance.
Use a sync.Once to close the channel. This will save use from a possible double close, without adding overhead from atomic Load() that's used in the Context package.

To pursue that second option, we'll need a follow-up PR in lnd to change the types lnwallet.AuxSigJob and lnwallet.SignJob. That can be followed by a PR in tapd to use the new job type.

I'll add these changes to lightningnetwork/lnd#9074 and then #1118 .

guggero · 2024-09-11T13:35:04Z

@jharveyb looking at this more closely, I don't think we need either of those solutions, as the validator is never expected to close that channel. See lightningnetwork/lnd#9074 (comment).

jharveyb · 2024-09-11T14:27:26Z

Writing back offline discussion:

Sync'd with Oli, and I agree that we can solve this by:

Having tapd skip all pending jobs if any job errs.
Never close the cancel channel.

Then for lnd:

Skip all pending jobs if any job errs.
Leave in the job sorting, as the above behavior relies on a shared job ordering.

Will update the existing PRs in-place.

Roasbeef · 2024-09-12T22:21:20Z

Given the above, do we have an answer to this question: #1118 (review)?

Roasbeef · 2024-09-12T23:50:41Z

Also re context cancel, looks to be thread safe: https://cs.opensource.google/go/go/+/refs/tags/go1.23.1:src/context/context.go;l=665-677

jharveyb · 2024-09-16T14:00:23Z

Given the above, do we have an answer to this question: #1118 (review)?

From the stack trace, is it certain that tapd was hanging on that send, or just that that's where that goroutine was during shudown?

Roasbeef added bug Something isn't working needs triage tap-channels P0 labels Aug 29, 2024

dstadulis added this to the v0.4.2 milestone Aug 29, 2024

dstadulis assigned Roasbeef Aug 29, 2024

Roasbeef removed their assignment Sep 3, 2024

dstadulis assigned jharveyb Sep 3, 2024

This was referenced Sep 9, 2024

tapchannel: improve aux signer signal handling #1118

Merged

lnwallet: aux signer batching fixes lightningnetwork/lnd#9074

Open

Roasbeef closed this as completed in #1118 Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tapchannel - [bug]: potential deadlock when signing second level HTLCs #1114

tapchannel - [bug]: potential deadlock when signing second level HTLCs #1114

Roasbeef commented Aug 29, 2024

Roasbeef commented Aug 29, 2024

Roasbeef commented Aug 29, 2024

Roasbeef commented Aug 29, 2024

dstadulis commented Sep 3, 2024

dstadulis commented Sep 5, 2024

jharveyb commented Sep 5, 2024

dstadulis commented Sep 7, 2024

jharveyb commented Sep 11, 2024

guggero commented Sep 11, 2024

jharveyb commented Sep 11, 2024

Roasbeef commented Sep 12, 2024

Roasbeef commented Sep 12, 2024

jharveyb commented Sep 16, 2024

tapchannel - [bug]: potential deadlock when signing second level HTLCs #1114

tapchannel - [bug]: potential deadlock when signing second level HTLCs #1114

Comments

Roasbeef commented Aug 29, 2024

Background

Expected behavior

Actual behavior

Roasbeef commented Aug 29, 2024

Roasbeef commented Aug 29, 2024

Roasbeef commented Aug 29, 2024

dstadulis commented Sep 3, 2024

dstadulis commented Sep 5, 2024

jharveyb commented Sep 5, 2024

dstadulis commented Sep 7, 2024

jharveyb commented Sep 11, 2024

guggero commented Sep 11, 2024

jharveyb commented Sep 11, 2024

Roasbeef commented Sep 12, 2024

Roasbeef commented Sep 12, 2024

jharveyb commented Sep 16, 2024