Honor deadman timer, fixes #1093 #1094

parsley42 · 2022-07-28T16:29:56Z

NOTE: The original description of this PR is outdated, but left here for reference. See the discussion for how the PR was updated.

Run make pr-prep from the root of the repository to run formatting, linting and tests.

This patch modifies the main run loop for the socket mode connection. Based on reading the comments, it appears the intention was for run() to return context.Cancelled in the event of a cancelled context, or nil in the event of a different error. This based on the comment:

// wg.Wait() finishes only after any of the above go routines finishes.

... but based on the use of the wait group, it appears that ALL of the goroutines would have to finish, causing the the deadman time to be ignored. This patch removes the wait group (and wg.Wait()) and instead selects on either 1) context cancelled, or 2) timer expired.

I believe this preserves the original intent of the code - but I could be wrong! In any case, this patch fixes #1093 for me, and my robot still runs correctly AFAICS. Ping to @xNok (the original PR author) to have a look.

xNok · 2022-07-29T10:02:33Z

I didn't write that part, but the wait group is supposed to be the right approach.

Yes, ALL of the goroutines would have to finish, but this is normal because you want a gracefull shutdown. When you want to terminate the run, you need to cancel the context, and then each goroutine should stop gracefully.

Ideally, when the deadman triggers, you should cancel the context and wait for all goroutine to terminate, then reconnect.

As to why it doesn't work properly, I don't know yet 😞

parsley42 · 2022-07-29T12:29:01Z

Yes, ALL of the goroutines would have to finish, but this is normal because you want a gracefull shutdown. When you want to terminate the run, you need to cancel the context, and then each goroutine should stop gracefully.

Ideally, when the deadman triggers, you should cancel the context and wait for all goroutine to terminate, then reconnect.

Ah, ok, thanks for the - hah! - context. Looking at the code, I see what you mean - so, the root of the problem is that one of the items being waited on is ReadJSON, which doesn't take a context. I'm going to mark the PR as draft and work on adding back the waitGroup and trying a different approach.

parsley42 · 2022-07-29T13:11:12Z

Ok @xNok and @mumoshu - this much smaller PR fixes things for me, and may be "good enough". In short, the runMessageReceiver thread eventually blocks in ReadJSON on the websocket, which doesn't take a context and so doesn't exit gracefully on cancel. Now, all my PR does it to remove that thread from the waitGroup and update some comments. How does this look?

mumoshu · 2022-07-30T08:30:54Z

@parsley42 Hey! Thanks for the fix.
I was rereading the code I've written but missed one thing- It would be great if you could clarify it if possible.
So, In L145-147 we have:

		case <-ctx.Done():
			// Detect when the connection is dead and try close connection.
			if err = conn.Close(); err != nil {

which is running in another goroutine.
I think it is supposed to chime in to close the connection on a context cancellation which in turn fails the ReadJSON attempt on the reader(=websocket.Conn) asap.

Is it that the connection close doesn't result in immediately failing ReadJSON hence your fix is needed?

parsley42 · 2022-07-30T10:18:14Z

Is it that the connection close doesn't result in immediately failing ReadJSON hence your fix is needed?

@mumoshu Thanks for having a look! Yes, that's my assessment of the issue. I'll walk you through my analysis.

The core robot code (github.com/lnxjedi/gopherbot) has been running very stable with RTM for many years, but a couple weeks after migrating to socket mode, one of our robots stopped responding to commands, and in the slack app I got a dispatch_failed error when I used the robot's slash command - this led me to suspect the connection had gone bad. Since people were trying to get work done, I didn't take further time to troubleshoot (a netstat would have been good at this point), I just restarted the robot service. When I restarted the robot process, all was well again.

This led me to look at the code for a ping timeout, since I knew slack was sending a ping at regular intervals. To artificially suppress the ping arriving from slack, I used the local firewall to block inbound traffic on the port, then watched the slack logs. It took several minutes for the connection to reset. I repeated the experiment, then sent my SIGUSR1 (forces an abort/stack dump) during the long hang, and found a thread blocking in ReadJSON. Now seeing your comments, I recognize that a full block like this would prevent a graceful tcp connection shutdown. I'm trying to think of a way to better test. Of course, if the vm/container/whatever on the Slack end just goes away, the result would be the same. It would probably be good to reconnect quickly on ping timeout, even in the event that the original connection can't be closed gracefully.

mumoshu · 2022-07-31T01:57:35Z

@parsley42 Thanks for your confirmation!

I read your experience and analysis and went back to gorilla/websocket code. I now got to think that conn.Close is no more than just closing TCP connection without coordination with the opposite side of the connection.

https://github.com/gorilla/websocket/blob/af47554f343b4675b30172ac301638d350db34a5/conn.go#L342-L346

ReadJSON may fail immediately only if the opposite of the connection(=the websocket server) send us a WebSocket "Close" message, which doesn't happen in this scenario.

This is a gorilla/websocket issue that describes a not exactly same but related problem: gorilla/websocket#474

If we were to gracefully stop the bot when the Slack's websocket server vm/container/whatever and the TCP connection in-between is still alive, we might be able to start a normal close procedure by, instead of just Closeing it, writing a control message, like c.WriteControl(websocket.CloseMessage, websocket.FormatCloseMessage(websocket.CloseNormalClosure, "") so that the server will hopefully send us a close message that may fail read and its consumer ReadJSON to fail immediately.

https://github.com/gorilla/websocket/blob/af47554f343b4675b30172ac301638d350db34a5/conn.go#L411-L413

But-

if the vm/container/whatever on the Slack end just goes away, the result would be the same

This is really a good point. If the Slack end disappeared without notifying us, this normal close procedure will end up with a read/connection timeout. We need to handle the timeout in a graceful stop scenario anyway.

That said, I believe your fix is perfectly valid and also the best way to handle the timeout issue!

parsley42 · 2022-09-09T12:54:10Z

@kanata2 Can you have a look at this and let me know if you have any questions?

kanata2 · 2022-09-09T12:57:06Z

Sorry for slow response. I'll confirm this weekend.

kanata2

@parsley42
I have read discussions in this PR and agree with you, so will merge this.

@xNok @mumoshu
Thanks for your reviews.

parsley42 added 2 commits July 28, 2022 11:53

Fail on context cancelled, retry on timeout

b69d0ea

Update comment, remove reference to timeout

d96792e

parsley42 requested a review from kanata2 July 28, 2022 16:29

parsley42 marked this pull request as draft July 29, 2022 12:29

parsley42 removed the request for review from kanata2 July 29, 2022 12:29

Add back the waitgroup, just don't wait on ReadJSON

9521295

parsley42 marked this pull request as ready for review July 31, 2022 12:44

parsley42 requested review from ollieparsley and zchee July 31, 2022 12:48

kanata2 requested review from kanata2 and removed request for ollieparsley August 1, 2022 02:13

kanata2 added needs review SocketMode about SocketMode bugfix [PR] bugfix labels Aug 1, 2022

kanata2 removed the needs review label Sep 11, 2022

kanata2 approved these changes Sep 11, 2022

View reviewed changes

kanata2 merged commit 6f5eda2 into slack-go:master Sep 11, 2022

kanata2 mentioned this pull request Nov 8, 2022

Socketmode panic, write on closed channel #1125

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor deadman timer, fixes #1093 #1094

Honor deadman timer, fixes #1093 #1094

parsley42 commented Jul 28, 2022 •

edited

Loading

xNok commented Jul 29, 2022

parsley42 commented Jul 29, 2022

parsley42 commented Jul 29, 2022

mumoshu commented Jul 30, 2022

parsley42 commented Jul 30, 2022 •

edited

Loading

mumoshu commented Jul 31, 2022 •

edited

Loading

parsley42 commented Sep 9, 2022

kanata2 commented Sep 9, 2022

kanata2 left a comment

Honor deadman timer, fixes #1093 #1094

Honor deadman timer, fixes #1093 #1094

Conversation

parsley42 commented Jul 28, 2022 • edited Loading

xNok commented Jul 29, 2022

parsley42 commented Jul 29, 2022

parsley42 commented Jul 29, 2022

mumoshu commented Jul 30, 2022

parsley42 commented Jul 30, 2022 • edited Loading

mumoshu commented Jul 31, 2022 • edited Loading

parsley42 commented Sep 9, 2022

kanata2 commented Sep 9, 2022

kanata2 left a comment

Choose a reason for hiding this comment

parsley42 commented Jul 28, 2022 •

edited

Loading

parsley42 commented Jul 30, 2022 •

edited

Loading

mumoshu commented Jul 31, 2022 •

edited

Loading