-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Honor deadman timer, fixes #1093 #1094
Conversation
I didn't write that part, but the wait group is supposed to be the right approach. Yes, ALL of the goroutines would have to finish, but this is normal because you want a gracefull shutdown. When you want to terminate the run, you need to cancel the context, and then each goroutine should stop gracefully. Ideally, when the deadman triggers, you should cancel the context and wait for all goroutine to terminate, then reconnect. As to why it doesn't work properly, I don't know yet 😞 |
Ah, ok, thanks for the - hah! - context. Looking at the code, I see what you mean - so, the root of the problem is that one of the items being waited on is |
Ok @xNok and @mumoshu - this much smaller PR fixes things for me, and may be "good enough". In short, the |
@parsley42 Hey! Thanks for the fix.
which is running in another goroutine. Is it that the connection close doesn't result in immediately failing |
@mumoshu Thanks for having a look! Yes, that's my assessment of the issue. I'll walk you through my analysis. The core robot code (github.com/lnxjedi/gopherbot) has been running very stable with RTM for many years, but a couple weeks after migrating to socket mode, one of our robots stopped responding to commands, and in the slack app I got a This led me to look at the code for a ping timeout, since I knew slack was sending a ping at regular intervals. To artificially suppress the ping arriving from slack, I used the local firewall to block inbound traffic on the port, then watched the slack logs. It took several minutes for the connection to reset. I repeated the experiment, then sent my |
@parsley42 Thanks for your confirmation! I read your experience and analysis and went back to gorilla/websocket code. I now got to think that conn.Close is no more than just closing TCP connection without coordination with the opposite side of the connection. https://github.com/gorilla/websocket/blob/af47554f343b4675b30172ac301638d350db34a5/conn.go#L342-L346
This is a gorilla/websocket issue that describes a not exactly same but related problem: gorilla/websocket#474 If we were to gracefully stop the bot when the Slack's websocket server vm/container/whatever and the TCP connection in-between is still alive, we might be able to start a normal close procedure by, instead of just https://github.com/gorilla/websocket/blob/af47554f343b4675b30172ac301638d350db34a5/conn.go#L411-L413 But-
This is really a good point. If the Slack end disappeared without notifying us, this normal close procedure will end up with a read/connection timeout. We need to handle the timeout in a graceful stop scenario anyway. That said, I believe your fix is perfectly valid and also the best way to handle the timeout issue! |
@kanata2 Can you have a look at this and let me know if you have any questions? |
Sorry for slow response. I'll confirm this weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parsley42
I have read discussions in this PR and agree with you, so will merge this.
NOTE: The original description of this PR is outdated, but left here for reference. See the discussion for how the PR was updated.
make pr-prep
from the root of the repository to run formatting, linting and tests.This patch modifies the main run loop for the socket mode connection. Based on reading the comments, it appears the intention was for run() to return
context.Cancelled
in the event of a cancelled context, ornil
in the event of a different error. This based on the comment:... but based on the use of the wait group, it appears that ALL of the goroutines would have to finish, causing the the deadman time to be ignored. This patch removes the wait group (and wg.Wait()) and instead selects on either 1) context cancelled, or 2) timer expired.
I believe this preserves the original intent of the code - but I could be wrong! In any case, this patch fixes #1093 for me, and my robot still runs correctly AFAICS. Ping to @xNok (the original PR author) to have a look.