-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option for dynamic backoff #258
Add option for dynamic backoff #258
Conversation
@MattBrittan This PR is an alternative to #252. This PR uses a simpler approach with a "stateless" backoff which doesn't need a reset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really sorry it's taken me so long to get back to you on this (and your previous PR).
Just about accepted this as-is, but would really like to cover as many cases as possible and feel that one small change would significantly improve the flexibility (and ensure we don't hammer the broker with attempts).
autopaho/net.go
Outdated
@@ -106,10 +107,11 @@ func establishServerConnection(ctx context.Context, cfg ClientConfig, firstConne | |||
|
|||
// Delay before attempting another connection | |||
select { | |||
case <-time.After(cfg.ConnectRetryDelay): | |||
case <-time.After(cfg.ReconnectBackoff(attempt)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about moving this to the top of the for loop? (the standard backoff could then return 0 if attempt
is 0, or event better, calculate the delay based on the time of the previous call). This would provide the ability to avoid rapid connection loops where there is a client with a duplicate ID or a corrupt message thats continually being resent upon connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the idea but I'm not sure how changing it to the start would fix:
- duplicate ID
- corrupt message thats continually being resent upon connection
Having an initial delay would impact all connection attempts. Means one would slow down ramp up for mass reconnects (e.g. in the good case of a broker restart) for the case that one MIGHT have a mass issue if duplicate IDs.
Otherwise one would need to track the reason(s) the connection was disconnected (good bye, come back later, duplicate ID, etc.).
If something happens in a loop the issue would still be there but slower (something one could also cover with throttling on the client side)
Regarding "duplicate IDs":
Here the actual issue seems to be more how one was able to create duplicate client IDs (and by that actually dulicate credentials which is even worse).
So the actual problem would be IMHO on the provisioning side.
Regarding "corrupt messages":
If I understand this case correctly, the corrupt message would lead to a disconnect.
This sounds not recoverable as it would happen regardless of the backoff value, just faster / slower.
Wouldn't it be better to block clients which run havoc until the rollout is fixed.
The point I'm trying to make is these issue seem to have other origins and I'm not sure the backoff is the right place to cover them. One could introduce complexity into the backoff but my aasumptionis that one will always hit unexpected issues which one can't cover as one would have to change backoff strategies dynamically.
So having a default policy with solid defaults might give one a better "coverage" for problems.
To cut it short:
Moving the backoff to the start of the loop is fine with me but I'd like to avoid to introduce any deep complexity in the backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the idea but I'm not sure how changing it to the start would fix:
It will not fix the root cause of either issue. The aim is to rate limit the connection attempts so that we don't end up in a tight connect loop when such an issue occurs. I don't think there is a perfect way of dealing with (or even detecting) these issues in the library (there is a bug or configuration issue), but do feel that it's worth attempting to mitigate the impact (which includes load on the broker and cost on metered connections).
I feel that users will expect a paramater called ReconnectBackoff
to limit reconnection attempt frequency whatever the reson for the connection attempt (i.e. failed attempt, or rapid loss of connection). These issues do somwtimes happen in production (have seen them numerious times in the issues here, in the python paho issues, and on stack overflow) and have the potential to be costly (some network links are still expensive and a connection loop could result in a significant amount of traffic and will not always be detected quickly).
My thought for the initial PR is that the backoff code would include:
if attempt == 0 { return 0 }
which will effectively deliver an identical result to your current PR. However doing this would then allow the introduction of something like (treat this as pseudo code!):
// Creates a new backoff with constant delay (regardless of the attempt).
func NewConstantBackoff(delay time.Duration) Backoff {
var previousAttempt time.Time
return func(attempt int) time.Duration {
t := previousAttempt
previousAttempt = time.Now()
if attempt = 0 { // delay is the minimum time between connection attempts
return delay - time.Since(t) // May be negative but that's fine
}
return delay
}
}
I'd like to avoid to introduce any deep complexity in the backoff.
Minimising complexity is also my aim (not sure how sucessfull I've been!). I considered a range of options, and to me this seemes like the simplest approach overall. Here is what the V3 client does for comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I'm trying to make is that one would need to create a custom backoff implementation before encountering one of the issue mentioned above which is not what usual happens.
So the code would be prepared for a use case where people would need a new deployment (to change the backoff implentation). But a new deployment should fix the actual problem (which ich not the backoff).
So I'm not really convinced but that's just me. Nonetheless I pushed another commit to this PR adding the possibility to have a delay before the first connection attempt.
(The force push was needed since I also rebased the PR on latest master
.)
c00e2ce
to
5bf8c1e
Compare
5bf8c1e
to
29212b8
Compare
Thanks very much for this - sorry it's taken so long!
My aim here is to implement the "delay between connection attempts will be a minimum of ConnectRetryDelay" type rule into the main library (as I believe it's what users would expect). Did not want to ask you to make more modifications than neccessary to get this in as I'm concious it's taken a while. |
This PR introduces the ability to use a dynamic backoff for reconnect attempts.
Its purpose is just to replace the current constant value with a more versatile option.
As default a backoff strategy is used which returns the same backoff value mimicking the previous behaviour.
In addition there is an exponential backoff available which will increase the max value on each attempt up to a configurable limit.
The created backoff is a random value between a configured non-zero minimum and the increasing max (up to the configured limit). Using a random value from an increasing range helps spread client reconnect attempts.
Testing
Existing code has been changed to use the same duration as before but now uses the constant backoff implementation instead. All tests pass.
Additionally there are dedicated tests for:
Closing issues
Closes issue #249