Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for dynamic backoff #258

Conversation

ViToni
Copy link
Contributor

@ViToni ViToni commented Jul 5, 2024

This PR introduces the ability to use a dynamic backoff for reconnect attempts.
Its purpose is just to replace the current constant value with a more versatile option.

As default a backoff strategy is used which returns the same backoff value mimicking the previous behaviour.
In addition there is an exponential backoff available which will increase the max value on each attempt up to a configurable limit.
The created backoff is a random value between a configured non-zero minimum and the increasing max (up to the configured limit). Using a random value from an increasing range helps spread client reconnect attempts.

Testing

Existing code has been changed to use the same duration as before but now uses the constant backoff implementation instead. All tests pass.

Additionally there are dedicated tests for:

  • Constant backoff
  • Exponential backoff

Closing issues

Closes issue #249

@ViToni
Copy link
Contributor Author

ViToni commented Jul 5, 2024

@MattBrittan This PR is an alternative to #252. This PR uses a simpler approach with a "stateless" backoff which doesn't need a reset.

Copy link
Contributor

@MattBrittan MattBrittan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really sorry it's taken me so long to get back to you on this (and your previous PR).

Just about accepted this as-is, but would really like to cover as many cases as possible and feel that one small change would significantly improve the flexibility (and ensure we don't hammer the broker with attempts).

autopaho/net.go Outdated
@@ -106,10 +107,11 @@ func establishServerConnection(ctx context.Context, cfg ClientConfig, firstConne

// Delay before attempting another connection
select {
case <-time.After(cfg.ConnectRetryDelay):
case <-time.After(cfg.ReconnectBackoff(attempt)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about moving this to the top of the for loop? (the standard backoff could then return 0 if attempt is 0, or event better, calculate the delay based on the time of the previous call). This would provide the ability to avoid rapid connection loops where there is a client with a duplicate ID or a corrupt message thats continually being resent upon connection.

Copy link
Contributor Author

@ViToni ViToni Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the idea but I'm not sure how changing it to the start would fix:

  • duplicate ID
  • corrupt message thats continually being resent upon connection

Having an initial delay would impact all connection attempts. Means one would slow down ramp up for mass reconnects (e.g. in the good case of a broker restart) for the case that one MIGHT have a mass issue if duplicate IDs.
Otherwise one would need to track the reason(s) the connection was disconnected (good bye, come back later, duplicate ID, etc.).
If something happens in a loop the issue would still be there but slower (something one could also cover with throttling on the client side)

Regarding "duplicate IDs":
Here the actual issue seems to be more how one was able to create duplicate client IDs (and by that actually dulicate credentials which is even worse).
So the actual problem would be IMHO on the provisioning side.

Regarding "corrupt messages":
If I understand this case correctly, the corrupt message would lead to a disconnect.
This sounds not recoverable as it would happen regardless of the backoff value, just faster / slower.
Wouldn't it be better to block clients which run havoc until the rollout is fixed.

The point I'm trying to make is these issue seem to have other origins and I'm not sure the backoff is the right place to cover them. One could introduce complexity into the backoff but my aasumptionis that one will always hit unexpected issues which one can't cover as one would have to change backoff strategies dynamically.
So having a default policy with solid defaults might give one a better "coverage" for problems.

To cut it short:
Moving the backoff to the start of the loop is fine with me but I'd like to avoid to introduce any deep complexity in the backoff.

Copy link
Contributor

@MattBrittan MattBrittan Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the idea but I'm not sure how changing it to the start would fix:

It will not fix the root cause of either issue. The aim is to rate limit the connection attempts so that we don't end up in a tight connect loop when such an issue occurs. I don't think there is a perfect way of dealing with (or even detecting) these issues in the library (there is a bug or configuration issue), but do feel that it's worth attempting to mitigate the impact (which includes load on the broker and cost on metered connections).

I feel that users will expect a paramater called ReconnectBackoff to limit reconnection attempt frequency whatever the reson for the connection attempt (i.e. failed attempt, or rapid loss of connection). These issues do somwtimes happen in production (have seen them numerious times in the issues here, in the python paho issues, and on stack overflow) and have the potential to be costly (some network links are still expensive and a connection loop could result in a significant amount of traffic and will not always be detected quickly).

My thought for the initial PR is that the backoff code would include:

if attempt == 0 { return 0 }

which will effectively deliver an identical result to your current PR. However doing this would then allow the introduction of something like (treat this as pseudo code!):

// Creates a new backoff with constant delay (regardless of the attempt).
func NewConstantBackoff(delay time.Duration) Backoff {
        var previousAttempt time.Time
	return func(attempt int) time.Duration {
                t := previousAttempt 
                previousAttempt = time.Now()
                if attempt = 0 { // delay is the minimum time between connection attempts                  
                   return delay - time.Since(t) // May be negative but that's fine
                }
		return delay 
	}
}

I'd like to avoid to introduce any deep complexity in the backoff.

Minimising complexity is also my aim (not sure how sucessfull I've been!). I considered a range of options, and to me this seemes like the simplest approach overall. Here is what the V3 client does for comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I'm trying to make is that one would need to create a custom backoff implementation before encountering one of the issue mentioned above which is not what usual happens.
So the code would be prepared for a use case where people would need a new deployment (to change the backoff implentation). But a new deployment should fix the actual problem (which ich not the backoff).

So I'm not really convinced but that's just me. Nonetheless I pushed another commit to this PR adding the possibility to have a delay before the first connection attempt.
(The force push was needed since I also rebased the PR on latest master.)

@ViToni ViToni force-pushed the feature/add_backoff_strategy__functional branch from c00e2ce to 5bf8c1e Compare August 9, 2024 14:00
@ViToni ViToni force-pushed the feature/add_backoff_strategy__functional branch from 5bf8c1e to 29212b8 Compare August 9, 2024 14:14
@MattBrittan MattBrittan merged commit f1fe38b into eclipse-paho:master Aug 11, 2024
1 check passed
@MattBrittan
Copy link
Contributor

Thanks very much for this - sorry it's taken so long!

The point I'm trying to make is that one would need to create a custom backoff implementation before encountering one of the issue mentioned above which is not what usual happens.

My aim here is to implement the "delay between connection attempts will be a minimum of ConnectRetryDelay" type rule into the main library (as I believe it's what users would expect). Did not want to ask you to make more modifications than neccessary to get this in as I'm concious it's taken a while.

@ViToni ViToni deleted the feature/add_backoff_strategy__functional branch August 13, 2024 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants