Add option for dynamic backoff #258

ViToni · 2024-07-05T21:32:01Z

This PR introduces the ability to use a dynamic backoff for reconnect attempts.
Its purpose is just to replace the current constant value with a more versatile option.

As default a backoff strategy is used which returns the same backoff value mimicking the previous behaviour.
In addition there is an exponential backoff available which will increase the max value on each attempt up to a configurable limit.
The created backoff is a random value between a configured non-zero minimum and the increasing max (up to the configured limit). Using a random value from an increasing range helps spread client reconnect attempts.

Testing

Existing code has been changed to use the same duration as before but now uses the constant backoff implementation instead. All tests pass.

Additionally there are dedicated tests for:

Constant backoff
Exponential backoff

Closing issues

Closes issue #249

ViToni · 2024-07-05T21:46:13Z

@MattBrittan This PR is an alternative to #252. This PR uses a simpler approach with a "stateless" backoff which doesn't need a reset.

MattBrittan

Really sorry it's taken me so long to get back to you on this (and your previous PR).

Just about accepted this as-is, but would really like to cover as many cases as possible and feel that one small change would significantly improve the flexibility (and ensure we don't hammer the broker with attempts).

MattBrittan · 2024-07-19T01:16:41Z

autopaho/net.go

@@ -106,10 +107,11 @@ func establishServerConnection(ctx context.Context, cfg ClientConfig, firstConne

 		// Delay before attempting another connection
 		select {
-		case <-time.After(cfg.ConnectRetryDelay):
+		case <-time.After(cfg.ReconnectBackoff(attempt)):


How would you feel about moving this to the top of the for loop? (the standard backoff could then return 0 if attempt is 0, or event better, calculate the delay based on the time of the previous call). This would provide the ability to avoid rapid connection loops where there is a client with a duplicate ID or a corrupt message thats continually being resent upon connection.

I understand the idea but I'm not sure how changing it to the start would fix:

duplicate ID

corrupt message thats continually being resent upon connection

Having an initial delay would impact all connection attempts. Means one would slow down ramp up for mass reconnects (e.g. in the good case of a broker restart) for the case that one MIGHT have a mass issue if duplicate IDs.
Otherwise one would need to track the reason(s) the connection was disconnected (good bye, come back later, duplicate ID, etc.).
If something happens in a loop the issue would still be there but slower (something one could also cover with throttling on the client side)

Regarding "duplicate IDs":
Here the actual issue seems to be more how one was able to create duplicate client IDs (and by that actually dulicate credentials which is even worse).
So the actual problem would be IMHO on the provisioning side.

Regarding "corrupt messages":
If I understand this case correctly, the corrupt message would lead to a disconnect.
This sounds not recoverable as it would happen regardless of the backoff value, just faster / slower.
Wouldn't it be better to block clients which run havoc until the rollout is fixed.

The point I'm trying to make is these issue seem to have other origins and I'm not sure the backoff is the right place to cover them. One could introduce complexity into the backoff but my aasumptionis that one will always hit unexpected issues which one can't cover as one would have to change backoff strategies dynamically.
So having a default policy with solid defaults might give one a better "coverage" for problems.

To cut it short:
Moving the backoff to the start of the loop is fine with me but I'd like to avoid to introduce any deep complexity in the backoff.

I understand the idea but I'm not sure how changing it to the start would fix:

It will not fix the root cause of either issue. The aim is to rate limit the connection attempts so that we don't end up in a tight connect loop when such an issue occurs. I don't think there is a perfect way of dealing with (or even detecting) these issues in the library (there is a bug or configuration issue), but do feel that it's worth attempting to mitigate the impact (which includes load on the broker and cost on metered connections).

I feel that users will expect a paramater called ReconnectBackoff to limit reconnection attempt frequency whatever the reson for the connection attempt (i.e. failed attempt, or rapid loss of connection). These issues do somwtimes happen in production (have seen them numerious times in the issues here, in the python paho issues, and on stack overflow) and have the potential to be costly (some network links are still expensive and a connection loop could result in a significant amount of traffic and will not always be detected quickly).

My thought for the initial PR is that the backoff code would include:

if attempt == 0 { return 0 }

which will effectively deliver an identical result to your current PR. However doing this would then allow the introduction of something like (treat this as pseudo code!):

// Creates a new backoff with constant delay (regardless of the attempt). func NewConstantBackoff(delay time.Duration) Backoff { var previousAttempt time.Time return func(attempt int) time.Duration { t := previousAttempt previousAttempt = time.Now() if attempt = 0 { // delay is the minimum time between connection attempts return delay - time.Since(t) // May be negative but that's fine } return delay } }

I'd like to avoid to introduce any deep complexity in the backoff.

Minimising complexity is also my aim (not sure how sucessfull I've been!). I considered a range of options, and to me this seemes like the simplest approach overall. Here is what the V3 client does for comparison.

The point I'm trying to make is that one would need to create a custom backoff implementation before encountering one of the issue mentioned above which is not what usual happens.
So the code would be prepared for a use case where people would need a new deployment (to change the backoff implentation). But a new deployment should fix the actual problem (which ich not the backoff).

So I'm not really convinced but that's just me. Nonetheless I pushed another commit to this PR adding the possibility to have a delay before the first connection attempt.
(The force push was needed since I also rebased the PR on latest master.)

MattBrittan · 2024-08-11T20:23:31Z

Thanks very much for this - sorry it's taken so long!

The point I'm trying to make is that one would need to create a custom backoff implementation before encountering one of the issue mentioned above which is not what usual happens.

My aim here is to implement the "delay between connection attempts will be a minimum of ConnectRetryDelay" type rule into the main library (as I believe it's what users would expect). Did not want to ask you to make more modifications than neccessary to get this in as I'm concious it's taken a while.

ViToni added 4 commits June 28, 2024 20:53

Add generic backoff function type and constant backoff implementation

fe0e6b7

Change autopaho to use backoff function instead of configurable value

286c44b

Add exponential backoff implementation and tests

4368a18

Add example for exponential backoff

8c0b228

MattBrittan reviewed Jul 19, 2024

View reviewed changes

ViToni force-pushed the feature/add_backoff_strategy__functional branch from c00e2ce to 5bf8c1e Compare August 9, 2024 14:00

Allow delay / backoff before first connection attempt

29212b8

ViToni force-pushed the feature/add_backoff_strategy__functional branch from 5bf8c1e to 29212b8 Compare August 9, 2024 14:14

MattBrittan merged commit f1fe38b into eclipse-paho:master Aug 11, 2024
1 check passed

ViToni mentioned this pull request Aug 11, 2024

Add backoff strategy interfaces and implementations #252

Closed

ViToni deleted the feature/add_backoff_strategy__functional branch August 13, 2024 10:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for dynamic backoff #258

Add option for dynamic backoff #258

ViToni commented Jul 5, 2024 •

edited

Loading

ViToni commented Jul 5, 2024

MattBrittan left a comment

MattBrittan Jul 19, 2024

ViToni Aug 6, 2024 •

edited

Loading

MattBrittan Aug 6, 2024 •

edited

Loading

ViToni Aug 9, 2024

MattBrittan commented Aug 11, 2024

Add option for dynamic backoff #258

Add option for dynamic backoff #258

Conversation

ViToni commented Jul 5, 2024 • edited Loading

ViToni commented Jul 5, 2024

MattBrittan left a comment

Choose a reason for hiding this comment

MattBrittan Jul 19, 2024

Choose a reason for hiding this comment

ViToni Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

MattBrittan Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

ViToni Aug 9, 2024

Choose a reason for hiding this comment

MattBrittan commented Aug 11, 2024

ViToni commented Jul 5, 2024 •

edited

Loading

ViToni Aug 6, 2024 •

edited

Loading

MattBrittan Aug 6, 2024 •

edited

Loading