Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove min_isr #789

Merged
merged 3 commits into from
Oct 1, 2024
Merged

Remove min_isr #789

merged 3 commits into from
Oct 1, 2024

Conversation

viktorerlingsson
Copy link
Member

@viktorerlingsson viktorerlingsson commented Sep 24, 2024

Setting minimum in sync replicas (min_isr) > 0 with the current implementation will block LavinMQ from starting. The reason for this is that LavinMQ does some file actions on startup (cleanup, truncate, etc - for example the actions taken here). With min_isr set, these actions needs to be replicated to at least n (min_isr) number of nodes before LavinMQ can continue. However, the listener for the replication server has not been started yet when these actions are taken, which makes it impossible for any followers to connect to the leader, and therefore makes it impossible to replicate the actions to the required number of nodes. This leaves the leader node stuck on waiting for actions to be replicated to followers who are themselves waiting for the leader to accept connections.

Using clustering without min_isr will work the same way as if min_isr would be set to 0: The leader will not wait for followers to connect/reconnect before accepting connections and data. If followers are connected and in-sync, they will be kept in-sync for as long as possible. But if a follower disconnects, the leader node will not wait for it to come back online before accepting new connections and data. This means that it is possible to have data that isn't synced to any follower.

Several workarounds for this where taken into consideration, but we deemed none of them to be good enough. Which is why we decided to remove min_isr for now, until we have a better idea of how to implement it.

WHAT is this pull request doing?

Removes settings and code related to min_isr.

The current implementation has been saved as a backup in https://github.com/cloudamqp/lavinmq/tree/min_isr_backup.

HOW can this pull request be tested?

Run specs

Copy link
Member

@carlhoerberg carlhoerberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit needs a lot more explanation

@viktorerlingsson
Copy link
Member Author

The commit needs a lot more explanation

Something like this? Or are you looking for something else?

Setting minimum in sync replicas (min_isr) > 0 with the current implementation will block LavinMQ from starting. The reason for this is that LavinMQ does some file actions on startup (cleanup, truncate, etc - for example the actions taken here). With min_isr set, these actions needs to be replicated to at least n (min_isr) number of nodes before LavinMQ can continue. However, the listener for the replication server has not been started yet when these actions are taken, which makes it impossible for any followers to connect to the leader, and therefore makes it impossible to replicate the actions to the required number of nodes. This leaves the leader node stuck on waiting for actions to be replicated to followers who are themselves waiting for the leader to accept connections.

Several workarounds for this where taken into consideration, but we deemed none of them to be good enough. Which is why we decided to remove min_isr for now, until we have a better idea of how to implement it.

@dentarg
Copy link
Member

dentarg commented Sep 29, 2024

What can happen without a properly working min_isr functionality? I think that should be spelled out

@viktorerlingsson
Copy link
Member Author

What can happen without a properly working min_isr functionality? I think that should be spelled out

Sure, added this to the comment:
"Using clustering without min_isr will work the same way as if min_isr would be set to 0: The leader will not wait for followers to connect/reconnect before accepting connections and data. If followers are connected and in-sync, they will be kept in-sync for as long as possible. But if a follower disconnects, the leader node will not wait for it to come back online before accepting new connections and data. This means that it is possible to have data that isn't synced to any follower."

@viktorerlingsson viktorerlingsson merged commit ddf5871 into main Oct 1, 2024
23 of 25 checks passed
@viktorerlingsson viktorerlingsson deleted the remove_min_isr branch October 1, 2024 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants