-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in leader election for OCP #455
Fix race condition in leader election for OCP #455
Conversation
Thanks for your PR,
To skip the vendors CIs use one of:
|
Thanks for your PR,
To skip the vendors CIs use one of:
|
Pull Request Test Coverage Report for Build 5257619445
💛 - Coveralls |
marking the node as draining (via node annotation) is done during leader election. in this case, i think in getDrainLock we should allow to continue if node already has draining annotation. and not rely on drainable attr. in such "edge" cases we may get more than one node draining but its better than deadlock. driving this whole process from controller should solve all of these cases |
In OCP we have two steps to prepare a node before we drain it. first we get leader election and annotate the node with Draining and then we pause the MCP and mark the node as MCP paused. if the config_daemon get a reset between the fist part to the second it will get stuck because one node will take the leader election BUT it will not mark the node as Draining as there is another node already draining. and the node with the draining label will try to get the drain lock again but the first node has it. with this change if the node as Draning or MCP pause label it will not try to take the lock again and just continue after the reset. Signed-off-by: Sebastian Sch <[email protected]>
Thanks for your PR,
To skip the vendors CIs use one of:
|
Hi @adrianchiris please take another look when you have time :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In OCP we have two steps to prepare a node before we drain it. first we get leader election and annotate the node with Draining and then we pause the MCP and mark the node as MCP paused.
if the config_daemon get a reset between the fist part to the second it will get stuck because one node will take the leader election BUT it will not mark the node as Draining as there is another node already draining. and the node with the draining label will try to get the drain lock again but the first node has it.
with this change if the node as Draning or MCP pause label it will not try to take the lock again and just continue after the reset.