Fix race condition in leader election for OCP #455

SchSeba · 2023-06-11T13:25:42Z

In OCP we have two steps to prepare a node before we drain it. first we get leader election and annotate the node with Draining and then we pause the MCP and mark the node as MCP paused.

if the config_daemon get a reset between the fist part to the second it will get stuck because one node will take the leader election BUT it will not mark the node as Draining as there is another node already draining. and the node with the draining label will try to get the drain lock again but the first node has it.

with this change if the node as Draning or MCP pause label it will not try to take the lock again and just continue after the reset.

github-actions · 2023-06-11T13:25:54Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

github-actions · 2023-06-11T15:01:36Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

coveralls · 2023-06-11T15:08:25Z

Pull Request Test Coverage Report for Build 5257619445

0 of 11 (0.0%) changed or added relevant lines in 1 file are covered.
16 unchanged lines in 4 files lost coverage.
Overall coverage decreased (-0.2%) to 25.767%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/daemon/daemon.go	0	11	0.0%

Files with Coverage Reduction	New Missed Lines	%
api/v1/helper.go	3	41.92%
pkg/utils/openshift_context.go	3	10.34%
pkg/daemon/daemon.go	4	42.68%
controllers/sriovibnetwork_controller.go	6	62.26%

Totals
Change from base Build 5143456346:	-0.2%
Covered Lines:	1957
Relevant Lines:	7595

💛 - Coveralls

pkg/daemon/daemon.go

adrianchiris · 2023-06-12T15:15:36Z

if the config_daemon get a reset between the fist part to the second it will get stuck because one node will take the leader election BUT it will not mark the node as Draining as there is another node already draining. and the node with the draining label will try to get the drain lock again but the first node has it.

marking the node as draining (via node annotation) is done during leader election.
what may happen is that another node got leader elected and has outdated caches then it would also mark its own node as draining.

in this case, i think in getDrainLock we should allow to continue if node already has draining annotation. and not rely on drainable attr.

in such "edge" cases we may get more than one node draining but its better than deadlock.

driving this whole process from controller should solve all of these cases
#427 is a good start

pkg/daemon/daemon.go

In OCP we have two steps to prepare a node before we drain it. first we get leader election and annotate the node with Draining and then we pause the MCP and mark the node as MCP paused. if the config_daemon get a reset between the fist part to the second it will get stuck because one node will take the leader election BUT it will not mark the node as Draining as there is another node already draining. and the node with the draining label will try to get the drain lock again but the first node has it. with this change if the node as Draning or MCP pause label it will not try to take the lock again and just continue after the reset. Signed-off-by: Sebastian Sch <[email protected]>

github-actions · 2023-06-13T15:46:02Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

SchSeba · 2023-06-13T15:51:53Z

Hi @adrianchiris please take another look when you have time :)

adrianchiris

LGTM

SchSeba requested review from e0ne and zeeke June 11, 2023 13:25

SchSeba force-pushed the fix_lock_race branch from e81d9ad to bee5a5a Compare June 11, 2023 15:01

adrianchiris reviewed Jun 12, 2023

View reviewed changes

pkg/daemon/daemon.go Outdated Show resolved Hide resolved

adrianchiris reviewed Jun 12, 2023

View reviewed changes

pkg/daemon/daemon.go Show resolved Hide resolved

adrianchiris reviewed Jun 13, 2023

View reviewed changes

pkg/daemon/daemon.go Outdated Show resolved Hide resolved

adrianchiris reviewed Jun 13, 2023

View reviewed changes

pkg/daemon/daemon.go Show resolved Hide resolved

SchSeba force-pushed the fix_lock_race branch from bee5a5a to f3845a7 Compare June 13, 2023 15:45

adrianchiris approved these changes Jun 13, 2023

View reviewed changes

adrianchiris merged commit e84d536 into k8snetworkplumbingwg:master Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in leader election for OCP #455

Fix race condition in leader election for OCP #455

SchSeba commented Jun 11, 2023

github-actions bot commented Jun 11, 2023

github-actions bot commented Jun 11, 2023

coveralls commented Jun 11, 2023 •

edited

Loading

adrianchiris commented Jun 12, 2023 •

edited

Loading

github-actions bot commented Jun 13, 2023

SchSeba commented Jun 13, 2023

adrianchiris left a comment

Fix race condition in leader election for OCP #455

Fix race condition in leader election for OCP #455

Conversation

SchSeba commented Jun 11, 2023

github-actions bot commented Jun 11, 2023

github-actions bot commented Jun 11, 2023

coveralls commented Jun 11, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5257619445

💛 - Coveralls

adrianchiris commented Jun 12, 2023 • edited Loading

github-actions bot commented Jun 13, 2023

SchSeba commented Jun 13, 2023

adrianchiris left a comment

Choose a reason for hiding this comment

coveralls commented Jun 11, 2023 •

edited

Loading

adrianchiris commented Jun 12, 2023 •

edited

Loading