Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: systemd: make corosync wait for sbd-start to complete #74

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vvidic
Copy link

@vvidic vvidic commented Apr 14, 2019

If sbd fails to start it prevents pacemaker from starting, but corosync
continues to start without errors. This generates a quorum vote for the
current node although the sbd and pacemaker components are not alive.

@wenningerk
Copy link

wenningerk commented Apr 15, 2019

Are you sure that this is working? We had a lengthy discussion about similar topics some time ago on the clusterlabs-list. Currently sbd is being started as part-of corosync which makes it start/stop in parallel providing that there is never sbd without corosync.
Actually I don't see an issue behind corosync and sbd starting up while pacemaker is not.
I would rather see it as a feature that the corosync-instance can be used as a vote-provider.
What is important is that pacemaker never starts up without sbd coming up properly - which
we achieved with the most recent changes in this field.

Do you have a scenario in mind where this behavior (assuming it is working as expected)
would really be desirable/needed?

Where I think we would need a similar pattern is when we look at pacemaker-remote. There it
is fatal that pacemaker-remote still can come up while sbd is failing. I remember having played
around with it and I didn't manage to get the desired behavior easily. (sbd starting as part-of
pacemaker-remote but pacemaker-remote at least not surviving without sbd coming up
properly)

@vvidic
Copy link
Author

vvidic commented Apr 15, 2019

The sbd.service file in 1.4.0 release has:

[Unit]
Before=pacemaker.service
Before=dlm.service

[Install]
RequiredBy=corosync.service
RequiredBy=pacemaker.service
RequiredBy=dlm.service

Not sure why is corosync missing from Before list?

@wenningerk
Copy link

wenningerk commented Apr 15, 2019

Because corosync doesn't do anything critical that might lead to split-brain if running unwatched by sbd ...
And iirc PartOf and Before don't play together as expected. (Well - expected - from the meaning of the words they anyway already kind of contradict each other ...)
Critical part that needs to be observed by sbd at any time is pacemaker (where there is room for improvement btw. anyway ...).
And as said before even if sbd has issues coming up it might be seen desirable that a node can still serve as quorum-participant although starting of resources on that node is gonna be prevented of course.

Not saying our current way to use systemd enforcement of the startup-sequence and -requirements is ideal. As already discussed in the thread I was referring to before a target something like ready-for-resource-manager sounds interesting. But again in my current world sbd & corosync would still be part of that target ...

@vvidic
Copy link
Author

vvidic commented Apr 15, 2019

I see, but than the line RequiredBy=corosync.service is not needed if sbd is only to be used with pacemaker?

@wenningerk
Copy link

Well sounds reasonable but iirc that still was needed for some part of the synchronized stopping/starting of corosync & sbd. But I might be wrong here - have to refresh / proof wrong my memory ;-)

@vvidic
Copy link
Author

vvidic commented Apr 15, 2019

Don't know, but if sbd and corosync start in parallel than there is no real dependency between them because they can start in any order. Instead starting or stopping pacemaker will take sbd with it (and fail pacemaker if sbd start fails).

@wenningerk
Copy link

Didn't have time to dive into that again yet.
But adding a reference to previous discussion to start with #39 Startup Issues

@knet-ci-bot
Copy link

Can one of the admins verify this patch?

@wenningerk
Copy link

As corosync is contributes to quorum-voting I guess it should be observed by sbd in some way.
Not saying it is perfect as it currently is though ...
We've meanwhile added periodic liveness-checks of corosync to sbd and a proper detection of graceful shutdown is in discussion.
I'd rather like to track these activities under a different title and close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants