Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run evacuation asynchronously, so it can evacuate 30+ vms #13

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dawiddeja
Copy link
Contributor

This brings back evacuating VM's asynchronously, since if we do not do it, timeout will be reached if there is a lot of instances on dead host.

Also, move 'wait for nova to update it internal state' part outside the fencing script, so it should cover simultaneous compute and controller failure problem. Even if we want to resolve it another way, evacuation itself cannot be run inside fencing script, cause for a lot of VMs it can take very long.

Also fixing 'simultaneous controller and compute fail' problem
dawiddeja referenced this pull request May 25, 2015
Since fence_compute is already a Python module, there is no reason to call
a subprocess Thread for running Nova CLI but rather we should directly use
the stable Nova client API.
@beekhof
Copy link
Owner

beekhof commented May 27, 2015

What about this instead?

diff --git a/pcmk/fence_compute b/pcmk/fence_compute
index 34c2ed1..be88639 100755
--- a/pcmk/fence_compute
+++ b/pcmk/fence_compute
@@ -116,6 +116,10 @@ def set_power_status(_, options):
         on_shared_storage = True
     else:
         on_shared_storage = False
+
+    if os.fork():
+        return
+
     _host_evacuate(options["--plug"], on_shared_storage)

     return

@mdbooth
Copy link

mdbooth commented May 27, 2015

Unfortunately I don't think this is safe due to a race condition in Nova following evacuate. Until this race is fixed, the logic we need from pacemaker is:

  • After evacuate, compute must not come up until evacuate completes
  • On normal start, compute must come up immediately

Due to a design deficiency in Nova, it is not possible to distinguish between an instance in the middle of a rebuild and an instance being evacuated. This means that in order to reliably detect that instances have been evacuated, we need the additional context that an evacuation is in progress, and we are waiting for its completion. I don't believe this context exists when enabling a resource, so I don't believe we can reliably do this check there. I believe the only place it exists is in the fence script itself, which therefore must block until all instances have been evacuated. If this hits a timeout, I think we have to extend the timeout. If it's possible to programatically extend the timeout when we detect liveness that could be more robust. This sucks, but we can improve it when Nova is fixed.

@aspiers
Copy link

aspiers commented Jan 15, 2018

@beekhof Is it worth revisiting this, or shall we just close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants