Run evacuation asynchronously, so it can evacuate 30+ vms #13

dawiddeja · 2015-05-22T10:05:06Z

This brings back evacuating VM's asynchronously, since if we do not do it, timeout will be reached if there is a lot of instances on dead host.

Also, move 'wait for nova to update it internal state' part outside the fencing script, so it should cover simultaneous compute and controller failure problem. Even if we want to resolve it another way, evacuation itself cannot be run inside fencing script, cause for a lot of VMs it can take very long.

Also fixing 'simultaneous controller and compute fail' problem

Since fence_compute is already a Python module, there is no reason to call a subprocess Thread for running Nova CLI but rather we should directly use the stable Nova client API.

beekhof · 2015-05-27T03:02:09Z

What about this instead?

diff --git a/pcmk/fence_compute b/pcmk/fence_compute
index 34c2ed1..be88639 100755
--- a/pcmk/fence_compute
+++ b/pcmk/fence_compute
@@ -116,6 +116,10 @@ def set_power_status(_, options):
         on_shared_storage = True
     else:
         on_shared_storage = False
+
+    if os.fork():
+        return
+
     _host_evacuate(options["--plug"], on_shared_storage)

     return

mdbooth · 2015-05-27T09:10:43Z

Unfortunately I don't think this is safe due to a race condition in Nova following evacuate. Until this race is fixed, the logic we need from pacemaker is:

After evacuate, compute must not come up until evacuate completes
On normal start, compute must come up immediately

Due to a design deficiency in Nova, it is not possible to distinguish between an instance in the middle of a rebuild and an instance being evacuated. This means that in order to reliably detect that instances have been evacuated, we need the additional context that an evacuation is in progress, and we are waiting for its completion. I don't believe this context exists when enabling a resource, so I don't believe we can reliably do this check there. I believe the only place it exists is in the fence script itself, which therefore must block until all instances have been evacuated. If this hits a timeout, I think we have to extend the timeout. If it's possible to programatically extend the timeout when we detect liveness that could be more robust. This sucks, but we can improve it when Nova is fixed.

aspiers · 2018-01-15T16:55:27Z

@beekhof Is it worth revisiting this, or shall we just close?

Run evacuation asynchronously, so it can evacuate 30+ vms

1cc5bff

Also fixing 'simultaneous controller and compute fail' problem

dawiddeja referenced this pull request May 25, 2015

Use Novaclient API for fence_compute

2c1f9ab

Since fence_compute is already a Python module, there is no reason to call a subprocess Thread for running Nova CLI but rather we should directly use the stable Nova client API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run evacuation asynchronously, so it can evacuate 30+ vms #13

Run evacuation asynchronously, so it can evacuate 30+ vms #13

dawiddeja commented May 22, 2015

beekhof commented May 27, 2015

mdbooth commented May 27, 2015

aspiers commented Jan 15, 2018

Run evacuation asynchronously, so it can evacuate 30+ vms #13

Are you sure you want to change the base?

Run evacuation asynchronously, so it can evacuate 30+ vms #13

Conversation

dawiddeja commented May 22, 2015

beekhof commented May 27, 2015

mdbooth commented May 27, 2015

aspiers commented Jan 15, 2018