From 53d7d54d5a33857c3331f0dbc44eadf6d092c90c Mon Sep 17 00:00:00 2001 From: "Gao,Yan" Date: Tue, 10 Mar 2015 16:02:33 +0100 Subject: [PATCH] Fix: crmd: Reset stonith failcount to recover transitioner when the node rejoins CRMd transitioner could not recover from "Too many failures to fence". Steps to produce: 1. Two-node cluster with stonith, for example using IPMI. 2. Node-1 has a complete power outage for a couple of minutes. The IPMI device is also without power, which causes the fencing to fail 3. Node-2 tries to fence node-1 for several times but fails. 4. Node-2 reports "Too many failures to fence node-1 (11), giving up". 5. The power returns and node-1 boots up normally. 6. Node-1 rejoins the cluster, but resources are not started on it. Expected result: The stonith failcount for node-1 should be reset and resources should be started on node-1. Actual result: Node-2 still logs "Too many failures to fence" and resources are not started on node-1. --- crmd/callbacks.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/crmd/callbacks.c b/crmd/callbacks.c index eae222324b3..cb1134e9be8 100644 --- a/crmd/callbacks.c +++ b/crmd/callbacks.c @@ -204,6 +204,9 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d if (alive && safe_str_eq(task, CRM_OP_FENCE)) { crm_info("Node return implies stonith of %s (action %d) completed", node->uname, down->id); + + st_fail_count_reset(node->uname); + erase_status_tag(node->uname, XML_CIB_TAG_LRM, cib_scope_local); erase_status_tag(node->uname, XML_TAG_TRANSIENT_NODEATTRS, cib_scope_local); /* down->confirmed = TRUE; Only stonith-ng returning should imply completion */