From 53d7d54d5a33857c3331f0dbc44eadf6d092c90c Mon Sep 17 00:00:00 2001
From: "Gao,Yan" <ygao@suse.com>
Date: Tue, 10 Mar 2015 16:02:33 +0100
Subject: [PATCH] Fix: crmd: Reset stonith failcount to recover transitioner
 when the node rejoins

CRMd transitioner could not recover from "Too many failures to fence".

Steps to produce:

1. Two-node cluster with stonith, for example using IPMI.
2. Node-1 has a complete power outage for a couple of minutes. The
IPMI device is also without power, which causes the fencing to fail
3. Node-2 tries to fence node-1 for several times but fails.
4. Node-2 reports "Too many failures to fence node-1 (11), giving up".
5. The power returns and node-1 boots up normally.
6. Node-1 rejoins the cluster, but resources are not started on it.

Expected result:
The stonith failcount for node-1 should be reset and resources should
be started on node-1.

Actual result:
Node-2 still logs "Too many failures to fence" and resources are not
started on node-1.
---
 crmd/callbacks.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/crmd/callbacks.c b/crmd/callbacks.c
index eae222324b3..cb1134e9be8 100644
--- a/crmd/callbacks.c
+++ b/crmd/callbacks.c
@@ -204,6 +204,9 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
             if (alive && safe_str_eq(task, CRM_OP_FENCE)) {
                 crm_info("Node return implies stonith of %s (action %d) completed", node->uname,
                          down->id);
+
+                st_fail_count_reset(node->uname);
+
                 erase_status_tag(node->uname, XML_CIB_TAG_LRM, cib_scope_local);
                 erase_status_tag(node->uname, XML_TAG_TRANSIENT_NODEATTRS, cib_scope_local);
                 /* down->confirmed = TRUE; Only stonith-ng returning should imply completion */