Skip to content

Commit

Permalink
Fix: crmd: Reset stonith failcount to recover transitioner when the n…
Browse files Browse the repository at this point in the history
…ode rejoins

CRMd transitioner could not recover from "Too many failures to fence".

Steps to produce:

1. Two-node cluster with stonith, for example using IPMI.
2. Node-1 has a complete power outage for a couple of minutes. The
IPMI device is also without power, which causes the fencing to fail
3. Node-2 tries to fence node-1 for several times but fails.
4. Node-2 reports "Too many failures to fence node-1 (11), giving up".
5. The power returns and node-1 boots up normally.
6. Node-1 rejoins the cluster, but resources are not started on it.

Expected result:
The stonith failcount for node-1 should be reset and resources should
be started on node-1.

Actual result:
Node-2 still logs "Too many failures to fence" and resources are not
started on node-1.
  • Loading branch information
gao-yan committed Mar 10, 2015
1 parent 72223f6 commit 53d7d54
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions crmd/callbacks.c
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,9 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
if (alive && safe_str_eq(task, CRM_OP_FENCE)) {
crm_info("Node return implies stonith of %s (action %d) completed", node->uname,
down->id);

st_fail_count_reset(node->uname);

erase_status_tag(node->uname, XML_CIB_TAG_LRM, cib_scope_local);
erase_status_tag(node->uname, XML_TAG_TRANSIENT_NODEATTRS, cib_scope_local);
/* down->confirmed = TRUE; Only stonith-ng returning should imply completion */
Expand Down

0 comments on commit 53d7d54

Please sign in to comment.