Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
The initial error that we faced during our online storage upgrade: ``` Command: x Exascaler Install: apply_lustre_params,create_udev_rules,email,emf_agent,emf_node_manager,ha,hosts,ipmi,kdump,logging,lustre,lvm,mdt_backup,modprobe,nics,ntp,os,ost_pools,restart_network,serial,start_cluster,sync_exa_toml (Config ver. 1) failed User: api Job: x es-install --steps start_cluster on node5 failed Step: x Run config-pacemaker on node5 failed (took: 12s 534ms 171us 586ns) Result (Error): Bad Exit Code: 1. Started: 2024-02-07T03:26:16.158Z Ended: 2024-02-07T03:26:28.692Z Stdout: Running Command: config-pacemaker --unmanaged-emf Stderr: x Command has failed. Code: exit status: 1 Stdout: INFO: cib.commit: committed '5e8558de-1ceb-46c2-bd70-1ab4d8504c9f' shadow CIB to the cluster Stderr: WARNING: DC lost during wait ``` Basically, the source of our problems below (case 3 - DC election or voiting during cluster recalculation): ``` [root@es-1-virt1 ~]# crmadmin -D -t 1; echo $? Designated Controller is: es-2-virt1 0 [root@es-1-virt1 ~]# crm cluster stop INFO: The cluster stack stopped on es-1-virt1 [root@es-1-virt1 ~]# crmadmin -D -t 1; echo $? error: Could not connect to controller: Connection refused error: Command failed: Connection refused 102 [root@es-1-virt1 ~]# crm cluster start INFO: The cluster stack started on es-1-virt1 [root@es-1-virt1 ~]# crmadmin -D -t 1; echo $? error: No reply received from controller before timeout (1000ms) error: Command failed: Connection timed out 124 ``` Potentially, we have a deadloop in dc_waiter, but it also means that pacemaker in the same state and in worst case the amount of time should not be more than 'dc-deadtime'.
- Loading branch information