Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corosyc/Pacemaker metrics issue on one node in cluster #245

Open
ivicavujovic opened this issue Feb 22, 2024 · 5 comments
Open

Corosyc/Pacemaker metrics issue on one node in cluster #245

ivicavujovic opened this issue Feb 22, 2024 · 5 comments
Labels
bug Something isn't working need research prio/high

Comments

@ivicavujovic
Copy link

Hi, we have several three-node clusters with the Corosyc/Pacemaker setup. There is a ha_cluster_exporter set on all of them, and it works just fine. Only on one node in one cluster, we get an error like this when I open the metrics URL:

An error has occurred while serving metrics:

collected metric "ha_cluster_corosync_member_votes" { label:<name:"local" value:"false" > label:<name:"node" value:"NR" > label:<name:"node_id" value:"32566" > gauge:<value:3 > } was collected before with the same name and label values

I checked all nodes in the cluster, and all of them have different IDs:

  • node1
ha_cluster_corosync_member_votes{local="false",node="xxxx",node_id="2"} 1
ha_cluster_corosync_member_votes{local="false",node="NR",node_id="32636"} 3
ha_cluster_corosync_member_votes{local="true",node="node1.infra.env",node_id="1"} 1
  • node2
ha_cluster_corosync_member_votes{local="false",node="xxxx",node_id="1"} 1
ha_cluster_corosync_member_votes{local="false",node="xxxx",node_id="3"} 1
ha_cluster_corosync_member_votes{local="false",node="NR",node_id="32652"} 2

Logs are very similar on all of the nodes:

level=info msg="Starting ha_cluster_exporter (version=1.3.0+git.1653405719.2a65dfc, branch=HEAD, revision=2a65dfc015e614e53f34effbd0847cc20317b952)"
level=info msg="Build context (go=go1.16.15, user=runner@fv-az341-182, date=20220524-15:44:13)"
level=warn msg="Reading config file failed" err="Config File \"ha_cluster_exporter\" Not Found in \"[/ /root/.config /etc /usr/etc]\""
level=info msg="Default config values will be used"
level=warn msg="Registration failure" err="could not initialize 'sbd' collector: '/usr/sbin/sbd' does not exist"
level=warn msg="Registration failure" err="could not initialize 'drbd' collector: '/sbin/drbdsetup' does not exist"
level=info msg="pacemaker collector registered."
level=info msg="corosync collector registered."
level=info msg="Serving metrics on :9664/metrics"
level=warn msg="Reading web config file failed" err="stat /etc/ha_cluster_exporter.web.yaml: no such file or directory"
level=info msg="Default web config or commandline values will be used"
level=info msg="TLS is disabled." http2=false

All nodes have the same configuration (OS, HDD, RAM, CPU) and are built and provisioned using Puppet configuration management.

Service file is very simple:

[Unit]
Description=Prometheus ha_cluster_exporter
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root

ExecStart=/usr/local/bin/ha_cluster_exporter

ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=always

[Install]
WantedBy=multi-user.target

For the love of God, I cannot find what could be an issue here. Did we make some misconfiguration, or did we miss some of that? There is nothing special set; we install the exporter and run it.

OS is Debian 11, version of exporter is 1.3.3 (but same issue with older versions too).

@maomaoaichirou
Copy link

this is bug

@stefanotorresi
Copy link
Member

Thanks for your bug report. This is definitely not supposed to happen.
Could you please report the output of corosync-quorumtool -p on both nodes?

@ivicavujovic
Copy link
Author

Yes, here is the output from all three nodes:

root@node1:~# corosync-quorumtool -p
Quorum information
------------------
Date:             Tue Mar  5 14:11:15 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          1
Ring ID:          1.149
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR node1.infra.env (local)
         2          1         NR XXXX:YYYY:ZZZZ:QQQQ::62%32695
         3          1         NR XXXX:YYYY:ZZZZ:QQQQ::63%32695
root@node2:~# corosync-quorumtool -p
Quorum information
------------------
Date:             Tue Mar  5 14:11:56 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          2
Ring ID:          1.149
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR XXXX:YYYY:ZZZZ:QQQQ::61%32620
         2          1         NR node2.infra.env (local)
         3          1         NR XXXX:YYYY:ZZZZ:QQQQ::63%32620
root@node3:~# corosync-quorumtool -p
Quorum information
------------------
Date:             Tue Mar  5 14:12:31 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          3
Ring ID:          1.149
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR XXXX:YYYY:ZZZZ:QQQQ::61%32728
         2          1         NR XXXX:YYYY:ZZZZ:QQQQ::62%32728
         3          1         NR node3.infra.env (local)

@stefanotorresi
Copy link
Member

Thanks, I will look into it sometime over the coming weeks and let you know.

@ivicavujovic
Copy link
Author

Thanks a lot for the effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need research prio/high
Projects
None yet
Development

No branches or pull requests

3 participants