Proposal: Extend network bonding metrics #1604

hoffie · 2020-02-15T21:37:05Z

node_exporter currently exposes details about network bonding, which is great. To be able to monitor more failure cases, we would need additional metrics which we haven't found in node_exporter yet:

An association between network devices and bonding devices: This would be useful to detect mixed-speed ports in the same bond (which is considered a misconfiguration in our case).
LACP (802.3ad) link aggregation status: Currently, there seems to be no way to detect if LACP has been negotiated between the machine and the switch on all ports.

Essentially, we would need the following metrics:

# HELP node_bonding_slave_info Contains association between bonding master device and associated slaves. Value is always 1.
# TYPE node_bonding_slave_info GAUGE
## from /sys/class/net/<master>/bonding/slaves
# - alert: LinuxPlatformBondMixedPortSpeeds
#   expr: min(node_bonding_slave_info * on(instance, device) group_left() node_network_speed_bytes) by (instance, master) != max(node_bonding_slave_info * on(instance, device) group_left() node_network_speed_bytes) by (instance, master) and on(instance, master) node_bonding_active == node_bonding_slaves
node_bonding_slave_info{master="bond0",device="eno1"} 1
node_bonding_slave_info{master="bond0",device="eno2"} 1

# HELP node_bonding_ad_num_ports Number of visible LACP/AD ports
# TYPE node_bonding_ad_num_ports GAUGE'
## from /sys/devices/virtual/net/<master>/bonding/ad_num_ports
## - alert: BondLacpPortsDegraded
##   expr: node_bonding_ad_num_ports != node_bonding_slaves
node_bonding_ad_num_ports{master="bond0"} 2

It could also make sense to add some more information in the same go. So far, we haven't required these in our alerting, but they may be useful nevertheless:

# HELP node_bonding_mode Kind of bonding (e.g. 1 for active-backup, 4 for lacp/ad)
# TYPE node_bonding_mode GAUGE
## from /sys/class/net/<master>/bonding/mode
## might be useful to have different monitoring for active-backup, lacp, etc.
node_bonding_mode{master="bond0"} 4

# HELP node_bonding_ad_info Contains the LACP/AD partner_mac. Value is always 1.
# TYPE node_bonding_ad_info GAUGE
## from /sys/devices/virtual/net/<master>/bonding/ad_partner_mac)
## might be useful for detecting bad lacp configs (zero MAC) or building stats about switch MACs
node_bonding_ad_info{master="bond0",partner_mac="00:00:00:00:00"} 1

# HELP node_bonding_ad_port_partner_oper_port_state Bitfield denoting LACP/AD port state for the partner
# TYPE node_bonding_ad_port_partner_oper_port_state GAUGE
## from /sys/class/net/bond0/slave_eno1/bonding_slave/ad_partner_oper_port_state
node_bonding_ad_port_partner_oper_port_state{master="bond0",device="eno1"} 61
node_bonding_ad_port_partner_oper_port_state{master="bond0",device="eno2"} 61

# HELP node_bonding_ad_port_actor_oper_port_state Bitfield denoting LACP/AD port state for the actor
# TYPE node_bonding_ad_port_actor_oper_port_state GAUGE
node_bonding_ad_port_actor_oper_port_state{master="bond0",device="eno1"} 61
node_bonding_ad_port_actor_oper_port_state{master="bond0",device="eno2"} 61

We currently use a shell script and node_exporter's textfile collector to fill this gap. However, I think it would be useful to support these metrics out-of-the box, especially since only /sys files need to be read.

I'd volunteer to work on PRs against procfs and node_exporter. I would suggest adding this to the existing bonding_linux.go as it is closely related. Looks like this would also imply converting the existing bonding collector to procfs.

Related: #841

@SuperQ @discordianfish @pgier What do you think? Does it make sense in general? Should we only implement the first two metrics or the more generic approach? Do the suggested names make sense?

The text was updated successfully, but these errors were encountered:

SuperQ · 2020-02-18T18:31:22Z

Most of that makes sense to me. The only thing that I would probably leave out is the node_bonding_ad_info. The mac address stuff might cause cardinality issues.

One other minor labeling suggestion would be to call it master_device to avoid ambiguity.

It might also be useful to make node_bonding_mode an enum type

node_bonding_mode{master_device="bond0",mode="balance-rr"} 0
node_bonding_mode{master_device="bond0",mode="active-backup"} 1
node_bonding_mode{master_device="bond0",mode="802.3ad"} 0

AlexZzz · 2020-03-02T12:43:31Z

We're using bonding too, it would be great to add Aggregator ID for every link as a gauge(/sys/class/net/bond0/lower_eth3/bonding_slave/ad_aggregator_id):
node_bonding_aggregator_id{master_device="bond0", device="eth3"} 1
Different aggregator ID's might be, for example, if switch is misconfigured and has only one link in aggregate. And sometimes it happens due to software failures.
Might also be good to add active aggregator ID (/sys/class/net/bond2/bonding/ad_aggregator):
node_bonding_active_aggregator_id{master_device="bond0"} 1

nacc · 2020-07-28T17:38:40Z

Hi, we also use bonding heavily in our infrastructure and I would be very interested in exposing some of these metrics as well. What is the status of this work?

discordianfish · 2020-08-11T14:00:32Z

I don't think anything was done in that regard. Feel free to submit a PR. But please note that procfs interactions, if there needs to be added/change anything, should go into : https://github.com/prometheus/procfs

…heus#1604

hoffie · 2020-08-12T20:30:13Z

Hi, we also use bonding heavily in our infrastructure and I would be very interested in exposing some of these metrics as well. What is the status of this work?

I'm very sorry, but I still didn't get around to finishing the work on this. I started moving the existing parsing into procfs. I'm sharing my work-in-progress here, but it's far from complete, needs rebasing unto more recent changes and the procfs changes are hacked into vendor/ instead of making them in the appropriate project:
https://github.com/hoffie/node_exporter/tree/bonding

If anyone wants to pick this up, feel free to (maybe leave a short comment here). I have still interest in this, but cannot promise when I'll be able to finish it.

frenkye · 2022-02-18T09:30:26Z

Came across this issue on our side today, would be nice if the info about aggregator ids could be implemented.

The iface with id 3 is out of bond aggregation.

$ cat /sys/class/net/bond0/bonding/slaves
enp96s0f0 enp216s0f0
$ cat /sys/class/net/bond0/bonding/ad_aggregator
2
$ cat /sys/class/net/enp96s0f0/bonding_slave/ad_aggregator_id
2
$ cat /sys/class/net/enp216s0f0/bonding_slave/ad_aggregator_id
3

bewing · 2022-03-16T21:58:09Z

I came across this today looking into monitoring switch-side misconfigurations (LACP bond with no active members on the host side). I think I have some time next week to look at extending the procfs module to collect these statistics and the bonding collector to export them.

rexagod · 2024-05-28T15:34:30Z

@bewing Are you working on this? If not, I can take this up.

bewing · 2024-05-28T21:35:02Z

I started on this a year ago, and got as far as opening some issues in related projects, identifying the need to flesh out test fixtures in #2347 so as to improve procps via prometheus/procfs#439 to make the metrics available.

The working directories are lost to the ethos, and surely the ground under them has moved. It's back to a fresh start at this point, but maybe the actual code changes in the procps pull are still useful. I have not had time to work on this, and encourage others to make the attempt if they can.

discordianfish added accepted enhancement labels Mar 20, 2020

hoffie added a commit to hoffie/node_exporter that referenced this issue Aug 12, 2020

collector/bonding: move parsing to procfs and add lacp details promet…

3d4aff6

…heus#1604

bewing mentioned this issue Mar 21, 2022

Process sysfs/class/net/*/bonding prometheus/procfs#439

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Extend network bonding metrics #1604

Proposal: Extend network bonding metrics #1604

hoffie commented Feb 15, 2020

SuperQ commented Feb 18, 2020

AlexZzz commented Mar 2, 2020

nacc commented Jul 28, 2020

discordianfish commented Aug 11, 2020

hoffie commented Aug 12, 2020

frenkye commented Feb 18, 2022

bewing commented Mar 16, 2022

rexagod commented May 28, 2024 •

edited

Loading

bewing commented May 28, 2024 •

edited

Loading

Proposal: Extend network bonding metrics #1604

Proposal: Extend network bonding metrics #1604

Comments

hoffie commented Feb 15, 2020

SuperQ commented Feb 18, 2020

AlexZzz commented Mar 2, 2020

nacc commented Jul 28, 2020

discordianfish commented Aug 11, 2020

hoffie commented Aug 12, 2020

frenkye commented Feb 18, 2022

bewing commented Mar 16, 2022

rexagod commented May 28, 2024 • edited Loading

bewing commented May 28, 2024 • edited Loading

rexagod commented May 28, 2024 •

edited

Loading

bewing commented May 28, 2024 •

edited

Loading