Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Extend network bonding metrics #1604

Open
hoffie opened this issue Feb 15, 2020 · 9 comments
Open

Proposal: Extend network bonding metrics #1604

hoffie opened this issue Feb 15, 2020 · 9 comments

Comments

@hoffie
Copy link
Contributor

hoffie commented Feb 15, 2020

node_exporter currently exposes details about network bonding, which is great. To be able to monitor more failure cases, we would need additional metrics which we haven't found in node_exporter yet:

  • An association between network devices and bonding devices: This would be useful to detect mixed-speed ports in the same bond (which is considered a misconfiguration in our case).
  • LACP (802.3ad) link aggregation status: Currently, there seems to be no way to detect if LACP has been negotiated between the machine and the switch on all ports.

Essentially, we would need the following metrics:

# HELP node_bonding_slave_info Contains association between bonding master device and associated slaves. Value is always 1.
# TYPE node_bonding_slave_info GAUGE
## from /sys/class/net/<master>/bonding/slaves
# - alert: LinuxPlatformBondMixedPortSpeeds
#   expr: min(node_bonding_slave_info * on(instance, device) group_left() node_network_speed_bytes) by (instance, master) != max(node_bonding_slave_info * on(instance, device) group_left() node_network_speed_bytes) by (instance, master) and on(instance, master) node_bonding_active == node_bonding_slaves
node_bonding_slave_info{master="bond0",device="eno1"} 1
node_bonding_slave_info{master="bond0",device="eno2"} 1

# HELP node_bonding_ad_num_ports Number of visible LACP/AD ports
# TYPE node_bonding_ad_num_ports GAUGE'
## from /sys/devices/virtual/net/<master>/bonding/ad_num_ports
## - alert: BondLacpPortsDegraded
##   expr: node_bonding_ad_num_ports != node_bonding_slaves
node_bonding_ad_num_ports{master="bond0"} 2

It could also make sense to add some more information in the same go. So far, we haven't required these in our alerting, but they may be useful nevertheless:

# HELP node_bonding_mode Kind of bonding (e.g. 1 for active-backup, 4 for lacp/ad)
# TYPE node_bonding_mode GAUGE
## from /sys/class/net/<master>/bonding/mode
## might be useful to have different monitoring for active-backup, lacp, etc.
node_bonding_mode{master="bond0"} 4

# HELP node_bonding_ad_info Contains the LACP/AD partner_mac. Value is always 1.
# TYPE node_bonding_ad_info GAUGE
## from /sys/devices/virtual/net/<master>/bonding/ad_partner_mac)
## might be useful for detecting bad lacp configs (zero MAC) or building stats about switch MACs
node_bonding_ad_info{master="bond0",partner_mac="00:00:00:00:00"} 1

# HELP node_bonding_ad_port_partner_oper_port_state Bitfield denoting LACP/AD port state for the partner
# TYPE node_bonding_ad_port_partner_oper_port_state GAUGE
## from /sys/class/net/bond0/slave_eno1/bonding_slave/ad_partner_oper_port_state
node_bonding_ad_port_partner_oper_port_state{master="bond0",device="eno1"} 61
node_bonding_ad_port_partner_oper_port_state{master="bond0",device="eno2"} 61

# HELP node_bonding_ad_port_actor_oper_port_state Bitfield denoting LACP/AD port state for the actor
# TYPE node_bonding_ad_port_actor_oper_port_state GAUGE
node_bonding_ad_port_actor_oper_port_state{master="bond0",device="eno1"} 61
node_bonding_ad_port_actor_oper_port_state{master="bond0",device="eno2"} 61

We currently use a shell script and node_exporter's textfile collector to fill this gap. However, I think it would be useful to support these metrics out-of-the box, especially since only /sys files need to be read.

I'd volunteer to work on PRs against procfs and node_exporter. I would suggest adding this to the existing bonding_linux.go as it is closely related. Looks like this would also imply converting the existing bonding collector to procfs.

Related: #841

@SuperQ @discordianfish @pgier What do you think? Does it make sense in general? Should we only implement the first two metrics or the more generic approach? Do the suggested names make sense?

@SuperQ
Copy link
Member

SuperQ commented Feb 18, 2020

Most of that makes sense to me. The only thing that I would probably leave out is the node_bonding_ad_info. The mac address stuff might cause cardinality issues.

One other minor labeling suggestion would be to call it master_device to avoid ambiguity.

It might also be useful to make node_bonding_mode an enum type

node_bonding_mode{master_device="bond0",mode="balance-rr"} 0
node_bonding_mode{master_device="bond0",mode="active-backup"} 1
node_bonding_mode{master_device="bond0",mode="802.3ad"} 0

@AlexZzz
Copy link
Contributor

AlexZzz commented Mar 2, 2020

We're using bonding too, it would be great to add Aggregator ID for every link as a gauge(/sys/class/net/bond0/lower_eth3/bonding_slave/ad_aggregator_id):
node_bonding_aggregator_id{master_device="bond0", device="eth3"} 1
Different aggregator ID's might be, for example, if switch is misconfigured and has only one link in aggregate. And sometimes it happens due to software failures.
Might also be good to add active aggregator ID (/sys/class/net/bond2/bonding/ad_aggregator):
node_bonding_active_aggregator_id{master_device="bond0"} 1

@nacc
Copy link

nacc commented Jul 28, 2020

Hi, we also use bonding heavily in our infrastructure and I would be very interested in exposing some of these metrics as well. What is the status of this work?

@discordianfish
Copy link
Member

I don't think anything was done in that regard. Feel free to submit a PR. But please note that procfs interactions, if there needs to be added/change anything, should go into : https://github.com/prometheus/procfs

hoffie added a commit to hoffie/node_exporter that referenced this issue Aug 12, 2020
@hoffie
Copy link
Contributor Author

hoffie commented Aug 12, 2020

Hi, we also use bonding heavily in our infrastructure and I would be very interested in exposing some of these metrics as well. What is the status of this work?

I'm very sorry, but I still didn't get around to finishing the work on this. I started moving the existing parsing into procfs. I'm sharing my work-in-progress here, but it's far from complete, needs rebasing unto more recent changes and the procfs changes are hacked into vendor/ instead of making them in the appropriate project:
https://github.com/hoffie/node_exporter/tree/bonding

If anyone wants to pick this up, feel free to (maybe leave a short comment here). I have still interest in this, but cannot promise when I'll be able to finish it.

@frenkye
Copy link

frenkye commented Feb 18, 2022

Came across this issue on our side today, would be nice if the info about aggregator ids could be implemented.

The iface with id 3 is out of bond aggregation.

$ cat /sys/class/net/bond0/bonding/slaves
enp96s0f0 enp216s0f0
$ cat /sys/class/net/bond0/bonding/ad_aggregator
2
$ cat /sys/class/net/enp96s0f0/bonding_slave/ad_aggregator_id
2
$ cat /sys/class/net/enp216s0f0/bonding_slave/ad_aggregator_id
3

@bewing
Copy link

bewing commented Mar 16, 2022

I came across this today looking into monitoring switch-side misconfigurations (LACP bond with no active members on the host side). I think I have some time next week to look at extending the procfs module to collect these statistics and the bonding collector to export them.

@rexagod
Copy link
Contributor

rexagod commented May 28, 2024

@bewing Are you working on this? If not, I can take this up.

@bewing
Copy link

bewing commented May 28, 2024

I started on this a year ago, and got as far as opening some issues in related projects, identifying the need to flesh out test fixtures in #2347 so as to improve procps via prometheus/procfs#439 to make the metrics available.

The working directories are lost to the ethos, and surely the ground under them has moved. It's back to a fresh start at this point, but maybe the actual code changes in the procps pull are still useful. I have not had time to work on this, and encourage others to make the attempt if they can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants