Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new sysfs class for Amazon Elastic Fabric Adapter #515

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

perifaws
Copy link

@perifaws perifaws commented May 8, 2023

This change adds a new sysfs class to read metrics from Amazon Elastic Fabric Adapter (EFA). This change is based on the Infiniband class.

EFA is supported on a variety of Amazon EC2 instances (list here) and is relevant for HPC & distributed training (ML) applications in the same fashion as Infiniband.

There's an associated collector for the node_exporter generated for validation. Happy to provide a sample output as requested. Thanks!

Related to the Prometheus Google Groups thread: https://groups.google.com/g/prometheus-developers/c/MEal59mDebs/m/ZQBU1f0hCAAJ

@perifaws perifaws force-pushed the feature/amazon-efa-sysfs branch 3 times, most recently from f09883d to c4ad75e Compare May 8, 2023 17:48
@matthiasr
Copy link

Can you please add some unit tests with examples of what the /sys structure looks like? Otherwise this code will be impossible to maintain with confidence.

@dcbw
Copy link
Contributor

dcbw commented May 17, 2023

What's EFA specific about the collector? I can't see anywhere that it checks the PCI device ID or something like that for an Amazon VID/PID. Looks like it just looks in the normal infiniband directories?

eg if I have a random Mellanox IB device, will this collector ignore it?

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants