-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu: Add a 2nd label 'package' to metric node_cpu_core_throttles_total #871
cpu: Add a 2nd label 'package' to metric node_cpu_core_throttles_total #871
Conversation
1f8b425
to
f02f372
Compare
Thanks! Can you add a fixture example for a second |
Could we consider not using the term |
collector/cpu_linux.go
Outdated
@@ -74,7 +74,7 @@ func NewCPUCollector() (Collector, error) { | |||
cpuCoreThrottle: prometheus.NewDesc( | |||
prometheus.BuildFQName(namespace, cpuCollectorSubsystem, "core_throttles_total"), | |||
"Number of times this cpu core has been throttled.", | |||
[]string{"core"}, nil, | |||
[]string{"node", "core"}, nil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we call this package
instead of node.
Besie @SuperQ's comment, LGTM! |
Sorry, for the late reply. I was thinking about the package vs node discussion and wanted to try a more complicated system first. So I've tested a dual-socket AMD EPYC system (multi-chip module cpus) and I came to the conclusion that my commit is wrong. It only works on Intel x86_64 because (at the moment) the NUMA node number and the I.e. we can't iterate over the NUMA node number and then their local CPU / core_ids if we want this to work in the general case, too. (Important: AMD EPYC does not have the I've created a gist to show the details (first Intel Haswell, then AMD EPYC). Please take a look at the EPYC output to see the difference. Instead we should iterated the physical_package_ids and their local core_ids. And, yes, the label should be called |
@knweiss Thanks for all the research to figure this out. |
One additional request, I'd like to start including |
7296f5a
to
5d46d30
Compare
5d46d30
to
77e1560
Compare
I've tried to simplify and fix the code. This a first version (without much testing!). @SuperQ I've added CHANGELOG entries. |
CHANGELOG.md
Outdated
@@ -2,10 +2,10 @@ | |||
|
|||
**Breaking changes** | |||
|
|||
* [CHANGE] | |||
* [CHANGE] Rename `node_package_throttle` label `node` to `package`. #871 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be just listed under the breaking changes section, we typically list one item per PR.
77e1560
to
8394d3c
Compare
👍 LGTM |
This commit fixes the node_cpu_core_throttles_total metrics on multi-socket systems as the core_ids are the same for each package. I.e. we need to count them seperately. Rename the node_package_throttles_total metric label `node` to `package`. Reorganize the sys.ttar archive and use the same symlinks as the Linux kernel. Also, the new fixtures now use a dual-socket dual-core cpu w/o HT/SMT (node0: cpu0+1, node1: cpu2+3) as well as processor-less (memory-only) NUMA node 'node2' (this is a very rare case). Signed-off-by: Karsten Weiss <[email protected]>
8394d3c
to
4a07295
Compare
Use the direct path /sys/devices/system/cpu/cpu[0-9]* (without symlinks) instead of /sys/bus/cpu/devices/cpu[0-9]*. The latter path also does not exist e.g. on RHEL 6.9's kernel. Signed-off-by: Karsten Weiss <[email protected]>
Signed-off-by: Karsten Weiss <[email protected]>
4a07295
to
709bce6
Compare
In the meantime I did some more testing on CentOS 6.9 and 7.4 machines with no, partial, and full core+package throttle information in /sys. Both Intel and AMD. So far it works fine and I even caught a couple of machines with a throttle count != 0 (both core and package). YMMV. |
Signed-off-by: Karsten Weiss <[email protected]>
Looking good. One note, we intentionally had |
@SuperQ It would be very useful to have fixtures from all the various cpu types, archs, and kernel versions and an e2e test that tries them all one after the other. This would allow making changes with more confidence... On the other hand it would, of course, require more effort to keep all the e2e reference files up-to-date. |
Yea, I think this is the case where we should move all of this code over to the |
Let's merge it, we can use the updated procfs later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
prometheus#871) * cpu: Add a 2nd label 'package' to metric node_cpu_core_throttles_total This commit fixes the node_cpu_core_throttles_total metrics on multi-socket systems as the core_ids are the same for each package. I.e. we need to count them seperately. Rename the node_package_throttles_total metric label `node` to `package`. Reorganize the sys.ttar archive and use the same symlinks as the Linux kernel. Also, the new fixtures now use a dual-socket dual-core cpu w/o HT/SMT (node0: cpu0+1, node1: cpu2+3) as well as processor-less (memory-only) NUMA node 'node2' (this is a very rare case). Signed-off-by: Karsten Weiss <[email protected]> * cpu: Use the direct /sys path to the cpu files. Use the direct path /sys/devices/system/cpu/cpu[0-9]* (without symlinks) instead of /sys/bus/cpu/devices/cpu[0-9]*. The latter path also does not exist e.g. on RHEL 6.9's kernel. Signed-off-by: Karsten Weiss <[email protected]> * cpu: Reverse core+package throttle processing order Signed-off-by: Karsten Weiss <[email protected]> * cpu: Add documentation URLs Signed-off-by: Karsten Weiss <[email protected]>
This fixes the
node_cpu_core_throttles_total
metrics on multi-socket systems as explained in detail in my comment to PR #836. As thecore_ids
are the same for each node (aka processor or package) we need to count them seperately. Otherwise we only count the last processor's cpu core throttles.Refactor the
core_throttles_count
handling and move it from the cpu loopto the NUMA node loop.
Reorganize the
sys.ttar
files and use the same symlinks as the Linuxkernel.