-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric collection via njmon #2279
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3f83468
to
d4804d6
Compare
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0927ab9252f9453b8e89b5dc23beba41 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 29m 17s |
This patch adds a new role to manage "njmon". This tool allows to collect metrics on the hypervisor so that we can check how the resources were used on it. By toggling `cifmw_monitoring` boolean to `true`, you will enable its installation at the very beginning of the reproducer.yml run, as well as the graphs generation at the very end of the run. This can help understanding potential issues, for instance related to memory shortage (oom-killer), slow I/O (disks) or clogged CPU (overprovisioning). One of the advantages of njmon, in addition to being really small and light, is its capacity to send data to a remote InfluxDB. This would allow you to display the graphs in realtime in Grafana, for instance.
An example of the metrics is visible in the molecule output, for instance here: You can then click on the various button at the top to see the generated graphs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but the problem with this PR is not its scope, which i strongly support, but rather the tool of choice.
Honestly I don't like this tool: it doesn't provide the bare minimum requirement for a modern supply chain requirement (no gpg signature on downloading and even no checksum for downloaded file), it's clearly written for a pre systemd world and it feels poorly integrated in the system.
We can find a better solution for the problem, probably something already present in EL repo or, at least, well packaged and, above all, modern. :)
- name: Build njmon | ||
community.general.make: | ||
chdir: "{{ cifmw_njmon_basedir }}/tmp/njmon" | ||
target: binary | ||
params: | ||
OSNAME: "{{ ansible_facts['distribution'] | lower }}" | ||
OSVERSION: "{{ ansible_facts['distribution_version'] }}" | ||
HW: "{{ ansible_facts['architecture'] }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge nope for me. We shouldn't build binaries by ourselves if not strictly required (such as some python native library, if any). In the same packages they are providing binary for RHEL9, njmon_RHEL9_x86_64_v83
we should use that.
- name: Get njmon archive | ||
ansible.builtin.unarchive: | ||
dest: "{{ cifmw_njmon_basedir }}/tmp/njmon" | ||
remote_src: true | ||
src: "{{ cifmw_njmon_repository }}/{{ cifmw_njmon_archive }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no gpg signature and hash to compare releases is a red flag for me
- name: Run njmon | ||
register: _start_njmon | ||
ansible.builtin.command: | ||
chdir: "{{ cifmw_njmon_output_dir }}" | ||
cmd: "njmon -a {{ cifmw_njmon_basedir }}/artifacts/njmon_opts.txt" | ||
failed_when: | ||
- _start_njmon.rc not in [0, 99] | ||
changed_when: _start_njmon.rc != 99 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check its manual, but how it can collect valuable metrics without being root?
/hold need to be discussed with other ci-framework members |
(brain dropping before I forget things)
Those tools can write on disk for later visualization, which is interesting in many cases. In the njmon case, we can get the resource usage directly from the zuul logs. Keeping this functionality seems a good idea imho. Another solution would be Node Exporter, but it requires some more things than just one single service, and more configuration. Also not sure about the footprint. Let's discuss that topic once we get more ppl on board. At least, it sparked some discussion and interest on the metric topic :) |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This patch adds a new role to manage "njmon". This tool allows to
collect metrics on the hypervisor so that we can check how the resources
were used on it.
By toggling
cifmw_monitoring
boolean totrue
, you will enable itsinstallation at the very beginning of the reproducer.yml run, as well as
the graphs generation at the very end of the run.
This can help understanding potential issues, for instance related to
memory shortage (oom-killer), slow I/O (disks) or clogged CPU
(overprovisioning).
One of the advantages of njmon, in addition to being really small and
light, is its capacity to send data to a remote InfluxDB. This would
allow you to display the graphs in realtime in Grafana, for instance.