Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric collection via njmon #2279

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Add metric collection via njmon #2279

wants to merge 1 commit into from

Conversation

cjeanner
Copy link
Collaborator

This patch adds a new role to manage "njmon". This tool allows to
collect metrics on the hypervisor so that we can check how the resources
were used on it.

By toggling cifmw_monitoring boolean to true, you will enable its
installation at the very beginning of the reproducer.yml run, as well as
the graphs generation at the very end of the run.

This can help understanding potential issues, for instance related to
memory shortage (oom-killer), slow I/O (disks) or clogged CPU
(overprovisioning).

One of the advantages of njmon, in addition to being really small and
light, is its capacity to send data to a remote InfluxDB. This would
allow you to display the graphs in realtime in Grafana, for instance.

Copy link
Contributor

openshift-ci bot commented Aug 29, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Aug 29, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from cjeanner. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cjeanner cjeanner force-pushed the njmon branch 2 times, most recently from 3f83468 to d4804d6 Compare August 29, 2024 08:47
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0927ab9252f9453b8e89b5dc23beba41

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 29m 17s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 14m 26s
cifmw-crc-podified-edpm-baremetal RETRY_LIMIT in 27m 08s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 47s
✔️ cifmw-pod-pre-commit SUCCESS in 6m 56s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 29s
cifmw-molecule-njmon FAILURE in 4m 29s

This patch adds a new role to manage "njmon". This tool allows to
collect metrics on the hypervisor so that we can check how the resources
were used on it.

By toggling `cifmw_monitoring` boolean to `true`, you will enable its
installation at the very beginning of the reproducer.yml run, as well as
the graphs generation at the very end of the run.

This can help understanding potential issues, for instance related to
memory shortage (oom-killer), slow I/O (disks) or clogged CPU
(overprovisioning).

One of the advantages of njmon, in addition to being really small and
light, is its capacity to send data to a remote InfluxDB. This would
allow you to display the graphs in realtime in Grafana, for instance.
@cjeanner
Copy link
Collaborator Author

An example of the metrics is visible in the molecule output, for instance here:
https://logserver.rdoproject.org/79/2279/d468184914fc9944da8cd68873f78812aabe69cd/github-check/cifmw-molecule-njmon/4955177/ci-framework-data/artifacts/njmon_stats/np0004868171_20240829_0631.html

You can then click on the various button at the top to see the generated graphs.
The job is gathering just 30 seconds, to allow building the graph, so it's not that "beautiful", but at least you can get an idea of the data it collects.

Copy link
Contributor

@rebtoor rebtoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but the problem with this PR is not its scope, which i strongly support, but rather the tool of choice.

Honestly I don't like this tool: it doesn't provide the bare minimum requirement for a modern supply chain requirement (no gpg signature on downloading and even no checksum for downloaded file), it's clearly written for a pre systemd world and it feels poorly integrated in the system.

We can find a better solution for the problem, probably something already present in EL repo or, at least, well packaged and, above all, modern. :)

Comment on lines +33 to +40
- name: Build njmon
community.general.make:
chdir: "{{ cifmw_njmon_basedir }}/tmp/njmon"
target: binary
params:
OSNAME: "{{ ansible_facts['distribution'] | lower }}"
OSVERSION: "{{ ansible_facts['distribution_version'] }}"
HW: "{{ ansible_facts['architecture'] }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge nope for me. We shouldn't build binaries by ourselves if not strictly required (such as some python native library, if any). In the same packages they are providing binary for RHEL9, njmon_RHEL9_x86_64_v83 we should use that.

Comment on lines +27 to +31
- name: Get njmon archive
ansible.builtin.unarchive:
dest: "{{ cifmw_njmon_basedir }}/tmp/njmon"
remote_src: true
src: "{{ cifmw_njmon_repository }}/{{ cifmw_njmon_archive }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no gpg signature and hash to compare releases is a red flag for me

Comment on lines +73 to +80
- name: Run njmon
register: _start_njmon
ansible.builtin.command:
chdir: "{{ cifmw_njmon_output_dir }}"
cmd: "njmon -a {{ cifmw_njmon_basedir }}/artifacts/njmon_opts.txt"
failed_when:
- _start_njmon.rc not in [0, 99]
changed_when: _start_njmon.rc != 99
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check its manual, but how it can collect valuable metrics without being root?

@rebtoor
Copy link
Contributor

rebtoor commented Aug 29, 2024

/hold need to be discussed with other ci-framework members

@cjeanner
Copy link
Collaborator Author

(brain dropping before I forget things)
Among packages alternatives, I can list:

  • collectd (but it's probably too big for the usage, and its configuration isn't that straightforward)
  • munin

Those tools can write on disk for later visualization, which is interesting in many cases. In the njmon case, we can get the resource usage directly from the zuul logs. Keeping this functionality seems a good idea imho.

Another solution would be Node Exporter, but it requires some more things than just one single service, and more configuration. Also not sure about the footprint.

Let's discuss that topic once we get more ppl on board. At least, it sparked some discussion and interest on the metric topic :)

@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants