Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

feat!: implement slurm snap to enable observability metrics collection for COS #30

Merged
merged 4 commits into from
Jul 15, 2024

Conversation

NucciTheBoss
Copy link
Member

@NucciTheBoss NucciTheBoss commented Jul 12, 2024

Description

This PR takes a first stab at implementing the Slurm snap package within the slurmctld operator. It contains the Prometheus exporter which collects the metrics fed into COS.

This PR also adds a high-level interface for invoking scontrol. scontrol enables us to easily control the cluster without needing to make constant updates to the Slurm configuration.

How was the code tested?

Locally on my Ubuntu 24.04 Noble Numbat workstation.

Related issues and/or tasks

N/A

Checklist

  • I am the author of these changes, or I have the rights to submit them.
  • I have added the relevant changes to the README and/or documentation.
  • I have self reviewed my own code.
  • All requested changes and/or review comments have been resolved.

Signed-off-by: Jason C. Nucciarone <[email protected]>
Signed-off-by: Jason C. Nucciarone <[email protected]>
Keeps legacy code from old slurmctld_ops that still have
not been enabled on the new manager. More focused on ensuring
that observability works.

Signed-off-by: Jason C. Nucciarone <[email protected]>
Still using some methods provided by the legacy manager
that we don't have time to refactor this pulse

Signed-off-by: Jason C. Nucciarone <[email protected]>
Copy link
Contributor

@jedel1043 jedel1043 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We can merge against experimental, then do further work to polish it there.

@NucciTheBoss NucciTheBoss marked this pull request as ready for review July 15, 2024 16:23
@NucciTheBoss
Copy link
Member Author

Merging as is. Verified that the Slurm snap is working within slurmctld and slurmd. Fixing outstanding issues with NHC (will rebase into experimental) and need to fix certain things with the juju-systemd-notices daemon.

@NucciTheBoss NucciTheBoss merged commit d9849ce into charmed-hpc:experimental Jul 15, 2024
4 of 5 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants