Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

Commit

Permalink
james longdev
Browse files Browse the repository at this point in the history
These changes make a number of modifications to the slurm charms.

* Removes no longer used interfaces
* remove slurmctld dependency on slurmdbd and slurmd
* support partitions with 0 nodes
* refactor relation data
* recreate how configs are written
* support user supplied partition configuration
* start modeling config
* add type checking
* remove unused code
* support partition events and slurmd node events
* remove upgrade-charm hook as it did nothing
* use systemd and apt charm libs
* rename interfaces
* update readme to reflect new interface names and minimal deployment using only slurmctld and slurmd
* address pr feedback
  • Loading branch information
jamesbeedy committed Jun 25, 2024
1 parent 81d201d commit da62269
Show file tree
Hide file tree
Showing 24 changed files with 1,371 additions and 2,267 deletions.
20 changes: 16 additions & 4 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Run tests
uses: get-woke/woke-action@v0
with:
Expand All @@ -34,7 +34,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Install dependencies
run: python3 -m pip install tox
- name: Run linters
Expand All @@ -45,12 +45,23 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Install dependencies
run: python3 -m pip install tox
- name: Run tests
run: tox -e unit

type-check:
name: Static type checking
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install dependencies
run: python3 -m pip install tox
- name: Run tests
run: tox -e type

integration-test:
strategy:
fail-fast: true
Expand All @@ -63,9 +74,10 @@ jobs:
- inclusive-naming-check
- lint
- unit-test
- type-check
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup operator environment
uses: charmed-kubernetes/actions-operator@main
with:
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ __pycache__/
.idea
.vscode/
version
.mypy_cache
.ruff_cache
8 changes: 1 addition & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,7 @@ This operator should be used with Juju 3.x or greater.
```shell
$ juju deploy slurmctld --channel edge
$ juju deploy slurmd --channel edge
$ juju deploy slurmdbd --channel edge
$ juju deploy mysql --channel 8.0/edge
$ juju deploy mysql-router slurmdbd-mysql-router --channel dpe/edge
$ juju integrate slurmctld:slurmd slurmd:slurmd
$ juju integrate slurmdbd-mysql-router:backend-database mysql:database
$ juju integrate slurmdbd:database slurmdbd-mysql-router:database
$ juju integrate slurmctld:slurmdbd slurmdbd:slurmdbd
$ juju integrate slurmctld:slurmd slurmd:slurmctld
```

## Project & Community
Expand Down
134 changes: 30 additions & 104 deletions charmcraft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,28 +22,15 @@ links:
source:
- https://github.com/charmed-hpc/slurmctld-operator

peers:
slurmctld-peer:
interface: slurmctld-peer
requires:
slurmd:
interface: slurmd
slurmdbd:
interface: slurmdbd
slurmrestd:
interface: slurmrestd
influxdb-api:
interface: influxdb-api
elasticsearch:
interface: elasticsearch
fluentbit:
interface: fluentbit
provides:
prolog-epilog:
interface: prolog-epilog
grafana-source:
interface: grafana-source
scope: global

assumes:
- juju
Expand All @@ -58,163 +45,102 @@ bases:
channel: "22.04"
architectures: [amd64]

parts:
charm:
build-packages: [git]
charm-python-packages: [setuptools]

# Create a version file and pack it into the charm. This is dynamically generated
# as part of the build process for a charm to ensure that the git revision of the
# charm is always recorded in this version file.
version-file:
plugin: nil
build-packages:
- git
override-build: |
VERSION=$(git -C $CRAFT_PART_SRC/../../charm/src describe --dirty --always)
echo "Setting version to $VERSION"
echo $VERSION > $CRAFT_PART_INSTALL/version
stage:
- version

config:
options:
custom-slurm-repo:
type: string
default: ""
description: >
Use a custom repository for Slurm installation.
This can be set to the Organization's local mirror/cache of packages and
supersedes the Omnivector repositories. Alternatively, it can be used to
track a `testing` Slurm version, e.g. by setting to
`ppa:omnivector/osd-testing`.

Note: The configuration `custom-slurm-repo` must be set *before*
deploying the units. Changing this value after deploying the units will
not reinstall Slurm.
cluster-name:
type: string
default: osd-cluster
description: >
description: |
Name to be recorded in database for jobs from this cluster.
This is important if a single database is used to record information from
multiple Slurm-managed clusters.

default-partition:
type: string
default: ""
description: >
description: |
Default Slurm partition. This is only used if defined, and must match an
existing partition.
custom-config:
slurm-conf-parameters:
type: string
default: ""
description: >
User supplied Slurm configuration.
This value supplements the charm supplied `slurm.conf` that is used for
Slurm Controller and Compute nodes.
description: |
User supplied Slurm configuration as a multiline string.
Example usage:
$ juju config slurmcltd custom-config="FirstJobId=1234"
proctrack-type:
type: string
default: proctrack/cgroup
description: >
Identifies the plugin to be used for process tracking on a job step
basis.
cgroup-config:
$ juju config slurmcltd slurm-conf-parameters="$(cat additional.conf)"

cgroup-parameters:
type: string
default: |
CgroupAutomount=yes
ConstrainCores=yes
description: >
Configuration content for `cgroup.conf`.
description: |
User supplied configuration for `cgroup.conf`.
health-check-params:
default: ""
type: string
description: >
description: |
Extra parameters for NHC command.
This option can be used to customize how NHC is called, e.g. to send an
e-mail to an admin when NHC detects an error set this value to
e-mail to an admin when NHC detects an error set this value to.
`-M [email protected]`.

health-check-interval:
default: 600
type: int
description: Interval in seconds between executions of the Health Check.

health-check-state:
default: "ANY,CYCLE"
type: string
description: Only run the Health Check on nodes in this state.

acct-gather-frequency:
type: string
default: "task=30"
description: >
Accounting and profiling sampling intervals for the acct_gather plugins.
Note: A value of `0` disables the periodic sampling. In this case, the
accounting information is collected when the job terminates.

Example usage:
$ juju config slurmcltd acct-gather-frequency="task=30,network=30"
acct-gather-custom:
type: string
default: ""
description: >
User supplied `acct_gather.conf` configuration.
This value supplements the charm supplied `acct_gather.conf` file that is
used for configuring the acct_gather plugins.

actions:
show-current-config:
description: >
description: |
Display the currently used `slurm.conf`.
Note: This file only exists in `slurmctld` charm and is automatically
distributed to all compute nodes by Slurm.

Example usage:
$ juju run-action slurmctld/leader --format=json --wait | jq .[].results.slurm.conf | xargs -I % -0 python3 -c 'print(%)'

```bash
juju run slurmctld/leader show-current-config \
--quiet --format=json | jq .[].results.slurm.conf | xargs -I % -0 python3 -c 'print(%)'
```

drain:
description: >
description: |
Drain specified nodes.
Example usage:
$ juju run-action slurmctld/leader drain nodename=node-[1,2] reason="Updating kernel"
$ juju run slurmctld/leader drain nodename="node-[1,2]" reason="Updating kernel"
params:
nodename:
type: string
description: The nodes to drain, using the Slurm format, e.g. `node-[1,2]`.
description: The nodes to drain, using the Slurm format, e.g. `"node-[1,2]"`.
reason:
type: string
description: Reason to drain the nodes.
required:
- nodename
- reason
resume:
description: >
description: |
Resume specified nodes.
Note: Newly added nodes will remain in the `down` state until configured,
with the `node-configured` action.

Example usage: $ juju run-action slurmctld/leader resume nodename=node-[1,2]
Example usage: $ juju run slurmctld/leader resume nodename="node-[1,2]"
params:
nodename:
type: string
description: >
The nodes to resume, using the Slurm format, e.g. `node-[1,2]`.
description: |
The nodes to resume, using the Slurm format, e.g. `"node-[1,2]"`.
required:
- nodename

influxdb-info:
description: >
Get InfluxDB info.
This action returns the host, port, username, password, database, and
retention policy regarding to InfluxDB.
8 changes: 3 additions & 5 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
ops==2.*
influxdb==5.3.1
jinja2==3.1.3
distro
pycryptodome
ops==2.14.0
distro==1.9.0
pycryptodome==3.20.0
Loading

0 comments on commit da62269

Please sign in to comment.