Skip to content

Commit

Permalink
Merge pull request #12 from CLIP-HPC/more_e2e
Browse files Browse the repository at this point in the history
Add support for SLURM < 21.08.x, improve error handling, end to end testing.
  • Loading branch information
pja237 authored Jun 1, 2022
2 parents d6088fb + b738a6a commit 85fe394
Show file tree
Hide file tree
Showing 61 changed files with 2,892 additions and 488 deletions.
8 changes: 6 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ config=cmd/goslmailer/goslmailer.conf.annotated_example cmd/gobler/gobler.conf
# can be replaced with go test ./... construct
testdirs=$(sort $(dir $(shell find ./ -name *_test.go)))

all: list test build test_endly install
all: list test build get_endly test_endly install

list:
@echo "================================================================================"
Expand Down Expand Up @@ -67,10 +67,14 @@ test:
@echo "********************************************************************************"
go test -v -count=1 ./...

get_endly:
endly_linux_$(endly_version).tar.gz:
curl -L -O https://github.com/viant/endly/releases/download/v$(endly_version)/endly_linux_$(endly_version).tar.gz

test_e2e/endly:
tar -C test_e2e/ -xzf endly_linux_$(endly_version).tar.gz

get_endly: endly_linux_$(endly_version).tar.gz test_e2e/endly

test_endly:
cd test_e2e
./endly
Expand Down
63 changes: 50 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
# goslmailer

> **Warning**
> Currently goslmailer will only work on SLURM >= 21.08.x
>
> Work in progress to support older versions of SLURM is tracked here:
https://github.com/CLIP-HPC/goslmailer/issues/4
> **Info**
> Now also works with SLURM < 21.08
>
> For templating differences between slurm>21.08 and slurm<21.08 see [templating guide](./templates/README.md)
## Drop-in notification delivery solution for slurm that can do...

Expand All @@ -21,7 +20,7 @@ https://github.com/CLIP-HPC/goslmailer/issues/4
**Goslmailer** (GoSlurmMailer) is a drop-in replacement [MailProg](https://slurm.schedmd.com/slurm.conf.html#OPT_MailProg) for [slurm](https://slurm.schedmd.com/).


With goslmailer configured as as the slurm mailer,
With goslmailer configured as as the slurm mailer,

```
MailProg = /usr/bin/goslmailer
Expand All @@ -43,13 +42,44 @@ To support future additional receiver schemes, a [connector package](connectors/

## Installation

### Build

#### Quick version, without end to end testing

```
git clone https://github.com/CLIP-HPC/goslmailer.git
make test
make build
make install
```

#### Slightly more involved, with end to end testing:

Prerequisites:

1. generated RSA keypair (passwordless) (`ssh-keygen -t rsa`)
2. `ssh $USER@localhost` must work without password

Known caveats:

* redhat/centos: must have lsb_release binary installed, package: `redhat-lsb-core`
* ubuntu 22: `set enable-bracketed-paste off` present in `~/.inputrc`
* maybe/maybe not, depends if you see failed tests: `export TERM=dumb` in `~/.bashrc` :)

```
# downloads endly binary and runs endly tests
make
```


### goslmailer

* place binary to the path of your liking
* place [goslmailer.conf](cmd/goslmailer/goslmailer.conf.annotated_example) here: `/etc/slurm/goslmailer.conf`
* place [goslmailer.conf](cmd/goslmailer/goslmailer.conf.annotated_example) here: `/etc/slurm/goslmailer.conf` (default path)
* OR: anywhere else, but then run the binary with `GOSLMAILER_CONF=/path/to/gosl.conf` in environment
* point slurm `MailProg` to the binary

### gobler
### gobler

* place binary to the path of your liking
* place [gobler.conf](cmd/gobler/gobler.conf) to the path of your liking
Expand All @@ -74,7 +104,7 @@ See each connector details below...

## Spooling and throttling of messages - gobler service

In high-throughput clusters or in situations where job/message spikes are common, it might not be advisable to try to send all of the incoming messages as they arrive.
In high-throughput clusters or in situations where job/message spikes are common, it might not be advisable to try to send all of the incoming messages as they arrive.
For these environments goslmailer can be configured to spool messages from certain connectors on disk, to be later processed by the **gobler** service.


Expand All @@ -86,7 +116,7 @@ On startup, gobler reads its config file and spins-up a `connector monitor` for

`connector monitor` in turn spins up 3 goroutines: `monitor`, `picker` and `numSenders` x `sender`.

* **monitor** :
* **monitor** :
* every `monitorT` seconds (or milliseconds) scans the `spoolDir` for new messages and sends them to the **picker**

* **picker** :
Expand All @@ -105,7 +135,7 @@ On startup, gobler reads its config file and spins-up a `connector monitor` for

## Connectors

### default connector
### default connector

Specifies which receiver scheme is the default one, in case when user didn't specify `--mail-user` and slurm sent a bare username.

Expand All @@ -125,6 +155,13 @@ With connector parameters, you can:
* template message body
* allowList the recipients

To make sure that mutt properly renders the HTML email, add following lines to `/etc/Muttrc.local`

```
# Local configuration for Mutt.
set content_type="text/html"
```

See [annotated configuration example](cmd/goslmailer/goslmailer.conf.annotated_example)

---
Expand All @@ -135,9 +172,9 @@ Sends **1on1** or **group chat** messages about jobs via [telegram messenger app

![Telegram card](./images/telegram.png)

Prerequisites for the telegram connector:
Prerequisites for the telegram connector:

1. a telegram bot must be created and
1. a telegram bot must be created and
2. the bot daemon service **tgslumbot** must be running.

Site admins can [create a telegram bot](https://core.telegram.org/bots#6-botfather) by messaging [botfather](https://t.me/botfather).
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
v2.1.5
v2.2.0
5 changes: 4 additions & 1 deletion cmd/goslmailer/goslmailer.go
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,10 @@ func main() {

// get job statistics based on the SLURM_JOB_ID from slurmEnv struct
// only if job is END or FAIL(?)
job.GetJobStats(log, ic.CmdParams.Subject, cfg.Binpaths)
err = job.GetJobStats(ic.CmdParams.Subject, cfg.Binpaths, log)
if err != nil {
log.Fatalf("Unable to retrieve job stats. Error: %v", err)
}

// generate hints based on SlurmEnv and JobStats (e.g. "too much memory requested" or "walltime << requested queue")
// only if job is END or fail(?)
Expand Down
114 changes: 100 additions & 14 deletions internal/slurmjob/getjobcontext.go
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
package slurmjob

import (
"errors"
"fmt"
"log"
"os"
"regexp"
"sort"
"strings"
"time"
Expand Down Expand Up @@ -111,7 +113,8 @@ func IsJobFinished(jobState string) bool {
"FAILED",
"COMPLETED",
"OUT_OF_MEMORY",
"TIMEOUT":
"TIMEOUT",
"Mixed":
return true
}
return false
Expand All @@ -121,36 +124,119 @@ func (j *JobContext) IsJobFinished() bool {
return IsJobFinished(j.SlurmEnvironment.SLURM_JOB_STATE)
}

// Parse a subject line string and return a partial filled SlurmEnvirinment struct
// throw error if parsing failes
func parseSubjectLine(subject string) (*SlurmEnvironment, error) {
rJob, _ := regexp.Compile(`^Slurm Job_id=(?P<JobId>\d+) Name=(?P<JobName>.*) (?P<MailType>\w+), \w+ time .+?(?:, (?P<JobState>\w+), ExitCode (?P<ExitCode>\d))?$`)
asJob, _ := regexp.Compile(`^Slurm Array Summary Job_id=.+ \((\d+)\) Name=(?P<JobName>.*) (?P<MailType>\w+)(?:, (?P<JobState>\w+), \w+ \[(?P<ExitCode>.+)\])?$`)
aJob, _ := regexp.Compile(`^Slurm Array Task Job_id=(?P<JobArrayId>\d+)_(?P<JobArrayIndex>\d+) \((?P<JobId>\d+)\) Name=(?P<JobName>.*) (?P<MailType>\w+), \w+ time .+?(?:, (?P<JobState>\w+), ExitCode (?P<ExitCode>\d))?$`)

env := new(SlurmEnvironment)
var jobId string
var jobState string
var mailType string
var jobName string
if strings.Contains(subject, "Slurm Array Summary Job_id=") {
matches := asJob.FindStringSubmatch(subject)
if matches == nil {
return nil, errors.New(("Invalid subject line: " + subject))
}
jobId = matches[1]
jobName = matches[2]
mailType = matches[3]
jobState = matches[4]
if jobState == "" {
jobState = "PENDING"
}
} else if strings.Contains(subject, "Slurm Array Task Job_id") {
matches := aJob.FindStringSubmatch(subject)
if matches == nil {
return nil, errors.New(("Invalid subject line: " + subject))
}
env.SLURM_ARRAY_JOB_ID = matches[1]
env.SLURM_ARRAY_TASK_ID = matches[2]
jobId = matches[3]
jobName = matches[4]
mailType = matches[5]
jobState = matches[6]
if jobState == "" {
jobState = "RUNNING"
}
} else {
matches := rJob.FindStringSubmatch(subject)
if matches == nil {
return nil, errors.New(("Invalid subject line: " + subject))
}
jobId = matches[1]
jobName = matches[2]
mailType = matches[3]
jobState = matches[4]
if jobState == "" {
jobState = "RUNNING"
}

}
env.SLURM_JOBID = jobId
env.SLURM_JOB_ID = jobId
env.SLURM_JOB_MAIL_TYPE = mailType
env.SLURM_JOB_STATE = jobState
env.SLURM_JOB_NAME = jobName
return env, nil
}

func (j *JobContext) UpdateEnvVarsFromMailSubject(subject string) error {
env, err := parseSubjectLine(subject)
if err != nil {
return err
}
j.SlurmEnvironment = *env
return nil
}

// Get additional job statistics from external source (e.g. jobinfo or sacct)
func (j *JobContext) GetJobStats(log *log.Logger, subject string, paths map[string]string) {
log.Print("Start retrieving job stats")
log.Printf("%#v", j.SlurmEnvironment)
func (j *JobContext) GetJobStats(subject string, paths map[string]string, l *log.Logger) error {
l.Print("Start retrieving job stats")
l.Printf("%#v", j.SlurmEnvironment)

// SLURM < 21.08.x don't have any SLURM envs set, we need to parse the mail subject line, retrieve the jobid and all other information from sacct
if j.SlurmEnvironment.SLURM_JOBID == "" {
err := j.UpdateEnvVarsFromMailSubject(subject)
if err != nil {
return err
}
}
jobId := j.SlurmEnvironment.SLURM_JOBID
if strings.Contains(subject, "Slurm Array Summary Job_id=") {
j.MailSubject = fmt.Sprintf("Job Array Summary %s (%s-%s)", j.SlurmEnvironment.SLURM_ARRAY_JOB_ID, j.SlurmEnvironment.SLURM_ARRAY_TASK_MIN, j.SlurmEnvironment.SLURM_ARRAY_TASK_MAX)
//jobId = fmt.Sprintf("%s_%s", j.SlurmEnvironment.SLURM_ARRAY_JOB_ID, j.SlurmEnvironment.SLURM_ARRAY_TASK_ID)
j.MailSubject = fmt.Sprintf("Job Array Summary %s_*", j.SlurmEnvironment.SLURM_ARRAY_JOB_ID)
} else if strings.Contains(subject, "Slurm Array Task Job_id") {
jobId = j.SlurmEnvironment.SLURM_ARRAY_JOB_ID
j.MailSubject = fmt.Sprintf("Job Array Task %s", jobId)

} else {
j.MailSubject = fmt.Sprintf("Job %s", jobId)

}
if j.SlurmEnvironment.SLURM_ARRAY_JOB_ID != "" {
jobId = j.SlurmEnvironment.SLURM_ARRAY_JOB_ID
}
log.Printf("Fetch job info %s", jobId)
j.JobStats = *GetSacctMetrics(jobId, log, paths)
l.Printf("Fetch job info %s", jobId)
jobStats, err := GetSacctMetrics(jobId, paths, l)
if err != nil {
return err
}
j.JobStats = *jobStats
counter := 0
for !IsJobFinished(j.JobStats.State) && j.JobStats.State != j.SlurmEnvironment.SLURM_JOB_STATE && counter < 5 {
time.Sleep(2 * time.Second)
j.JobStats = *GetSacctMetrics(jobId, log, paths)
jobStats, err = GetSacctMetrics(jobId, paths, l)
if err != nil {
return fmt.Errorf("failed to get job stats: %w", err)
}
j.JobStats = *jobStats
counter += 1
}
if j.JobStats.State == "RUNNING" {
log.Print("Update job with live stats")
updateJobStatsWithLiveData(&j.JobStats, jobId, log, paths)
l.Print("Update job with live stats")
updateJobStatsWithLiveData(&j.JobStats, jobId, paths, l)
}
log.Printf("Finished retrieving job stats")
l.Printf("Finished retrieving job stats")
return nil
}
Loading

0 comments on commit 85fe394

Please sign in to comment.