Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submitit jobs die with no error on cluster with SLURM 19.05 #1762

Open
mihdalal opened this issue Feb 10, 2024 · 1 comment
Open

Submitit jobs die with no error on cluster with SLURM 19.05 #1762

mihdalal opened this issue Feb 10, 2024 · 1 comment

Comments

@mihdalal
Copy link

mihdalal commented Feb 10, 2024

I have been dealing with a particularly strange submitit error that I am having trouble understanding. Specifically, all jobs I launch through submitit die after 7-10 hours without error. However, this only happens on our cluster with slurm 19.05 and does not occur on a different cluster with slurm 20.11 (there the jobs run fine for the entire allotted time). Are there specific settings in slurm that are needed for submitit to work? Is submitit incompatible with slurm 19.05? Also note this is an error specific to launching jobs on slurm with submitit, I can manually launch sbatch jobs just fine and srun also works on my cluster.

Here is a minimum reproducible example:

launch_script:

import submitit

slurm_additional_parameters = {
    "partition": "russ_reserved",
    "time": "3-00:00:00",
    "gpus": 1,
    "cpus_per_gpu": 20,
    "mem": 62,
}

def test():
    while True:
        pass

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="test_cluster_log")
# set timeout in min, and partition for running the job
slurm_additional_parameters["job_name"] = "test_cluster"
executor.update_parameters(slurm_additional_parameters=slurm_additional_parameters)
job = executor.submit(test)  # will
print(job.job_id)  # ID of your job

output:

slurmstepd: error: *** STEP 250338.0 ON matrix-2-1 CANCELLED AT 2024-02-10T03:41:37 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** JOB 250338 ON matrix-2-1 CANCELLED AT 2024-02-10T03:41:37 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
submitit WARNING (2024-02-10 03:41:37,635) - Bypassing signal SIGCONT
submitit WARNING (2024-02-10 03:41:37,636) - Bypassing signal SIGTERM

submitit version: 1.5.1

@hannaribaspeeters
Copy link

Hi, did you manage to solve this issue? I am encountering the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants