Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow proper resuming of slurm jobs on job failure #777

Closed
wants to merge 7 commits into from

Conversation

rayg1234
Copy link
Collaborator

@rayg1234 rayg1234 commented Jul 19, 2024

T195276678

This fixes 2 issues when you get a node failure (very frequent on fair cluster)

  1. W&B uses a unique timestamp id as the unique id of the run so when the job goes down, W&B does not know how to resume since it doesnt know what id to use
  2. During no failure, no explicit checkpoint is saved and we need to tell the trainer to attempt to resume from the last checkpoint available. This is different from the preemption logic

The main changes here are:

  • Create a unique_id function to encapsulate logic on when to use slurm_id and when to use timestamp_id (for local runs)
  • Update wandb to use the (unique_id) slurmid as the jobid if it exists, this allows wandb to auto resume on failures
  • Update the checkpoint logic in task.py to attempt to load from last checkpoint based on the know checkpoint dir
  • Refactor the paths so that all the data from a run is consolidated in 1 place ie:
    data is in:
    • run_dir/<unique_id>/logs (previously: run_dir/logs/timestamp_id)
    • run_dir/<unique_id>/checkpoints (previously: run_dir/checkpoints/timestamp_id)
    • run_dir/<unique_id>/results (previously: run_dir/results/timestamp_id)
  • Slurm logs are also now in the same dir as the wandb logs (previously they were in different places)

@rayg1234 rayg1234 changed the title add helpers Allow proper resuming of slurm jobs on job failure Jul 19, 2024
@rayg1234
Copy link
Collaborator Author

Replace with #803

@rayg1234 rayg1234 closed this Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant