-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix job resume on non-preemption type failures #803
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAttention: Patch coverage is
|
mshuaibii
previously approved these changes
Aug 13, 2024
mshuaibii
approved these changes
Aug 13, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See T192775876,
Slurm jobs that are pre-empted currently call a checkpoint callback and get requeued automatically by slurm. Given a timestamp_id (unique job id) and a checkpoint, the job is able to resume correctly AND resume the same run on wandb.
However, if jobs experience a "node failure" or other methods of failure which does not trigger the "checkpoint" callback. The slurm job is missing the timestamp_id and checkpoint info to be able to resume properly. So it ends up starting a new job (and a new wandb run) instead.
Failure example
Example runs that have failed and restarted with new job from scratch when they should've resumed from last checkpoint:
This PR modifies
Testing
We can use
scontrol requeue <job_id>
to simulate a "node failure":Single-node job requeue
Example job that has been requeued where it did not trigger the checkpoint callback:
https://fairwandb.org/fairchem/fm_testing/runs/2024-08-13-00-08-32-test_resume_requeue
Multi-node job requeue
https://fairwandb.org/fairchem/fm_testing/runs/2024-08-13-20-20-16-test_resume_requeue_5
**Note in this case if the last checkpoint was saved at step N and the most recent step written to wandb is N+K, then the job will resume from step N and overwrite K steps in the logs, it will look like the following:
198 wandb: WARNING (User provided step: 544 is less than current step: 676. Dropping entry: {'train/grad_norm': 10.882038116455078, '_timestamp': 1723508783.0449212}).
It is very important to have deterministic training so that when we repeat these steps the jobs end up in the same future state
Pre-emption
Manual preemptions can be triggered by starting a job on a low priority partition (ie: scavenge) and then start a new job targeting the node that the first job was running (ie: using the srun -w <node_id> ...)
Example job that was premepted where the checkpoint callback was triggered:
https://fairwandb.org/fairchem/fm_testing/runs/2024-08-12-23-40-48-test_resume_node_preemption?nw=nwuserrgao