-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove destroy_process_group() from finally wrapper as it can hang #884
Conversation
Codecov ReportAttention: Patch coverage is
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @misko!
… into fix_distutils_cleanup
@@ -666,7 +666,9 @@ def run_relaxations(self, split="val"): | |||
) | |||
gather_results["chunk_idx"] = np.cumsum( | |||
[gather_results["chunk_idx"][i] for i in idx] | |||
)[:-1] # np.split does not need last idx, assumes n-1:end | |||
)[ | |||
:-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really wierd linter formatting but 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the exact linting error... Also agree very weird, I think the IDE is injecting it sometimes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for investigating and figuring this out!
destroy_process_group() hangs and returns when a OOM CUDA error is raised internally. Instead of allowing the error to propagate it gets stuck in a never ending process and has a NCCL timeout.
Instead of calling destroy_process_group() when we know an exception already occurred, lets just crash out with the original exception.