Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove destroy_process_group() from finally wrapper as it can hang #884

Merged
merged 6 commits into from
Oct 23, 2024

Conversation

misko
Copy link
Collaborator

@misko misko commented Oct 21, 2024

destroy_process_group() hangs and returns when a OOM CUDA error is raised internally. Instead of allowing the error to propagate it gets stuck in a never ending process and has a NCCL timeout.

Instead of calling destroy_process_group() when we know an exception already occurred, lets just crash out with the original exception.

@misko misko added bug Something isn't working patch Patch version release labels Oct 21, 2024
Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 78.57143% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/fairchem/core/common/utils.py 78.57% 6 Missing ⚠️
Files with missing lines Coverage Δ
src/fairchem/core/common/utils.py 68.79% <78.57%> (+0.04%) ⬆️

lbluque
lbluque previously approved these changes Oct 22, 2024
Copy link
Collaborator

@lbluque lbluque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @misko!

@@ -666,7 +666,9 @@ def run_relaxations(self, split="val"):
)
gather_results["chunk_idx"] = np.cumsum(
[gather_results["chunk_idx"][i] for i in idx]
)[:-1] # np.split does not need last idx, assumes n-1:end
)[
:-1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really wierd linter formatting but 🤷

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the exact linting error... Also agree very weird, I think the IDE is injecting it sometimes

rayg1234
rayg1234 previously approved these changes Oct 23, 2024
Copy link
Collaborator

@rayg1234 rayg1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for investigating and figuring this out!

@misko misko added this pull request to the merge queue Oct 23, 2024
Merged via the queue into main with commit 712511f Oct 23, 2024
9 checks passed
@misko misko deleted the fix_distutils_cleanup branch October 23, 2024 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working patch Patch version release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants