-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/han/alltoallv Bugfixes #12806
coll/han/alltoallv Bugfixes #12806
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: e6c34d3: coll/han/allreduce: squelch a particularly noisy l...
cb82581: coll/han/alltoallv: handle errors better
e3e9e1f: coll/han/alltoallv: correct loop condition on empt...
577968e: coll/han: alltoallv should fall-back if transform ...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
A status update: These commits are good as they are, but I think I have found evidence of possible data corruption. I am working to identify and fix the issue, and then I'll add them to this PR. |
e6c34d3
to
c1ddbd5
Compare
c1ddbd5
to
4fafbfe
Compare
Hello! The Git Commit Checker CI bot found a few problems with this PR: a50519c: coll/han/alltoallv: Correct swapped send/recv data...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
18a5ada
to
bf267b3
Compare
@bosilca At long last this is ready for review again. Thank you for your patience. I have made a rather exhaustive test suite. I am trying to figure out the best place to add it. Perhaps to MTT I suppose. Any other suggestions? |
if (rc) { break; }; | ||
} | ||
if (rc) { | ||
opal_output_verbose(1, mca_coll_han_component.han_output, | ||
"Failed in alltoallv_sendrecv_w_direct_for_debugging: jloop=%d, rc=%d\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what debugging ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That function is inefficient and not used. However if you doubt that something is correct in all the business logic of the main sendrecv_w, then it is a easy code change to use this function to double-check.
Is there a better way to keep such a function but also keep it out of the way?
We take path 2 here, to avoid possible race conditions between the | ||
cancel and the possibility of a matching recv. | ||
|
||
After realizing that the convertor may decide to not fully pack the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confused about the meaning of this ? The convertor might decide to avoid stopping in the middle of a predefined type. Is this what this comment is about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's what I intended to say. I observed a strange type consisting of 1 char and 1 int, and the convertor refused to use the last 3 bytes of the buffer, which threw off my original counting scheme.
Thank you George. I will be on a vacation for the next week and then I will address your comments! |
62dcda2
to
afe6275
Compare
@bosilca I believe I have addressed all your comments. Do you have any further review? |
please squash before merge. |
afe6275
to
77adf3b
Compare
Fix several bugs in the original implementation: - Fallback if transform is in-place - Fix bug related to empty messages - Introduce barrier before exit during error - Handle datatypes with negative lower bounds - Fix a completion-ordering bug - Cleanup some unused variables Signed-off-by: Luke Robison <[email protected]>
Allow the user to change the packing buffer size and how many maximum buffers will be allocated. Currently only han's alltoallv uses the buffers. Signed-off-by: Luke Robison <[email protected]>
77adf3b
to
69ac92b
Compare
This series of commits fixes several errors, and changes behavior on error conditions:
While debugging an application which exposed (2) above, I found that the rank which first experienced the error tried to cleanly exit with an error return code. However doing so while other ranks are accessing the shared memory region exposed by SMSC caused all the other local ranks to segfault. This hampered my progress during debug. To address it, I have moved the barrier that we previously only passed during successful exits, to be passed during all exit conditions. This will hang the execution instead of segfaulting other local ranks. I think this is better but I welcome other perspectives here.