-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consensus protocol #39
base: main
Are you sure you want to change the base?
consensus protocol #39
Conversation
@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
1 similar comment
@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
resolved conflict (cc @Sergei-Lebedev) |
@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One general comment, I think we also need to remove TORCH_UCC_CHECK
in https://github.com/facebookresearch/torch_ucc/blob/main/src/torch_ucc_comm.cpp#L186, so application won't error out when UCC TIMEOUT is returned. Not sure how CI test passes through it.
When I tested the PR internally with some GPU training jobs (32 GPUs), I see some strange value of states (randomly).
Still debugging this, look like some garbage value from somewhere? |
9151606
to
37229ce
Compare
@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
37229ce
to
91ddeaa
Compare
@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
params.datatype = ucp_dt_make_contig(sizeof(torch_ucc_rank_state_t)); | ||
params.user_data = (void*)state; | ||
params.cb.send = [](void* request, ucs_status_t status, void* user_data) { | ||
torch_ucc_rank_state_t * state = (torch_ucc_rank_state_t*)user_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
is this required? we don't use state
at all here since we do delete user_data
next line. or do you intent to do delete state
?
Summary: Pull Request resolved: facebookresearch#39 Differential Revision: D31765741 Pulled By: kingchc fbshipit-source-id: 3932a76964b2d2a3412cb60796994e572041d690
Summary: Pull Request resolved: facebookresearch#39 Differential Revision: https://www.internalfb.com/diff/D31765741?entry_point=27 Pulled By: kingchc fbshipit-source-id: eb987398fc22f12200273934513cbdc24b5bc93f
@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: Pull Request resolved: facebookresearch#39 Reviewed By: srinivas212 Differential Revision: D31765741 Pulled By: kingchc fbshipit-source-id: 71dceaa79bab323cc07547289932496f31976f3b
Summary: Pull Request resolved: facebookresearch#39 Reviewed By: srinivas212 Differential Revision: D31765741 Pulled By: kingchc fbshipit-source-id: 140ac43108b8a9f97fdda622dc8910a0677afebd
if (work->request_->status == UCC_ERR_TIMED_OUT) { | ||
check_communicator_status(work->rank, work->comm_id, work->seq_num_, | ||
work->eps); | ||
} | ||
work->finalize(eptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sergei-Lebedev - I just realized the CI tests are failing because I suggested to let progress thread aborts immediately in check_communicator_status
and hence the main thread is not able to catch the exception. I think moving work->finalize(eptr)
before check_communicator_status
should fix it. WDYT?
if (work->request_->status == UCC_ERR_TIMED_OUT) { | |
check_communicator_status(work->rank, work->comm_id, work->seq_num_, | |
work->eps); | |
} | |
work->finalize(eptr); | |
work->finalize(eptr); | |
if (work->request_->status == UCC_ERR_TIMED_OUT) { | |
check_communicator_status(work->rank, work->comm_id, work->seq_num_, | |
work->eps); | |
} |
Summary: Pull Request resolved: facebookresearch#39 Reviewed By: srinivas212 Differential Revision: D31765741 Pulled By: kingchc fbshipit-source-id: b34a63a86d1de06ee214b882797e37c376f5ce03
@kingchc has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
No description provided.