coll: add coll_group to collective interfaces #7103

hzhou · 2024-08-16T23:01:59Z

Pull Request Description

Make all (most) collective algorithms able to work within a subgroup.

Replace subcomm with lightweight MPIR_Subgroup
MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.
Communicator is a very complex object that takes on all kinds of tricks such as attributes, hints, cache, context id ... Maintaining sub-communicator compounds these complexities and are expensive to maintain. The goal of this PR is to eventually remove sub-communicators.
This PR will provide a few examples.
Use MPIR_SUBGROUP_NONE in place of coll_group argument will provide backward collective semantics, i.e. the whole communicator collective.

typedef struct MPIR_Subgroup {
    int size;
    int rank;
    int *proc_table;
} MPIR_Subgroup;

mpi_errno = MPIR_Bcast_impl(buffer, count, datatype, root, comm, coll_group, errflag);
                                                                 ^^^^^^^^^^

Replacing subcomm usage with subgroup

MPIR_Bcast(buf, count, datatype, root, node_comm, MPIR_ERR_NONE);

become

MPIR_Bcast(buf, count, datatype, root, comm, MPIR_SUBGROUP_NODE, MPIR_ERR_NONE);

One of the goals of this PR is to make all mpir- layer intra-collectives coll_group aware.

inter-collectives only work with MPIR_SUBGROUP_NONE (no group inter collectives)
compositional algorithms (i.e. _smp) only work with MPIR_SUBGROUP_NONE, algo selection need make sure not to create recursive compositional situation
All non-compositional intra algorithms should work with all coll_group. So it should not directly access (size,rank)-related fields in the communicator structure.
[skip warnings]

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

hzhou · 2024-08-23T03:38:22Z

test:mpich/ch4/most
test:mpich/ch3/most

Only ch4-ofi-shm 42 failures in allgather2

hzhou · 2024-08-24T22:22:32Z

test:mpich/ch4/most
test:mpich/ch3/most

hzhou · 2024-08-25T01:38:21Z

test:mpich/ch4/most
test:mpich/ch3/most

1 failure - ch4-ucx-external - [coll.01376 - ./coll/reduce 10 MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM=release_gather MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE=131072 MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE=32768 MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS=4 MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL=8 MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE=knomial_2](https://jenkins-pmrs.cels.anl.gov/job/mpich-review-ch4-ucx/3401/jenkins_configure=external,label=ubuntu22.04_review/testReport/junit/(root)/coll/01376_____coll_reduce_10__MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM_release_gather_MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE_131072_MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE_32768_MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS_4_MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL_8_MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE_knomial_2/)

hzhou · 2024-08-26T15:14:56Z

Try get a clean test:

test:mpich/ch4/ucx

src/mpi/comm/commutil.c

raffenet · 2024-09-06T13:39:03Z

MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.

What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.

hzhou · 2024-09-06T14:04:13Z

MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.

What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.

MPIR_Group is not associated with any communicator. Unlike MPIR_Subgroup, it's inside a communicator.

When we use MPIR_Group for communication -

we need to find a communicator for its communication context
we need translate the lpid used in MPIR_Group into the ranks in that communicator
So it's more cumbersome to use, and easily can confuse developers who are not familiar.
Also, MPIR_Group represents and is confined by MPI_Group. Thus it's less flexible to adapt to our internal usages.

src/mpi/coll/allreduce/allreduce_intra_k_reduce_scatter_allgather.c

src/mpid/common/sched/mpidu_sched.c

hzhou · 2024-09-12T23:05:27Z

test:mpich/ch4/most
test:mpich/ch3/most

Only 2 timouts in ch4-ofi-default due to congestions:

datatype.01767 - ./datatype/large_type_sendrec 2 33 
coll.00127 - ./coll/gather_big 8

It does not take many instructions to calculate pof2 on the fly. Use of hard coded pof2 prevents collective algorithms to be used for non-trivial coll_group.

Lightweight struct to describe sub-groups of a communicator. They intend to replace the subcomms. Preset a set of reserved subgroups to simplify common usages such as intranode group and crossnode group. Since we only expect limited number of dynamic subgroups and they should always be push/pop'ed within the scope, we don't need many dynamic slots.

Group collectives will have non-trivial coll_group that alter the rank and size of the communicator. Thease macros and functions will facilitate it.

Add coll_group, index to comm->subgroups[], to all collectives except neighborhood collectives.

Assuming the device layer collectives are not able to handle non-trivial coll_group, always fallback when coll_group != MPIR_SUBGROUP_NONE, for now. Also normalize the code style to use the fallback label. We should always fallback to mpir impl routines rather than the netmod routines (composition_beta). The composition_beta may fallback in the future when netmod coll become fancy, resulting in deadloop.

Make csel coll_group aware.

Use coll_group=MPIR_SUBGROUP_THREADCOMM for threadcomm collectives. This allows compositional collectives under threadcomm.

We call MPIR_Comm_is_parent_comm to prevent recursively entering compositional algorithms such as the _smp algorithms. Check coll_group as well as we will switch to use subgroup rather than subcomms. Also check num_external directly for trivial comm. Subcomms and comm->hierarchy_kind will be removed in the future.

Use MPIR_COLL_RANK_SIZE if the algorithm is topology neutral. Use MPIR_COLL_RANK_SIZE_NO_GROUP if the algorithm is topology dependent. It adds an assertion on coll_group == MPIR_SUBGROUPS_NONE since coll_group may alter the topology assumptions. Intercomm does not work with non-zero coll_group.

Replace the usage of subcomms with subgroups.

When root is not local rank 0, instead of adding a extra intra-node send/recv or bcast, construct an inter group that includes the root process.

Directly use information from MPIR_Process rather than from nodecomm in MPIR_Process. One step toward removing subcomms.

Now that we may run collectives on subgroups, we can't pre-prune the csel trees based on communicator size or topology since that may change for subgroups. I don't think the performance from the tree pruning is significant -- it only saves a couple levels of tree decendence. But if we later decide the efficiency from pruning is important, we can easily prune the trees at subgroup level and save the pruned trees to the MPIR_Group structure.

Use a single "cached_tree" rather than 3 different fields for each tree type.

The topology-aware tree utilities need check coll_group for correct world ranks.

Some algorithm, e.g. Allgather recexch, caches comm size-related info in communicator, thus won't work with none trivial coll_group. Add a restriction so it will fallback when coll_group != MPIR_SUBGROUP_NONE.

All subgroup collectives should use the same tag within the parent collectives. This is because all processes in the communicator has to agree on the tag to use, but group collectives may not involve all processes. It is okay to use the same tag as long as the group collectives are always issued in order. This is the case since all group collectives are spawned under a parent collective, which has to obey the non-overlapping rule.

Because the compiler can't figure out the arithmetic, it is warning: ‘MPIC_Waitall’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=] Refactor to suppress warning and for better readability.

Commit ba1b4dd left an empty branch that should be removed.

Update this code to use coll_group and apply some whitespace changes.

hzhou · 2024-11-08T18:05:54Z

src/mpi/coll/allreduce/allreduce_intra_recursive_multiplying.c

- * stage, ranks exchange data within groups of size k in rounds with 
- * increasing distance (k, k^2, ...). Lastly, those in the main stage 
- * disperse the result back to the excluded ranks. Setting k according 
- * to the network hierarchy (e.g., the number of NICs in a node) can 


I wonder how these escaped the whitespace checker in the original PR.

hzhou · 2024-11-11T14:26:25Z

@zhenggb72 What is your suggested path forward?

zhenggb72 · 2024-11-11T15:54:35Z

@zhenggb72 What is your suggested path forward?

I don't have much to add. As long as this PR does not change the behavior of the common scenarios, whatever you do to the subgroup is up to you. I don't know much about the motivation and use case of subgroup, but maybe there are a few solutions: for subgroup, you can choose to skip CSEL, and go straight to MPIR auto or fallback algorithms, or you can choose not to prune the tree and use the global tree, if you don't want to prune it.

hzhou · 2024-11-11T17:16:34Z

We are delaying this PR to 4.4.
To help merging this PR, we'll need more performance measurement to quantify the memory and performance impact. Also get more input to ensure everyone is on board.

hzhou force-pushed the 2408_coll_group branch 10 times, most recently from 295e2e0 to 736c1d1 Compare August 23, 2024 03:19

hzhou force-pushed the 2408_coll_group branch 3 times, most recently from cefbfd9 to d8d297a Compare August 24, 2024 22:12

hzhou force-pushed the 2408_coll_group branch from d8d297a to 6cb5afa Compare August 25, 2024 01:38

hzhou marked this pull request as ready for review August 25, 2024 01:38

hzhou requested a review from raffenet August 25, 2024 01:38

hzhou mentioned this pull request Aug 29, 2024

coll: add coll_attr and comm subgroups #6590

Closed

4 tasks

raffenet reviewed Sep 3, 2024

View reviewed changes

src/mpi/comm/commutil.c Show resolved Hide resolved

raffenet reviewed Sep 10, 2024

View reviewed changes

src/mpi/coll/allreduce/allreduce_intra_k_reduce_scatter_allgather.c Show resolved Hide resolved

raffenet reviewed Sep 10, 2024

View reviewed changes

src/mpid/common/sched/mpidu_sched.c Outdated Show resolved Hide resolved

hzhou force-pushed the 2408_coll_group branch 2 times, most recently from b3477a3 to b6cecd0 Compare September 12, 2024 22:56

hzhou force-pushed the 2408_coll_group branch from 331eb14 to 10adb96 Compare September 14, 2024 14:54

hzhou added 23 commits November 8, 2024 11:46

coll: remove coll.pof2 field

132c188

It does not take many instructions to calculate pof2 on the fly. Use of hard coded pof2 prevents collective algorithms to be used for non-trivial coll_group.

coll: add macros to get rank/size with coll_group

d9bf21b

Group collectives will have non-trivial coll_group that alter the rank and size of the communicator. Thease macros and functions will facilitate it.

coll: add coll_group argument to coll interfaces

105c7e0

Add coll_group, index to comm->subgroups[], to all collectives except neighborhood collectives.

continue: add coll_group to collective interfaces

3501b57

coll: add coll_group argument to MPIC/sched/TSP routines

1c40a12

continue: add coll_group in MPIC/sched/TSP routines

b410435

coll: add coll_group to csel signature

a683ec6

Make csel coll_group aware.

coll: threadcomm coll to use MPIR_SUBGROUP_THREADCOMM

287aeb4

Use coll_group=MPIR_SUBGROUP_THREADCOMM for threadcomm collectives. This allows compositional collectives under threadcomm.

coll: modify bcast_intra_smp to use subgroups

bdd4532

Replace the usage of subcomms with subgroups.

coll: avoid extra intra bcast in bcast_intra_smp

066e586

When root is not local rank 0, instead of adding a extra intra-node send/recv or bcast, construct an inter group that includes the root process.

coll: modify smp algorithms to use MPIR_Subgroup

e374920

mpir: replace subcomm usage with subgroups

f98f2e7

Directly use information from MPIR_Process rather than from nodecomm in MPIR_Process. One step toward removing subcomms.

coll: refactor caching tree in the comm struct

718d868

Use a single "cached_tree" rather than 3 different fields for each tree type.

coll: add coll_group to treealgo routines

1d79beb

The topology-aware tree utilities need check coll_group for correct world ranks.

coll: add nogroup restriction to certain algorithms

b8c1f54

Some algorithm, e.g. Allgather recexch, caches comm size-related info in communicator, thus won't work with none trivial coll_group. Add a restriction so it will fallback when coll_group != MPIR_SUBGROUP_NONE.

coll: refactor barrier_intra_k_dissemination

138c760

Because the compiler can't figure out the arithmetic, it is warning: ‘MPIC_Waitall’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=] Refactor to suppress warning and for better readability.

coll/allreduce: remove a leftover empty branch

7001a71

Commit ba1b4dd left an empty branch that should be removed.

hzhou force-pushed the 2408_coll_group branch from 1309602 to 7001a71 Compare November 8, 2024 17:47

coll: patch allreduce_intra_recursive_multiplying.c

37fc447

Update this code to use coll_group and apply some whitespace changes.

hzhou commented Nov 8, 2024

View reviewed changes

hzhou removed the 4.3.0b1 label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coll: add coll_group to collective interfaces #7103

coll: add coll_group to collective interfaces #7103

hzhou commented Aug 16, 2024 •

edited

Loading

hzhou commented Aug 23, 2024 •

edited

Loading

hzhou commented Aug 24, 2024

hzhou commented Aug 25, 2024 •

edited

Loading

hzhou commented Aug 26, 2024

raffenet commented Sep 6, 2024

hzhou commented Sep 6, 2024

hzhou commented Sep 12, 2024 •

edited

Loading

hzhou Nov 8, 2024

hzhou commented Nov 11, 2024

zhenggb72 commented Nov 11, 2024 •

edited

Loading

hzhou commented Nov 11, 2024

coll: add coll_group to collective interfaces #7103

Are you sure you want to change the base?

coll: add coll_group to collective interfaces #7103

Conversation

hzhou commented Aug 16, 2024 • edited Loading

Pull Request Description

Author Checklist

hzhou commented Aug 23, 2024 • edited Loading

hzhou commented Aug 24, 2024

hzhou commented Aug 25, 2024 • edited Loading

hzhou commented Aug 26, 2024

raffenet commented Sep 6, 2024

hzhou commented Sep 6, 2024

hzhou commented Sep 12, 2024 • edited Loading

hzhou Nov 8, 2024

Choose a reason for hiding this comment

hzhou commented Nov 11, 2024

zhenggb72 commented Nov 11, 2024 • edited Loading

hzhou commented Nov 11, 2024

hzhou commented Aug 16, 2024 •

edited

Loading

hzhou commented Aug 23, 2024 •

edited

Loading

hzhou commented Aug 25, 2024 •

edited

Loading

hzhou commented Sep 12, 2024 •

edited

Loading

zhenggb72 commented Nov 11, 2024 •

edited

Loading