Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll: add coll_group to collective interfaces #7103

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Aug 16, 2024

Pull Request Description

Make all (most) collective algorithms able to work within a subgroup.

  • Replace subcomm with lightweight MPIR_Subgroup
  • MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.
  • Communicator is a very complex object that takes on all kinds of tricks such as attributes, hints, cache, context id ... Maintaining sub-communicator compounds these complexities and are expensive to maintain. The goal of this PR is to eventually remove sub-communicators.
  • This PR will provide a few examples.
  • Use MPIR_SUBGROUP_NONE in place of coll_group argument will provide backward collective semantics, i.e. the whole communicator collective.
typedef struct MPIR_Subgroup {
    int size;
    int rank;
    int *proc_table;
} MPIR_Subgroup; 
mpi_errno = MPIR_Bcast_impl(buffer, count, datatype, root, comm, coll_group, errflag);
                                                                 ^^^^^^^^^^
  • Replacing subcomm usage with subgroup
MPIR_Bcast(buf, count, datatype, root, node_comm, MPIR_ERR_NONE);

become

MPIR_Bcast(buf, count, datatype, root, comm, MPIR_SUBGROUP_NODE, MPIR_ERR_NONE);

One of the goals of this PR is to make all mpir- layer intra-collectives coll_group aware.

  • inter-collectives only work with MPIR_SUBGROUP_NONE (no group inter collectives)
  • compositional algorithms (i.e. _smp) only work with MPIR_SUBGROUP_NONE, algo selection need make sure not to create recursive compositional situation
  • All non-compositional intra algorithms should work with all coll_group. So it should not directly access (size,rank)-related fields in the communicator structure.
    [skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2408_coll_group branch 10 times, most recently from 295e2e0 to 736c1d1 Compare August 23, 2024 03:19
@hzhou
Copy link
Contributor Author

hzhou commented Aug 23, 2024

test:mpich/ch4/most
test:mpich/ch3/most

Only ch4-ofi-shm 42 failures in allgather2

@hzhou hzhou force-pushed the 2408_coll_group branch 3 times, most recently from cefbfd9 to d8d297a Compare August 24, 2024 22:12
@hzhou
Copy link
Contributor Author

hzhou commented Aug 24, 2024

test:mpich/ch4/most
test:mpich/ch3/most

@hzhou
Copy link
Contributor Author

hzhou commented Aug 25, 2024

test:mpich/ch4/most
test:mpich/ch3/most

1 failure - ch4-ucx-external - [coll.01376 - ./coll/reduce 10 MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM=release_gather MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE=131072 MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE=32768 MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS=4 MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL=8 MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE=knomial_2](https://jenkins-pmrs.cels.anl.gov/job/mpich-review-ch4-ucx/3401/jenkins_configure=external,label=ubuntu22.04_review/testReport/junit/(root)/coll/01376_____coll_reduce_10__MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM_release_gather_MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE_131072_MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE_32768_MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS_4_MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL_8_MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE_knomial_2/)

@hzhou hzhou marked this pull request as ready for review August 25, 2024 01:38
@hzhou hzhou requested a review from raffenet August 25, 2024 01:38
@hzhou
Copy link
Contributor Author

hzhou commented Aug 26, 2024

Try get a clean test:

test:mpich/ch4/ucx

@raffenet
Copy link
Contributor

raffenet commented Sep 6, 2024

  • MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.

What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.

@hzhou
Copy link
Contributor Author

hzhou commented Sep 6, 2024

  • MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.

What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.

MPIR_Group is not associated with any communicator. Unlike MPIR_Subgroup, it's inside a communicator.

When we use MPIR_Group for communication -

  1. we need to find a communicator for its communication context
  2. we need translate the lpid used in MPIR_Group into the ranks in that communicator
  3. So it's more cumbersome to use, and easily can confuse developers who are not familiar.
  4. Also, MPIR_Group represents and is confined by MPI_Group. Thus it's less flexible to adapt to our internal usages.

@hzhou hzhou force-pushed the 2408_coll_group branch 2 times, most recently from b3477a3 to b6cecd0 Compare September 12, 2024 22:56
@hzhou
Copy link
Contributor Author

hzhou commented Sep 12, 2024

test:mpich/ch4/most
test:mpich/ch3/most

Only 2 timouts in ch4-ofi-default due to congestions:

datatype.01767 - ./datatype/large_type_sendrec 2 33 
coll.00127 - ./coll/gather_big 8

It does not take many instructions to calculate pof2 on the fly. Use of
hard coded pof2 prevents collective algorithms to be used for
non-trivial coll_group.
Lightweight struct to describe sub-groups of a communicator. They intend
to replace the subcomms.

Preset a set of reserved subgroups to simplify common usages such as
intranode group and crossnode group. Since we only expect limited number
of dynamic subgroups and they should always be push/pop'ed within the
scope, we don't need many dynamic slots.
Group collectives will have non-trivial coll_group that alter the rank
and size of the communicator. Thease macros and functions will
facilitate it.
Add coll_group, index to comm->subgroups[], to all collectives except
neighborhood collectives.
Assuming the device layer collectives are not able to handle non-trivial
coll_group, always fallback when coll_group != MPIR_SUBGROUP_NONE, for
now.

Also normalize the code style to use the fallback label. We should
always fallback to mpir impl routines rather than the netmod routines
(composition_beta). The composition_beta may fallback in the
future when netmod coll become fancy, resulting in deadloop.
Make csel coll_group aware.
Use coll_group=MPIR_SUBGROUP_THREADCOMM for threadcomm collectives. This
allows compositional collectives under threadcomm.
We call MPIR_Comm_is_parent_comm to prevent recursively entering
compositional algorithms such as the _smp algorithms. Check coll_group
as well as we will switch to use subgroup rather than subcomms.
Also check num_external directly for trivial comm. Subcomms and
comm->hierarchy_kind will be removed in the future.
Use MPIR_COLL_RANK_SIZE if the algorithm is topology neutral.

Use MPIR_COLL_RANK_SIZE_NO_GROUP if the algorithm is topology dependent.
It adds an assertion on coll_group == MPIR_SUBGROUPS_NONE since
coll_group may alter the topology assumptions.

Intercomm does not work with non-zero coll_group.
Replace the usage of subcomms with subgroups.
When root is not local rank 0, instead of adding a extra intra-node
send/recv or bcast, construct an inter group that includes the root
process.
Directly use information from MPIR_Process rather than from nodecomm in
MPIR_Process.

One step toward removing subcomms.
Now that we may run collectives on subgroups, we can't pre-prune the
csel trees based on communicator size or topology since that may change for
subgroups.

I don't think the performance from the tree pruning is significant -- it
only saves a couple levels of tree decendence. But if we later decide
the efficiency from pruning is important, we can easily prune the trees at
subgroup level and save the pruned trees to the MPIR_Group structure.
Use a single "cached_tree" rather than 3 different fields for each tree
type.
The topology-aware tree utilities need check coll_group for correct
world ranks.
Some algorithm, e.g. Allgather recexch, caches comm size-related info in
communicator, thus won't work with none trivial coll_group. Add a
restriction so it will fallback when coll_group != MPIR_SUBGROUP_NONE.
All subgroup collectives should use the same tag within the parent
collectives. This is because all processes in the communicator has to
agree on the tag to use, but group collectives may not involve all
processes. It is okay to use the same tag as long as the group
collectives are always issued in order. This is the case since all group
collectives are spawned under a parent collective, which has to obey the
non-overlapping rule.
Because the compiler can't figure out the arithmetic, it is warning:
    ‘MPIC_Waitall’ accessing 8 bytes in a region of size 0
[-Wstringop-overflow=]

Refactor to suppress warning and for better readability.
Commit ba1b4dd left an empty branch
that should be removed.
Update this code to use coll_group and apply some whitespace changes.
* stage, ranks exchange data within groups of size k in rounds with
* increasing distance (k, k^2, ...). Lastly, those in the main stage
* disperse the result back to the excluded ranks. Setting k according
* to the network hierarchy (e.g., the number of NICs in a node) can
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how these escaped the whitespace checker in the original PR.

@hzhou
Copy link
Contributor Author

hzhou commented Nov 11, 2024

@zhenggb72 What is your suggested path forward?

@zhenggb72
Copy link
Collaborator

zhenggb72 commented Nov 11, 2024

@zhenggb72 What is your suggested path forward?

I don't have much to add. As long as this PR does not change the behavior of the common scenarios, whatever you do to the subgroup is up to you. I don't know much about the motivation and use case of subgroup, but maybe there are a few solutions: for subgroup, you can choose to skip CSEL, and go straight to MPIR auto or fallback algorithms, or you can choose not to prune the tree and use the global tree, if you don't want to prune it.

@hzhou hzhou removed the 4.3.0b1 label Nov 11, 2024
@hzhou
Copy link
Contributor Author

hzhou commented Nov 11, 2024

We are delaying this PR to 4.4.
To help merging this PR, we'll need more performance measurement to quantify the memory and performance impact. Also get more input to ensure everyone is on board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants