Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonblocking Collectives #456

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
4b9f48d
Adding first draft for nb collective intro, broadcast, and test
manjugv Oct 26, 2020
bb13dd6
Update: Addressing feedback from WG - Oct 26
manjugv Oct 26, 2020
df261dd
Adding NB blocking a2a; minor updates
manjugv Jan 21, 2022
48a012c
Adding collective wait
manjugv Jan 19, 2023
d3a7ac9
Adding shmem barrier all nonblocking
manjugv Mar 13, 2023
a3c8b15
Fixing description of SHMEM Collective Wait
manjugv Mar 31, 2023
1becb00
Change collective wait to shmem_req_wait
manjugv Mar 31, 2023
1cbd79a
Fix description of request param
manjugv Mar 31, 2023
f98cdf7
Add req object to NB barrier arguments
manjugv Sep 14, 2023
cc72331
Remove tagged collectives
manjugv Sep 28, 2023
a972738
Add missing 32 and 64 bit interfaces; minor edits
manjugv Sep 29, 2023
db8528c
Update content/nb_collectives_intro.tex
manjugv Sep 29, 2023
558e0d3
Update content/shmem_alltoall_nb.tex
manjugv Sep 29, 2023
ef3ecf1
Update content/shmem_barrier_all_nb.tex
manjugv Sep 29, 2023
db79fcb
Update content/shmem_broadcast_nb.tex
manjugv Sep 29, 2023
7d74314
Update content/shmem_broadcast_nb.tex
manjugv Sep 29, 2023
2355bf8
Remove 32 and 64 bit nb variants
manjugv Sep 30, 2023
b15c727
Minor edits: Remove active set mention
manjugv Sep 30, 2023
030a019
Make shmem_req_test/wait functions neutral
manjugv Sep 30, 2023
82e5d54
Update content/nb_collectives_intro.tex
Apr 12, 2024
fb0b482
Update content/shmem_alltoall_nb.tex
Apr 12, 2024
930b39c
Update content/shmem_broadcast_nb.tex
Apr 12, 2024
96f0b97
Update content/shmem_barrier_all_nb.tex
Apr 12, 2024
ff96cf5
Update content/shmem_collective_test.tex
Apr 12, 2024
ef49e8f
Update content/shmem_collective_wait.tex
Apr 12, 2024
e40b263
Minor update to nb_collectives_intro
May 10, 2024
02a539a
Inclusion of text from PR #507
Aug 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions content/execution_model.tex
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,20 @@ \subsection{Progress of OpenSHMEM Operations}\label{subsec:progress}

The \openshmem model assumes that computation and communication are naturally
overlapped. \openshmem programs are expected to exhibit progression of
communication both with and without \openshmem calls. Consider a \ac{PE} that is
communication both with and without \openshmem calls. For point-to-point
operations, consider a \ac{PE} that is
engaged in a computation with no \openshmem calls. Other \acp{PE} should be able
to communicate (e.g., \OPR{put}, \OPR{get}, \OPR{atomic}, etc.) and
complete communication operations with that computationally-bound \ac{PE}
without that \ac{PE} issuing any explicit \openshmem calls. One-sided \openshmem
communication calls involving that \ac{PE} should progress regardless of when
that \ac{PE} next engages in an \openshmem call.
that \ac{PE} next engages in an \openshmem call. Similarly,
for non-blocking collectives, consider the \acp{PE} that are part of a team
issuing a non-blocking collective and overlapping collective completion with
computation. Once a non-blocking collective operation is initiated by
all of the \acp{PE} in the team of the collective, any \ac{PE} in the team must
eventually observe completion through a call to \FUNC{shmem\_req\_test} or a
call to \FUNC{shmem\_req\_wait}.

\parimpnotes{
An \openshmem implementation for hardware that does not provide
Expand Down
9 changes: 9 additions & 0 deletions content/library_constants.tex
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,15 @@
See Section~\ref{subsec:shmem_ctx_create} for more detail about its use.
\tabularnewline \hline
%%
\LibConstDecl{SHMEM\_REQ\_INVALID} &
A value corresponding to an invalid request handle.
This value can be used to initialize or update request handles to indicate
that they do not reference a valid request.
When managed in this way, applications can use an equality comparison
to test whether a given request handle references a valid request.
See Section~\ref{subsec:nb_coll} for more detail about its use.
\tabularnewline \hline
%%
\LibConstDecl{SHMEM\_SIGNAL\_SET} &
An integer constant expression corresponding to the signal update set operation.
See Section~\ref{subsec:shmem_put_signal} and
Expand Down
29 changes: 29 additions & 0 deletions content/nb_collectives_intro.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
An \openshmem nonblocking collective operation, like a blocking collective
operation, is a group communication operation among the
participants of the team. All \acp{PE} in the team are required to call the
collective operation and each collective operation must be initiated in the same
order across all \acp{PE} while the execution may be performed in any order.

\begin{enumerate}

\item Invocation semantics: Upon invocation of a nonblocking collective routine,
the operation is initiated and the routine returns without ensuring completion. All \acp{PE} in the team
must call this routine with identical arguments.

\item Collective Types: The nonblocking variants supported include the alltoall
and broadcast collectives. All other collective operations such as
reductions, collect, fcollect, barrier, barrier all, alltoalls, sync, and sync all will not have nonblocking variants.

Copy link
Collaborator

@davidozog davidozog Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should sync_all be supported? My hunch is it seems possibly useful...

\item Completion semantics: \openshmem programs can learn the status of the collective operations
using the \FUNC{shmem\_req\_test} routine. The operation is completed after
a call to \FUNC{shmem\_req\_test} or a call to \FUNC{shmem\_req\_wait}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to what @kwaters4 mentioned. Add something like "Completion of the operation can be observed through one or more calls to \FUNC{shmem_req_test} or a single call to \FUNC{shmem_req_wait}."

\item Threads: While using SHMEM\_THREAD\_MULTIPLE, the \openshmem
programs are not allowed to call multiple collective operations on different threads
and the same team.

\end{enumerate}

Note: Like other nonblocking \openshmem operations, the implementations are
expected to asynchronously progress the collective operations. The guidance on
asynchronous progress is provided in Section \ref{subsec:progress}.
123 changes: 123 additions & 0 deletions content/shmem_alltoall_nb.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
\apisummary{
Exchanges a fixed amount of contiguous data blocks between all pairs
of \acp{PE} participating in the collective routine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we mention later that "All PEs in the provided team must participate in the collective.", can we just say that here, instead of keeping it in the abstract?
i.e. "Exchanges a fixed amount of contiguous data blocks between all PEs on the specified team."

}

\begin{apidefinition}

%% C11
\begin{C11synopsis}
int @\FuncDecl{shmem\_alltoall\_nb}@(shmem_team_t team, TYPE *dest, const TYPE
*source, size_t nelems, shmem_req_h *request);
\end{C11synopsis}
where \TYPE{} is one of the standard \ac{RMA} types specified by Table \ref{stdrmatypes}.

\begin{Csynopsis}
\end{Csynopsis}
\begin{CsynopsisCol}
int @\FuncDecl{shmem\_\FuncParam{TYPENAME}\_alltoall\_nb}@(shmem_team_t team,
TYPE *dest, const TYPE *source, size_t nelems, shmem_req_h *request);
\end{CsynopsisCol}
where \TYPE{} is one of the standard \ac{RMA} types and has a corresponding \TYPENAME{} specified by Table \ref{stdrmatypes}.

\begin{CsynopsisCol}
int @\FuncDecl{shmem\_alltoallmem\_nb}@(shmem_team_t team, void *dest, const
void *source, size_t nelems, shmem_req_h *request);
\end{CsynopsisCol}

\begin{apiarguments}

\apiargument{IN}{team}{A valid \openshmem team handle to a team.}%

\apiargument{OUT}{dest}{Symmetric address of a data object large enough to receive
the combined total of \VAR{nelems} elements from each \ac{PE} in the
team.
The type of \dest{} should match that implied in the SYNOPSIS section.}
\apiargument{IN}{source}{Symmetric address of a data object that contains \VAR{nelems}
elements of data for each \ac{PE} in the team, ordered according to
destination \ac{PE}.
The type of \source{} should match that implied in the SYNOPSIS section.}
\apiargument{IN}{nelems}{
The number of elements to exchange for each \ac{PE}.
For \FUNC{shmem\_alltoallmem\_nb} it represents bytes.
}
\apiargument{OUT}{request}{An opaque request handle identifying the collective
operation.}

\end{apiarguments}

\apidescription{
The \FUNC{shmem\_alltoall\_nb} routines are collective routines. All
\acp{PE} in the provided team must participate in the collective. If
\VAR{team} compares equal to \LibConstRef{SHMEM\_TEAM\_INVALID} or is
otherwise invalid, the behavior is undefined.

{\bf Invocation and completion}: A call to the nonblocking alltoall routine initiates the operation and returns
immediately without necessarily completing the operation. On success,
an opaque request handle is created and returned. The
operation is completed after a call to \FUNC{shmem\_req\_test} or
a call to \FUNC{shmem\_req\_wait}. When the operation is complete, the request handle
is deallocated and cannot be reused.
Copy link
Contributor

@kwaters4 kwaters4 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, this is currently worded in a way that both shmem_req_test and shmem_req_wait guarantee completion.

I believe the correct language exists in the intro on lines 17 and 18.

\item Completion semantics: \openshmem programs can learn the status of the collective operations using the \FUNC{shmem\_req\_test} routine and can be completed using the \FUNC{shmem\_req\_wait} routine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwaters4 line 17 and 18 of the intro (completion semantics) states:

\item Completion semantics: \openshmem programs can learn the status of the collective operations using the \FUNC{shmem\_req\_test} routine. The operation is completed after a call to \FUNC{shmem\_req\_test} or a call to \FUNC{shmem\_req\_wait}.

This is the same as here other than the ability to observe the status of the collective. I believe the text you are quoting is older.

Copy link
Contributor

@kwaters4 kwaters4 Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstand.

I thought shmem_req_test only checks the status of the collective. but then the program continues even if the non-blocking call is incomplete. While shmem_req_wait waits and guarantees completion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shmem_req_test checks for completion and if complete, will deallocate the request handle, set the handle to SHMEM_REQ_INVALID, and returns 0.

I feel like this wording may be the confusing part:

The operation is completed after a call to \FUNC{shmem\_req\_test} or a call to \FUNC{shmem\_req\_wait}.

Maybe this wording is more clear:
Completion of the operation can be observed through one or more calls to \FUNC{shmem\_req\_test} or a single call to \FUNC{shmem\_req\_wait}.


Though nonblocking alltoall varies in invocation and completion semantics
when compared to blocking alltoall, the data exchange semantics are similar.

{\bf Data exchange semantics}:
In this routine, each \ac{PE}
participating in the operation exchanges \VAR{nelems} data elements
with all other \acp{PE} participating in the operation.
The size of a data element is:
\begin{itemize}
\item 8 bits for \FUNC{shmem\_alltoallmem\_nb}
\item \FUNC{sizeof}(\TYPE{}) for alltoall routines taking typed \VAR{source} and \VAR{dest}
\end{itemize}

The data being sent and received are
stored in a contiguous symmetric data object. The total size of each \ac{PE}'s
\VAR{source} object and \VAR{dest} object is \VAR{nelems} times the size of
an element
times \VAR{N}, where \VAR{N} equals the number of \acp{PE} participating
in the operation.
The \VAR{source} object contains \VAR{N} blocks of data
(where the size of each block is defined by \VAR{nelems}) and each block of data
is sent to a different \ac{PE}.

The same \dest{} and \source{}
arrays, and same value for nelems
must be passed by all \acp{PE} that participate in the collective.

Given a \ac{PE} \VAR{i} that is the \kth \ac{PE}
participating in the operation and a \ac{PE}
\VAR{j} that is the \lth \ac{PE}
participating in the operation,

\ac{PE} \VAR{i} sends the \lth block of its \VAR{source} object to
the \kth block of
the \VAR{dest} object of \ac{PE} \VAR{j}.


Like data exchange semantics, the entry and completion
criteria of blocking and nonblocking alltoall are similar.

{\bf Entry criteria}: Before any \ac{PE} calls a \FUNC{shmem\_alltoall\_nb} routine,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry criteria is confusing - if we have the same entry criteria as blocking operations - then the users are supposed to make sure the src/dst buffers are available at the point of entry - meaning the users are expected to sync with themselves before launching the nb collective operation.

AFAIU - the src/dst needs to be available when everyone in the team reaches the state - which can be determined during runtime - without the need for explicit sync before launching the nb collective operation.

Please clarify the intended semantics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naveen, my intent is to keep it consistent with the semantics of blocking collectives. I wanted to improve the description. I don’t see how the current wording is violating that. If the wording doesn’t reflect that, then we can fix it.

the following condition must be ensured:
\begin{itemize}
\item The \VAR{dest} data object on all \acp{PE} in the team is
ready to accept the \FUNC{shmem\_alltoall\_nb} data.
\end{itemize}
Otherwise, the behavior is undefined.

{\bf Completion criteria}: Upon completion, the following is true for
the local PE:
\begin{itemize}
\item Its \VAR{dest} symmetric data object is completely updated and
the data has been copied out of the \VAR{source} data object.
\end{itemize}
}

\apireturnvalues{
Zero on successful local completion. Nonzero otherwise.
}

\end{apidefinition}

103 changes: 103 additions & 0 deletions content/shmem_broadcast_nb.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
\apisummary{
Broadcasts a block of data from one \ac{PE} to one or more destination
\acp{PE}.
}

\begin{apidefinition}

%% C11
\begin{C11synopsis}
int @\FuncDecl{shmem\_broadcast\_nb}@(shmem_team_t team, TYPE *dest, const TYPE
*source, size_t nelems, int PE_root, shmem_req_h *request);
\end{C11synopsis}
where \TYPE{} is one of the standard \ac{RMA} types specified by Table \ref{stdrmatypes}.

%% C/C++
\begin{Csynopsis}
\end{Csynopsis}
\begin{CsynopsisCol}
int @\FuncDecl{shmem\_\FuncParam{TYPENAME}\_broadcast\_nb}@(shmem_team_t team, TYPE
*dest, const TYPE *source, size_t nelems, int PE_root, shmem_req_h *request);
\end{CsynopsisCol}
where \TYPE{} is one of the standard \ac{RMA} types and has a corresponding \TYPENAME{} specified by Table \ref{stdrmatypes}.

\begin{CsynopsisCol}
int @\FuncDecl{shmem\_broadcastmem\_nb}@(shmem_team_t team, void *dest, const void
*source, size_t nelems, int PE_root, shmem_req_h *request);
\end{CsynopsisCol}

\begin{apiarguments}

\apiargument{IN}{team}{The team over which to perform the operation.}%

\apiargument{OUT}{dest}{Symmetric address of destination data object.
The type of \dest{} should match that implied in the SYNOPSIS section.}
\apiargument{IN}{source}{Symmetric address of the source data object.
The type of \source{} should match that implied in the SYNOPSIS section.}
\apiargument{IN}{nelems}{
The number of elements in \source{} and \dest{} arrays.
For \FUNC{shmem\_broadcastmem\_nb}, elements are bytes.
}
\apiargument{IN}{PE\_root}{Zero-based ordinal of the \ac{PE}, with respect to
the team, from which the data is copied.}
\apiargument{OUT}{request}{An opaque request handle identifying the collective
operation.}


\end{apiarguments}

\apidescription{
\openshmem nonblocking broadcast routines are collective routines over a
valid \openshmem team.
They copy the \source{} data object on the \ac{PE} specified by
\VAR{PE\_root} to the \dest{} data object on the \acp{PE}
participating in the collective operation.
The same \dest{} and \source{} data objects and the same value of
\VAR{PE\_root} must be passed by all \acp{PE} participating in the
collective operation.

A call to the nonblocking broadcast routine initiates the operation and returns
immediately without necessarily completing the operation. On success,
an opaque request handle is created and returned. The
operation is completed after a call to \FUNC{shmem\_req\_test} or a
call to \FUNC{shmem\_req\_wait}. When the operation is complete, the request handle
is deallocated and cannot be reused.
Copy link
Contributor

@kwaters4 kwaters4 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, this is currently worded in a way that both shmem_req_test and shmem_req_wait guarantee completion.

I believe the correct language exists in the intro on lines 17 and 18.

\item Completion semantics: \openshmem programs can learn the status of the collective operations using the \FUNC{shmem\_req\_test} routine and can be completed using the \FUNC{shmem\_req\_wait} routine.


Like blocking broadcast, before any \ac{PE} calls a broadcast routine, the following
conditions must be ensured:
\begin{itemize}
\item The \dest{} array on all \acp{PE} participating in the broadcast
is ready to accept the broadcast data.
\item All \acp{PE} in the \VAR{team} argument must participate in
the operation.
\item If the \VAR{team} compares equal to \LibConstRef{SHMEM\_TEAM\_INVALID} or is
otherwise invalid, the behavior is undefined.
\item \ac{PE} numbering is relative to the team. The specified
root \ac{PE} must be a valid \ac{PE} number for the team,
between \CONST{0} and \VAR{N$-$1}, where \VAR{N} is the size of
the team.
\end{itemize}
Otherwise, the behavior is undefined.

Upon completion of a nonblocking broadcast routine, the following are true for the local
\ac{PE}:
\begin{itemize}
\item The \dest{} data object is updated.
\item If the local \ac{PE} is \VAR{PE\_root}, the data has been copied
out of the \source{} data object.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this change to a separate PR and ask section committee to add it. @davidozog

\end{itemize}
}


\apireturnvalues{
Zero on success and nonzero otherwise.
}

\apinotes{
Team handle error checking and integer return codes are currently undefined.
Implementations may define these behaviors as needed, but programs should
ensure portability by doing their own checks for invalid team handles and for
\LibConstRef{SHMEM\_TEAM\_INVALID}.
}

\end{apidefinition}
35 changes: 35 additions & 0 deletions content/shmem_collective_test.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
\apisummary{
The routine outputs the status of the operation identified by the request.
}

\begin{apidefinition}

\begin{Csynopsis}
int @\FuncDecl{shmem\_req\_test}@(shmem_req_h *request);
\end{Csynopsis}

\begin{apiarguments}

\apiargument{IN}{request}{Request handle}

\end{apiarguments}

\apidescription{
A call to \FUNC{shmem\_req\_test} returns immediately. If the
operation identified by the request is completed, it returns
zero, and the request object is deallocated and set to \LibConstRef{SHMEM\_REQ\_INVALID}.
If the operation is not completed, it returns a non-negative integer.
If the request object is not valid (i.e., it is set to \LibConstRef{SHMEM\_REQ\_INVALID}),
no operation is performed and a negative value is returned.

In a multithreaded environment, \FUNC{shmem\_req\_test} can be called by
different threads but on different request objects. It is the responsibility
of the \openshmem user to ensure that proper synchronization is used to
prevent race conditions or deadlock.
}

\apireturnvalues{
On success returns zero, otherwise returns a nonzero integer.
}

\end{apidefinition}
38 changes: 38 additions & 0 deletions content/shmem_collective_wait.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
\apisummary{
The routine waits until a operation identified by a request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "an operation" instead of "a operation"

object completes.
}

\begin{apidefinition}

\begin{Csynopsis}
int @\FuncDecl{shmem\_req\_wait}@(shmem_req_h *request);
\end{Csynopsis}

\begin{apiarguments}

\apiargument{IN}{request}{Request handle}

\end{apiarguments}

\apidescription{

The \FUNC{shmem\_req\_wait} function is a blocking operation used to
determine whether an operation identified by the request object has
been completed. When the operation is completed, \FUNC{shmem\_req\_wait} returns
zero, and the request object is deallocated and set to \LibConstRef{SHMEM\_REQ\_INVALID}.
If the request object is not valid (i.e., it is set to
\LibConstRef{SHMEM\_REQ\_INVALID}), no operation is performed and a negative
value is returned.

In a multithreaded environment, \FUNC{shmem\_req\_wait} can be called by different
threads but on different request objects. It is the responsibility of the
\openshmem user to ensure that proper synchronization is used to prevent race
conditions or deadlock.
}

\apireturnvalues{
On success returns zero, otherwise returns a negative integer.
}

\end{apidefinition}
17 changes: 17 additions & 0 deletions main_spec.tex
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,23 @@ \subsubsection{\textbf{SHMEM\_COLLECT, SHMEM\_FCOLLECT}}\label{subsec:shmem_coll
\subsubsection{\textbf{SHMEM\_REDUCTIONS}}\label{subsec:shmem_reductions}
\input{content/shmem_reductions.tex}

\newpage
\subsection{Nonblocking Collective Routines}\label{subsec:nb_coll}
\input{content/nb_collectives_intro.tex}

\subsubsection{\textbf{SHMEM\_BROADCAST\_NB}}\label{subsec:shmem_broadcast_nb}
\input{content/shmem_broadcast_nb.tex}

\subsubsection{\textbf{SHMEM\_ALLTOALL\_NB}}\label{subsec:shmem_alltoall_nb}
\input{content/shmem_alltoall_nb.tex}

\subsubsection{\textbf{SHMEM\_REQ\_TEST}}\label{subsec:shmem_collective_test}
\input{content/shmem_collective_test.tex}

\subsubsection{\textbf{SHMEM\_REQ\_WAIT}}\label{subsec:shmem_collective_wait}
\input{content/shmem_collective_wait.tex}





Expand Down