Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per PE Fence #314

Open
manjugv opened this issue Nov 24, 2019 · 16 comments · May be fixed by #315
Open

Per PE Fence #314

manjugv opened this issue Nov 24, 2019 · 16 comments · May be fixed by #315
Assignees

Comments

@manjugv
Copy link
Collaborator

manjugv commented Nov 24, 2019

Problem :

To achieve ordering for RMA operations between two PEs, the OpenSHMEM programs have to use shmem_fence. The shmem_fence routine, however, orders all RMA operations from the calling PE to all other PEs. Though this achieves the intended result, this is an expensive operation for unordered networks.

Proposal :

Introduce PE specific ordering routines that orders RMA operations from the calling PE to the target PE

shmem_pe_fence(target_pe)
shmem_ctx_pe_fence(ctx, target_pe)

Impact on Users:

This provides a light-weight alternative for shmem_fence if the OpenSHMEM programs desire an ordering guarantee only from the calling PE to a specific PE. There is no change in semantics for the current shmem_fence interface.

Impact on implementation:

Implementations will have to implement the new interface shmem_fence(target_pe) and its context variants shmem_ctx_fence(ctx, int target_pe).

@manjugv
Copy link
Collaborator Author

manjugv commented Nov 24, 2019

PR describing the interface to address the issue

@manjugv manjugv linked a pull request Nov 25, 2019 that will close this issue
@nspark
Copy link
Contributor

nspark commented Dec 23, 2019

I think this is an interesting feature, but that users need some additional information about the implementation to understand the cases in which shmem_pe_fence might be advantageous over shmem_fence. Specifically, the user application should be able to query when shmem_pe_fence is equivalent to shmem_fence, and when it is a lighter weight operation.

Consider a message-passing example in which a PE is sending, in order, data and a flag (like shmem_put_signal) to multiple PEs. One would likely select different fencing behavior if shmem_pe_fence is equivalent to shmem_fence or not. For example:

void message_exchange(void *dst, const void *src, size_t nbytes,
                      uint64_t *flags, int *targets, int num_targets) {

  // fence_pe < fence                                                           
  for (int i = 0; i < num_targets; i++)
    shmem_putmem(dstdata, srcdata, size, targets[i]);
  for (int i = 0; i < num_targets; i++) {
    shmem_pe_fence(targets[i]);
    shmem_atomic_set(flag, 1, targets[i]);
  }

  // fence_pe == fence                                                          
  for (int i = 0; i < num_targets; i++)
    shmem_put(dstdata, srcdata, size, targets[i]);
  shmem_fence();
  for (int i = 0; i < num_targets; i++)
    shmem_atomic_set(flag, 1, targets[i]);
}

@jdinan
Copy link
Collaborator

jdinan commented Dec 24, 2019

@nspark Thanks for the nice example. I agree, the performance query will be helpful. Using shmem_putmem_nbi might also help to illustrate the potential performance improvement --- finer-grain fencing will enable pipelining of the two loops in the network. That is, an implementation that waits for completion of some, rather than all of the puts may start issuing sets sooner and do a better job of hiding the fence latency.

@naveen-rn
Copy link
Contributor

naveen-rn commented Jan 11, 2020

for (int i = 0; i < num_targets; i++) {
    shmem_pe_fence(targets[i]);
    shmem_atomic_set(flag, 1, targets[i]);
  }

If shmem_fence == shmem_pe_fence, shmem_pe_fence(target[0]) will be expensive, shouldn't subsequent fences inside the loop be no-ops (just a single conditional check)? In general, I see the performance regression caused by introducing extra conditional checks - but not because of any heavy network ops. Maybe unnecessary memory fences be another issue if the target PEs are on the same node.

That said, I see shmem_pe_fence/quiet not as routines to further obtain granularity in memory ordering (use context for that usage model) or replacements for contexts, but to provide hints to the implementations that can take advantage of any network semantics.

@jdinan
Copy link
Collaborator

jdinan commented Jan 12, 2020

@naveen-rn Agree that optimization can be used. IIUC, it will require a flag to be set in RMA/AMO operations to track whether the PE is still in the fenced/quieted state. In a multithreaded PE, the flag would require atomics. So, it seems like there is a tradeoff to be made in terms of performance of per-pe fence/quiet and RMA/AMO rate for small messages.

For the NVSHMEM or NUMA case I mentioned it may not be so straightforward. It will be platform dependent whether O(N) finer-grain (e.g. GPU or Socket local) fence/quiet operations will be cheaper that O(1) system-level operation.

@shamisp
Copy link
Contributor

shamisp commented Jan 12, 2020

@nspark - Users would have to be really careful how they use it. If you use it without understanding system topology performance will be hurt. In case of shared memory we would have to place "global" memory barrier regardless what is used.  If you have to do more than one fence/quite you better use regular/fence quite.

@bcernohous
Copy link

bcernohous commented Jan 20, 2020

(I'm late to this).

The example above can pretty much be replaced by put w/signal. It's more interesting if there's multiple puts being fenced.

What about non-blocking fences? Has that been part of this discussion?

  // fence_pe < fence                                                           
  for (int i = 0; i < num_targets; i++){
    shmem_putmem_nb(dstdata, srcdata, size, targets[i]);
    shmem_putmem_nb(dstdata1, srcdata1, size1, targets[i]);
    shmem_putmem_nb(dstdata2, srcdata2, size2, targets[i]);
    shmem_pe_fence_nb(targets[i]);
    shmem_atomic_set(flag, 1, targets[i]);
  }

Actually, if it's non-blocking, does it matter as much if it's shmem_pe_fence or shmem_fence?

DMAPP did a "bundled" put where you could put multiple times with a flag on the last put, guaranteed to be delivered after the "bundle".

/* dmapp_bput_nbi - Non-blocking implicit bundled Put
 *
 * Parameters:
 * IN  target_addr     Address of target buffer
 *     target_seg      Segment descriptor of target buffer
 *     target_pe       Target PE
 *     source_addr     Address of source buffer
 *     nelems          Number of elements to be transferred
 *     type            Type of elements to be transferred
 *     flag_addr       Address of (8-byte) target flag location (must be non-NULL for the last call in a bundle)
 *     flag_seg        Segment descriptor for flag buffer (if NULL, assumed to be the same as target_seg)
 *     flag_value      64-bit user defined flag value
 *

@shamisp
Copy link
Contributor

shamisp commented Jan 21, 2020

@bcernohous - Does it have to be put to the same PE or can be to different PEs ?

/* dmapp_bput_nbi - Non-blocking implicit bundled Put
 *
 * Parameters:
 * IN  target_addr     Address of target buffer
 *     target_seg      Segment descriptor of target buffer
 *     target_pe       Target PE
 *     source_addr     Address of source buffer
 *     nelems          Number of elements to be transferred
 *     type            Type of elements to be transferred
 *     flag_addr       Address of (8-byte) target flag location (must be non-NULL for the last call in a bundle)
 *     flag_seg        Segment descriptor for flag buffer (if NULL, assumed to be the same as target_seg)
 *     flag_value      64-bit user defined flag value
 *

@bcernohous
Copy link

It was pretty limited. The same PE. And since it was internally an FMA chain, the chain would be broken by non-bundled puts.

I'm not sure it was ever used, but it was a requested experimental feature. Not proposing it for SHMEM, just reminded by the per-PE fence/flag example. There's an implicit PE fence on the last bundled put similar to put w/signal.

I'm really looking for clarification on the use case(s) driving this feature. The example was basically put w/signal.

@nspark
Copy link
Contributor

nspark commented Jan 21, 2020

What about non-blocking fences? Has that been part of this discussion?

No, but I also don't think there's anything about shmem_fence that requires it to be blocking. The meat of what is said about shmem_fence is that it "assures ordering". One way of assuring ordering is to call shmem_quiet (and block); others might count ACKs for the issued puts. Even still, one might not need to block. An internal flag could be set that indicates a fence has been issued (and maybe even capture the list of PEs with outstanding operations). A subsequent put (to one of those PEs with outstanding operations) could then block in the put call (and not the earlier fence) waiting for the expected number of ACKs.

@nspark
Copy link
Contributor

nspark commented Jan 21, 2020

The example above can pretty much be replaced by put w/signal. It's more interesting if there's multiple puts being fenced.

That wasn't the point of the example, which acknowledged that shmem_put_signal could be used there. The point is that applications (or their authors) will need to have some general sense of the relative cost of shmem_ctx_fence_pe vs. shmem_ctx_fence. If they're the same operation (or have nearly the same cost), the application will almost certainly choose a different synchronization pattern than if they're not. This could be tested empirically, but I would rather have the implementation inform the developer somehow.

Going further, for N == shmem_n_pes(), an application should not expect that time(shmem_fence) = N · time(shmem_fence_pe). It could be nice, but it is unrealistic. I also don't think an implementation could really give consistent, meaningful relative costs for these operations. Where there's a difference, an application will like have to be tuned empirically to select one over the other.

@shamisp
Copy link
Contributor

shamisp commented Jan 21, 2020

It was pretty limited. The same PE. And since it was internally an FMA chain, the chain would be broken by non-bundled puts.

I'm not sure it was ever used, but it was a requested experimental feature. Not proposing it for SHMEM, just reminded by the per-PE fence/flag example. There's an implicit PE fence on the last bundled put similar to put w/signal.

I'm really looking for clarification on the use case(s) driving this feature. The example was basically put w/signal.

This feature is integral part of Verbs and IBTA spec (post list). It actually the default API for sending messages. It definitely has performance benefits and provides some opportunities to reduce the number of memory barrier. It also comes with some overheads.

@shamisp
Copy link
Contributor

shamisp commented Jan 21, 2020

What about non-blocking fences? Has that been part of this discussion?

No, but I also don't think there's anything about shmem_fence that requires it to be blocking. The meat of what is said about shmem_fence is that it "assures ordering". One way of assuring ordering is to call shmem_quiet (and block); others might count ACKs for the issued puts. Even still, one might not need to block. An internal flag could be set that indicates a fence has been issued (and maybe even capture the list of PEs with outstanding operations).

Something like post-list of request is much more powerful. User provides clear "plan" to the library what are the next steps and library can optimize those. It is difficult to optimize for fence.

@shamisp
Copy link
Contributor

shamisp commented Jan 21, 2020

The example above can pretty much be replaced by put w/signal. It's more interesting if there's multiple puts being fenced.

That wasn't the point of the example, which acknowledged that shmem_put_signal could be used there. The point is that applications (or their authors) will need to have some general sense of the relative cost of shmem_ctx_fence_pe vs. shmem_ctx_fence. If they're the same operation (or have nearly the same cost), the application will almost certainly choose a different synchronization pattern than if they're not. This could be tested empirically, but I would rather have the implementation inform the developer somehow.

Going further, for N == shmem_n_pes(), an application should not expect that time(shmem_fence) = N · time(shmem_fence_pe). It could be nice, but it is unrealistic. I also don't think an implementation could really give consistent, meaningful relative costs for these operations. Where there's a difference, an application will like have to be tuned empirically to select one over the other.

@nspark - the cost might be runtime depended since with each allocation you can get different layout of nodes and as result, the cost will be highly inconsistent. The cost of operation will depend on the transports that are used and locality.

@bcernohous
Copy link

Something like post-list of request is much more powerful. User provides clear "plan" to the library what are the next steps and library can optimize those. It is difficult to optimize for fence.

I agree it's difficult to to optimize which is why I'd like the use case spelled out. I question the value of tracking PE completions (cost) vs per-PE fence (benefits) without hardware support.

@manjugv
Copy link
Collaborator Author

manjugv commented Jan 24, 2020

@bcernohous the main use case that is driving this interface is very close to what Nick has posted. The DMAPP interface you mention can be used for it. In fact, as it was pointed out, it is supported in Verbs as well.

I made a choice of Per_PE_fence interface because it covers the use case that Nick has posted, it also expands the semantics to AMOs, and memory operations. Besides that, for network paths, where the packets does not arrive to the destination in order, this provides a potentially low cost option compared to All_PE_Fence option. Also, for such network paths, this can be leveraged for multiple NIC scenarios .i.e., you need to order one PE (1 * M * HCAs) rather than multiple PEs (NPEs * M * HCAs).

This though advantageous for OpenSHMEM using p2p/network transports. This is not necessarily advantageous to OpenSHMEM using shared memory (the granularity of the memory barrier is much coarser in shared memory systems). As folks have pointed out it can have variable performance and can hurt the performance of the overall application (with a wrong usage). In the working group, we discussed the idea of using performance_query. I’m not sure how to design that query and standardize it, yet.

@jdinan jdinan added this to the OpenSHMEM 1.5 milestone Jan 31, 2020
@jdinan jdinan modified the milestones: OpenSHMEM 1.5, OpenSHMEM 1.6 Mar 4, 2020
@manjugv manjugv self-assigned this Jun 4, 2020
@jdinan jdinan removed this from the OpenSHMEM 1.6 milestone Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants