Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking fetching AMOs are unaffected by fence/quiet #518

Closed

Conversation

davidozog
Copy link
Collaborator

Summary of changes

This clarifies that blocking fetching AMOs are unaffected by fence/quiet, but blocking non-fetching AMOs are affected by fence/quiet. Previously this was a bit unclear from Table 14 in OpenSHMEM v1.5.

@davidozog davidozog added this to the OpenSHMEM 1.6 milestone Jul 11, 2024
@davidozog davidozog self-assigned this Jul 11, 2024
@jdinan
Copy link
Collaborator

jdinan commented Aug 22, 2024

There is actually an unresolved memory model issue with this proposal. Specifically, given a litmus test like this:

void foo(int pe) {
  static int x = 0;
  int y = shmem_atomic_fetch_inc(&x, pe);
  int z = shmem_atomic_fetch_inc(&x, pe);
  assert(z > y);
}

and assuming no overflow, do we require the assertion to be true? Relaxed order atomics (e.g. std::memory_order_relaxed or GPU native atomics) don't guarantee this. So we may want to (or maybe we already did) define fence/quiet as enforcing this ordering for blocking atomics. At the moment, the memory order semantics of OpenSHMEM AMOs remains unspecified.

@davidozog
Copy link
Collaborator Author

... do we require the assertion to be true? Relaxed order atomics (e.g. std::memory_order_relaxed or GPU native atomics) don't guarantee this. So we may want to (or maybe we already did) define fence/quiet as enforcing this ordering for blocking atomics. At the moment, the memory order semantics of OpenSHMEM AMOs remains unspecified.

Yeah, great question. I might agree that we do not require the assertion to be true in this example (at least if relaxed atomics are allowed) but this proposal says that a fence here does not fix it regardless. So if we do want a fence to force the ordering, we should reconsider this proposal and/or fix up the memory model first.

I think a related confusion is this sentence in 9.7:

The [fetching AMO] routines return after the data has been fetched from the target PE and delivered to the calling PE

so a naive user might think the 2nd call isn't even initiated until y is set - this feels more like "acquire-release" than "relaxed" semantics to me...

Also, just to clarify/simplify this example, can we assume that there are no other atomics happening on pe at &x during foo? Is there a synchronization with the target pe before foo is called?

Pinging @wrrobin and @stewartl318 who might be interested and will have to cover for me at the ballot (I'll be chairing the HOTI tutorials ;).

@jdinan
Copy link
Collaborator

jdinan commented Aug 22, 2024

Given that this is a minor change that treads into the space of clarifying unspecified behavior, I'm tempted to ask that we table it for OpenSHMEM 1.6 so that we can try to address the issue in the memory model holistically. I suspect that we'll eventually want to define blocking fetching AMOs as ordered (this will be painful on GPUs) and introduce new AMOs that allow users to choose the ordering model (i.e. can choose relaxed for performance). What do you think?

@wrrobin
Copy link
Collaborator

wrrobin commented Aug 22, 2024

@jdinan - AMOs with user options for choosing ordering and scope is definitely a good idea. But, even for those, the default option needs to be clarified. Are you suggesting to use fence/quiet for both blocking and non-blocking fetching AMOs to ensure ordering in the default case?

@stewartl318
Copy link
Collaborator

I am firmly on the side of having the default behavior be relaxed.

First of all, Jim's example could be implemented by fetch_and_add(2), and then the caller can decide the ordering.

Second, my (dim) understanding of the memory models is that the atomics with ordering (seq_cst, release, acquire, acquire-release) have to do with making sure that data buffers are written before an associated flag (release) and that a flag is checked before a buffer is read (acquire). They only really apply to the sequencing of operations from a particular source. We already have a flag API (put with signal) and that one should be "release". We also have shmem_fence to enforce order of delivery.

My suspicion is that most individual non-fetching AMOs are there to support lock-free modifications of remote buffers, and the lock freedom is against other PEs, hardly ever against yourself. Relaxed is the most performant choice.

For fetching atomics, it seems likely to me that again the use case is almost always contention with other PEs not with yourself, so again, the ordering doesn't matter.

When ordering of atomics from the same pe matters, it will (usually) be to different addresses, when you wish the target to see the updates in a particular order. shmem_fence does that.

I am perfectly comfortable with the idea that shmem doesn't enforce ordering of fetching atomics and I certainly wouldn't want to slow down the common cases in support of such a corner case.

@wrrobin
Copy link
Collaborator

wrrobin commented Aug 23, 2024

Thanks @stewartl318 for your thoughts. This sounds good to me as well. But, how do we clarify it in the spec? Should we then update the AMO section 9.7 text as Dave pointed above?
Also, from the users point of view, how would they expect different behavior from blocking and non-blocking versions of fetch AMOs? Looks like, we are in favor of replacing the existing ones with user options to specify the behavior. Let me know if I am missing anything in this regard.

@jdinan
Copy link
Collaborator

jdinan commented Aug 23, 2024

@stewartl318 I think that's a very reasonable proposal. It may break backwards compatibility and we'll need to wrestle with that. Regarding this proposal, my understanding is it's being put forward as a simple bugfix to a table that's believed to be inconsistent. IIUC, the goal is not to clarify the memory model. However, any change we make here will have memory model implications/complications. So my suggestion is that we take some additional time to craft a proper clarification to the memory model and delay this change to the OpenSHMEM 1.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants