Skip to content

RMA WG 01 03 2019

James Dinan edited this page Jan 4, 2019 · 3 revisions

Agenda

  1. Wait/test changes to put-with-signal proposal (Naveen)

Attendees

  • Naveen (Cray)
  • Jim, Dave, Wasi (Intel)
  • Akil, Anshuman (NVIDIA)
  • Swaroop (ORNL)
  • Manju (Mellanox)

Open Action Items

  • None

New Action Items

  • Prepare proposals for upcoming meeting (Jan. 14 cutoff)

Notes

Naveen - 2 questions about put w/signal:

  1. Do we need to specify that put w/signal requires using shmem_wait/test to check the signal value?

    • Naveen: shmem_wait/test text already has a note to implementors that looks somewhat complete for put w/signal, but not sure.
    • Manju: We need to specify that shmem_wait must be used with put/signal. Anshuman's proposal should clarify other atomicity requirements.
    • Jim: Put/signal + barrier should work, need to be careful about specifing strict shmem_wait requirement.
    • M: Ordering is a concern, requires using shmem_wait/test interface.
    • N: Put/signal shouldn't need anything new, it's not different from other routines that accompany shmem_wait.
    • M: Put/signal is a special operation disctinct from others, and requires shmem_wait/test for portable full completion.
    • J: What if user does shmem_barrier_all? When is it valid for the target to read signal word directly? Need a well-defined memory model for this.
    • M: It's always correct to use shmem_wait. The barrier_all may also work, but target side may not be fully complete, still need shmem_wait after barrier to be sure.
    • J: Barrier must work here for SHMEM to work. Specifying shmem_wait only is too restrictive.
    • M: Not so; for example, we don't overspecify quiet by saying it's redundant immediately after a barrier.
    • J: The point is there are other ways to acheive put/signal semantics than wait/test, we shouldn't over-specify..
    • M: Using shmem_wait to check signal is always correct, so that should be specified.
    • N: This isn't similarly specified anywhere else in the spec, why only for put/signal? Isn't partial update description enough?
  2. Does signal operation need to be an atomic?

    • Jim: 1) This could be a note to implementors that it might need to be an atomic operation (more flexible option) or 2) Specify that it must be atomic w/ respect to SHMEM AMO semantics.
    • N: Looks like an implementation detail, does not need to be specified. If PCI, use atomics. Otherwise, maybe not.
    • M: Memory model should clarify this. But requiring an atomic op helps avoid partial updates.
    • J: AMOs, scalar put, and signal would be included in this. By saying put/signal must be compatible with wait/test, implementation applies an atomic where necessary.
    • M: Counting puts might be different, don't have to look at multiple memory locations.
    • J: Counters may be limited to 1000's or 10,000's, works similar to the a signal value being incremented. Triggered ops enable more counters.
    • M: Likes the counter as an opaque object, but it need not be.
    • J: It's better as an opaque object, gives implementation more optimization opportunities.
    • J: E.g. local sorting in ISx can be done independently of communication. Signal can have extra info, like the offset for written data. Can't do that with counting put.
    • M: Multiple counters might be able to handle that and other cases; but yes, the signal value can capture more general info.
    • J: Better to start with signal, since it's been supported by Cray SHMEM, could do counting puts or something more general later.

Naveen: (conclusion / clarification)

  1. Need to specify that put w/signal should use shmem_wait/test for portability/compatibility.
  2. Do not need to specify atomic op requirement for signal operation.

Naveen will clean up and squash pull request and prepare for January 28th meeting.

Jim:

  • No meeting next Thursday, but let Jim know otherwise. Should be able to handle put/signal through Github from here.
  • Please get together proposals/special ballot readings before cutoff on 14th, get out an early annoucement.
  • 2 weeks from now will have memory model discussion with Anshuman.
Clone this wiki locally