Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixed-atomicity memory model: only relaxed reads #42

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gasche
Copy link
Member

@gasche gasche commented Jun 19, 2024

Rendered.

This is a revision of the general model proposed in #41, where only non-atomic/relaxed reads are permitted on atomic locations, not relaxed writes. The model is simpler, and avoids the issues pointed out by @stedolan in #41 -- but it may still have problems!

@kayceesrk
Copy link

Thanks for working on this.

Assuming that the definition below is descriptive,

The only difference with atomic reads is that the thread frontier is not updated. Such a relaxed read gets the "current" value of the store (as defined by the sequential ordering between atomic writes), but without any synchronization with the atomic writers.

I do not understand why the implementation section says

We assume that the current compilation strategy is enough to realize this memory model, where relaxed reads are compiled exactly like non-atomic reads.

If relaxed reads are compiled exactly like non-atomic reads, why is it that a relaxed read on an atomic location guaranteed not to see older values? In particular, on Arm, we use the compilation scheme in Table 5(b), where the loads are compiled to plain loads. In the semantics, atomic locations only hold one/current value, and so the semantics does not even have a way to talk about older values. However, without any synchronization, why are we guaranteed to see this most recent write?

@gasche
Copy link
Member Author

gasche commented Jun 20, 2024

I know very little about the compilation-scheme side of weak memory models, so it is very possible that (if I understand your point correctly) the proposed model cannot be implemented efficiently in practice.

I can think of other models that are reasonably natural, and allow relaxed-read to observe older values. In these models, atomic locations carry histories which are timestamp-indexed sets of not just values, but (frontier, value) pairs. Would one of the informal description below sound reasonable to implement to you ?

  1. A relaxed read may observe any (frontier, value) pair that is at or after its own frontier-timestamp for this location, it updates its frontier with the new timestamp but without any other form of synchronization. In other word, the reader updates its view of the location, but not of any other location. (Updating the frontier with the new timestamp guarantees that we will not read a strictly-older value later.)

  2. A relaxed read may observe any (frontier, value) pair that is at or after its own frontier-timestamp for this location, and it updates its frontier with the location frontier at this timestamp. In other words, the relaxed read synchronizes with all writers at or before the timestamp it observes, but not later writers.

(I can easily try to work out a formal description of either approaches, and see what happens.)

Remarks:

  • If we want to avoid amnesic behaviors where we later read an older value, we need stronger guarantees for relaxed reads than for non-atomic reads as those do allow amnesic behaviors. (@stedolan commented in A proposal for a mixed-atomicity memory model #41 that we want to rule out amnesic behaviors, and that sounds convenient for programming but I am not sure it is a strict requirement for the model to work.) This intuitively suggest that relaxed reads may require a different compilation scheme than non-atomic reads.

  • (this is only about the formalism, and more of a comment about future proposals than the current discussion) The mathematics suggest that, if we have one frontier per timestamp, those frontiers should be increasing monotonically with the timestamps. In the current/trunk model, timestamps are totally ordered (they are rational numbers), so naturally per-timestamp frontiers will be monotonically increasing. If we consider relaxed writes again, relaxed writes have to synchronize with all writes older than their timestamp to preserve this property. This does not allow behaviors where, for example, two threads would perform relaxed writes independently, without observing each other in the process. It would be natural to consider weaker models where relaxed writes need not synchronize with each other, but this requires changing the set of timestamps to be a partially-ordered set.

@stedolan
Copy link
Contributor

Thanks for the update! This model makes more sense to me, but I agree with @kayceesrk that it would need a bit more implementation work than the RFC suggests.

In particular, a program consisting entirely of Atomic.get_relaxed and Atomic.set would have entirely SC semantics, since neither of those operations really care about frontiers. On ARM, achieving this requires work at both loads and stores.

So the options then are to weaken the model or strengthen the implementation until they line up. To decide between these, it might be helpful to look carefully at some intended uses of Atomic.get_relaxed to see what ordering requirements they actually have, and what the cost of fencing is.

@stedolan commented in #41 that we want to rule out amnesic behaviors, and that sounds convenient for programming but I am not sure it is a strict requirement for the model to work.

Right, agreed, the various convenient properties of the current OCaml model are not strict requirements. It might end up being a good idea to break some of these properties to get better performance in some cases, but I want to make sure we don't do so by accident.

Having said that, the property of coherence of atomics is generally considered an important one for writing programs - it's one of the few guarantees made even by C++ relaxed atomics.

@kayceesrk
Copy link

So the options then are to weaken the model or strengthen the implementation until they line up. To decide between these, it might be helpful to look carefully at some intended uses of Atomic.get_relaxed to see what ordering requirements they actually have, and what the cost of fencing is.

If we take the view that we want to compile relaxed reads as cheaply as possible (as in kcas library where fenceless_getis a non-atomic read %field0: ocaml-multicore/kcas@950f7ec#diff-3130266185c669ae1127fe3df54e5f5687e7300fd7e48954dcfaaa01e4af83e3R12), and given that atomic writes are at least as strong as non-atomic writes (is this true?), then the semantics we would get is

A relaxed read may observe any (frontier, value) pair that is at or after its own frontier-timestamp for this location , and it updates its frontier with the location frontier at this timestamp. In other words, the relaxed read synchronizes with all writers at or before the timestamp it observes, but not later writers.

This is similar to non-atomic reads, where no frontier is updated. This might be acceptable as the uses of relaxed read are expected to be at places where you do cheap reads until you observe a value that you care about, and then you'd follow this with an atomic operation to synchronise.

@gasche
Copy link
Member Author

gasche commented Jun 20, 2024

One property that your proposal does not have, if I understand correctly, is coherence as formulated by @stedolan: if I do a relaxed read at some newer timestamp, I don't learn anything new in my frontier, so if I do another relaxed read I may now observe an older value. (This is what I called an "amnesic behavior" above.)

I am fine with this, I can write a precise description of the model to have something concrete to discuss, and I believe that current practitioners do indeed want to have the cheapest possible read rather than stronger guarantees, but @stedolan points out that coherence is an important property and that the proposed model (if I understand correctly) would be strictly weaker than C11 relaxed atomics.

@gasche
Copy link
Member Author

gasche commented Jun 20, 2024

(If we don't update our frontier at all, I don't think we need to have a frontier at each timestamp, so I can simplify the presentation a bit: atomic locations are modelled by a (frontier, history) pair, where atomic reads always observe the latest value in history and synchronize the frontier, and non-atomic reads observe any value at their own frontier or later and don't synchronize anything.)

@OlivierNicole
Copy link

If I understand correctly @stedolan’s comments, the underlying hardware is coherent, so observation of values with non-monotonic timestamps can only be due to compiler optimisations.

Isn’t there a world where we have maximally cheap Atomic.fenceless_get, i.e., compiled like a non-atomic load, but with certain compiler barriers to ensure coherence? E.g., to ban example 2 from the PLDI paper, forbidding write forwarding to relaxed loads (but not to non-atomic ones)?

@gasche
Copy link
Member Author

gasche commented Jun 20, 2024

To decide between these, it might be helpful to look carefully at some intended uses of Atomic.get_relaxed to see what ordering requirements they actually have, and what the cost of fencing is.

I think that we can find all existing uses of fenceless_get using Github code search: search url.

The use-cases that I understand well enough are those were a relaxed read follows, or is followed by, an unconditional compare-and-set operation. I think that those would work well in basically any model, in the sense that they should behave like an atomic read -- there is a discussions of this in the text of the RFC.

For example:

https://github.com/ocaml-multicore/kcas/blob/240981e0ef9e9d5de6801aa8a15b62f73b7a37af/src/kcas/kcas.ml#L307-L312

(in general kcas.ml is full of comments about why "fenceless_get" is safe, generally the justification is that we know there was a fence before or there will be one after.)

Other use-cases that are more mysterious to me are when there is a conditional check on the relaxed-read return value, and an atomic write/cas only in one of the branches. For example, the following is part of a hashtable implementation that supports resizing concurrently with other operations:

https://github.com/ocaml-multicore/picos/blob/51895fb3169ffaba2219d02324ad429ccf174f88/lib/picos_htbl/picos_htbl.ml#L130-L138

I don't understand the code enough to guess whether the caller makes any synchronization assumption about the result of this function.

@kayceesrk
Copy link

kayceesrk commented Jun 21, 2024

The second code example is indeed interesting in that no synchronization follows the relaxed read. I guess the reasoning would be that publication safety holds, even for non-atomics. So if the relaxed read returns B (Resize spine_r), then the value read is guaranteed to be well-initialised. It is guaranteed to be one of the values that was written to this Atomic location.

(This is what I called an "amnesic behavior" above.). I am fine with this,

I would be ok with the amnesic behaviour as well.

One property that your proposal does not have, if I understand correctly, is coherence as formulated by @stedolan

I'm a bit confused here as Stephen suggested that coherence is broken by mixing atomic and relaxed writes to the same atomic location. Here we're choosing to only do relaxed reads. Given that non-atomic reads do exhibit amenesic behaviour, I would argue that it is ok for relaxed reads to do the same.

Are we overlooking something @stedolan?

@gadmm
Copy link

gadmm commented Jun 21, 2024

  1. In low-level terms, the cheapest implementation of fenceless_get would let you take advantage of dependency ordering, since those are already needed to enforce memory safety (publication safety). Ideally the memory model would tell you that you see at least the initialising writes. Now the memory model does not really talk about publication safety since initialising writes are not modelled, and initial values come out of thin air (see Explain mapping between OCaml memory model and C ocaml#10995 for a longer discussion).

    Would it be useful and/or necessary to first clarify the business around initial writes in the memory model?

    I am especially thinking about the following scenario: in the C implementation of the runtime, the initialising write is not always the first write. Similarly on the OCaml side, you might want to construct a value by mutation before publishing it, and so you would like to reason that the writes prior to publishing are initialising writes.

  2. I second the question about what it affords us to have "compiler barriers" for fenceless_get, in addition to this. (In hindsight this was the meaning of my last questions at Explain mapping between OCaml memory model and C ocaml#10995.) If I adapt my example:

    let x = ref 0
    let t1 r = (* thread 1 *)
      x := 1 ;
      Atomic.set r x
    let t2 r = (* thread 2 *)
      print_int (Atomic.fenceless_get r)
    let _ = 
      let r = Atomic.make (ref 1) in
      {{ t1 r || t2 r }}

    Note that compilers barriers are necessary at least to avoid repeatedly reading a stale value (e.g. hoisting a relaxed read out of a loop). AFAIU the C standard has specific wording to this purpose (https://eel.is/c++draft/atomics.order#11), and AFAIU this is not part of the formal memory model but comes in addition to it (let me know if I misunderstood this).

@kayceesrk
Copy link

kayceesrk commented Jun 22, 2024

the cheapest implementation of fenceless_get would let you take advantage of dependency ordering, since those are already needed to enforce memory safety (publication safety).

I don't follow how dependency ordering leads to a cheap implementation of fenceless_get. Can you explain?

Would it be useful and/or necessary to first clarify the business around initial writes in the memory model?

I think this would be useful. Unlike the memory model presented in the paper, implementation has to deal with 3 different cases for writes to non-atomic locations -- initialising writes, non-initialising writes to integer fields, and non-initialising writes to pointer fields (caml_modify) 1. For atomic fields, we distinguish between initialising writes and non-initialising writes (Atomic.set). It will be useful to extend the model distinguishing initialising and publishing writes, and then define and prove publication safety.

in the C implementation of the runtime, the initialising write is not always the first write.

Do you mean the case where caml_alloc initially writes Val_unit to the fields, but the user code then assigns the field values?

Note that compilers barriers are necessary at least to avoid repeatedly reading a stale value (e.g. hoisting a relaxed read out of a loop).

I am interested in thinking about how to formally specify this. Linux Kernel Memory Model has a READ_ONCE primitive to prevent compiler optimisation on load2. It would be useful to think about how to encode such a property formally for relaxed read, but also to atomics in general. The OCaml memory model today only prescribes a subset of allowed behaviours. This permits compiler optimisations such that the program exhibits fewer behaviours than permitted by the memory model. However, applying optimisations that lead to fewer behaviours on atomic locations is not always sensible. Adapting the example from LKMM2, the according to the memory model

while Atomic.get a = 0 do Domain.cpu_relax () done

can be optimised to

let tmp = Atomic.get a in
while tmp = 0 do Domain.cpu_relax () done

which is not what was intended.

ocaml/ocaml#12713 also has a similar flavour to the example above. There, the compiler had a bug -- full CSE on atomic locations led to more behaviours than what was permitted by the memory model. However, optimising the initial example by eliminating redundant load is permitted by the model3 but does not match with the intention.

Given that the notion of a compiler barrier does not exist in the model today, and it is not possible currently to express even the intended semantics for atomics formally, should a compiler barrier for relaxed read be necessarily a pre-condition for including that feature in the compiler? Informally, we can say that the compiler will not optimise any operations on atomic locations, including relaxed read.

Footnotes

  1. https://github.com/ocaml/ocaml/pull/10972#issuecomment-1030596278

  2. https://www.kernel.org/doc/Documentation/memory-barriers.txt 2

  3. https://github.com/ocaml/ocaml/issues/12713#issuecomment-1793870709

@polytypic
Copy link

polytypic commented Jun 22, 2024

For example, the following is part of a hashtable implementation that supports resizing concurrently with other operations:

https://github.com/ocaml-multicore/picos/blob/51895fb3169ffaba2219d02324ad429ccf174f88/lib/picos_htbl/picos_htbl.ml#L130-L138

I don't understand the code enough to guess whether the caller makes any synchronization assumption about the result of this function.

In that case the B (Resize _) value/state is terminal. IOW, the target location does not change after that value has been successfully written to the location.

Also, the runtime / compiler does not currently expose a sequentially consistent atomic read from an array, which is why Multicore_magic.Atomic_array does not expose such an operation. You could hack together something like that by performing dummy atomic reads, but I didn't bother to do that.

Also, note that every operation on the hash table begins with an Atomic.get. I mentioned about this with @stedolan recently, but my experience implementing non-trivial lock-free (non-blocking) data structures is that most of the atomic accesses inside a lock-free operation can be relaxed. Typically you only need to ensure sequential consistency / linearizabilty once. After / before that is done the other accesses can be relaxed without changing the semantics (linearizability) of the entire operation.

@gasche
Copy link
Member Author

gasche commented Jun 22, 2024

Thanks (everyone and) @polytypic for your clarifications here and the other thread, this is very helpful.

@gadmm
Copy link

gadmm commented Jun 24, 2024

I don't follow how dependency ordering leads to a cheap implementation of fenceless_get. Can you explain?

I am rather saying that, in low level terms, if you want to avoid all fences, then you should still be able to rely on dependency ordering (since you already do for memory safety), and conversely there is no other synchronization principle available (AFAIU).

Do you mean the case where caml_alloc initially writes Val_unit to the fields, but the user code then assigns the field values?

This sort of case indeed, since the block initialized to Val_unit is not memory-safe in general. (We are also leaving aside the fact that the allocator itself is written in C, etc.).

Given that the notion of a compiler barrier does not exist in the model today, and it is not possible currently to express even the intended semantics for atomics formally, should a compiler barrier for relaxed read be necessarily a pre-condition for including that feature in the compiler? Informally, we can say that the compiler will not optimise any operations on atomic locations, including relaxed read.

This makes sense if it is not possible/desirable to state the liveness property formally (and even then the compiler barrier is more of an implementation detail than something in the model). For the record the informal liveness property in the C/C++ memory model is the following: "The implementation should make atomic stores visible to atomic loads, and atomic loads should observe atomic stores, within a reasonable amount of time."

@kayceesrk kayceesrk added the memory-model Memory model related discussions label Jun 25, 2024
@gasche
Copy link
Member Author

gasche commented Jun 27, 2024

In my role as a scribe / operational semanticist, here is my attempt to translate the different models that I think are easy/natural to describe starting from the current operational semantics.

"latest, no sync"

As in the current PR: relaxed reads always access the sequentially-latest value, but they don't synchronize at all.

Store       S := a ↦ H ⊎ A ↦ (F, x)
History     H := t ↦ x
Frontier    F := a ↦ t
timestamp   t
value x

—————————————————————————————————————————————— Write-AT
(Fᴬ, y); F ——<A:write x>—→ (Fᴬ ⊔ F, x); Fᴬ ⊔ F

————————————————————————————————————————— Read-AT
(Fᴬ, x); F ——<A:read-at x>—→ (Fᴬ, x); Fᴬ ⊔ F

——————————————————————————————————————— Read-relaxed
(Fᴬ, x); F ——<A:read-na x>—→ (Fᴬ, x); F

"older, sync"

Relaxed reads may access an older value, but they synchronize with its writer (and all sequentially-before writers):

Store                 S := a ↦ h ⊎ A ↦ H
Non-atomic history    h := t ↦ x
Atomic history        H := t ↦ (F, x)
Frontier              F := l ↦ t

t = latest(H)    H(t) = (Fᴬ, y) 
t' > t     F' := (Fᴬ ⊔ F)[A ↦ t']
——————————————————————————————————————— Write-AT
H; F ——<A:write x>—→ H[t' ↦ F', x]; F'

t = latest(H)    H(t) = (Fᴬ, x)
———————————————————————————————— Read-AT
H; F ——<A:read-at x>—→ H; Fᴬ ⊔ F

t ≥ F(A)         H(t) = (Fᴬ, x)
———————————————————————————————— Read-relaxed
H; F ——<A:read-na x>—→ H; Fᴬ ⊔ F

(Invariant: the frontiers are monotonically increasing in atomic histories. The same invariant also holds for the proposal below.)

"older, loc sync only"

Relaxed reads may access older values, and they only update the frontier for that one atomic location -- the minimal amount to rule out amnesic behaviors.

(types: same as "older, sync")
Store                 S := a ↦ h ⊎ A ↦ H
Non-atomic history    h := t ↦ x
Atomic history        H := t ↦ (F, x)
Frontier              F := l ↦ t

t = latest(H)   H(t) = (Fᴬ, y)
t' > t     F' := (Fᴬ ⊔ F)[A ↦ t']
——————————————————————————————————————— Write-AT
H; F ——<A:write x>—→ H[t' ↦ F', x]; F'

t = latest(H)   H(t) = (Fᴬ, x)
————————————————————————————————— Read-AT
H; F ——<A:read-at x>—→ H; Fᴬ ⊔ F

t ≥ F(A)        H(t) = (Fᴬ, x)
——————————————————————————————————— Read-relaxed
H; F ——<A:read-na x>—→ H; F[A ↦ t]

"older, no sync"

Relaxed reads may access older values, with no synchronization at all. (This allows amnesic behaviors, even in absence of relaxed-write.)

Store                 S := a ↦ H ⊎ A ↦ (F, H)
History               H := t ↦ x
Frontier              F := l ↦ t

t' > latest(H)   F' := (Fᴬ ⊔ F)[A ↦ t']
——————————————————————————————————————————————— Write-AT
(Fᴬ, H); F ——<A:write x>—→ (F', H[t' ↦ x]); F'

t = latest(H)    H(t) = x
———————————————————————————————————————————— Read-AT
(Fᴬ, H); F ——<A:read-at x>—→ (Fᴬ, H); Fᴬ ⊔ F

t ≥ F(A)         H(t) = x
——————————————————————————————————————— Read-relaxed
(Fᴬ, H); F ——<A:read-na x>—→ (Fᴬ, H); F

@gasche
Copy link
Member Author

gasche commented Jun 27, 2024

My current understanding from your implementation-side discussions is as follows:

  • "older, no sync" is the weakest model, and it would probably be validated by the !(Obj.magic a : 'a ref) implementation, but it lacks coherence which makes @stedolan nervous
  • either "latest, no sync" and "older, sync" (they are incomparable) would probably require strictly more barriers in hardware
  • "older, loc sync only" is an intermediate model that guarantees per-location coherence but not much else; I don't find it very pleasant formally, but maybe it could be a candidate for a model that could be enforced by restricting compiler optimizations around Atomic.get_relaxed (in the specification, not just the implementation), without changing the compilation scheme.

@kayceesrk
Copy link

Thanks Gabriel. An important question here is which of these models can we actually implement. It may not be possible to have precisely the guarantees that are offered by these models in hardware and we're likely to be "at least as strong as" the operational model.

From a performance perspective, I do think we need "older, no sync", which can be implemented as you have pointed out. My question/confusion about about coherence from an earlier message (#42 (comment)) still remains (I think):

I'm a bit confused here as Stephen suggested that coherence is broken by mixing atomic and relaxed writes to the same atomic location. Here we're choosing to only do relaxed reads. Given that non-atomic reads do exhibit amenesic behaviour, I would argue that it is ok for relaxed reads to do the same.

Are we overlooking something @stedolan?

@gasche
Copy link
Member Author

gasche commented Oct 28, 2024

We discussed this at the maintainer meeting today. @stedolan clarified that his remark on reading old values later is not a criticism of the proposal itself, but rather a remark to point out that it is weaker than relaxed reads in the C11 memory model. (So they are "very relaxed" reads in this sense.)

The next step is for me to update the RFC with the "older, no sync" model, so that we have a clear document. Then someone who knows about this stuff should think about the whether the proposed implementation, of compiling those very-relaxed reads exactly like non-atomic reads, is in fact correct with respect to the memory model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
memory-model Memory model related discussions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants