Deterministic execution checks #2

gardnervickers · 2019-10-22T02:34:44Z

In the StrangeLoop FoundationDB talk, Will Wilson mentions the need for re-running simulations with the same deterministic seed and comparing the execution in order to detect sources of nondeterminism which have sneaked into the application.

It would be good to support this, if even at a basic level. One idea that has been floated is to use Tracing to collect a globally ordered set of execution events. This could serve as the basis for comparison to ensure that for multiple runs of the same seed, execution histories are identical.

davidbarsky · 2019-10-22T16:46:22Z

I'm not sure that tracing is the ideal mechanism for tracing executions of simulations because tracing is primarily meant for collecting instrumentation data that can be arbitrarily queried over. This isn't to say that it can't be used for that, but I think pushing tracing towards this use-case might be more effort than getting each simulated component to insert an enum describing the operation to a global/lazy-static'd vec.

gardnervickers · 2019-10-23T04:25:14Z

@davidbarsky I think that makes sense, the reason I was drawn to using tracing was that I wanted to allow users to interleave their own application specific events into an execution history. I agree that it's less than ideal to force tracing to serve this purpose.

I think instead, we could achieve the same thing by using a global epoch which events can incorporate. Something like {timestep}-{epoch} where timestep is the mock time value. The reason we can't use mock time alone is that multiple events can fire in the span of a single timestep.

davidbarsky · 2019-10-24T15:11:31Z

We spoke on Discord about this, but to summarize: I think a priority queue of events will be sufficient. The events will be inserted into the priority queue during a simulation run. "Priority" of an event is a monotonic instant/integer that allows a user/this system to determine a total ordering of events during a given simulation.

This approach is outlined in greater detail here: https://lobste.rs/s/igiolo/learning_build_distributed_systems#c_nlpl7r

bIgBV · 2019-10-24T17:35:27Z

I think the same approach would work for not only network events but IO events as well, and will let us simulate storage failures.

A good implementation strategy would be to have each mock Async* object maintain it's own event priority queue. This can then be used with a global fault injector to drop events/delay them to induce failures.

thisismiller · 2020-05-15T23:45:09Z

The easy way to do this is to, right before the test ends, print/log the random number generator's next value. If two runs produce the same final random number, they they're very likely to have identical executions. FoundationDB calls this the unseed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic execution checks #2

Deterministic execution checks #2

gardnervickers commented Oct 22, 2019

davidbarsky commented Oct 22, 2019

gardnervickers commented Oct 23, 2019

davidbarsky commented Oct 24, 2019

bIgBV commented Oct 24, 2019

thisismiller commented May 15, 2020

Deterministic execution checks #2

Deterministic execution checks #2

Comments

gardnervickers commented Oct 22, 2019

davidbarsky commented Oct 22, 2019

gardnervickers commented Oct 23, 2019

davidbarsky commented Oct 24, 2019

bIgBV commented Oct 24, 2019

thisismiller commented May 15, 2020