Skip to content
This repository has been archived by the owner on Aug 12, 2022. It is now read-only.

Deterministic execution checks #2

Open
gardnervickers opened this issue Oct 22, 2019 · 5 comments
Open

Deterministic execution checks #2

gardnervickers opened this issue Oct 22, 2019 · 5 comments

Comments

@gardnervickers
Copy link
Member

In the StrangeLoop FoundationDB talk, Will Wilson mentions the need for re-running simulations with the same deterministic seed and comparing the execution in order to detect sources of nondeterminism which have sneaked into the application.

It would be good to support this, if even at a basic level. One idea that has been floated is to use Tracing to collect a globally ordered set of execution events. This could serve as the basis for comparison to ensure that for multiple runs of the same seed, execution histories are identical.

@davidbarsky
Copy link
Member

I'm not sure that tracing is the ideal mechanism for tracing executions of simulations because tracing is primarily meant for collecting instrumentation data that can be arbitrarily queried over. This isn't to say that it can't be used for that, but I think pushing tracing towards this use-case might be more effort than getting each simulated component to insert an enum describing the operation to a global/lazy-static'd vec.

@gardnervickers
Copy link
Member Author

@davidbarsky I think that makes sense, the reason I was drawn to using tracing was that I wanted to allow users to interleave their own application specific events into an execution history. I agree that it's less than ideal to force tracing to serve this purpose.

I think instead, we could achieve the same thing by using a global epoch which events can incorporate. Something like {timestep}-{epoch} where timestep is the mock time value. The reason we can't use mock time alone is that multiple events can fire in the span of a single timestep.

@davidbarsky
Copy link
Member

We spoke on Discord about this, but to summarize: I think a priority queue of events will be sufficient. The events will be inserted into the priority queue during a simulation run. "Priority" of an event is a monotonic instant/integer that allows a user/this system to determine a total ordering of events during a given simulation.

This approach is outlined in greater detail here: https://lobste.rs/s/igiolo/learning_build_distributed_systems#c_nlpl7r

@bIgBV
Copy link

bIgBV commented Oct 24, 2019

I think the same approach would work for not only network events but IO events as well, and will let us simulate storage failures.

A good implementation strategy would be to have each mock Async* object maintain it's own event priority queue. This can then be used with a global fault injector to drop events/delay them to induce failures.

@thisismiller
Copy link

The easy way to do this is to, right before the test ends, print/log the random number generator's next value. If two runs produce the same final random number, they they're very likely to have identical executions. FoundationDB calls this the unseed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants