Create a standard API / specification for testing workloads #13

thisismiller · 2020-05-16T00:26:30Z

Once a simulation framework exists, then next question is what to do with it. FoundationDB coupled the randomized fault injection with specification based testing to obtain a high level of coverage.

Specifically, there is a workload interface, which "tester" actors run one or more workloads in parallel. Workloads are either antagonistic, e.g. kill N processes during the test, or a specification, e.g. backup and restore some data and assert that they're the same. Orthogonal workloads can be combined in the same test to test different failure scenarios or types of work in the same test.

There is an alternative direction available, which is to allow an easy way to specify the exact fault injections that will happen in a test. Millions of Tiny Databases went in this direction instead, which allowed them to generate a comprehensive set of simulated tests, model checking style.

davidbarsky · 2020-05-16T01:44:59Z

I was considering sketching out something closer to the Simnet approach for this, where we’d generate actions that’d be executed by the simulator. The actions could be API calls or things like network partitions failures, timesouts, or delays. I think something like Blip implements something similar. What do you think?

thisismiller · 2020-05-16T08:45:33Z

I think the best answer I can give is "that might work". It's far enough away from FoundationDB's model that I can't comment strongly on if it would work or not, but I'm also skeptical that it would work well when extended to a full system. I think I understand the appeal, because it also means that one could try to minimize test failures. The allocator fuzzing is a simple example, and I think the technique applies well there, specifically when it's easy to define an interpreter because each action is atomic and synchronous. For a full distributed system, it's not immediately clear to me how to extend that model to encompass asynchronous and concurrent operations, multi-stage work, and fault injection therein. If you do manage to work out an design that resolves these potential issues, you'll still need to also find a good way to separately specify a set of work and verification actions that you can compose together in a test.

To give things a name, I'll call the design of the allocator example the "interpreter-style".

Asynchronous operations

As a concrete example, let's use FoundationDB's Increment workload, which is a basic verification of the database. The workload increments some keys randomly, applies the same increments to an in-memory map, and then verifies that they're the same at the end. This sounds simple enough, and you could do this in an interpreter style as well. One could define an Increment operation that increments a random key, and a Verify operation that does the database to in-memory map comparison. If one was trying to write an elastic binary tree in rust and compare it against std::collections::BTreeMap for verification, then I could see how this style approach could work well and result in a very effective fuzzer.

However, there's no concurrency provided in this do-one-operation-and-wait-for-it-to-complete. Hopefully, I would be able to verify that if multiple increments are performed concurrently, then the result is still correct. The Increment workload spawns actorCount copies of the incrementClient actor to provide concurrent operations. How would this work in the libfuzzer model? We could have the Increment operation start an increment, but then what drives the completion? It seems we would still need to have the event loop running to advance the async work we begin, and the interpreter style main loop would become a spawned task of its own. Even without our async work, the database itself needs to be able to perform background activities that have nothing directly to do with our test (such as data redistribution). It feels to me like the main interpreter loop no longer driving the execution of the code would lower the chance of failure minimization working, which I think is the main goal of this design. A quick google for a concurrent fuzzer results in two research projects: Cuzz and ConFuzz, which suggests that this might be a difficult direction. It's possible that even without minimization, the Interpreter API could be a useful design to focus on operations and incremental verification.

Multi-stage work

For some tests, the interpreter operation that would be defined for some tests would be monolithic. To use backup tests in FoundationDB as an example, the "start a backup" flow involves multiple serial transactions, with each transaction being its own meaty chunk of logic. This code is BUGGIFY'd so that in simulation it sometimes spontaneously restarts from the top. I'm unsure quite how to phrase something similar in the interpreter model.

Even operations that feel simple and atomic at a first glance might not be so when one examines the details. The Increment operation described above, if applied to FoundationDB, would be 8 RPCs and 1 disk write for the commit alone. Reading a key could be one RPC, or it could require going back through the full bootstrap protocol of discovering where a key range and storage server exist in the cluster. When using abstractions that provide simple, high level APIs, I'm unsure of how to design an interpreter API that still permits the full concurrency, and thus test coverage, of all of the operations the abstraction performs.

Fault injection

With the goal of also having faults driven from the interpreter, I think it's harder to describe when and where to inject a fault. Considering the cases of a read above, how do I describe when to drop a packet? The amount of data sent/received by a single get request can vary wildly. If I drop the first packet after a DropPacket interpreter op, then there will exist multi-step operations that will never see faults. How does one fault inject background async work equally? How do I schedule BUGGIFYs in this mode?

To return to Millions of Tiny Databases, Physalia is described as being tested with fault injection explicitly interwoven in the execution of the consensus group. This doesn't seem to give me a large hint on how to apply it to higher levels of abstraction. Being able to predict the work performed from a single stage of a consensus protocol on one node does seem easier than predicting the work done by one FoundationDB client for a read. Maybe there's a good answer that I'm missing, as there are some downsides to just blindly dropping N% of packets as well (N has to be kept very small if any component exists such that packet loss causes a large amount of work).

Composition

Cycle is a workload that tests data consistency where

setup() Initializes a database with key-value pairs, such that the value is one of the other used keys, and that they all form a cycle
run() randomly performs mutations of the cycle, where it takes A->B->C->D and transforms it into A->C->B->D
verify() reads all the key value pairs, and verifies that the cycle is still a cycle of the same length.

It's important to note that this is a workload that will pass, given any prefix of the commits done to the database.

BackupToDBCorrectness is a workload that

run() starts the async replication stream from database A to database B, and optionally copies the data back to A at the end of the workload
setup() and verify() do nothing

Individually, Cycle verifies properties about database consistency, BackupToDBCorrectness verifies things about the Disaster Recovery API. If you run both of them at the same time, then you can verify that the DR keeps data consistency. Modular specification testing means more coverage with less work. In this sense, you could look at BackupToDBCorrectness like a simulation test fixture, and might need to under the interpreter model.

Is there a point to having different tests? Why not just always run all verification work with all faults? Well, sometimes they conflict. Let's look at some antagonistic workloads:

ConfigureDatabase is a workload that issues random database configuration changes. e.g. "Change from keeping 3 copies of all data to 2 copies of all data".

Attrition is a workload that kills N machines over a span of time, as long as the coded model of availability says that the database should remain available with the machine killed.

Unfortunately, these two can't be combined (easily). Attrition doesn't understand that the replication factor of the database's configuration might not match the actual data replication, because ConfigureDatabase recently changed the configuration.

This mix-and-match of correctness checking and antagonistic workloads lets one have a wide variety of simulation test configurations to uncover different sorts of bugs, and it's a thing I'd like to see kept in whatever the API is for performing test work in a simulation framework.

thisismiller mentioned this issue May 16, 2020

Support BUGGIFY #12

Open

mcches mentioned this issue Jan 12, 2023

Explore state exploration in turmoil tokio-rs/turmoil#75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a standard API / specification for testing workloads #13

Create a standard API / specification for testing workloads #13

thisismiller commented May 16, 2020

davidbarsky commented May 16, 2020

thisismiller commented May 16, 2020 •

edited

Loading

Create a standard API / specification for testing workloads #13

Create a standard API / specification for testing workloads #13

Comments

thisismiller commented May 16, 2020

davidbarsky commented May 16, 2020

thisismiller commented May 16, 2020 • edited Loading

Asynchronous operations

Multi-stage work

Fault injection

Composition

thisismiller commented May 16, 2020 •

edited

Loading