Skip to content

Is Deadlock Possible?

Ted Dunning edited this page Aug 16, 2016 · 2 revisions

The issue is kind of subtle and it isn't clear that we have or do not have a potential for deadlock in the listener.

This comes up when we talk about how much ordering you want in terms of updates. For highest throughput, you want lots o' parallelism, of course, but that runs the risk of losing important ordering information.

To deal with this, we allow for multiple partitions in the stream topic that handles the update messages. Then we assign transactions to partitions using the volume name for maximum preservation of ordering or using a directory name for more parallelism with some guarantees or just a file name for maximum throughput.

The issue comes when you move a file from one ordering domain to another. If we are using directory as the partition key, then renaming a file within a single directory is no big deal, but renaming the file from one directory to another is a bit trickier. In particular, we want all writes to the original file to succeed before the rename and all writes to the new file to succeed after the rename.

But that won't necessarily work if the rename and one or the other writes are in different threads because the different threads might run through operations faster or slower.

To deal with this, we split the rename into two messages, rename_from and rename_to. The rename_from message uses the source directory as the key and the rename_to uses the destination directory. By doing this, we can guarantee that the writes to the original file occur in the same thread and before the rename_from message. Likewise, all of the writes to the new file name occur after the rename_to event. All that remains is to guarantee that the rename_to event occurs after the rename_from event. This shows the basic idea:

Simple rename state sequence

In the change listener, every partition of the stream is given a separate thread and the thread that executes the rename_to event waits until a semaphore is hit by the rename_from event. If both are executed in the same thread, this works by default.

If you have multiple threads and they are happening in tight synchrony (by whatever accident) then you get something like this:

Crossing renames

Each of the rename_to events will wait for the rename_from events and all will be good. Trouble might begin, however, when things get skewed. A rename_to event might well be processed before the corresponding rename_from event. If the rename_from event eventually happens in the other thread, all will be fine. But it is plausible to imagine that there is a time warping between the threads such that two rename_to events could cause threads to wait, thus blocking the rename_from events from being processed.

What we need is either:

a) a simple proof that deadlock cannot happen even with many threads and complicated sequences of renames.

OR

b) tests that show when deadlock is possible

A Simple Proof

By design, all of our events are emitted by a single thread that emits rename-from and rename-to events as a consecutive pair. Moreover, different pairs have a defined ordering in that the events these pairs are never intertwined in the execution trace of the single thread generating the events.

Once emitted, however, these events can go into multiple partitions of the change notification stream. This transfer weakens the ordering constraints that applied to the original event sequence. The key question, then, is whether this weakening can introduce cyclic dependencies between the threads that process these partitions and thus result in a deadlock.

Modeling execution of events

To describe this situation, we refer to the events in a single partition as the potentially semi-infinite sequence , and likewise with other partitions.

We can show the originally generated sequence of events as an interleaving of all of these sequences

In this sequence, we have marked coordination points between the partitions by putting two events in parentheses. This construction of the full sequence from the interleaving of the partitions is the reverse of how things happen in practice, but it makes it simpler to show the relationship in the ordering of events.

Abstract execution

The abstract execution of the message stream by a multi-threaded program can be represented an arbitrary, ordered interleaving of the events in the partitions. The re-ordering of events in this final execution can be caused both by the message transport mechanism (MapR Streams in our case) or by the multi-threaded nature of the program receiving the messages. Thus, the following is a valid execution of the original stream shown above

Lines have been added here to highlight pairs from the original. Notice how event has slipped into the middle of the pair as has event . Also, note how the first pair is flipped. This kind of event rearrangement can result in pairs overlapping or being entirely inverted, but there can be constraints that limit the rearrangement because of the requirement that each partitions events be processed in order. In our example above, the first pair contains and the second contains . The ordering of these events cannot be changed in any valid interleaving because it would contradict the ordering of these events in the B partition so these pairs can overlap, but they cannot be completely reordered.

Concrete Execution

A concrete execution of any valid abstract execution can be derived by a sequential simulation program that walks through the actions of the threads of control that actually execute the events. This is done by assigning every partition to some thread at each event in the abstract execution. The assignment of partitions to threads can change at any time so we can't have a universal assignment that stays constant. The result of running the simulation on the abstract execution is the concrete execution of the events.

The simulation can model the process of the threads stepping through the abstract execution event by event. The simulation takes as input the abstract execution, the total number of threads, and the assignment of partitions to threads at each event in the abstract execution. While the number of threads is conceptually unbounded, only as many threads as there are events in the abstract execution can possibly be involved in the simulation.

Initially, all threads are in a pool of unblocked threads. All events can be marked as blocking or non-blocking and are initially marked as non-blocking except for events that are the first of a pair which are marked blocking. Simulation proceeds by removing an unblocked thread from the pool and simulating its behavior until it reaches the end of all events (terminates successfully) or encounters a blocking event that is in a partition assigned to the thread. Threads that encounter a blocking event are attached to the blocking event and simulation starts again with the selection of a thread from the unblocked pool.

Whenever a thread encounters an event in a partition it is processing that is the second event of a pair, the corresponding first event of the pair is marked as non-blocking and any thread that was attached to the first event of the pair is moved to the unblocked pool. When a thread runs out of events to process, it is marked as terminated.

The simulation terminates when there are no threads left in the unblocked pool. At this point, all threads must have terminated successfully or been blocked.

If this simulation program can be shown to always reach a state where all threads are successful, then the concrete execution is said to be deadlock free. If all possible concrete executions of an abstract execution are deadlock free, then that abstract execution is said to be deadlock free.

Our question, then, is whether all abstract executions constructed as described above are deadlock free.

Deadlocks are trivially possible

Consider the sequence . This can have an abstract execution like this:

If all partitions are assigned to the same thread, then that thread will pause when it encounters a1 and the simulation will end because there are no other threads to continue execution. Furthermore, no finite amount of look-ahead can allow that thread to realize that it will eventually unblock itself.

In contrast, if every partition is assigned to a unique thread single event look-ahead can guarantee that the process will never block. This can be proven by induction on the number of blocked threads.

In the trivial case of one blocked threads, then obviously every other thread must have traversed all events and the other event in the pair blocking the blocked thread must have been processed by some one of those other threads which is a contradiction or the other event in the pair must belong to the blocked thread. If the unblocking event belongs to the blocked thread, however, these events must have been adjacent in the original abstract sequence (they are a pair, after all) and thus no other event from the blocked thread can be between them. That means that the blocked thread will see the unblocking event with 1 event look-ahead and can unblock itself.

If there are n+1 blocked threads, then there must be some thread that is blocked on the earliest transaction. The unblocking event for that thread must occur later in the concrete execution sequence and must belong to some other thread (otherwise one event lookahead would win as before). The other thread that owns the unblocking event must also itself be blocked some time before the unblocking event. That means that the concrete execution looks like this:

$12$

Moreover, since we know by induction that n threads cannot result in deadlock, there can be no cycle dependencies among the n blocked threads other than the first blocked thread. Also, since to be blocked the n+1 threads must have a cycle of dependencies and since there is no cycle of length m < n+1 (by induction again), if the addition of the earliest blocked thread cannot result in a deadlock, then the induction is complete.

Resolution of the deadlock

Sandbox

, we maintain a state of either running or paused for each thread. The state of paused threads includes the event where they were paused. As we process each event, if the event is not part of a pair and the corresponding thread is running, then we rename the event. If the event is not part of a pair and the corresponding thread is currently paused, we defer the renaming of the event until the thread is returned to the running state. When we hit the first element of a pair, the corresponding thread is put into the paused state. When we hit the second element of a pair, the thread associated with the first element of that pair is switched to

A concrete execution can be said to be deadlock free if A deadlock can occur if there is some original stream that has no

Clearly, no such cycles are possible in the original single-threaded event sequence because there is a total order for all events. All events in a single partition will preserve this total order.

Notes

Total ordering defined by abstract trace

Total ordering => no dead-lock, even if total ordering not known (need argument)

De-interleaving gives invariant form for all concrete traces which is same as de-interleaved abstract trace

Execution directly of de-interleaved trace cannot fail because of existence of total ordering

Second argument, total ordering => no crossing dependencies in de-interleaved form

Third argument, consider events in a single thread to be connected by a directional link. Events connected by a synchronization link to be linked bidirectionally. (Total ordering => no loops) and (deadlock => loop) => (total ordering => no deadlock