Interrupts may cause deadlocks with threaded runtime #393

sberkun · 2024-03-14T13:04:46Z

Currently, the threaded runtime uses mutexes to maintain exclusive access to key data structures (namely, environments). On embedded systems, lingua franca also supports scheduling physical actions from interrupts, via lf_schedule. However, the following sequence of events may occur:

Core 1 aquires the mutex (i.e. LF_MUTEX_LOCK(&env->mutex))
Core 1 begins modifying the environment
An interrupt occurs on core 1. The interrupt handler calls lf_schedule, which also modifies the environment.
Interrupt handler finishes.
Core 1 finishes modifying environment
Core 1 releases mutex

In this case, the mutex was unhelpful, and the environment could be corrupted by the interrupt handler. The situation is even worse if the lf_schedule tries to acquire the mutex (is this the case? I could not find a LF_MUTEX_LOCK line). Then, deadlock occurs inside the interrupt handler.

There are a few potential solutions to this:

(1) Specify that mutexes should disable interrupts on embedded platforms. This is heavy-handed, but should work. The main concern is that interrupts would become disabled when any reaction is running, since reactions acquire a reactor-level mutex (here). Interrupts would therefore only be enabled when all cores are sleeping.
(2) Create a separate platform API for regular mutexes, and mutexes that disable interrupts. (i.e. there would be lf_mutex_init and lf_mutex_init_with_interrupts). The environments would be locked with "interrupt mutexes", and reactors locked with regular mutexes. On most platforms, the two APIs would just be equivalent. However, this would be a 3rd type of locking-related API (in addition to regular mutexes and critical sections), which may clutter the platform API.
(3) Use critical sections instead of mutexes for environments. Very similar to above.
(4) Create a separate function, lf_schedule_external, for scheduling from interrupts. Instead of modifying the environment, this function could have it's own queue for external events. Then whenever an event occurs, the scheduler would need to transfer external events from the external event queue onto the regular event queue. The advantage of this approach is that interrupts only need to be disabled when the external event queue is being modified. The main issue with this approach would be "waking up" the scheduler whenever the external event queue has something new; Currently, waking up is handled by _lf_cond_timedwait, which assumes a mutex.

The text was updated successfully, but these errors were encountered:

edwardalee · 2024-03-14T14:44:33Z

This is a major issue for embedded platforms. Thanks for raising it. A couple of comments:

(1) Reactions will acquire a mutex when running only if the mutex field is set, which I believe currently only occurs if there is a watchdog. But since watchdogs are likely to be used in the most timing-critical applications, this makes this solution unattractive because it will disrupt timing most in applications that are most sensitive to timing issues.
(2) This solution looks workable to me. It adds only one function to the platform API, and the default implementation is simple.
(3) Critical sections are, effectively, a global mutex. While this will work, it will create some interference between enclaves. Within each enclave, every event queue operation would have to acquire a global mutex.
(4) I can't think of a solution to the wake up problem.

In summary, (2) looks like the best solution to me.

erlingrj · 2024-03-15T09:12:56Z

I think we should use this opportunity to map this out entirely. As I see it there are actually two distinct synchronization needs:

Synchronize among worker threads.
Synchronize access to event queue.

How to do (1) and (2) depends on the following questions:
A. Do we use the threaded runtime
B. Do we have any physical actions
C. Which asynchronous contexts are there (threads, ISRs or both)

We can make a matrix of this:

	Threaded	Physical Actions	Async context	Sync mechanism
1	no	-	-	None
2	no	no	-	None
2	no	yes	threads	mutex
2	no	yes	ISR	disable
2	no	yes	both	disable+mutex
--------	--------	--------	--------	----
1	yes	-	-	mutex
2	yes	no	-	mutex
2	yes	yes	threads	mutex
2	yes	yes	ISR	disable+mutex
2	yes	yes	both	disable+mutex

So basically, how to implement the synchronization will depend on the program and if it has physical actions, and if so, who can schedule event to those. For the threaded runtime, it makes sense that for all synchronization belonging to (1), it should use the platform API with lf_mutex_lock etc directly. But for any synchronization belonging to (2) should go through a new level of indirection (which I believe lf_critical_section_enter might serve as?). This new level of indirection should either disable interrupts, or acquire a mutex, or both, depending on the answers to (B) and (C).

I think this aligns with your suggestions, but without sacrificing performance (due to globally disabling interrupts) unless the program absolutely requires it. If we bring enclaves into the mix, then the implementation of the new level of indirection should be implemented differently for each enclave depending on whether it has physical actions on who is scheduling on it.

sberkun · 2024-10-07T05:25:38Z

#488 may be a fix; I started testing before I graduated but never finished it unfortunately

sberkun mentioned this issue Oct 7, 2024

disable interrupts for global mutex #488

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interrupts may cause deadlocks with threaded runtime #393

Interrupts may cause deadlocks with threaded runtime #393

sberkun commented Mar 14, 2024

edwardalee commented Mar 14, 2024

erlingrj commented Mar 15, 2024 •

edited

Loading

sberkun commented Oct 7, 2024

Interrupts may cause deadlocks with threaded runtime #393

Interrupts may cause deadlocks with threaded runtime #393

Comments

sberkun commented Mar 14, 2024

edwardalee commented Mar 14, 2024

erlingrj commented Mar 15, 2024 • edited Loading

sberkun commented Oct 7, 2024

erlingrj commented Mar 15, 2024 •

edited

Loading