Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupts may cause deadlocks with threaded runtime #393

Open
sberkun opened this issue Mar 14, 2024 · 3 comments
Open

Interrupts may cause deadlocks with threaded runtime #393

sberkun opened this issue Mar 14, 2024 · 3 comments

Comments

@sberkun
Copy link
Collaborator

sberkun commented Mar 14, 2024

Currently, the threaded runtime uses mutexes to maintain exclusive access to key data structures (namely, environments). On embedded systems, lingua franca also supports scheduling physical actions from interrupts, via lf_schedule. However, the following sequence of events may occur:

  • Core 1 aquires the mutex (i.e. LF_MUTEX_LOCK(&env->mutex))
  • Core 1 begins modifying the environment
  • An interrupt occurs on core 1. The interrupt handler calls lf_schedule, which also modifies the environment.
  • Interrupt handler finishes.
  • Core 1 finishes modifying environment
  • Core 1 releases mutex

In this case, the mutex was unhelpful, and the environment could be corrupted by the interrupt handler. The situation is even worse if the lf_schedule tries to acquire the mutex (is this the case? I could not find a LF_MUTEX_LOCK line). Then, deadlock occurs inside the interrupt handler.

There are a few potential solutions to this:

  • (1) Specify that mutexes should disable interrupts on embedded platforms. This is heavy-handed, but should work. The main concern is that interrupts would become disabled when any reaction is running, since reactions acquire a reactor-level mutex (here). Interrupts would therefore only be enabled when all cores are sleeping.
  • (2) Create a separate platform API for regular mutexes, and mutexes that disable interrupts. (i.e. there would be lf_mutex_init and lf_mutex_init_with_interrupts). The environments would be locked with "interrupt mutexes", and reactors locked with regular mutexes. On most platforms, the two APIs would just be equivalent. However, this would be a 3rd type of locking-related API (in addition to regular mutexes and critical sections), which may clutter the platform API.
  • (3) Use critical sections instead of mutexes for environments. Very similar to above.
  • (4) Create a separate function, lf_schedule_external, for scheduling from interrupts. Instead of modifying the environment, this function could have it's own queue for external events. Then whenever an event occurs, the scheduler would need to transfer external events from the external event queue onto the regular event queue. The advantage of this approach is that interrupts only need to be disabled when the external event queue is being modified. The main issue with this approach would be "waking up" the scheduler whenever the external event queue has something new; Currently, waking up is handled by _lf_cond_timedwait, which assumes a mutex.
@edwardalee
Copy link
Contributor

This is a major issue for embedded platforms. Thanks for raising it. A couple of comments:

  • (1) Reactions will acquire a mutex when running only if the mutex field is set, which I believe currently only occurs if there is a watchdog. But since watchdogs are likely to be used in the most timing-critical applications, this makes this solution unattractive because it will disrupt timing most in applications that are most sensitive to timing issues.
  • (2) This solution looks workable to me. It adds only one function to the platform API, and the default implementation is simple.
  • (3) Critical sections are, effectively, a global mutex. While this will work, it will create some interference between enclaves. Within each enclave, every event queue operation would have to acquire a global mutex.
  • (4) I can't think of a solution to the wake up problem.

In summary, (2) looks like the best solution to me.

@erlingrj
Copy link
Collaborator

erlingrj commented Mar 15, 2024

I think we should use this opportunity to map this out entirely. As I see it there are actually two distinct synchronization needs:

  1. Synchronize among worker threads.
  2. Synchronize access to event queue.

How to do (1) and (2) depends on the following questions:
A. Do we use the threaded runtime
B. Do we have any physical actions
C. Which asynchronous contexts are there (threads, ISRs or both)

We can make a matrix of this:

Threaded Physical Actions Async context Sync mechanism
1 no - - None
2 no no - None
2 no yes threads mutex
2 no yes ISR disable
2 no yes both disable+mutex
-------- -------- -------- -------- ----
1 yes - - mutex
2 yes no - mutex
2 yes yes threads mutex
2 yes yes ISR disable+mutex
2 yes yes both disable+mutex

So basically, how to implement the synchronization will depend on the program and if it has physical actions, and if so, who can schedule event to those. For the threaded runtime, it makes sense that for all synchronization belonging to (1), it should use the platform API with lf_mutex_lock etc directly. But for any synchronization belonging to (2) should go through a new level of indirection (which I believe lf_critical_section_enter might serve as?). This new level of indirection should either disable interrupts, or acquire a mutex, or both, depending on the answers to (B) and (C).

I think this aligns with your suggestions, but without sacrificing performance (due to globally disabling interrupts) unless the program absolutely requires it. If we bring enclaves into the mix, then the implementation of the new level of indirection should be implemented differently for each enclave depending on whether it has physical actions on who is scheduling on it.

@sberkun
Copy link
Collaborator Author

sberkun commented Oct 7, 2024

#488 may be a fix; I started testing before I graduated but never finished it unfortunately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants