Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with tag assignment for asynchronous events #33

Open
oowekyala opened this issue Oct 21, 2022 · 4 comments
Open

Problem with tag assignment for asynchronous events #33

oowekyala opened this issue Oct 21, 2022 · 4 comments

Comments

@oowekyala
Copy link
Collaborator

There was a bug in the C++ runtime, and it can also happen in Rust theoretically (I wasn't able to reproduce it with an unmodified runtime, it depends on thread interleaving).

Possible faulty execution

  • An async thread reads the current time, computes the tag T for its new event. Thread is parked before the event is put into the event queue.
  • The scheduler continues executing some reactions (eg from a timer) until tag T is exceeded.
  • The async thread wakes up and pushes the event into the queue.
  • The scheduler then sees an event that was scheduled for the past - that was a bug in C++ and would currently crash the Rust runtime (assertion failure).

C++ fix

In C++ there is a global event queue and a global mutex protecting it. The fix is to put the time reading and the pushing of the event in the same critical section.

Rust

In Rust the event queue is split:

  • the scheduler owns the only reference to the global, sorted queue. This is where events are popped from for execution.
  • each async thread uses a channel Sender to push events to the scheduler asynchronously. The Receiver end maintains an unsorted buffer of events that is periodically flushed into the main queue by the scheduler thread. Events pushed through the Sender have already been assigned a tag.

We can assume Sender/Receiver communicate atomically.

Possible solutions for the Rust runtime

Global mutex

We could reproduce the C++ solution by introducing a mutex to guard the receiver and sender. This would however defeat part of the purpose of using channels, which is that we don't need to block the async sender thread when sending something.

Let the scheduler assign tags

Another solution would be to let the scheduler thread assign tags to asynchronous events. There are several possible problems with this:

  • This relies on the assumption that reaction execution times are negligible. A long-running reaction could delay the tag assignment for an asynchronous event significantly. This would compromise the real-time capabilities of the runtime, however, the lag can be measured and reported on.
  • Another problem with this approach is that async events would be "bucketed" into fewer tags than what would be were they assigned tags asynchronously. This might make more events simultaneous than necessary.

Mixed solution

We could use the asynchronously assigned tag as long as it is greater than the latest processed tag. If it isn't, then we're in the problematic situation described above. Then, we can do something else:

  • crash
  • drop the event, and go on
  • assign the current latest processed tag + 1 microstep, and go on
    and report in any case to the user that something wrong happened.

None of these look super appealing in the general case - maybe it should be selectable

  • globally, eg with a compile time feature flag.
  • per individual action, with an annotation in the source.
@lhstrh
Copy link
Member

lhstrh commented Oct 21, 2022

I think reassigning the tag of the new event is the only reasonable option. I think we should think of it as a transaction. If the race occurs and the tag of the new event is wrong, we roll back, get a new tag, and attempt inserting it again.

@lhstrh
Copy link
Member

lhstrh commented Oct 21, 2022

Note that whatever tag is obtained for the scheduled physical action is uncertain, anyway.

@oowekyala
Copy link
Collaborator Author

Ok, I'll implement this.

For the record, I could not reproduce the bug without adding a thread::sleep in the middle of the critical section, in the code of the runtime (not of the LF program). I suspect this bug is mostly theoretical...

@lhstrh
Copy link
Member

lhstrh commented Nov 7, 2022

These kinds of bugs are load dependent and might only surface rarely, yet I wouldn't call them theoretical because that wrongfully suggests that they cannot really happen in deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants