Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arch: POSIX: Fix race with unused threads #17614

Merged
merged 1 commit into from
Jul 19, 2019

Conversation

aescolar
Copy link
Member

@aescolar aescolar commented Jul 17, 2019

Fix a race which seems to have been presenting itself
very sporadically on loaded systems.
The race seems to have caused tests/kernel/sched/schedule_api
to fail at random on native_posix.

The case is a bit convoluted:
When the kernel calls z_new_thread(), the POSIX arch saves
the new thread entry call in that new Zephyr thread stack
together with a bit of extra info for the POSIX arch.
And spawns a new pthread (posix_thread_starter()) which
will eventually (after the Zephyr kernel swapped to it),
call that entry function.
(Note that in principle a thread spawned by pthreads may
be arbitrarily delayed)
The POSIX arch does not try to synchronize to that new
pthread (because why should it) until the first time the
Zephyr kernel tries to swap to that thread.
But, the kernel may never try to swap to it.
And therefore that threat's posix_thread_starter() may never
have got to run before the thread was aborted, and its
Zephyr stack reused for something else by the Zephyr app.

As posix_thread_starter() was relaying on looking into that
thread stack, it may now be looking into another thread stack
or anything else.

So, this commit fixes it by having posix_thread_starter()
get the input it always needs not from the Zephyr stack,
but from its own pthread_create() parameter pointing to a
structure kept by the POSIX arch.

Note that if the thread was aborted before reaching that point
posix_thread_starter() will NOT call the Zephyr thread entry
function, but just cleanup.

With this change all "asynchronous" parts of the POSIX arch
should relay only on the POSIX arch own structures.

Signed-off-by: Alberto Escolar Piedras [email protected]


Fixes #17613

@aescolar aescolar added bug The issue is a bug, or the PR is fixing a bug area: native port Host native arch port (native_sim) labels Jul 17, 2019
@aescolar aescolar changed the title arch: POSIX: Fix possible race with unused threads arch: POSIX: Fix race with unused threads Jul 17, 2019
Fix a race which seems to have been presenting itself
very sporadically on loaded systems.
The race seems to have caused tests/kernel/sched/schedule_api
to fail at random on native_posix.

The case is a bit convoluted:
When the kernel calls z_new_thread(), the POSIX arch saves
the new thread entry call in that new Zephyr thread stack
together with a bit of extra info for the POSIX arch.
And spawns a new pthread (posix_thread_starter()) which
will eventually (after the Zephyr kernel swapped to it),
call that entry function.
(Note that in principle a thread spawned by pthreads may
be arbitrarily delayed)
The POSIX arch does not try to synchronize to that new
pthread (because why should it) until the first time the
Zephyr kernel tries to swap to that thread.
But, the kernel may never try to swap to it.
And therefore that thread's posix_thread_starter() may never
have got to run before the thread was aborted, and its
Zephyr stack reused for something else by the Zephyr app.

As posix_thread_starter() was relaying on looking into that
thread stack, it may now be looking into another thread stack
or anything else.

So, this commit fixes it by having posix_thread_starter()
get the input it always needs not from the Zephyr stack,
but from its own pthread_create() parameter pointing to a
structure kept by the POSIX arch.

Note that if the thread was aborted before reaching that point
posix_thread_starter() will NOT call the Zephyr thread entry
function, but just cleanup.

With this change all "asynchronous" parts of the POSIX arch
should relay only on the POSIX arch own structures.

Signed-off-by: Alberto Escolar Piedras <[email protected]>
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. This is one of those spurious failures that @marc-hb likes to worry about. As schedule_api is one of the tests that tends to fail spuriously on qemu, santicheck will retry failures and suppress this kind of thing.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 17, 2019

spurious failures that @marc-hb likes to worry about

See PR #17107

@aescolar
Copy link
Member Author

@andyross @marc-hb : It took me a while to decide myself to dig into it. Because it was failing on all platforms on CI I was just assuming there must have been something rotten with that test (apart from its dependency on the qemu host timing). Neither valgrind, nor the address sanitizer were able to pinpoint the problem, and a quick debug did not reveal anything. Only a couple of days ago I got annoyed enough with it to start digging deeper.

Copy link
Collaborator

@wopu-ot wopu-ot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ioannisg ioannisg requested a review from nashif July 18, 2019 14:24
@marc-hb
Copy link
Collaborator

marc-hb commented Jul 18, 2019

spurious failures that @marc-hb likes to worry about

Actually, PR #17107 is all about [q]emu and this is specific to native_posix, correct? So they're not related.

I haven't been testing native_posix much because it uses the random toolchain of the host by design which makes it a non-starter from a reproducible builds perspective. This seems fine: I think very few people if any are interested in signing native_posix builds or security around them in general, it's not in their purpose.

Off-topic; zephyrproject-rtos/sdk-ng#81 is one better place.

@aescolar aescolar merged commit fece690 into zephyrproject-rtos:master Jul 19, 2019
@aescolar aescolar deleted the races_np branch July 19, 2019 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: native port Host native arch port (native_sim) bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

POSIX arch: occasional failures of tests/kernel/sched/schedule_api on CI
5 participants