Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.0.2] SHiP: Fix hang on shutdown #816

Merged
merged 8 commits into from
Sep 24, 2024
Merged

[1.0.2] SHiP: Fix hang on shutdown #816

merged 8 commits into from
Sep 24, 2024

Conversation

heifner
Copy link
Member

@heifner heifner commented Sep 24, 2024

state_history_plugin (SHiP) would sometimes hang on shutdown. The nodeos process would get stuck trying to lock a mutex of the main application io_context strand while destroying a boost::beast::websocket::stream.

Created a test that reproduces the failure (thanks @taokayan for #813). The failure was not reliably repeatable, but it did fail on every CI/CD run for at least one of the platforms. See for example: https://github.com/AntelopeIO/spring/actions/runs/11017331583/job/30595880305

Modified SHiP boost::beast::websocket to use the SHiP thread instead of the main application thread executor strand.

Resolves #815

@heifner heifner added the OCI Work exclusive to OCI team label Sep 24, 2024
@heifner heifner linked an issue Sep 24, 2024 that may be closed by this pull request
@ericpassmore ericpassmore added the bug The product is not working as was intended. label Sep 24, 2024
@ericpassmore
Copy link
Contributor

Note:start
category: Other
component: SHiP
summary: Fix hang on shutdown.
Note:end

Copy link
Member

@spoonincode spoonincode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified SHiP boost::beast::websocket to use the SHiP thread instead of the main application thread executor strand

I think really what this changes is what I'd call the "default executor" of the websocket (I am not sure what the proper terminology is). My understanding was that the default executor was never used based on how the code is structured because the bound executor for every async call on the websocket was a use_awaitable, and that will use the coroutine's executor (which is the strand created from ship thread, in this case). The websocket certainly uses the ship thread and strand prior to this change. But if this does fix the problem there must be some other case where the default executor matters. And I do think it's good to make the default executor match up.

@@ -101,7 +101,7 @@ struct state_history_plugin_impl {
void create_listener(const std::string& address) {
const boost::posix_time::milliseconds accept_timeout(200);
// connections set must only be modified by main thread; run listener on main thread to avoid needing another post()
fc::create_listener<Protocol>(app().get_io_service(), _log, accept_timeout, address, "", [this](Protocol::socket&& socket) {
fc::create_listener<Protocol>(thread_pool.get_executor(), _log, accept_timeout, address, "", [this](Protocol::socket&& socket) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will need to be additional changes below here otherwise connections will be modified in the ship thread which is not allowed.

But that's really still not enough to get it 100% correct if the intent is to make the default executor of the stream to be what we want -- the stream should be using a per-stream strand instead of the thread's executor. The problem is that fc::create_listener doesn't give access to the variant of async_accept() that allows creating the new socket with a different default executor than the listening socket. This is a real shortcoming and without refactoring fc::create_listener I'm not sure off hand how to fix that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case there is only one SHiP thread and its implicit strand. I agree that it would be better if fc::create_listener took a strand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for 1.0.x that's fine but we should really get it right on main -- the code is currently structured with the assumption that increasing ship threads 'just works'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems passing a strand works.

tests/ship_kill_client_test.py Outdated Show resolved Hide resolved
tests/ship_kill_client_test.py Outdated Show resolved Hide resolved
tests/ship_kill_client_test.py Show resolved Hide resolved
tests/ship_kill_client_test.py Outdated Show resolved Hide resolved
fc::create_listener<Protocol>(strand, _log, accept_timeout, address, "", [this, strand](Protocol::socket&& socket) {
boost::asio::post(app().get_io_service(), [this, strand, socket{std::move(socket)}]() mutable {
catch_and_log([this, &socket, &strand]() {
connections.emplace(new session(std::move(socket), std::move(strand), chain_plug->chain(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here seem to make only a single strand for all connections. And then this single strand is moved here? Doesn't look right. I think what you had before was better even though it still didn't make the websocket's default executor be a strand.

}, _log));
// connections set must only be modified by main thread; run listener on ship thread so sockets use default executor of the ship thread
fc::create_listener<Protocol>(thread_pool.get_executor(), _log, accept_timeout, address, "", [this](Protocol::socket&& socket) {
boost::asio::post(app().get_io_service(), [this, socket{std::move(socket)}]() mutable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I think this creates a (fantastically remote) possibility of the socket being destroyed after its executor is destroyed. This occurs when this callback is post()ed after the main thread is stopped, and then the plugin is destroyed, and then appbase is destroyed which will destroy pending callbacks. Obviously something we can tolerate for 1.0.x but another good reason to improve create_listener

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The product is not working as was intended. OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SHiP hang on exit
4 participants