[1.0.2] SHiP: Fix hang on shutdown #816

heifner · 2024-09-24T16:27:04Z

state_history_plugin (SHiP) would sometimes hang on shutdown. The nodeos process would get stuck trying to lock a mutex of the main application io_context strand while destroying a boost::beast::websocket::stream.

Created a test that reproduces the failure (thanks @taokayan for #813). The failure was not reliably repeatable, but it did fail on every CI/CD run for at least one of the platforms. See for example: https://github.com/AntelopeIO/spring/actions/runs/11017331583/job/30595880305

Modified SHiP boost::beast::websocket to use the SHiP thread instead of the main application thread executor strand.

Resolves #815

…ffected

…lients

ericpassmore · 2024-09-24T16:56:22Z

Note:start
category: Other
component: SHiP
summary: Fix hang on shutdown.
Note:end

spoonincode

Modified SHiP boost::beast::websocket to use the SHiP thread instead of the main application thread executor strand

I think really what this changes is what I'd call the "default executor" of the websocket (I am not sure what the proper terminology is). My understanding was that the default executor was never used based on how the code is structured because the bound executor for every async call on the websocket was a use_awaitable, and that will use the coroutine's executor (which is the strand created from ship thread, in this case). The websocket certainly uses the ship thread and strand prior to this change. But if this does fix the problem there must be some other case where the default executor matters. And I do think it's good to make the default executor match up.

spoonincode · 2024-09-24T16:35:45Z

plugins/state_history_plugin/state_history_plugin.cpp

@@ -101,7 +101,7 @@ struct state_history_plugin_impl {
   void create_listener(const std::string& address) {
      const boost::posix_time::milliseconds accept_timeout(200);
      // connections set must only be modified by main thread; run listener on main thread to avoid needing another post()
-      fc::create_listener<Protocol>(app().get_io_service(), _log, accept_timeout, address, "", [this](Protocol::socket&& socket) {
+      fc::create_listener<Protocol>(thread_pool.get_executor(), _log, accept_timeout, address, "", [this](Protocol::socket&& socket) {


There will need to be additional changes below here otherwise connections will be modified in the ship thread which is not allowed.

But that's really still not enough to get it 100% correct if the intent is to make the default executor of the stream to be what we want -- the stream should be using a per-stream strand instead of the thread's executor. The problem is that fc::create_listener doesn't give access to the variant of async_accept() that allows creating the new socket with a different default executor than the listening socket. This is a real shortcoming and without refactoring fc::create_listener I'm not sure off hand how to fix that.

In this case there is only one SHiP thread and its implicit strand. I agree that it would be better if fc::create_listener took a strand.

Yeah for 1.0.x that's fine but we should really get it right on main -- the code is currently structured with the assumption that increasing ship threads 'just works'

Seems passing a strand works.

tests/ship_kill_client_test.py

… ship thread.

spoonincode · 2024-09-24T17:55:45Z

plugins/state_history_plugin/state_history_plugin.cpp

+      fc::create_listener<Protocol>(strand, _log, accept_timeout, address, "", [this, strand](Protocol::socket&& socket) {
+         boost::asio::post(app().get_io_service(), [this, strand, socket{std::move(socket)}]() mutable {
+            catch_and_log([this, &socket, &strand]() {
+               connections.emplace(new session(std::move(socket), std::move(strand), chain_plug->chain(),


The changes here seem to make only a single strand for all connections. And then this single strand is moved here? Doesn't look right. I think what you had before was better even though it still didn't make the websocket's default executor be a strand.

spoonincode · 2024-09-25T06:52:55Z

plugins/state_history_plugin/state_history_plugin.cpp

-                                            }, _log));
+      // connections set must only be modified by main thread; run listener on ship thread so sockets use default executor of the ship thread
+      fc::create_listener<Protocol>(thread_pool.get_executor(), _log, accept_timeout, address, "", [this](Protocol::socket&& socket) {
+         boost::asio::post(app().get_io_service(), [this, socket{std::move(socket)}]() mutable {


Thinking about this more, I think this creates a (fantastically remote) possibility of the socket being destroyed after its executor is destroyed. This occurs when this callback is post()ed after the main thread is stopped, and then the plugin is destroyed, and then appbase is destroyed which will destroy pending callbacks. Obviously something we can tolerate for 1.0.x but another good reason to improve create_listener

heifner added 4 commits September 24, 2024 09:21

GH-815 Add a test that kills ship clients and make sure nodeos is una…

32c5b87

…ffected

GH-815 Attempt to make test fail more often by increasing number of c…

c09ec2a

…lients

GH-815 shutdown nodeos during shutdown of clients

6350739

GH-815 Use state_history_plugin thread for socket

80d5aef

heifner requested review from spoonincode and linh2931 September 24, 2024 16:27

heifner added the OCI Work exclusive to OCI team label Sep 24, 2024

heifner linked an issue Sep 24, 2024 that may be closed by this pull request

SHiP hang on exit #815

Closed

heifner mentioned this pull request Sep 24, 2024

[Test only, don't merge] test ship streamer #813

Closed

arhag mentioned this pull request Sep 24, 2024

[Test only, don't merge] test web-socket fork test instablity eosnetworkfoundation/eos-evm-node#295

Closed

ericpassmore added the bug The product is not working as was intended. label Sep 24, 2024

spoonincode requested changes Sep 24, 2024

View reviewed changes

linh2931 reviewed Sep 24, 2024

View reviewed changes

tests/ship_kill_client_test.py Outdated Show resolved Hide resolved

tests/ship_kill_client_test.py Outdated Show resolved Hide resolved

tests/ship_kill_client_test.py Show resolved Hide resolved

tests/ship_kill_client_test.py Outdated Show resolved Hide resolved

GH-815 Modify connections on the main thread. Use explicit strand for…

49492dc

… ship thread.

spoonincode reviewed Sep 24, 2024

View reviewed changes

heifner added 3 commits September 24, 2024 12:58

GH-815 Make the test always run in Savanna mode

9736f6f

GH-815 Revert usage of strand. Needs to be done in fc::create_listener

2641515

GH-815 Cleanup test

6df1c11

heifner requested a review from linh2931 September 24, 2024 19:03

spoonincode approved these changes Sep 24, 2024

View reviewed changes

linh2931 approved these changes Sep 24, 2024

View reviewed changes

heifner merged commit 94fea6a into release/1.0 Sep 24, 2024
36 checks passed

heifner deleted the GH-815-ship-hang branch September 24, 2024 20:08

heifner mentioned this pull request Sep 24, 2024

[1.0.2 -> main] SHiP: Fix hang on shutdown #820

Merged

This was referenced Sep 25, 2024

Gas parameter fork handling test instability eosnetworkfoundation/eos-evm-node#282

Closed

Websocket fork handling test instability eosnetworkfoundation/eos-evm-node#283

Closed

spoonincode reviewed Sep 25, 2024

View reviewed changes

spoonincode mentioned this pull request Sep 25, 2024

P2P, HTTP, SHiP: Change fc::create_listener to use strand #819

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.0.2] SHiP: Fix hang on shutdown #816

[1.0.2] SHiP: Fix hang on shutdown #816

heifner commented Sep 24, 2024

ericpassmore commented Sep 24, 2024

spoonincode left a comment

spoonincode Sep 24, 2024

heifner Sep 24, 2024

spoonincode Sep 24, 2024

heifner Sep 24, 2024

spoonincode Sep 24, 2024

spoonincode Sep 25, 2024

[1.0.2] SHiP: Fix hang on shutdown #816

[1.0.2] SHiP: Fix hang on shutdown #816

Conversation

heifner commented Sep 24, 2024

ericpassmore commented Sep 24, 2024

spoonincode left a comment

Choose a reason for hiding this comment

spoonincode Sep 24, 2024

Choose a reason for hiding this comment

heifner Sep 24, 2024

Choose a reason for hiding this comment

spoonincode Sep 24, 2024

Choose a reason for hiding this comment

heifner Sep 24, 2024

Choose a reason for hiding this comment

spoonincode Sep 24, 2024

Choose a reason for hiding this comment

spoonincode Sep 25, 2024

Choose a reason for hiding this comment