cleanup and define consistent pattern for handling nodeos plugin shutdown #842

spoonincode · 2023-01-27T22:29:04Z

My impression is that we do not have a good handle on the threading semantics during shutdown of nodeos. The lingering issues we've seen such as AntelopeIO/leap#646 seem to be good evidence to back this assertion up.

I've long wondered if we really need a two-stage shutdown: first we get all the various plugin threads to stop -- thus, critically, ensuring no callbacks are being run anywhere -- and then destroy the plugins (and their members). Such a pattern is very asio-like and easy to reason on because it doesn't require any shutdown specific handling inside various callbacks such as "keep an object alive until shutdown is complete" or other shenanigans -- just rely on C++ to dtor all the objects naturally which is easy to reason about the safety if you're guaranteed no callbacks are in flight.

It turns out that appbase already seems to define a two stage shutdown. All plugins get their plugin_shutdown() called on them before all plugins are destroyed. The shutdown & destruction happens in reverse order to startup (which honors plugins' defined dependencies).

Surprisingly (to me), many plugins are already pretty good about just doing a minimal amount of work (such as stopping threads) in their plugin_shutdown(). But some, like chain_plugin, are significant violators of this proposed pattern.
https://github.com/AntelopeIO/leap/blob/cab88df475fe63d654f8020c65a6d04a6f071fd5/plugins/chain_plugin/chain_plugin.cpp#L1110-L1121
In particular destroying chain::controller here at this stage of shutdown is rather alarming. It's almost even a smoking gun for shutdown grief if it weren't for chain_plugin fortuitously being the last plugin shutdown.

But more problematic is due to the lack of a cohesive and consistent pattern our code has become littered with what I'm going to call shutdown sprinkles. These are little bits of code sprinkled about that seem to be doing something to make shutdown safer: checking if the app is currently shutting down to short circuit something, trying to increase the lifetime of some shared_ptr somewhere; either via weak_ptrs or asio queuing, etc. In some cases it's not clear how such sprinkles could have any side effect at all! In other cases it appears overly conservative (likely due to lack in confidence how shutdown behaves, again due to lack of a consistent pattern). For some examples of shutdown sprinkles see the section below.

shutdown sprinkle examples and commentary

producer_plugin

producer_plugin's plugin_shutdown seems to have two great examples. First, consider the timer cancel
https://github.com/AntelopeIO/leap/blob/cab88df475fe63d654f8020c65a6d04a6f071fd5/plugins/producer_plugin/producer_plugin.cpp#L1043-L1054
Recall that by the time plugin_shutdown() is called, the main event loop has been stopped and drained/cleared. It's impossible for callbacks to be run on the main event loop once plugins' shutdown has started. _timer is created on the main event loop, and all of _timer.async_wait()s are also being handled on the main thread. It seems impossible for this cancel() to have any side effect because the main event loop will never be pumped again.

An important caveat: if the async_wait() had a callback wrapped to an executor on a different thread, there would still be an opportunity it could have a side effect. For example, if the async_wait() was wrapped to the _thread_pool executor.

Another interesting sprinkle is this one
https://github.com/AntelopeIO/leap/blob/cab88df475fe63d654f8020c65a6d04a6f071fd5/plugins/producer_plugin/producer_plugin.cpp#L1060
The comment appears to be inaccurate. Again, by this point the queue has already been drained/cleared and will never be pumped again. But that doesn't stop you from submitting work on to the queue still. And that appears to be what this line is doing: just wedging a reference to the shared_ptr that will cause the plugin_impl to live until after all the plugin's have been both shutdown & destroyed. I assume this is related to the circular dependency of chain_plugin & producer_plugin.

trace_api_plugin & http async handlers

trace_api_plugin makes uses of the ability to handle its HTTP RPC requests within http_plugin's thread pool.

trace_api_plugin requires http_plugin
https://github.com/AntelopeIO/leap/blob/cab88df475fe63d654f8020c65a6d04a6f071fd5/plugins/trace_api_plugin/include/eosio/trace_api/trace_api_plugin.hpp#L12
This guarantees that http_plugin is started before trace_api_plugin, but also that trace_api_plugin::plugin_shutdown() is before http_plugin::plugin_shutdown(), and that trace_api_plugin::~trace_api_plugin() is before http_plugin::~http_plugin(). Importantly this means that http_plugin's threads will be stopped before trace_api_plugin::~trace_api_plugin(). It should never be possible for an http async callback to enter trace_api_plugin after trace_api_plugin::~trace_api_plugin() has started. Yet, trace_api_plugin always tries to lock a weak_ptr on itself for every http request
https://github.com/AntelopeIO/leap/blob/cab88df475fe63d654f8020c65a6d04a6f071fd5/plugins/trace_api_plugin/trace_api_plugin.cpp#L298-L304
I suspect this is just being overly conservative.

http_plugin

https://github.com/AntelopeIO/leap/blob/df6ffbb016282d8f4a9f76ab3db04e52816d3c12/plugins/http_plugin/http_plugin.cpp#L102-L109
This check may be something to do with the the previous appbase behavior which drained its queue before calling plugins' shutdown. Consider in the proposed pattern that once all plugin's shutdowns are completed that no callbacks can be executing, and no plugins have started their destruction. Such a sprinkle would be unnecessary as there is no chance a straggling late callback may access a destroyed object.

I'm reluctant to make definitive sweeping recommendations because a deep dive through the code may yield some surprising interactions that make these impossible to implement. But I'm going to try anyway.

plugin_shutdown()s should only stop plugin specific threads, with few exceptions. Absolutely no destruction of any objects.
Do not rely on shared_ptr ownership to keep alive objects to varying degrees during shutdown.
- More generally, lean heavily on standard C++ destruction ordering.
Aggressively remove any shutdown sprinkles.
Remove any circular plugin dependencies as these without a doubt cause shutdown sprinkles ([1.0] Fix incorrect threshold calculation in setFinalizers #668 for a known one)
Refactor appbase to not use a static for its lifecycle ([1.0 -> main] Use other_branch_latest_time for liveness #650 is blocker)
Reflect further on how exceptions during plugin_shutdown() and destruction are handled.
- Can we make it a rule that exceptions during shutdown & dtor not be allowed to communicate error codes through main?
Reflect further on how exceptions during plugin_initialize() & plugin_startup() are handled.
- Some plugins, like producer_plugin, manually call plugin_shutdown() themselves when their plugin_startup() fails. Why?
- [greg: see https://github.com/Ensure that plugin_shutdown() is always called after plugin_startup() appbase#25]
Clearly define and document appbase's semantics of init/startup/shutdown/dtor and their rules w.r.t. exceptions. Some of these you can discover just by reading the code but they need to be clearly documented. For example:
- If a plugin throws during plugin_startup(), what other plugins get plugin_shutdown()?
- If a plugin throws during plugin_shutdown(), what happens -- are other plugins still shutdown even after that exception occurs?

The text was updated successfully, but these errors were encountered:

bhazzard · 2023-04-20T18:07:27Z

Create new issues for refinement for any items that are not actionable now.

bhazzard · 2023-04-27T20:04:29Z

Split into:

greg7mdp · 2023-05-23T20:18:26Z

Refactor appbase to not use a static for its lifecycle

@spoonincode The main use of a static is to allow the app() call which returns the application, as in:

my->state_dir = app().data_dir() / sd;

auto& net = app().get_plugin<net_plugin>();

my->incoming_transaction_ack_subscription = app().get_channel<compat::channels::transaction_ack>().subscribe(...

Would you like to remove the static variable, at the cost of passing and storing a variable containing a reference to appbase::application?

It would also be a breaking change in appbase.

greg7mdp · 2023-05-24T19:34:02Z

Can we make it a rule that exceptions during shutdown & dtor not be allowed to communicate error codes through main?

@spoonincode What you you mean by that? Currently when we have an exceptions during shutdown or dtor, we log a message on std::cerr. Do you think we should stop doing that? What would be the rationale for that?

bhazzard · 2023-06-22T17:14:32Z

Not urgent, but blocked awaiting feedback from Matt

enf-ci-bot added the triage label Jan 27, 2023

heifner removed the triage label Jan 30, 2023

bhazzard added the 👍 lgtm label Apr 20, 2023

bhazzard mentioned this issue Apr 27, 2023

Proposal: a consistent and safe way to handle nodeos "shutdown sprinkles" AntelopeIO/leap#1098

Closed

4 tasks

bhazzard changed the title ~~cleanup and define consistent pattern for handling nodeos plugin shutdown~~ nodeos plugin shutdown consistency Apr 27, 2023

bhazzard changed the title ~~nodeos plugin shutdown consistency~~ cleanup and define consistent pattern for handling nodeos plugin shutdown Apr 27, 2023

bhazzard mentioned this issue Apr 27, 2023

Improve Nodeos Plugin Shutdown Patterns AntelopeIO/leap#1099

Closed

5 tasks

greg7mdp self-assigned this May 22, 2023

This was referenced May 24, 2023

Ensure that plugin_shutdown() is always called after plugin_startup() AntelopeIO/appbase#25

Closed

Ensure that plugin_shutdown() is always called. AntelopeIO/appbase#26

Merged

greg7mdp referenced this issue in AntelopeIO/leap Jun 5, 2023

Merge branch 'main' of github.com:AntelopeIO/leap into gh-672

97aa420

greg7mdp referenced this issue in AntelopeIO/leap Jun 5, 2023

Merge branch 'main' of github.com:AntelopeIO/leap into gh-672

a35e793

greg7mdp referenced this issue in AntelopeIO/leap Jun 5, 2023

Merge branch 'main' of github.com:AntelopeIO/leap into gh-672

2bd8818

greg7mdp referenced this issue in AntelopeIO/leap Jun 5, 2023

Merge branch 'main' of github.com:AntelopeIO/leap into gh-672

4869d63

greg7mdp linked a pull request Jun 5, 2023 that will close this issue

Cleanup and define consistent pattern for leap plugin startup/shutdown. AntelopeIO/leap#1231

Draft

greg7mdp referenced this issue in AntelopeIO/leap Jun 8, 2023

Merge branch 'main' of github.com:AntelopeIO/leap into gh-672

839f5c9

greg7mdp referenced this issue in AntelopeIO/leap Jun 12, 2023

Merge branch 'main' of github.com:AntelopeIO/leap into gh-672

42ca516

greg7mdp referenced this issue in AntelopeIO/leap Jun 12, 2023

Merge branch 'main' of github.com:AntelopeIO/leap into gh-672

ce96bf9

greg7mdp referenced this issue in AntelopeIO/leap Jun 14, 2023

Merge branch 'main' of github.com:AntelopeIO/leap into gh-672

42e8824

greg7mdp referenced this issue in AntelopeIO/leap Jun 15, 2023

Merge branch 'main' of github.com:AntelopeIO/leap into gh-672

2ac21bc

heifner mentioned this issue Sep 28, 2024

Drain io_context after shutdown of plugins AntelopeIO/appbase#34

Draft

arhag transferred this issue from AntelopeIO/leap Sep 30, 2024

arhag added the tech-debt label Sep 30, 2024

arhag added this to the Spring v1.1.0-rc1 milestone Sep 30, 2024

arhag unassigned greg7mdp Sep 30, 2024

This was referenced Sep 30, 2024

[1.0.2] SHiP: Fix hang on exit #845

Merged

[1.0.2 -> main] SHiP: Fix hang on exit #852

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cleanup and define consistent pattern for handling nodeos plugin shutdown #842

cleanup and define consistent pattern for handling nodeos plugin shutdown #842

spoonincode commented Jan 27, 2023 •

edited by greg7mdp

Loading

producer_plugin

trace_api_plugin & http async handlers

http_plugin

bhazzard commented Apr 20, 2023

bhazzard commented Apr 27, 2023

greg7mdp commented May 23, 2023 •

edited

Loading

greg7mdp commented May 24, 2023

bhazzard commented Jun 22, 2023

cleanup and define consistent pattern for handling nodeos plugin shutdown #842

cleanup and define consistent pattern for handling nodeos plugin shutdown #842

Comments

spoonincode commented Jan 27, 2023 • edited by greg7mdp Loading

producer_plugin

trace_api_plugin & http async handlers

http_plugin

bhazzard commented Apr 20, 2023

bhazzard commented Apr 27, 2023

greg7mdp commented May 23, 2023 • edited Loading

greg7mdp commented May 24, 2023

bhazzard commented Jun 22, 2023

spoonincode commented Jan 27, 2023 •

edited by greg7mdp

Loading

greg7mdp commented May 23, 2023 •

edited

Loading