Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup and define consistent pattern for handling nodeos plugin shutdown #842

Open
spoonincode opened this issue Jan 27, 2023 · 5 comments · May be fixed by AntelopeIO/leap#1231
Open

cleanup and define consistent pattern for handling nodeos plugin shutdown #842

spoonincode opened this issue Jan 27, 2023 · 5 comments · May be fixed by AntelopeIO/leap#1231

Comments

@spoonincode
Copy link
Member

spoonincode commented Jan 27, 2023

My impression is that we do not have a good handle on the threading semantics during shutdown of nodeos. The lingering issues we've seen such as AntelopeIO/leap#646 seem to be good evidence to back this assertion up.

I've long wondered if we really need a two-stage shutdown: first we get all the various plugin threads to stop -- thus, critically, ensuring no callbacks are being run anywhere -- and then destroy the plugins (and their members). Such a pattern is very asio-like and easy to reason on because it doesn't require any shutdown specific handling inside various callbacks such as "keep an object alive until shutdown is complete" or other shenanigans -- just rely on C++ to dtor all the objects naturally which is easy to reason about the safety if you're guaranteed no callbacks are in flight.

It turns out that appbase already seems to define a two stage shutdown. All plugins get their plugin_shutdown() called on them before all plugins are destroyed. The shutdown & destruction happens in reverse order to startup (which honors plugins' defined dependencies).

Surprisingly (to me), many plugins are already pretty good about just doing a minimal amount of work (such as stopping threads) in their plugin_shutdown(). But some, like chain_plugin, are significant violators of this proposed pattern.
https://github.com/AntelopeIO/leap/blob/cab88df475fe63d654f8020c65a6d04a6f071fd5/plugins/chain_plugin/chain_plugin.cpp#L1110-L1121
In particular destroying chain::controller here at this stage of shutdown is rather alarming. It's almost even a smoking gun for shutdown grief if it weren't for chain_plugin fortuitously being the last plugin shutdown.

But more problematic is due to the lack of a cohesive and consistent pattern our code has become littered with what I'm going to call shutdown sprinkles. These are little bits of code sprinkled about that seem to be doing something to make shutdown safer: checking if the app is currently shutting down to short circuit something, trying to increase the lifetime of some shared_ptr somewhere; either via weak_ptrs or asio queuing, etc. In some cases it's not clear how such sprinkles could have any side effect at all! In other cases it appears overly conservative (likely due to lack in confidence how shutdown behaves, again due to lack of a consistent pattern). For some examples of shutdown sprinkles see the section below.

shutdown sprinkle examples and commentary

producer_plugin

producer_plugin's plugin_shutdown seems to have two great examples. First, consider the timer cancel
https://github.com/AntelopeIO/leap/blob/cab88df475fe63d654f8020c65a6d04a6f071fd5/plugins/producer_plugin/producer_plugin.cpp#L1043-L1054
Recall that by the time plugin_shutdown() is called, the main event loop has been stopped and drained/cleared. It's impossible for callbacks to be run on the main event loop once plugins' shutdown has started. _timer is created on the main event loop, and all of _timer.async_wait()s are also being handled on the main thread. It seems impossible for this cancel() to have any side effect because the main event loop will never be pumped again.

An important caveat: if the async_wait() had a callback wrapped to an executor on a different thread, there would still be an opportunity it could have a side effect. For example, if the async_wait() was wrapped to the _thread_pool executor.

Another interesting sprinkle is this one
https://github.com/AntelopeIO/leap/blob/cab88df475fe63d654f8020c65a6d04a6f071fd5/plugins/producer_plugin/producer_plugin.cpp#L1060
The comment appears to be inaccurate. Again, by this point the queue has already been drained/cleared and will never be pumped again. But that doesn't stop you from submitting work on to the queue still. And that appears to be what this line is doing: just wedging a reference to the shared_ptr that will cause the plugin_impl to live until after all the plugin's have been both shutdown & destroyed. I assume this is related to the circular dependency of chain_plugin & producer_plugin.

trace_api_plugin & http async handlers

trace_api_plugin makes uses of the ability to handle its HTTP RPC requests within http_plugin's thread pool.

trace_api_plugin requires http_plugin
https://github.com/AntelopeIO/leap/blob/cab88df475fe63d654f8020c65a6d04a6f071fd5/plugins/trace_api_plugin/include/eosio/trace_api/trace_api_plugin.hpp#L12
This guarantees that http_plugin is started before trace_api_plugin, but also that trace_api_plugin::plugin_shutdown() is before http_plugin::plugin_shutdown(), and that trace_api_plugin::~trace_api_plugin() is before http_plugin::~http_plugin(). Importantly this means that http_plugin's threads will be stopped before trace_api_plugin::~trace_api_plugin(). It should never be possible for an http async callback to enter trace_api_plugin after trace_api_plugin::~trace_api_plugin() has started. Yet, trace_api_plugin always tries to lock a weak_ptr on itself for every http request
https://github.com/AntelopeIO/leap/blob/cab88df475fe63d654f8020c65a6d04a6f071fd5/plugins/trace_api_plugin/trace_api_plugin.cpp#L298-L304
I suspect this is just being overly conservative.

http_plugin

https://github.com/AntelopeIO/leap/blob/df6ffbb016282d8f4a9f76ab3db04e52816d3c12/plugins/http_plugin/http_plugin.cpp#L102-L109
This check may be something to do with the the previous appbase behavior which drained its queue before calling plugins' shutdown. Consider in the proposed pattern that once all plugin's shutdowns are completed that no callbacks can be executing, and no plugins have started their destruction. Such a sprinkle would be unnecessary as there is no chance a straggling late callback may access a destroyed object.


I'm reluctant to make definitive sweeping recommendations because a deep dive through the code may yield some surprising interactions that make these impossible to implement. But I'm going to try anyway.

  • plugin_shutdown()s should only stop plugin specific threads, with few exceptions. Absolutely no destruction of any objects.
  • Do not rely on shared_ptr ownership to keep alive objects to varying degrees during shutdown.
    • More generally, lean heavily on standard C++ destruction ordering.
  • Aggressively remove any shutdown sprinkles.
  • Remove any circular plugin dependencies as these without a doubt cause shutdown sprinkles ([1.0] Fix incorrect threshold calculation in setFinalizers #668 for a known one)
  • Refactor appbase to not use a static for its lifecycle ([1.0 -> main] Use other_branch_latest_time for liveness #650 is blocker)
  • Reflect further on how exceptions during plugin_shutdown() and destruction are handled.
    • Can we make it a rule that exceptions during shutdown & dtor not be allowed to communicate error codes through main?
  • Reflect further on how exceptions during plugin_initialize() & plugin_startup() are handled.
  • Clearly define and document appbase's semantics of init/startup/shutdown/dtor and their rules w.r.t. exceptions. Some of these you can discover just by reading the code but they need to be clearly documented. For example:
    • If a plugin throws during plugin_startup(), what other plugins get plugin_shutdown()?
    • If a plugin throws during plugin_shutdown(), what happens -- are other plugins still shutdown even after that exception occurs?
@bhazzard
Copy link

Create new issues for refinement for any items that are not actionable now.

@bhazzard bhazzard changed the title cleanup and define consistent pattern for handling nodeos plugin shutdown nodeos plugin shutdown consistency Apr 27, 2023
@bhazzard bhazzard changed the title nodeos plugin shutdown consistency cleanup and define consistent pattern for handling nodeos plugin shutdown Apr 27, 2023
@greg7mdp greg7mdp self-assigned this May 22, 2023
@greg7mdp
Copy link
Contributor

greg7mdp commented May 23, 2023

Refactor appbase to not use a static for its lifecycle

@spoonincode The main use of a static is to allow the app() call which returns the application, as in:

my->state_dir = app().data_dir() / sd;

auto& net = app().get_plugin<net_plugin>();

my->incoming_transaction_ack_subscription = app().get_channel<compat::channels::transaction_ack>().subscribe(...

Would you like to remove the static variable, at the cost of passing and storing a variable containing a reference to appbase::application?

It would also be a breaking change in appbase.

@greg7mdp
Copy link
Contributor

Can we make it a rule that exceptions during shutdown & dtor not be allowed to communicate error codes through main?

@spoonincode What you you mean by that? Currently when we have an exceptions during shutdown or dtor, we log a message on std::cerr. Do you think we should stop doing that? What would be the rationale for that?

@bhazzard
Copy link

Not urgent, but blocked awaiting feedback from Matt

@arhag arhag transferred this issue from AntelopeIO/leap Sep 30, 2024
@arhag arhag added this to the Spring v1.1.0-rc1 milestone Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

6 participants