Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.0.2] SHiP: Fix hang on exit #845

Merged
merged 3 commits into from
Oct 1, 2024
Merged

[1.0.2] SHiP: Fix hang on exit #845

merged 3 commits into from
Oct 1, 2024

Conversation

heifner
Copy link
Member

@heifner heifner commented Sep 30, 2024

This is a temporary fix until issue #842 can be worked. Some alternatives were attempted, but no simple fix could be found that didn't require more work toward #842 then desired for 1.0.x. See AntelopeIO/appbase#34

Test runs of a hacked up state_history_plugin with sleeps that caused it to hang on almost every CI/CD run.
Example failures: https://github.com/AntelopeIO/spring/actions/runs/11076070901
With this fix: https://github.com/AntelopeIO/spring/actions/runs/11124229769

Resolves #822

…raction of app io_context and plugin threads.
@heifner heifner added the OCI Work exclusive to OCI team label Sep 30, 2024
@heifner heifner linked an issue Sep 30, 2024 that may be closed by this pull request
@heifner heifner changed the title [1.0.2] SHiP: Fix hand on exit [1.0.2] SHiP: Fix hang on exit Sep 30, 2024
app().executor().get_io_service().restart();
while (app().executor().get_io_service().poll())
;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a SIGINT etc here will interrupt it. Probably fine.

I wonder if there is a way to add some sort of sentinel to know that "anything that could be referencing state_history_plugin" has actually been drained, so we don't somehow get in a state where we loop forever here because something else unrelated is post()ing forever. I'm not immediately coming up with a good solution though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just not do the poll() call in a loop. We are not expecting anything to be posting at this point.

Copy link
Member Author

@heifner heifner Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting enough, the poll() has to be in a loop. I guess the co_spawn must be posting additional as it drains. I tried with just one call to poll(), and it didn't work.

@@ -380,6 +380,14 @@ void state_history_plugin_impl::plugin_shutdown() {
accepted_block_connection.reset();
block_start_connection.reset();
thread_pool.stop();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the connections torn down, is there any risk pumping the main thread below might apply a block that ship would then not see?

Copy link
Member Author

@heifner heifner Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! It does seem like this could push a block into the controller that would be missed by SHiP.

Copy link
Member Author

@heifner heifner Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no I don't think that is possible. Everything that uses app().executor().post when they execute they are only pushed into the priority_queue to execute. So probably should clear that out again here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we even need those connections to be manually torn down

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but that can be cleaned up when we work the real solution.

@heifner heifner merged commit 4648fba into release/1.0 Oct 1, 2024
36 checks passed
@heifner heifner deleted the GH-822-ship-hang branch October 1, 2024 19:25
@ericpassmore ericpassmore added the bug The product is not working as was intended. label Oct 1, 2024
@ericpassmore
Copy link
Contributor

Note:start
category: Other
component: SHiP
summary: Fix hang on exit, draining references to SHiP found in io_service.
Note:end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The product is not working as was intended. OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure: ship_kill_client_test
4 participants