-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay coredump.service shutdown after scylla.service shutdown #436
Comments
Is this different than https://github.com/scylladb/scylla-enterprise/issues/2648 which is solved via scylladb/scylladb#12757 ? |
No, this case is a coredump happening during shutdown of a node after systemd-coredump.socket is closed Luckily for us this coredump happened in more cases. |
I finally able to reproduce this after patching scylla to delay shutdown and cause SIGSEGV:
So I tried to delay coredump.service shutdown after scylla-server.service, by following drop-in conf:
Also, I found that there are GH issue which says [email protected] may get terminate when shutdown (systemd/systemd#7176), so I also added a workaround for this:
I thought now we can capture coredump correctly on systemd-coredump, but it's not.
I tried again and again with bit different configuration, but systemd-coredump never worked during shutdown. |
…down We found that systemd-coredump does not correctly capturing coredump during system reboot or shutdown. As a workaround of this issue, set coredump file path to kernel.core_pattern during system reboot or shutdown. It will save core to /var/tmp/core.scylla.$PID.$TIMESTAMP. Fixes scylladb/scylla-machine-image#436
@avikivity Do you have any idea with this issue? |
Opened issue on systemd GH systemd/systemd#28338 |
@avikivity ping, do you have any idea? |
…down We found that systemd-coredump does not correctly capturing coredump during system reboot or shutdown. As a workaround of this issue, set coredump file path to kernel.core_pattern during system reboot or shutdown. It will save core to /var/lib/scylla/shutdown-coredump/. Fixes scylladb/scylla-machine-image#436
Sorry for missing the issue. I often skip over scylla-machine-image because I don't maintain it. I'll look over it now. |
I guess the problem is that, even with the dependency, systemd thinks the process is done (not sure why - the PID still exists while dumping code) so it stops systemd-coredumpd while the code dump is in progress. Very funky. |
@avikivity since we decided to not merging workaround, what else can we do for this? |
Issue description
Recently in a test when executing a soft reboot node, scylla had an error 'aborting on shard'. It didn't create coredump due
It looks like the reboot triggers shutdown of coredump service before waiting for scylla service to stop - can we make it to stop after scylla is down?
Impact
No coredump - harder issues investigation
Installation details
Kernel Version: 5.15.0-1028-aws
Scylla version (or git commit hash):
5.2.0~rc1-20230207.8ff4717fd010
with build-id78fbb2c25e9244a62f57988313388a0260084528
Cluster size: 3 nodes (i4i.large)
Scylla Nodes used in this run:
OS / Image:
ami-05e1d6aa4f71f3f25
(aws: eu-west-1)Test:
longevity-5gb-1h-SoftRebootNodeMonkey-aws-test
Test id:
249f30ed-7007-4b8e-a320-1207ebca5e5d
Test name:
scylla-5.2/nemesis/longevity-5gb-1h-SoftRebootNodeMonkey-aws-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 249f30ed-7007-4b8e-a320-1207ebca5e5d
$ hydra investigate show-logs 249f30ed-7007-4b8e-a320-1207ebca5e5d
Logs:
Jenkins job URL
The text was updated successfully, but these errors were encountered: