Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the robustness of old ZooKeeper log removal #1040

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

trustin
Copy link
Member

@trustin trustin commented Sep 27, 2024

Motivation:

OldLogRemover in ZooKeeperCommandExecutor currently catches a Throwable when deleting an old log or its log blocks. However, it has two issues doing so:

  • It doesn't handle an exception that's raised when reading the metadata of the old log.
  • Throwable is way too wide exception to catch. Catching a KeeperException whose code is NONODE will be enough.
    • Note that the failure will only transfer the leadership to other replica, rather than stopping the whole replication process.

Modifications:

  • OldLogRemover now catches KeeperException whose code is NONODE only.
  • An attempt to read a missing log node's metadata is now handled properly.
  • Added more detail to the log messages about missing nodes
    • Split deleteLog() into deleteLog() and deleteLogBlock()

Result:

  • The leadership is not transferred anymore when OldLogRemover attempts to retrieve a missing log node's metadata, which is not really a critical issue.
    • Instead, the leadership will be transferred when an exception occurs not because of a missing node.

Motivation:

`OldLogRemover` in `ZooKeeperCommandExecutor` currently catches a
`Throwable` when deleting an old log or its log blocks. However, it has
two issues doing so:

- It doesn't handle an exception that's raised when reading the metadata
  of the old log.
- `Throwable` is way too wide exception to catch. Catching a
  `KeeperException` whose code is `NONODE` will be enough.
  - Note that the failure will only transfer the leadership to other
    replica, rather than stopping the whole replication process.

Modifications:

- `OldLogRemover` now catches `KeeperException` whose code is `NONODE`
  only.
- An attempt to read a missing log node's metadata is now handled
  properly.
- Added more detail to the log messages about missing nodes
  - Split `deleteLog()` into `deleteLog()` and `deleteLogBlock()`

Result:

- The leadership is not transferred anymore when `OldLogRemover`
  attempts to retrieve a missing log node's metadata, which is not
  really a critical issue.
  - Instead, the leadership will be transferred when an exception
    occurs not because of a missing node.
@trustin trustin added the defect label Sep 27, 2024
@trustin trustin added this to the 0.71.0 milestone Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant