Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check whether to switch to fail when setting the node to pfail in cron #1061

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

This may speed up the transition to the fail state a bit.
Previously we would only check when we received a pfail/fail
report from others in gossip. If myself is the last vote,
we can directly switch to fail in here without waiting for
the next gossip packet.

This may speed up the transition to the fail state a bit.
Previously we would only check when we received a pfail/fail
report from others in gossip. If myself is the last vote,
we can directly switch to fail in here without waiting for
the next gossip packet.

Signed-off-by: Binbin <[email protected]>
Copy link

codecov bot commented Sep 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.62%. Comparing base (d9c41e9) to head (f294c0b).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1061      +/-   ##
============================================
+ Coverage     70.59%   70.62%   +0.03%     
============================================
  Files           114      114              
  Lines         61673    61673              
============================================
+ Hits          43537    43557      +20     
+ Misses        18136    18116      -20     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.10% <100.00%> (-0.04%) ⬇️

... and 16 files with indirect coverage changes

@@ -5066,7 +5066,7 @@ void clusterCron(void) {
if (!(node->flags & (CLUSTER_NODE_PFAIL | CLUSTER_NODE_FAIL))) {
node->flags |= CLUSTER_NODE_PFAIL;
update_state = 1;
if (server.cluster->size == 1 && clusterNodeIsVotingPrimary(myself)) {
if (clusterNodeIsVotingPrimary(myself)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes a lot of sense to me. I actually don't fully understand why we special-cased single-shard clusters in b3aaa0a

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it was just to reduce the impact (to solve a specific problem (single shard) at the time). @bentotten would you be able to take a review?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single-shard clusters, there is no need to wait for a quorum of 1, so we can proceed to directly fail the node here and not wait for the original gossip-logic (for reference). Unfortunately prior to that, a single-shard cluster would never properly mark a replica as failed.

I believe we still need to wait to achieve a quorum from gossip to mark this node as truly failed, no? If the node uses the information it has from "old gossip" during each cluster cron run, wont this increase the number of calls to this function without increasing the accuracy of the failure report?

Copy link
Contributor

@bentotten bentotten Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though the benefit of the node being able to mark the replica as failed as soon as it knows it cant reach said replica in the last-needed-vote scenario also makes sense - @madolson I know you did the initial investigation into the original bug, if you have any insights. This was not a scenario I had considered for the original commit, and dont see any issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, thanks for the input, that is new to me. clusterNodeFailureReportsCount will cleanup the old failure report, so i think (hope) it can avoid the old gossip information risk. But you do make a good point, let see if others have a new thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants