Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix #4336: ingester leaving pool #4340

Merged
merged 4 commits into from
Jan 11, 2024
Merged

Bugfix #4336: ingester leaving pool #4340

merged 4 commits into from
Jan 11, 2024

Conversation

fulmicoton
Copy link
Contributor

@fulmicoton fulmicoton commented Jan 2, 2024

Avoid evicting ingester node from IngesterPool when facing or a transport error.

Before when facing such error case, we were removing the faulty nodes from the pool, and
Nn code path would re-add it to the ingester pool. The ingester pool is also used by the ingest
source resulting in bug #4336.

After this patch:

When facing a transport error, we assume the targetted node is
unreachable and chitchat has just not detected this just yet.

In an ideal world we would inform chitchat about this, but it is a bit
difficult to do codewise.

Instead, we register the leader as unavailable for the span of the
workbench. It will then react as if it was out of the pool for
subsequent retries.

A GetOrCreatedShard will carry the information that the node was
unavailable, and the control plane will attempt to create a shard on a
different node

A timeout on the other hand is treated as a normal retryable error.

Closes #4336

fn last_failure_is_transient(&self) -> bool {
match self.last_failure_opt {
Some(SubworkbenchFailure::IndexNotFound) => false,
Some(SubworkbenchFailure::SourceNotFound) => false,
Some(SubworkbenchFailure::Internal(_)) => false,
Some(SubworkbenchFailure::Internal(_)) => true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

treating internal error as transient.

@fulmicoton fulmicoton force-pushed the issue/4336-ingest-bug branch 3 times, most recently from 531f49a to 59d36e8 Compare January 8, 2024 07:46
@fulmicoton fulmicoton marked this pull request as ready for review January 8, 2024 07:46
@fulmicoton fulmicoton marked this pull request as draft January 8, 2024 07:49
@fulmicoton fulmicoton marked this pull request as ready for review January 8, 2024 09:22
…port error.

Before when facing such error case, we were removing the faulty nodes from the pool, and
Nn code path would re-add it to the ingester pool. The ingester pool is also used by the ingest
source resulting in bug #4336.

After this patch:

When facing a transport error, we assume the targetted node is
unreachable and chitchat has just not detected this just yet.

In an ideal world we would inform chitchat about this, but it is a bit
difficult to do codewise.

Instead, we register the leader as unavailable for the span of the
workbench. It will then react as if it was out of the pool for
subsequent retries.

A GetOrCreatedShard will carry the information that the node was
unavailable, and the control plane will attempt to create a shard on a
different node

A timeout on the other hand is treated as a normal retryable error.

Closes #4336
@fulmicoton fulmicoton changed the title First stab at bugfix. Bugfix. Jan 10, 2024
@fulmicoton fulmicoton changed the title Bugfix. Bugfix #4336: ingester leaving pool Jan 10, 2024
quickwit/quickwit-ingest/src/ingest_v2/workbench.rs Outdated Show resolved Hide resolved
quickwit/quickwit-ingest/src/ingest_v2/workbench.rs Outdated Show resolved Hide resolved
@fulmicoton fulmicoton enabled auto-merge (squash) January 11, 2024 02:06
@fulmicoton fulmicoton merged commit b556868 into main Jan 11, 2024
4 checks passed
@fulmicoton fulmicoton deleted the issue/4336-ingest-bug branch January 11, 2024 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants