Bugfix #4336: ingester leaving pool #4340

fulmicoton · 2024-01-02T15:22:11Z

Avoid evicting ingester node from IngesterPool when facing or a transport error.

Before when facing such error case, we were removing the faulty nodes from the pool, and
Nn code path would re-add it to the ingester pool. The ingester pool is also used by the ingest
source resulting in bug #4336.

After this patch:

When facing a transport error, we assume the targetted node is
unreachable and chitchat has just not detected this just yet.

In an ideal world we would inform chitchat about this, but it is a bit
difficult to do codewise.

Instead, we register the leader as unavailable for the span of the
workbench. It will then react as if it was out of the pool for
subsequent retries.

A GetOrCreatedShard will carry the information that the node was
unavailable, and the control plane will attempt to create a shard on a
different node

A timeout on the other hand is treated as a normal retryable error.

Closes #4336

fulmicoton · 2024-01-02T15:52:36Z

quickwit/quickwit-ingest/src/ingest_v2/workbench.rs

    fn last_failure_is_transient(&self) -> bool {
        match self.last_failure_opt {
            Some(SubworkbenchFailure::IndexNotFound) => false,
            Some(SubworkbenchFailure::SourceNotFound) => false,
-            Some(SubworkbenchFailure::Internal(_)) => false,
+            Some(SubworkbenchFailure::Internal(_)) => true,


treating internal error as transient.

quickwit/quickwit-ingest/src/ingest_v2/router.rs

quickwit/quickwit-ingest/src/ingest_v2/ingester.rs

quickwit/quickwit-ingest/src/ingest_v2/router.rs

…port error. Before when facing such error case, we were removing the faulty nodes from the pool, and Nn code path would re-add it to the ingester pool. The ingester pool is also used by the ingest source resulting in bug #4336. After this patch: When facing a transport error, we assume the targetted node is unreachable and chitchat has just not detected this just yet. In an ideal world we would inform chitchat about this, but it is a bit difficult to do codewise. Instead, we register the leader as unavailable for the span of the workbench. It will then react as if it was out of the pool for subsequent retries. A GetOrCreatedShard will carry the information that the node was unavailable, and the control plane will attempt to create a shard on a different node A timeout on the other hand is treated as a normal retryable error. Closes #4336

quickwit/quickwit-ingest/src/ingest_v2/workbench.rs

Co-authored-by: Adrien Guillo <[email protected]>

fulmicoton commented Jan 2, 2024

View reviewed changes

fulmicoton commented Jan 3, 2024

View reviewed changes

quickwit/quickwit-ingest/src/ingest_v2/router.rs Show resolved Hide resolved

fulmicoton force-pushed the issue/4336-ingest-bug branch from f8ffbb1 to 10748e1 Compare January 3, 2024 10:37

fulmicoton commented Jan 3, 2024

View reviewed changes

quickwit/quickwit-ingest/src/ingest_v2/ingester.rs Show resolved Hide resolved

fulmicoton force-pushed the issue/4336-ingest-bug branch 3 times, most recently from 531f49a to 59d36e8 Compare January 8, 2024 07:46

fulmicoton marked this pull request as ready for review January 8, 2024 07:46

fulmicoton marked this pull request as draft January 8, 2024 07:49

fulmicoton force-pushed the issue/4336-ingest-bug branch from 59d36e8 to 83e8c35 Compare January 8, 2024 09:15

fulmicoton commented Jan 8, 2024

View reviewed changes

quickwit/quickwit-ingest/src/ingest_v2/router.rs Show resolved Hide resolved

fulmicoton commented Jan 8, 2024

View reviewed changes

quickwit/quickwit-ingest/src/ingest_v2/router.rs Show resolved Hide resolved

fulmicoton requested a review from guilload January 8, 2024 09:21

fulmicoton marked this pull request as ready for review January 8, 2024 09:22

fulmicoton force-pushed the issue/4336-ingest-bug branch from 83e8c35 to cd204d7 Compare January 8, 2024 09:24

fulmicoton force-pushed the issue/4336-ingest-bug branch from cd204d7 to 52f88ff Compare January 8, 2024 09:26

fulmicoton mentioned this pull request Jan 9, 2024

env var override or overriden? #4355

Closed

fulmicoton changed the title ~~First stab at bugfix.~~ Bugfix. Jan 10, 2024

fulmicoton changed the title ~~Bugfix.~~ Bugfix #4336: ingester leaving pool Jan 10, 2024

guilload approved these changes Jan 10, 2024

View reviewed changes

quickwit/quickwit-ingest/src/ingest_v2/workbench.rs Outdated Show resolved Hide resolved

quickwit/quickwit-ingest/src/ingest_v2/workbench.rs Outdated Show resolved Hide resolved

fulmicoton and others added 3 commits January 11, 2024 11:04

Update quickwit/quickwit-ingest/src/ingest_v2/workbench.rs

e8f8286

Co-authored-by: Adrien Guillo <[email protected]>

Rename ConnectionError->TransportError

715a58d

Merge branch 'main' into issue/4336-ingest-bug

49dc4e7

fulmicoton enabled auto-merge (squash) January 11, 2024 02:06

fulmicoton disabled auto-merge January 11, 2024 05:41

fulmicoton merged commit b556868 into main Jan 11, 2024
4 checks passed

fulmicoton deleted the issue/4336-ingest-bug branch January 11, 2024 05:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix #4336: ingester leaving pool #4340

Bugfix #4336: ingester leaving pool #4340

fulmicoton commented Jan 2, 2024 •

edited

Loading

fulmicoton Jan 2, 2024

Bugfix #4336: ingester leaving pool #4340

Bugfix #4336: ingester leaving pool #4340

Conversation

fulmicoton commented Jan 2, 2024 • edited Loading

fulmicoton Jan 2, 2024

Choose a reason for hiding this comment

fulmicoton commented Jan 2, 2024 •

edited

Loading