GH-41334: [C++][Acero] Use per-node basis temp vector stack to mitigate overflow #41335

zanmato1984 · 2024-04-22T12:43:23Z

Rationale for this change

The risk of temp vector stack overflow still exists as described in #41334 . Many people have agreed on a per-node basis approach:

it doesn't introduce more performance penalty than shared stack; 2) it can mitigate the overflow in a natural way, i.e., expanding the stack size linear to the number of nodes; 3) it requires no more complexity to the existing stack implementation.

The full (but long) story is also revealed in the subsequent discussion of this PR. Feel free to scroll down.

What changes are included in this PR?

Change the current shared (per-thread) temp vector stack usage to per-node basis.
Make the stack size required by each stack user more explicit.

Are these changes tested?

UT included.

Are there any user-facing changes?

None.

GitHub Issue: [C++][Acero] Acero's shared (per-thread) temp vector stack usage may overflow #41334

github-actions · 2024-04-22T12:43:50Z

⚠️ GitHub issue #41334 has been automatically assigned in GitHub to PR creator.

zanmato1984 · 2024-04-22T12:44:53Z

cc @westonpace @pitrou

cpp/src/arrow/acero/query_context.cc

felipecrv

This looks good to me. I will merge soon unless @westonpace or @bkietz has extra comments.

Thank you for doing the lower/upper bound changes and keeping things robust.

cpp/src/arrow/acero/query_context.cc

pitrou · 2024-04-27T07:02:58Z

Am on vacation, but no, this is not the right way to solve this issue. Users should not have to tune obscure parameters to avoid crashes due to inflexible implementation choices. Instead, one should arrange for the temp space to be dynamically allocated as needed. Le 22 avril 2024 14:45:15 GMT+02:00, Rossi Sun ***@***.***> a écrit :

…

cc @westonpace @pitrou -- Reply to this email directly or view it on GitHub: #41335 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

felipecrv · 2024-04-29T15:08:51Z

The stack space might be storing pointers to itself, so it can't grow dynamically and preserve referential integrity.

The proper fix is to have things not use the temp-stack and do their own heap management, but given that we currently have things using this temp-stack, allowing size configuration seemed like a requirement caused by the earlier decision of having a thread-local temporary stack in the first place.

I won't merge before @pitrou comes back from vacation and has the time to explain his alternative solution.

bkietz

There are two approaches we could take with temp vector overflow:

modify TempVectorStack::alloc to allow graceful failure, which in turn requires all consumers of it to propagate that failure or retry with a resized temp stack
It's somewhat odd that temp vectors are associated with a QueryContext; this means that join nodes within a query are sharing their statically allocated space. Since each join node guarantees that it only allocates minibatch-sized subsets of the temp stack, I think it makes more sense to move temp stacks from QueryContext to SwissJoin, allowing it to manage the memory it uses.

zanmato1984 · 2024-04-30T16:53:57Z

It's somewhat odd that temp vectors are associated with a QueryContext; this means that join nodes within a query are sharing their statically allocated space. Since each join node guarantees that it only allocates minibatch-sized subsets of the temp stack, I think it makes more sense to move temp stacks from QueryContext to SwissJoin, allowing it to manage the memory it uses.

Well this is exactly what I've been asking in a previous email thread [1] to dev group. The answer back then suggested the opposite though :(

[1] https://lists.apache.org/thread/mm9667vst0jmk127thhsywjr1hvb614h

bkietz · 2024-04-30T17:58:08Z

Sorry for the confusion! I think per-node thread local temp stacks are what was eventually intended but that refactor was not completed.

pitrou · 2024-04-30T19:10:54Z

The stack space might be storing pointers to itself, so it can't grow dynamically and preserve referential integrity.

IIUC, this can be circumvented by using something like a std::list<std::array<uint8_t, N>> where N is a fixed per-chunk size.
(can also be a std::deque<std::string> if you ensure that each individual string is never resized)

zanmato1984 · 2024-05-01T08:02:31Z

Sorry for the confusion! I think per-node thread local temp stacks are what was eventually intended but that refactor was not completed.

Sorry I'm still confused. From the discussion in the email thread, it seems like the intention at that time was making the temp stack thread-local and inter-node, i.e., sharing the same temp stack instance among all the nodes being executed in a specific thread (sometimes referred to as a "pipeline"). The "unfinished refactoring" back then might be making more kinds of nodes to use the shared temp stack rather than their own.

I would +1 to the "per-node temp stack" solution as I think it achieves the original goal of reducing the heap allocation overhead, as well as the possibility of overflow. Comparing to "shared stack", it only introduces moderate overhead of the amount of memory that is allocated for each node's stack. I'm asking the above question just to sort out the whole story. Thanks.

westonpace · 2024-05-01T15:55:16Z

From the PR it mentions the max value for this variable should be 64MB. That seems confusing to me. Why wouldn't we just always use 64MB. Is there concern that 64MB per plan is too much RAM overhead?

Sorry I'm still confused. From the discussion in the email thread, it seems like the intention at that time was making the temp stack thread-local and inter-node, i.e., sharing the same temp stack instance among all the nodes being executed in a specific thread (sometimes referred to as a "pipeline"). The "unfinished refactoring" back then might be making more kinds of nodes to use the shared temp stack rather than their own.

This is correct. At the time our (potentially flawed) belief was that the temp stack usage was fairly well bounded and so there didn't seem to be much advantage in every node needing to reinvent the wheel. If the knowledge of how much temp stack space is needed is local to specific nodes (and can vary widely so there is no reasonable default max) then per-node is correct.

I probably missed this in the earlier fixes but do we actually know why the temp stack is overflowing? Is there some reason to think that a swiss join node can predict how much temp stack space it needs? Otherwise I'm not sure why moving the temp-stack to a per-node basis fixes the root cause.

It's somewhat odd that temp vectors are associated with a QueryContext

I believe the intention was to create something similar to malloc/new but faster (by taking advantage of pre-allocated thread local space).

this means that join nodes within a query are sharing their statically allocated space

This may be true since node A calls node B calls node C, etc. However, note that the stack itself would behave in the same way in this situation (stack frame A is followed by stack frame B followed by stack frame C).

If this nesting is somehow unbounded (potentially through async recursion) then you will get a stack overflow at some point, regardless of the size of the temp local stack, because you will eventually overflow the real stack (the C program stack)

I think Antoine's suggestion works too (if I understand it correctly) which is just to allocate a new slab if there isn't space available in the initial slab. The extra slab can be deallocated when the stack shrinks enough that it is no longer needed.

zanmato1984

cc @pitrou

zanmato1984 · 2024-05-14T16:36:36Z

cpp/src/arrow/compute/row/grouper.cc

@@ -600,6 +600,7 @@ struct GrouperFastImpl : public Grouper {
  }

  Status Reset() override {
+    ARROW_DCHECK_EQ(temp_stack_.AllocatedSize(), 0);


This is not related to the purpose of this PR but rather something remaining in #41352 (comment) and too trivial for another PR.

zanmato1984 · 2024-05-14T16:37:07Z

cpp/src/arrow/compute/util_internal.h

+ public:
+  Status Init(MemoryPool* pool, int64_t size);
+
+  int64_t AllocatedSize() const { return top_; }


This is not related to the purpose of this PR but rather something remaining in #41352 (comment) and too trivial for another PR.

zanmato1984

cc @pitrou

conbench-apache-arrow · 2024-05-15T05:07:02Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 6c386da.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

…mitigate overflow (apache#41335) ### Rationale for this change The risk of temp vector stack overflow still exists as described in apache#41334 . Many people have agreed on a per-node basis approach: > 1) it doesn't introduce more performance penalty than shared stack; 2) it can mitigate the overflow in a natural way, i.e., expanding the stack size linear to the number of nodes; 3) it requires no more complexity to the existing stack implementation. The full (but long) story is also revealed in the subsequent discussion of this PR. Feel free to scroll down. ### What changes are included in this PR? 1. Change the current shared (per-thread) temp vector stack usage to per-node basis. 2. Make the stack size required by each stack user more explicit. ### Are these changes tested? UT included. ### Are there any user-facing changes? None. * GitHub Issue: apache#41334 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

### Rationale for this change The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like #43040 . ### What changes are included in this PR? Reduce the recursion to 16, which is strong enough for the purpose of #41335 which introduced this test. ### Are these changes tested? Change is test. ### Are there any user-facing changes? None. * GitHub Issue: #43040 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…43042) ### Rationale for this change The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like apache#43040 . ### What changes are included in this PR? Reduce the recursion to 16, which is strong enough for the purpose of apache#41335 which introduced this test. ### Are these changes tested? Change is test. ### Are there any user-facing changes? None. * GitHub Issue: apache#43040 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

zanmato1984 requested a review from westonpace as a code owner April 22, 2024 12:43

github-actions bot added Component: C++ Component: Documentation awaiting review Awaiting review labels Apr 22, 2024

felipecrv reviewed Apr 23, 2024

View reviewed changes

cpp/src/arrow/acero/query_context.cc Outdated Show resolved Hide resolved

cpp/src/arrow/acero/query_context.cc Outdated Show resolved Hide resolved

cpp/src/arrow/acero/query_context.cc Outdated Show resolved Hide resolved

cpp/src/arrow/acero/query_context.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Apr 23, 2024

zanmato1984 force-pushed the env-var-temp-stack-size branch from 0a9979e to 6b79553 Compare April 24, 2024 15:42

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 24, 2024

felipecrv approved these changes Apr 26, 2024

View reviewed changes

cpp/src/arrow/acero/query_context.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 26, 2024

felipecrv approved these changes Apr 26, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Apr 29, 2024

bkietz reviewed Apr 30, 2024

View reviewed changes

zanmato1984 added 14 commits May 15, 2024 00:28

Fix lint

a48ed90

Address comment

5ff1b29

Update lower and upper bound of temp stack size

8b1d9cd

Use arrow internal int parser instead of std

ccc3366

Use per-node stack wip

a15b6de

WIP

7fe76d5

Add many join case to overflow the stack

cf45ef9

Explain the many join case

7b397fb

Format

2d6f053

Refine naming

a3622d9

Move temp stack vector into internal header and reduce including

94be92d

Enhance comments about stack size constants

3b3c771

Change many join test case to use int32 instead of int8

dcc3b7a

Address a remaining comment in apache#41352

4821398

zanmato1984 force-pushed the env-var-temp-stack-size branch from 8f3c34e to 4821398 Compare May 14, 2024 16:33

zanmato1984 commented May 14, 2024

View reviewed changes

pitrou approved these changes May 14, 2024

View reviewed changes

pitrou merged commit 6c386da into apache:main May 14, 2024
38 of 41 checks passed

pitrou removed the awaiting merge Awaiting merge label May 14, 2024

pitrou mentioned this pull request May 14, 2024

[C++][Acero] Acero's shared (per-thread) temp vector stack usage may overflow #41334

Closed

This was referenced May 16, 2024

[C++][Acero] Use ASAN to replace current temp stack memory poisoning #41460

Closed

GH-41460: [C++] Use ASAN to poison temp vector stack memory #41695

Merged

jorisvandenbossche mentioned this pull request Jun 25, 2024

[C++] Acero hash join ManyJoins test failing on several nightly builds #43040

Closed

zanmato1984 mentioned this pull request Jun 25, 2024

GH-43040: [C++] Reduce the recursion of many-join test #43042

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41334: [C++][Acero] Use per-node basis temp vector stack to mitigate overflow #41335

GH-41334: [C++][Acero] Use per-node basis temp vector stack to mitigate overflow #41335

zanmato1984 commented Apr 22, 2024 •

edited

Loading

github-actions bot commented Apr 22, 2024

zanmato1984 commented Apr 22, 2024

felipecrv left a comment

pitrou commented Apr 27, 2024 via email

felipecrv commented Apr 29, 2024

bkietz left a comment

zanmato1984 commented Apr 30, 2024 •

edited

Loading

bkietz commented Apr 30, 2024

pitrou commented Apr 30, 2024

zanmato1984 commented May 1, 2024

westonpace commented May 1, 2024 •

edited

Loading

zanmato1984 left a comment

zanmato1984 May 14, 2024

zanmato1984 May 14, 2024

zanmato1984 left a comment

conbench-apache-arrow bot commented May 15, 2024

GH-41334: [C++][Acero] Use per-node basis temp vector stack to mitigate overflow #41335

GH-41334: [C++][Acero] Use per-node basis temp vector stack to mitigate overflow #41335

Conversation

zanmato1984 commented Apr 22, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Apr 22, 2024

zanmato1984 commented Apr 22, 2024

felipecrv left a comment

Choose a reason for hiding this comment

pitrou commented Apr 27, 2024 via email

felipecrv commented Apr 29, 2024

bkietz left a comment

Choose a reason for hiding this comment

zanmato1984 commented Apr 30, 2024 • edited Loading

bkietz commented Apr 30, 2024

pitrou commented Apr 30, 2024

zanmato1984 commented May 1, 2024

westonpace commented May 1, 2024 • edited Loading

zanmato1984 left a comment

Choose a reason for hiding this comment

zanmato1984 May 14, 2024

Choose a reason for hiding this comment

zanmato1984 May 14, 2024

Choose a reason for hiding this comment

zanmato1984 left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented May 15, 2024

zanmato1984 commented Apr 22, 2024 •

edited

Loading

zanmato1984 commented Apr 30, 2024 •

edited

Loading

westonpace commented May 1, 2024 •

edited

Loading