-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41334: [C++][Acero] Use per-node basis temp vector stack to mitigate overflow #41335
Conversation
|
0a9979e
to
6b79553
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I will merge soon unless @westonpace or @bkietz has extra comments.
Thank you for doing the lower/upper bound changes and keeping things robust.
Am on vacation, but no, this is not the right way to solve this issue. Users should not have to tune obscure parameters to avoid crashes due to inflexible implementation choices. Instead, one should arrange for the temp space to be dynamically allocated as needed.
Le 22 avril 2024 14:45:15 GMT+02:00, Rossi Sun ***@***.***> a écrit :
…cc @westonpace @pitrou
--
Reply to this email directly or view it on GitHub:
#41335 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
The stack space might be storing pointers to itself, so it can't grow dynamically and preserve referential integrity. The proper fix is to have things not use the temp-stack and do their own heap management, but given that we currently have things using this temp-stack, allowing size configuration seemed like a requirement caused by the earlier decision of having a thread-local temporary stack in the first place. I won't merge before @pitrou comes back from vacation and has the time to explain his alternative solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two approaches we could take with temp vector overflow:
- modify TempVectorStack::alloc to allow graceful failure, which in turn requires all consumers of it to propagate that failure or retry with a resized temp stack
- It's somewhat odd that temp vectors are associated with a QueryContext; this means that join nodes within a query are sharing their statically allocated space. Since each join node guarantees that it only allocates minibatch-sized subsets of the temp stack, I think it makes more sense to move temp stacks from QueryContext to
SwissJoin
, allowing it to manage the memory it uses.
Well this is exactly what I've been asking in a previous email thread [1] to dev group. The answer back then suggested the opposite though :( [1] https://lists.apache.org/thread/mm9667vst0jmk127thhsywjr1hvb614h |
Sorry for the confusion! I think per-node thread local temp stacks are what was eventually intended but that refactor was not completed. |
IIUC, this can be circumvented by using something like a |
Sorry I'm still confused. From the discussion in the email thread, it seems like the intention at that time was making the temp stack thread-local and inter-node, i.e., sharing the same temp stack instance among all the nodes being executed in a specific thread (sometimes referred to as a "pipeline"). The "unfinished refactoring" back then might be making more kinds of nodes to use the shared temp stack rather than their own. I would +1 to the "per-node temp stack" solution as I think it achieves the original goal of reducing the heap allocation overhead, as well as the possibility of overflow. Comparing to "shared stack", it only introduces moderate overhead of the amount of memory that is allocated for each node's stack. I'm asking the above question just to sort out the whole story. Thanks. |
From the PR it mentions the max value for this variable should be 64MB. That seems confusing to me. Why wouldn't we just always use 64MB. Is there concern that 64MB per plan is too much RAM overhead?
This is correct. At the time our (potentially flawed) belief was that the temp stack usage was fairly well bounded and so there didn't seem to be much advantage in every node needing to reinvent the wheel. If the knowledge of how much temp stack space is needed is local to specific nodes (and can vary widely so there is no reasonable default max) then per-node is correct. I probably missed this in the earlier fixes but do we actually know why the temp stack is overflowing? Is there some reason to think that a swiss join node can predict how much temp stack space it needs? Otherwise I'm not sure why moving the temp-stack to a per-node basis fixes the root cause.
I believe the intention was to create something similar to malloc/new but faster (by taking advantage of pre-allocated thread local space).
This may be true since node A calls node B calls node C, etc. However, note that the stack itself would behave in the same way in this situation (stack frame A is followed by stack frame B followed by stack frame C). If this nesting is somehow unbounded (potentially through async recursion) then you will get a stack overflow at some point, regardless of the size of the temp local stack, because you will eventually overflow the real stack (the C program stack) I think Antoine's suggestion works too (if I understand it correctly) which is just to allocate a new slab if there isn't space available in the initial slab. The extra slab can be deallocated when the stack shrinks enough that it is no longer needed. |
8f3c34e
to
4821398
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @pitrou
@@ -600,6 +600,7 @@ struct GrouperFastImpl : public Grouper { | |||
} | |||
|
|||
Status Reset() override { | |||
ARROW_DCHECK_EQ(temp_stack_.AllocatedSize(), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to the purpose of this PR but rather something remaining in #41352 (comment) and too trivial for another PR.
public: | ||
Status Init(MemoryPool* pool, int64_t size); | ||
|
||
int64_t AllocatedSize() const { return top_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to the purpose of this PR but rather something remaining in #41352 (comment) and too trivial for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @pitrou
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 6c386da. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…mitigate overflow (apache#41335) ### Rationale for this change The risk of temp vector stack overflow still exists as described in apache#41334 . Many people have agreed on a per-node basis approach: > 1) it doesn't introduce more performance penalty than shared stack; 2) it can mitigate the overflow in a natural way, i.e., expanding the stack size linear to the number of nodes; 3) it requires no more complexity to the existing stack implementation. The full (but long) story is also revealed in the subsequent discussion of this PR. Feel free to scroll down. ### What changes are included in this PR? 1. Change the current shared (per-thread) temp vector stack usage to per-node basis. 2. Make the stack size required by each stack user more explicit. ### Are these changes tested? UT included. ### Are there any user-facing changes? None. * GitHub Issue: apache#41334 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…mitigate overflow (apache#41335) ### Rationale for this change The risk of temp vector stack overflow still exists as described in apache#41334 . Many people have agreed on a per-node basis approach: > 1) it doesn't introduce more performance penalty than shared stack; 2) it can mitigate the overflow in a natural way, i.e., expanding the stack size linear to the number of nodes; 3) it requires no more complexity to the existing stack implementation. The full (but long) story is also revealed in the subsequent discussion of this PR. Feel free to scroll down. ### What changes are included in this PR? 1. Change the current shared (per-thread) temp vector stack usage to per-node basis. 2. Make the stack size required by each stack user more explicit. ### Are these changes tested? UT included. ### Are there any user-facing changes? None. * GitHub Issue: apache#41334 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like #43040 . ### What changes are included in this PR? Reduce the recursion to 16, which is strong enough for the purpose of #41335 which introduced this test. ### Are these changes tested? Change is test. ### Are there any user-facing changes? None. * GitHub Issue: #43040 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…43042) ### Rationale for this change The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like apache#43040 . ### What changes are included in this PR? Reduce the recursion to 16, which is strong enough for the purpose of apache#41335 which introduced this test. ### Are these changes tested? Change is test. ### Are there any user-facing changes? None. * GitHub Issue: apache#43040 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
The risk of temp vector stack overflow still exists as described in #41334 . Many people have agreed on a per-node basis approach:
The full (but long) story is also revealed in the subsequent discussion of this PR. Feel free to scroll down.
What changes are included in this PR?
Are these changes tested?
UT included.
Are there any user-facing changes?
None.