Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41334: [C++][Acero] Use per-node basis temp vector stack to mitigate overflow #41335

Merged
merged 20 commits into from
May 14, 2024

Conversation

zanmato1984
Copy link
Collaborator

@zanmato1984 zanmato1984 commented Apr 22, 2024

Rationale for this change

The risk of temp vector stack overflow still exists as described in #41334 . Many people have agreed on a per-node basis approach:

  1. it doesn't introduce more performance penalty than shared stack; 2) it can mitigate the overflow in a natural way, i.e., expanding the stack size linear to the number of nodes; 3) it requires no more complexity to the existing stack implementation.

The full (but long) story is also revealed in the subsequent discussion of this PR. Feel free to scroll down.

What changes are included in this PR?

  1. Change the current shared (per-thread) temp vector stack usage to per-node basis.
  2. Make the stack size required by each stack user more explicit.

Are these changes tested?

UT included.

Are there any user-facing changes?

None.

Copy link

⚠️ GitHub issue #41334 has been automatically assigned in GitHub to PR creator.

@zanmato1984
Copy link
Collaborator Author

cc @westonpace @pitrou

cpp/src/arrow/acero/query_context.cc Outdated Show resolved Hide resolved
cpp/src/arrow/acero/query_context.cc Outdated Show resolved Hide resolved
cpp/src/arrow/acero/query_context.cc Outdated Show resolved Hide resolved
cpp/src/arrow/acero/query_context.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Apr 23, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 24, 2024
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I will merge soon unless @westonpace or @bkietz has extra comments.

Thank you for doing the lower/upper bound changes and keeping things robust.

cpp/src/arrow/acero/query_context.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 26, 2024
@pitrou
Copy link
Member

pitrou commented Apr 27, 2024 via email

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Apr 29, 2024
@felipecrv
Copy link
Contributor

The stack space might be storing pointers to itself, so it can't grow dynamically and preserve referential integrity.

The proper fix is to have things not use the temp-stack and do their own heap management, but given that we currently have things using this temp-stack, allowing size configuration seemed like a requirement caused by the earlier decision of having a thread-local temporary stack in the first place.

I won't merge before @pitrou comes back from vacation and has the time to explain his alternative solution.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two approaches we could take with temp vector overflow:

  • modify TempVectorStack::alloc to allow graceful failure, which in turn requires all consumers of it to propagate that failure or retry with a resized temp stack
  • It's somewhat odd that temp vectors are associated with a QueryContext; this means that join nodes within a query are sharing their statically allocated space. Since each join node guarantees that it only allocates minibatch-sized subsets of the temp stack, I think it makes more sense to move temp stacks from QueryContext to SwissJoin, allowing it to manage the memory it uses.

@zanmato1984
Copy link
Collaborator Author

zanmato1984 commented Apr 30, 2024

  • It's somewhat odd that temp vectors are associated with a QueryContext; this means that join nodes within a query are sharing their statically allocated space. Since each join node guarantees that it only allocates minibatch-sized subsets of the temp stack, I think it makes more sense to move temp stacks from QueryContext to SwissJoin, allowing it to manage the memory it uses.

Well this is exactly what I've been asking in a previous email thread [1] to dev group. The answer back then suggested the opposite though :(

[1] https://lists.apache.org/thread/mm9667vst0jmk127thhsywjr1hvb614h

@bkietz
Copy link
Member

bkietz commented Apr 30, 2024

Sorry for the confusion! I think per-node thread local temp stacks are what was eventually intended but that refactor was not completed.

@pitrou
Copy link
Member

pitrou commented Apr 30, 2024

The stack space might be storing pointers to itself, so it can't grow dynamically and preserve referential integrity.

IIUC, this can be circumvented by using something like a std::list<std::array<uint8_t, N>> where N is a fixed per-chunk size.
(can also be a std::deque<std::string> if you ensure that each individual string is never resized)

@zanmato1984
Copy link
Collaborator Author

Sorry for the confusion! I think per-node thread local temp stacks are what was eventually intended but that refactor was not completed.

Sorry I'm still confused. From the discussion in the email thread, it seems like the intention at that time was making the temp stack thread-local and inter-node, i.e., sharing the same temp stack instance among all the nodes being executed in a specific thread (sometimes referred to as a "pipeline"). The "unfinished refactoring" back then might be making more kinds of nodes to use the shared temp stack rather than their own.

I would +1 to the "per-node temp stack" solution as I think it achieves the original goal of reducing the heap allocation overhead, as well as the possibility of overflow. Comparing to "shared stack", it only introduces moderate overhead of the amount of memory that is allocated for each node's stack. I'm asking the above question just to sort out the whole story. Thanks.

@westonpace
Copy link
Member

westonpace commented May 1, 2024

From the PR it mentions the max value for this variable should be 64MB. That seems confusing to me. Why wouldn't we just always use 64MB. Is there concern that 64MB per plan is too much RAM overhead?

Sorry I'm still confused. From the discussion in the email thread, it seems like the intention at that time was making the temp stack thread-local and inter-node, i.e., sharing the same temp stack instance among all the nodes being executed in a specific thread (sometimes referred to as a "pipeline"). The "unfinished refactoring" back then might be making more kinds of nodes to use the shared temp stack rather than their own.

This is correct. At the time our (potentially flawed) belief was that the temp stack usage was fairly well bounded and so there didn't seem to be much advantage in every node needing to reinvent the wheel. If the knowledge of how much temp stack space is needed is local to specific nodes (and can vary widely so there is no reasonable default max) then per-node is correct.

I probably missed this in the earlier fixes but do we actually know why the temp stack is overflowing? Is there some reason to think that a swiss join node can predict how much temp stack space it needs? Otherwise I'm not sure why moving the temp-stack to a per-node basis fixes the root cause.

It's somewhat odd that temp vectors are associated with a QueryContext

I believe the intention was to create something similar to malloc/new but faster (by taking advantage of pre-allocated thread local space).

this means that join nodes within a query are sharing their statically allocated space

This may be true since node A calls node B calls node C, etc. However, note that the stack itself would behave in the same way in this situation (stack frame A is followed by stack frame B followed by stack frame C).

If this nesting is somehow unbounded (potentially through async recursion) then you will get a stack overflow at some point, regardless of the size of the temp local stack, because you will eventually overflow the real stack (the C program stack)

I think Antoine's suggestion works too (if I understand it correctly) which is just to allocate a new slab if there isn't space available in the initial slab. The extra slab can be deallocated when the stack shrinks enough that it is no longer needed.

Copy link
Collaborator Author

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @pitrou

@@ -600,6 +600,7 @@ struct GrouperFastImpl : public Grouper {
}

Status Reset() override {
ARROW_DCHECK_EQ(temp_stack_.AllocatedSize(), 0);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to the purpose of this PR but rather something remaining in #41352 (comment) and too trivial for another PR.

public:
Status Init(MemoryPool* pool, int64_t size);

int64_t AllocatedSize() const { return top_; }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to the purpose of this PR but rather something remaining in #41352 (comment) and too trivial for another PR.

Copy link
Collaborator Author

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @pitrou

@pitrou pitrou merged commit 6c386da into apache:main May 14, 2024
38 of 41 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label May 14, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 6c386da.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…mitigate overflow (apache#41335)

### Rationale for this change

The risk of temp vector stack overflow still exists as described in apache#41334 . Many people have agreed on a per-node basis approach:
> 1) it doesn't introduce more performance penalty than shared stack; 2) it can mitigate the overflow in a natural way, i.e., expanding the stack size linear to the number of nodes; 3) it requires no more complexity to the existing stack implementation.

The full (but long) story is also revealed in the subsequent discussion of this PR. Feel free to scroll down.

### What changes are included in this PR?

1. Change the current shared (per-thread) temp vector stack usage to per-node basis.
2. Make the stack size required by each stack user more explicit.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

None.

* GitHub Issue: apache#41334

Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request May 29, 2024
…mitigate overflow (apache#41335)

### Rationale for this change

The risk of temp vector stack overflow still exists as described in apache#41334 . Many people have agreed on a per-node basis approach:
> 1) it doesn't introduce more performance penalty than shared stack; 2) it can mitigate the overflow in a natural way, i.e., expanding the stack size linear to the number of nodes; 3) it requires no more complexity to the existing stack implementation.

The full (but long) story is also revealed in the subsequent discussion of this PR. Feel free to scroll down.

### What changes are included in this PR?

1. Change the current shared (per-thread) temp vector stack usage to per-node basis.
2. Make the stack size required by each stack user more explicit.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

None.

* GitHub Issue: apache#41334

Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
pitrou pushed a commit that referenced this pull request Jun 26, 2024
### Rationale for this change

The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like #43040 .

### What changes are included in this PR?

Reduce the recursion to 16, which is strong enough for the purpose of #41335 which introduced this test.

### Are these changes tested?

Change is test.

### Are there any user-facing changes?

None.

* GitHub Issue: #43040

Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
zanmato1984 added a commit to zanmato1984/arrow that referenced this pull request Jul 9, 2024
…43042)

### Rationale for this change

The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like apache#43040 .

### What changes are included in this PR?

Reduce the recursion to 16, which is strong enough for the purpose of apache#41335 which introduced this test.

### Are these changes tested?

Change is test.

### Are there any user-facing changes?

None.

* GitHub Issue: apache#43040

Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants