Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Acero hash join ManyJoins test failing on several nightly builds #43040

Closed
jorisvandenbossche opened this issue Jun 25, 2024 · 5 comments
Closed

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 25, 2024

The ManyJoins tests (part of arrow-acero-hash-join-node-test) added in #41335 seems to be causing nightly failures (cc @zanmato1984).

test-alpine-linux-cpp

Here there is no useful output except for "Segmentation fault", but I assume it is coming from HashJoin.ManyJoins because that is the one that has no "OK" (and the last passing build was from the day the PR was merged)

test-ubuntu-22.04-cpp-emscripten

Here there is some output logs from the failure:

[ RUN      ] HashJoin.ManyJoins
/build/cpp/debug/arrow-acero-hash-join-node-test.js:138
      throw ex;
      ^

RuntimeError: memory access out of bounds
    at arrow-acero-hash-join-node-test.wasm.dlmalloc (wasm://wasm/arrow-acero-hash-join-node-test.wasm-0c5ecc4a:wasm-function[31169]:0x2023c95)
    at arrow-acero-hash-join-node-test.wasm.operator new(unsigned long) (wasm://wasm/arrow-acero-hash-join-node-test.wasm-0c5ecc4a:wasm-function[33617]:0x206b5c0)
    at arrow-acero-hash-join-node-test.wasm.std::__2::vector<arrow::compute::ResizableArrayData, std::__2::allocator<arrow::compute::ResizableArrayData>>::__append(unsigned long) (wasm://wasm/arrow-acero-hash-join-node-test.wasm-0c5ecc4a:wasm-function[12664]:0x9e131f)
    at arrow-acero-hash-join-node-test.wasm.std::__2::vector<arrow::compute::ResizableArrayData, std::__2::allocator<arrow::compute::ResizableArrayData>>::resize(unsigned long) (wasm://wasm/arrow-acero-hash-join-node-test.wasm-0c5ecc4a:wasm-function[12663]:0x9e1173)
    at arrow-acero-hash-join-node-test.wasm.arrow::compute::ExecBatchBuilder::AppendSelected(arrow::MemoryPool*, arrow::compute::ExecBatch const&, int, unsigned short const*, int, int const*) (wasm://wasm/arrow-acero-hash-join-node-test.wasm-0c5ecc4a:wasm-function[12662]:0x9e0dda)
    at arrow-acero-hash-join-node-test.wasm.arrow::acero::JoinResultMaterialize::Append(arrow::compute::ExecBatch const&, int, unsigned short const*, unsigned int const*, unsigned int const*, int*) (wasm://wasm/arrow-acero-hash-join-node-test.wasm-0c5ecc4a:wasm-function[3665]:0x2952d8)
    at invoke_viiiiiiii (/build/cpp/debug/arrow-acero-hash-join-node-test.js:6049:29)
    at arrow-acero-hash-join-node-test.wasm.arrow::acero::JoinProbeProcessor::OnNextBatch(long long, arrow::compute::ExecBatch const&, arrow::util::TempVectorStack*, std::__2::vector<arrow::compute::KeyColumnArray, std::__2::allocator<arrow::compute::KeyColumnArray>>*) (wasm://wasm/arrow-acero-hash-join-node-test.wasm-0c5ecc4a:wasm-function[3686]:0x29f96b)
    at invoke_viijiii (/build/cpp/debug/arrow-acero-hash-join-node-test.js:6680:29)
    at arrow-acero-hash-join-node-test.wasm.arrow::acero::SwissJoin::ProbeSingleBatch(unsigned long, arrow::compute::ExecBatch) (wasm://wasm/arrow-acero-hash-join-node-test.wasm-0c5ecc4a:wasm-function[3716]:0x2a7bcc)

(cc @joemarshall in case this output gives you any pointers)

@zanmato1984
Copy link
Collaborator

One particular specialty of this test is that it takes more stack (the C program stack) space, about over 2MB. This is so far what I've been guessing to cause this failure.

Surprise is that I ran this case locally (on my Mac M1) with ASAN enabled, it effectively reported:

==3768==ERROR: AddressSanitizer: stack-overflow on address 0x00016d7ffcc0 (pc 0x0001164330f8 bp 0x00016d8001d0 sp 0x00016d7ffcc0 T3)

Due to the lack of environment of alpine and emscripten, I'm going to file a PR and do some experiments to the stack size. Is it possible to run the failed tasks in a particular PR? @jorisvandenbossche Any idea? Thank you.

Also, maybe not directly related to this issue, I'm curious why existing CI doesn't detect the said ASAN failure. I may want to see some recent ASAN testing in CI or nightly. @jorisvandenbossche Do you have any pointers? Thank you again.

@jorisvandenbossche
Copy link
Member Author

Is it possible to run the failed tasks in a particular PR?

Yes, we can trigger those from a PR (I am not sure being a collaborator is sufficient to allow triggering this, or if you need to be a commiter, but in any case I can trigger those builds for you in case)

I'm curious why existing CI doesn't detect the said ASAN failure. I may want to see some recent ASAN testing in CI or nightly. @jorisvandenbossche Do you have any pointers?

I am not very familiar with our ASAN/UBSAN test builds (cc @pitrou).
There is a ubuntu-cpp-sanitizer docker image defined, and that should be run on PRs in the standard CI (see eg https://github.com/apache/arrow/actions/runs/9082938790/job/24960515760 from your PR)

@zanmato1984
Copy link
Collaborator

zanmato1984 commented Jun 25, 2024

Verified that the issue is caused by too many joins taking big amount of stack space. PR #43042 filed to solve this (and includes the experiment results).

The original test was too aggressive (had 64 joins, actually 16 should just do). Sorry for failing the nightly for such a long time :(

@zanmato1984
Copy link
Collaborator

Also, maybe not directly related to this issue, I'm curious why existing CI doesn't detect the said ASAN failure. I may want to see some recent ASAN testing in CI or nightly. @jorisvandenbossche Do you have any pointers? Thank you again.

The ubuntu ASAN build seems running all fine. I guess it might be related to the operating system (mine is OSX). PR #43042 solves the ASAN error on OSX too so I'm not going to dive any deeper about this.

pitrou pushed a commit that referenced this issue Jun 26, 2024
### Rationale for this change

The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like #43040 .

### What changes are included in this PR?

Reduce the recursion to 16, which is strong enough for the purpose of #41335 which introduced this test.

### Are these changes tested?

Change is test.

### Are there any user-facing changes?

None.

* GitHub Issue: #43040

Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou
Copy link
Member

pitrou commented Jun 26, 2024

Issue resolved by pull request 43042
#43042

@pitrou pitrou closed this as completed Jun 26, 2024
zanmato1984 added a commit to zanmato1984/arrow that referenced this issue Jul 9, 2024
…43042)

### Rationale for this change

The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like apache#43040 .

### What changes are included in this PR?

Reduce the recursion to 16, which is strong enough for the purpose of apache#41335 which introduced this test.

### Are these changes tested?

Change is test.

### Are there any user-facing changes?

None.

* GitHub Issue: apache#43040

Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants