Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41149: [C++][Acero] Fix asof join race #41614

Merged
merged 13 commits into from
May 14, 2024

Conversation

zanmato1984
Copy link
Collaborator

@zanmato1984 zanmato1984 commented May 10, 2024

Rationale for this change

Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in #40675 and #41149.

Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens.

What changes are included in this PR?

Eliminate the logical race of emptiness by combining multiple call-sites of Empty().

Are these changes tested?

Include the UT to reproduce the issue.

Are there any user-facing changes?

None.

This PR contains a "Critical Fix".
In #40675 and #41149 , incorrect results are produced.

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@zanmato1984
Copy link
Collaborator Author

@ursabot please benchmark

@ursabot
Copy link

ursabot commented May 10, 2024

Benchmark runs are scheduled for commit 223951e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

Copy link

Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit 223951e.

There were 60 benchmark results indicating a performance regression:

The full Conbench report has more details.

@zanmato1984 zanmato1984 changed the title Test to repro GH-41149: [C++][Acero] Fix asof join race May 10, 2024
@zanmato1984
Copy link
Collaborator Author

@ursabot please benchmark

@zanmato1984 zanmato1984 reopened this May 10, 2024
@ursabot
Copy link

ursabot commented May 10, 2024

Benchmark runs are scheduled for commit ade727e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@zanmato1984 zanmato1984 marked this pull request as draft May 10, 2024 15:59
@zanmato1984
Copy link
Collaborator Author

@ursabot please benchmark

@ursabot
Copy link

ursabot commented May 10, 2024

Benchmark runs are scheduled for commit 606f48c. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

Copy link

⚠️ GitHub issue #41149 has been automatically assigned in GitHub to PR creator.

@zanmato1984 zanmato1984 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label May 10, 2024
@zanmato1984 zanmato1984 marked this pull request as ready for review May 10, 2024 18:46
@zanmato1984
Copy link
Collaborator Author

Hi @jorisvandenbossche @pitrou @westonpace , would you please help to take a look? This fix can bring python CI peace. Thanks.

Copy link

Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit ade727e.

There were 62 benchmark results indicating a performance regression:

The full Conbench report has more details.

Copy link

Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit 606f48c.

There were 64 benchmark results indicating a performance regression:

The full Conbench report has more details.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to understand why this fixes the issue at all. The comments talk about an "atomic call" but I fail to see what is atomic here. This might even work by chance...

@@ -548,8 +548,10 @@ class InputState {
// true when the queue is empty and, when memo may have future entries (the case of a
// positive tolerance), when the memo is empty.
// used when checking whether RHS is up to date with LHS.
bool CurrentEmpty() const {
return memo_.no_future_ ? Empty() : memo_.times_.empty() && Empty();
// NOTE: The emptiness must be decided by an atomic all to Empty() in caller, due to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NOTE: The emptiness must be decided by an atomic all to Empty() in caller, due to the
// NOTE: The emptiness must be decided by an atomic call to Empty() in caller, due to the

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what do you call "atomic" here? I don't see any synchronization around calls to Empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"single call" might be a better description.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "single call". Thanks for suggesting the new naming.

// NOTE: The emptiness must be decided by an atomic all to Empty() in caller, due to the
// potential race with Push(), see GH-41614.
bool CurrentEmpty(bool empty) const {
return memo_.no_future_ ? empty : memo_.times_.empty() && empty;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add parentheses to make sure operator precedence is not mistaken?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

// Advances the RHS as far as possible to be up to date for the current LHS timestamp.
// Returns a pair of booleans. The first is if any RHS has advanced. The second is if
// all RHS are up to date for LHS.
Result<std::pair<bool, bool>> UpdateRhsAndCheckUpToDateWithLhs() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you return a struct with explanatory member names instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

cpp/src/arrow/acero/asof_join_node_test.cc Show resolved Hide resolved
cpp/src/arrow/acero/asof_join_node_test.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 13, 2024
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if I understand correctly you are saying that the value returned by rhs.Empty could change (from true to false I presume?) between the UpdateRhs call and the IsUpToDateWithLhsRow call and this becomes a logic bug?

It might be easier / simpler to wrap some part of ProcessInner in a mutex that is also wrapped around state modification done by the parts of the code that add data. However, I realize that may have performance impact.

@@ -548,8 +548,10 @@ class InputState {
// true when the queue is empty and, when memo may have future entries (the case of a
// positive tolerance), when the memo is empty.
// used when checking whether RHS is up to date with LHS.
bool CurrentEmpty() const {
return memo_.no_future_ ? Empty() : memo_.times_.empty() && Empty();
// NOTE: The emptiness must be decided by an atomic all to Empty() in caller, due to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"single call" might be a better description.

// NOTE: The emptiness must be decided by an atomic all to Empty() in caller, due to the
// potential race with Push(), see GH-41614.
bool CurrentEmpty(bool empty) const {
return memo_.no_future_ ? empty : memo_.times_.empty() && empty;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the memo_.times_.empty() value need to be obtained in a single call as well?

Copy link
Collaborator Author

@zanmato1984 zanmato1984 May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I can't tell for 100% sure. But from my understanding of the code so far, we are good. Unlike the race happening on the queue_, between one side being Push and the other being Process, the memo_ seems only to be manipulated by the Process thread which is in good sync with itself.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting committer review Awaiting committer review awaiting review Awaiting review awaiting changes Awaiting changes labels May 13, 2024
@zanmato1984
Copy link
Collaborator Author

It is hard to understand why this fixes the issue at all. The comments talk about an "atomic call" but I fail to see what is atomic here. This might even work by chance...

Thanks for looking. And sorry about the bad wording - I should've come up with a better name than "atomic call". But the issue this PR is trying to fix is carefully studied and the fix is not by chance. Maybe I should've put this more explicitly than in the PR description, the issue can be 100% reproduced by injecting special delay as shown in this commit
zanmato1984@ea3b24c, so the reason is clear enough.

I shall explain it more:
Suppose the left batch is pushed to asof join node at time 100:
zanmato1984@ea3b24c#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL1398
Then at time 200 the Process thread starts and in turn calls ProcessInner:
zanmato1984@ea3b24c#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL1034
Within ProcessInner, it first advances the right batch (necessary for emitting results) (still time 200, and it sees empty right batch):
zanmato1984@ea3b24c#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL967
Then at time 300 the right batch is pushed to asof join node:
zanmato1984@ea3b24c#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL1398
At last, time 400, ProcessInner checks if the right batch is up-to-date with the left batch (if yes, emits result):
zanmato1984@ea3b24c#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL974
which it should not be the case because the right batch is not advanced at time 200 (because it is not pushed yet). But in IsUpToDateWithLhsRow, it checks for emptiness again:
zanmato1984@ea3b24c#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL942
and
zanmato1984@ea3b24c#diff-5493b6ae7ea2a4d5cfb581034c076e9c4be7608382168de6d1301ef67b6c01eeL552
And this time, it's not empty (as the right batch is already pushed) and ProcessInner thinks "Great, you're up-to-date and let's emit some results".

As you can see this is a logical race about the right batch's emptiness between Push and ProcessInner. My original attempt to explain the fix is (improperly) describing it as an "atomic call". But as Weston suggested, maybe "single call" is better. And hope this explanation could give you more confidence on this fix :)

@zanmato1984
Copy link
Collaborator Author

zanmato1984 commented May 13, 2024

I think if I understand correctly you are saying that the value returned by rhs.Empty could change (from true to false I presume?) between the UpdateRhs call and the IsUpToDateWithLhsRow call and this becomes a logic bug?

Thank you Weston. Your presumption is correct. I've explained the issue in more detail in my previous reply to Antoine.

It might be easier / simpler to wrap some part of ProcessInner in a mutex that is also wrapped around state modification done by the parts of the code that add data. However, I realize that may have performance impact.

Actually I was trying to fix it using mutex, here is a single-mutex version 223951e (#41614) and a double-mutex version ade727e (#41614), both showing a up to 10% performance regression (disclaimer: it could also be the case that I'm bad at interpreting the ursabot's benchmark report). This forces me to think of a lockless fix (the current one). But if the performance down is acceptable when code complexity is a consideration, I'm totally fine to go back to the mutex solution (maybe the double-mutex one is a fair tradeoff). Any recommendation?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 13, 2024
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this fix once @pitrou is content. Thanks for digging into these tricky issues.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels May 14, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @zanmato1984 !

@pitrou pitrou merged commit 8f27e26 into apache:main May 14, 2024
39 of 40 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label May 14, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label May 14, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8f27e26.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
### Rationale for this change

Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in apache#40675 and apache#41149.

Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens.

### What changes are included in this PR?

Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`.

### Are these changes tested?

Include the UT to reproduce the issue.

### Are there any user-facing changes?

None.

**This PR contains a "Critical Fix".**
In apache#40675 and apache#41149 , incorrect results are produced.
* GitHub Issue: apache#41149 
* Also closes apache#40675

Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request May 29, 2024
### Rationale for this change

Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in apache#40675 and apache#41149.

Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens.

### What changes are included in this PR?

Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`.

### Are these changes tested?

Include the UT to reproduce the issue.

### Are there any user-facing changes?

None.

**This PR contains a "Critical Fix".**
In apache#40675 and apache#41149 , incorrect results are produced.
* GitHub Issue: apache#41149 
* Also closes apache#40675

Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting committer review Awaiting committer review Component: C++ Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Python] Sporadic asof join test failure
4 participants