-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41149: [C++][Acero] Fix asof join race #41614
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
@ursabot please benchmark |
Benchmark runs are scheduled for commit 223951e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit 223951e. There were 60 benchmark results indicating a performance regression:
The full Conbench report has more details. |
@ursabot please benchmark |
Benchmark runs are scheduled for commit ade727e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
@ursabot please benchmark |
Benchmark runs are scheduled for commit 606f48c. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
Hi @jorisvandenbossche @pitrou @westonpace , would you please help to take a look? This fix can bring python CI peace. Thanks. |
Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit ade727e. There were 62 benchmark results indicating a performance regression:
The full Conbench report has more details. |
Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit 606f48c. There were 64 benchmark results indicating a performance regression:
The full Conbench report has more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard to understand why this fixes the issue at all. The comments talk about an "atomic call" but I fail to see what is atomic here. This might even work by chance...
@@ -548,8 +548,10 @@ class InputState { | |||
// true when the queue is empty and, when memo may have future entries (the case of a | |||
// positive tolerance), when the memo is empty. | |||
// used when checking whether RHS is up to date with LHS. | |||
bool CurrentEmpty() const { | |||
return memo_.no_future_ ? Empty() : memo_.times_.empty() && Empty(); | |||
// NOTE: The emptiness must be decided by an atomic all to Empty() in caller, due to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// NOTE: The emptiness must be decided by an atomic all to Empty() in caller, due to the | |
// NOTE: The emptiness must be decided by an atomic call to Empty() in caller, due to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what do you call "atomic" here? I don't see any synchronization around calls to Empty
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"single call" might be a better description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "single call". Thanks for suggesting the new naming.
// NOTE: The emptiness must be decided by an atomic all to Empty() in caller, due to the | ||
// potential race with Push(), see GH-41614. | ||
bool CurrentEmpty(bool empty) const { | ||
return memo_.no_future_ ? empty : memo_.times_.empty() && empty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add parentheses to make sure operator precedence is not mistaken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
// Advances the RHS as far as possible to be up to date for the current LHS timestamp. | ||
// Returns a pair of booleans. The first is if any RHS has advanced. The second is if | ||
// all RHS are up to date for LHS. | ||
Result<std::pair<bool, bool>> UpdateRhsAndCheckUpToDateWithLhs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you return a struct with explanatory member names instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if I understand correctly you are saying that the value returned by rhs.Empty
could change (from true to false I presume?) between the UpdateRhs
call and the IsUpToDateWithLhsRow
call and this becomes a logic bug?
It might be easier / simpler to wrap some part of ProcessInner
in a mutex that is also wrapped around state modification done by the parts of the code that add data. However, I realize that may have performance impact.
@@ -548,8 +548,10 @@ class InputState { | |||
// true when the queue is empty and, when memo may have future entries (the case of a | |||
// positive tolerance), when the memo is empty. | |||
// used when checking whether RHS is up to date with LHS. | |||
bool CurrentEmpty() const { | |||
return memo_.no_future_ ? Empty() : memo_.times_.empty() && Empty(); | |||
// NOTE: The emptiness must be decided by an atomic all to Empty() in caller, due to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"single call" might be a better description.
// NOTE: The emptiness must be decided by an atomic all to Empty() in caller, due to the | ||
// potential race with Push(), see GH-41614. | ||
bool CurrentEmpty(bool empty) const { | ||
return memo_.no_future_ ? empty : memo_.times_.empty() && empty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the memo_.times_.empty()
value need to be obtained in a single call as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I can't tell for 100% sure. But from my understanding of the code so far, we are good. Unlike the race happening on the queue_
, between one side being Push
and the other being Process
, the memo_
seems only to be manipulated by the Process
thread which is in good sync with itself.
Thanks for looking. And sorry about the bad wording - I should've come up with a better name than "atomic call". But the issue this PR is trying to fix is carefully studied and the fix is not by chance. Maybe I should've put this more explicitly than in the PR description, the issue can be 100% reproduced by injecting special delay as shown in this commit I shall explain it more: As you can see this is a logical race about the right batch's emptiness between |
Thank you Weston. Your presumption is correct. I've explained the issue in more detail in my previous reply to Antoine.
Actually I was trying to fix it using mutex, here is a single-mutex version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this fix once @pitrou is content. Thanks for digging into these tricky issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @zanmato1984 !
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8f27e26. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in apache#40675 and apache#41149. Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens. ### What changes are included in this PR? Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`. ### Are these changes tested? Include the UT to reproduce the issue. ### Are there any user-facing changes? None. **This PR contains a "Critical Fix".** In apache#40675 and apache#41149 , incorrect results are produced. * GitHub Issue: apache#41149 * Also closes apache#40675 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in apache#40675 and apache#41149. Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens. ### What changes are included in this PR? Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`. ### Are these changes tested? Include the UT to reproduce the issue. ### Are there any user-facing changes? None. **This PR contains a "Critical Fix".** In apache#40675 and apache#41149 , incorrect results are produced. * GitHub Issue: apache#41149 * Also closes apache#40675 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in #40675 and #41149.
Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens.
What changes are included in this PR?
Eliminate the logical race of emptiness by combining multiple call-sites of
Empty()
.Are these changes tested?
Include the UT to reproduce the issue.
Are there any user-facing changes?
None.
This PR contains a "Critical Fix".
In #40675 and #41149 , incorrect results are produced.