-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37515: [C++] Remove memory address optimization from ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)
if the ChunkedArray
can have NaN
values
#37579
Conversation
other) to only return early chunked arrays with the same address cannot have NaN values
Could you use "what this PR does" instead of "what problem this PR solves" for PR title? |
ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)
does not treat NaN
values as unequal consistentlyChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)
if the ChunkedArray
can have NaN
values
Just renamed the PR to describe what it does instead of the problem it fixes. |
Could you fix a lint error by running https://github.com/apache/arrow/actions/runs/6091569576/job/16528361395?pr=37579#step:5:7333
|
Will do! |
I'm not sure why the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
The failure on AppVeyor is #37555.
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 85ec07e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…dArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (apache#37579) ### Rationale for this change `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results. The program below illustrates this inconsistency: ```c++ #include <arrow/api.h> #include <arrow/type.h> #include <iostream> #include <math.h> #include <sstream> arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() { arrow::NumericBuilder<arrow::DoubleType> builder; std::shared_ptr<arrow::Array> array; ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4})); ARROW_RETURN_NOT_OK(builder.Finish(&array)); return arrow::ChunkedArray::Make({array}); } int main(int argc, char *argv[]) { auto maybe_chunked_array = make_chunked_array(); if (!maybe_chunked_array.ok()) { return -1; } auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe(); auto array = chunked_array->chunk(0); std::stringstream stream; stream << "chunked_array contents: "; stream << "\n\n"; stream << chunked_array->ToString(); stream << "\n\n"; stream << "chunked_array->Equals(chunked_array): "; stream << chunked_array->Equals(chunked_array); stream << "chunked_array->Equals(*chunked_array): "; stream << chunked_array->Equals(*chunked_array); std::cout << stream.str() << std::endl; } ``` Here is the output of this program: ```shell chunked_array contents: [ [ 0, 1, nan, 2, 4 ] ] chunked_array->Equals(chunked_array): 1 chunked_array->Equals(*chunked_array): 0 ``` ### What changes are included in this PR? Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF: - The two share the same address AND - They cannot have `NaN` values If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison. ### Are these changes tested? Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`. ### Are there any user-facing changes? Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values. * Closes: apache#37515 Authored-by: Sarah Gilmore <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…dArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (apache#37579) ### Rationale for this change `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results. The program below illustrates this inconsistency: ```c++ #include <arrow/api.h> #include <arrow/type.h> #include <iostream> #include <math.h> #include <sstream> arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() { arrow::NumericBuilder<arrow::DoubleType> builder; std::shared_ptr<arrow::Array> array; ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4})); ARROW_RETURN_NOT_OK(builder.Finish(&array)); return arrow::ChunkedArray::Make({array}); } int main(int argc, char *argv[]) { auto maybe_chunked_array = make_chunked_array(); if (!maybe_chunked_array.ok()) { return -1; } auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe(); auto array = chunked_array->chunk(0); std::stringstream stream; stream << "chunked_array contents: "; stream << "\n\n"; stream << chunked_array->ToString(); stream << "\n\n"; stream << "chunked_array->Equals(chunked_array): "; stream << chunked_array->Equals(chunked_array); stream << "chunked_array->Equals(*chunked_array): "; stream << chunked_array->Equals(*chunked_array); std::cout << stream.str() << std::endl; } ``` Here is the output of this program: ```shell chunked_array contents: [ [ 0, 1, nan, 2, 4 ] ] chunked_array->Equals(chunked_array): 1 chunked_array->Equals(*chunked_array): 0 ``` ### What changes are included in this PR? Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF: - The two share the same address AND - They cannot have `NaN` values If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison. ### Are these changes tested? Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`. ### Are there any user-facing changes? Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values. * Closes: apache#37515 Authored-by: Sarah Gilmore <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Rationale for this change
ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)
assumes that if the twoChunkedArray
s share the same memory address, then they must be equal. However, this optimization doesn't take into account thatNaN
values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example,ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)
andChunkedArray::Equals(const ChunkedArray& other)
may return different results.The program below illustrates this inconsistency:
Here is the output of this program:
What changes are included in this PR?
Updated
ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)
to only returntrue
early IF:NaN
valuesIf both of those conditions are not satisfied,
ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)
will do the element-by-element comparison.Are these changes tested?
Yes. I added a new test case called
EqualsSameAddressWithNaNs
tochunked_array_test.cc
.Are there any user-facing changes?
Yes.
ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)
may returnfalse
even if the twoChunkedArray
s have the same memory address. This will only occur if theChunkedArray
's containNaN
values.ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other)
does not treatNaN
values as unequal consistently #37515