Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37515: [C++] Remove memory address optimization from ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other) if the ChunkedArray can have NaN values #37579

Merged
merged 8 commits into from
Sep 7, 2023

Conversation

sgilmore10
Copy link
Member

@sgilmore10 sgilmore10 commented Sep 5, 2023

Rationale for this change

ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other) assumes that if the two ChunkedArrays share the same memory address, then they must be equal. However, this optimization doesn't take into account that NaN values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other) and ChunkedArray::Equals(const ChunkedArray& other) may return different results.

The program below illustrates this inconsistency:

#include <arrow/api.h>
#include <arrow/type.h>

#include <iostream>
#include <math.h>
#include <sstream>

arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() {
    arrow::NumericBuilder<arrow::DoubleType> builder;
    
    std::shared_ptr<arrow::Array> array;
    ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4}));
    ARROW_RETURN_NOT_OK(builder.Finish(&array));
    
    return arrow::ChunkedArray::Make({array});
}

int main(int argc, char *argv[])
{
    auto maybe_chunked_array = make_chunked_array();
    if (!maybe_chunked_array.ok()) {
        return -1;
    }
    auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe();
    auto array = chunked_array->chunk(0);
    
    std::stringstream stream;

    stream << "chunked_array contents: ";
    stream << "\n\n";
    stream << chunked_array->ToString();
    stream << "\n\n";
    stream << "chunked_array->Equals(chunked_array): ";
    stream << chunked_array->Equals(chunked_array);
    stream << "chunked_array->Equals(*chunked_array): ";
    stream << chunked_array->Equals(*chunked_array);
    
    std::cout << stream.str() << std::endl;
}

Here is the output of this program:

chunked_array contents: 

[
  [
    0,
    1,
    nan,
    2,
    4
  ]
]

chunked_array->Equals(chunked_array): 1
chunked_array->Equals(*chunked_array): 0

What changes are included in this PR?

Updated ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other) to only return true early IF:

  • The two share the same address AND
  • They cannot have NaN values

If both of those conditions are not satisfied, ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other) will do the element-by-element comparison.

Are these changes tested?

Yes. I added a new test case called EqualsSameAddressWithNaNs to chunked_array_test.cc.

Are there any user-facing changes?

Yes. ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other) may return false even if the two ChunkedArrays have the same memory address. This will only occur if the ChunkedArray's contain NaN values.

cpp/src/arrow/chunked_array.cc Outdated Show resolved Hide resolved
cpp/src/arrow/chunked_array.cc Outdated Show resolved Hide resolved
cpp/src/arrow/chunked_array.cc Outdated Show resolved Hide resolved
cpp/src/arrow/chunked_array.cc Outdated Show resolved Hide resolved
cpp/src/arrow/chunked_array_test.cc Outdated Show resolved Hide resolved
@kou
Copy link
Member

kou commented Sep 6, 2023

Could you use "what this PR does" instead of "what problem this PR solves" for PR title?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 6, 2023
@sgilmore10 sgilmore10 changed the title GH-37515: [C++] In some cases ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other) does not treat NaN values as unequal consistently GH-37515: [C++] Remove memory address optimization from ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other) if the ChunkedArray can have NaN values Sep 6, 2023
@sgilmore10
Copy link
Member Author

Could you use "what this PR does" instead of "what problem this PR solves" for PR title?

Just renamed the PR to describe what it does instead of the problem it fixes.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 6, 2023
@kou
Copy link
Member

kou commented Sep 6, 2023

Could you fix a lint error by running ninja format?

https://github.com/apache/arrow/actions/runs/6091569576/job/16528361395?pr=37579#step:5:7333

--- /arrow/cpp/src/arrow/chunked_array.cc
+++ /arrow/cpp/src/arrow/chunked_array.cc (after clang format)
@@ -127,7 +127,7 @@
   return false;
 }
 
-} //  namespace
+}  //  namespace
 
 bool ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other) const {
   if (!other) {

@sgilmore10
Copy link
Member Author

Could you fix a lint error by running ninja format?

https://github.com/apache/arrow/actions/runs/6091569576/job/16528361395?pr=37579#step:5:7333

--- /arrow/cpp/src/arrow/chunked_array.cc
+++ /arrow/cpp/src/arrow/chunked_array.cc (after clang format)
@@ -127,7 +127,7 @@
   return false;
 }
 
-} //  namespace
+}  //  namespace
 
 bool ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other) const {
   if (!other) {

Will do!

@sgilmore10
Copy link
Member Author

I'm not sure why the continuous-integration/appveyor/pr workflow failed. I don't think it has anything to do with my changes.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

The failure on AppVeyor is #37555.

@kou kou merged commit 85ec07e into apache:main Sep 7, 2023
34 checks passed
@kou kou removed the awaiting change review Awaiting change review label Sep 7, 2023
@github-actions github-actions bot added the awaiting merge Awaiting merge label Sep 7, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 85ec07e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…dArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (apache#37579)

### Rationale for this change

`ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)`  and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results.

 The program below illustrates this inconsistency:

```c++
#include <arrow/api.h>
#include <arrow/type.h>

#include <iostream>
#include <math.h>
#include <sstream>

arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() {
    arrow::NumericBuilder<arrow::DoubleType> builder;
    
    std::shared_ptr<arrow::Array> array;
    ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4}));
    ARROW_RETURN_NOT_OK(builder.Finish(&array));
    
    return arrow::ChunkedArray::Make({array});
}

int main(int argc, char *argv[])
{
    auto maybe_chunked_array = make_chunked_array();
    if (!maybe_chunked_array.ok()) {
        return -1;
    }
    auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe();
    auto array = chunked_array->chunk(0);
    
    std::stringstream stream;

    stream << "chunked_array contents: ";
    stream << "\n\n";
    stream << chunked_array->ToString();
    stream << "\n\n";
    stream << "chunked_array->Equals(chunked_array): ";
    stream << chunked_array->Equals(chunked_array);
    stream << "chunked_array->Equals(*chunked_array): ";
    stream << chunked_array->Equals(*chunked_array);
    
    std::cout << stream.str() << std::endl;
}
```

Here is the output of this program:

```shell
chunked_array contents: 

[
  [
    0,
    1,
    nan,
    2,
    4
  ]
]

chunked_array->Equals(chunked_array): 1
chunked_array->Equals(*chunked_array): 0
```

### What changes are included in this PR?

Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF:
   - The two share the same address AND
   - They cannot have `NaN` values 

If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison.

### Are these changes tested?

Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`.

### Are there any user-facing changes?

Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values.

* Closes: apache#37515

Authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…dArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (apache#37579)

### Rationale for this change

`ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)`  and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results.

 The program below illustrates this inconsistency:

```c++
#include <arrow/api.h>
#include <arrow/type.h>

#include <iostream>
#include <math.h>
#include <sstream>

arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() {
    arrow::NumericBuilder<arrow::DoubleType> builder;
    
    std::shared_ptr<arrow::Array> array;
    ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4}));
    ARROW_RETURN_NOT_OK(builder.Finish(&array));
    
    return arrow::ChunkedArray::Make({array});
}

int main(int argc, char *argv[])
{
    auto maybe_chunked_array = make_chunked_array();
    if (!maybe_chunked_array.ok()) {
        return -1;
    }
    auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe();
    auto array = chunked_array->chunk(0);
    
    std::stringstream stream;

    stream << "chunked_array contents: ";
    stream << "\n\n";
    stream << chunked_array->ToString();
    stream << "\n\n";
    stream << "chunked_array->Equals(chunked_array): ";
    stream << chunked_array->Equals(chunked_array);
    stream << "chunked_array->Equals(*chunked_array): ";
    stream << chunked_array->Equals(*chunked_array);
    
    std::cout << stream.str() << std::endl;
}
```

Here is the output of this program:

```shell
chunked_array contents: 

[
  [
    0,
    1,
    nan,
    2,
    4
  ]
]

chunked_array->Equals(chunked_array): 1
chunked_array->Equals(*chunked_array): 0
```

### What changes are included in this PR?

Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF:
   - The two share the same address AND
   - They cannot have `NaN` values 

If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison.

### Are these changes tested?

Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`.

### Are there any user-facing changes?

Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values.

* Closes: apache#37515

Authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants