Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41225: [C#] Slice value buffers when writing sliced list or binary arrays in IPC format #41230

Merged
merged 3 commits into from
Apr 16, 2024

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Apr 16, 2024

Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

What changes are included in this PR?

Changes ArrowStreamWriter to write only the subset of the values that is needed rather than the full value buffer when writing a ListArray or BinaryArray, and compute shifted value offset buffers.

Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new ArrayComparer to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.

Copy link

⚠️ GitHub issue #41225 has been automatically assigned in GitHub to PR creator.

keyBuilder.Append("0");
valueBuilder.Append(0);

if (Length > 0)
Copy link
Contributor Author

@adamreeve adamreeve Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was needed because otherwise tests that create zero-length arrays would fail when doing a strict comparison, as the values array would be sliced so would not exactly match the input. It seems logical to only add an extra item to the last list if there is actually a list to add to.

The similar change to the ListViewType above wasn't strictly needed but was just made for consistency.

The ListType data generation also had the same behaviour before but I've made that generate more complex lists with nulls and a range of list sizes including empty lists, for more confidence that the code is correct.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 16, 2024
@CurtHagenlocher CurtHagenlocher merged commit 56186b9 into apache:main Apr 16, 2024
10 checks passed
@CurtHagenlocher CurtHagenlocher removed the awaiting committer review Awaiting committer review label Apr 16, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Apr 16, 2024
@adamreeve adamreeve deleted the list_slice_ipc branch April 16, 2024 20:39
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 56186b9.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

paleolimbot added a commit that referenced this pull request Apr 17, 2024
… with non-zero offsets (#41259)

### Rationale for this change

After #41230 , the integration tests are failing on main .

### What changes are included in this PR?

The bit-by-bit comparison branch in the validity bitmap comparison is missing an offset on one side of the comparison. This PR adds that offset back in.

### Are these changes tested?

Via the integration test CI job.

### Are there any user-facing changes?

No
* GitHub Issue: #41258

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
CurtHagenlocher pushed a commit that referenced this pull request Apr 19, 2024
…ero length offsets to IPC format (#41303)

### Rationale for this change

Fixes the integration test failures caused by #41230

### What changes are included in this PR?

Only try to access the offset values if the array length is non-zero when writing list and binary arrays to IPC format.

### Are these changes tested?

Yes, I've manually run the integration tests with C# and Java to verify they pass (when also including the changes from #41264), and also added new unit tests for this.

### Are there any user-facing changes?

This may also be a bug that affects users but it isn't in a released version.
* GitHub Issue: #41302

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
raulcd pushed a commit that referenced this pull request Apr 29, 2024
… arrays in IPC format (#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: #41225

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
raulcd pushed a commit that referenced this pull request Apr 29, 2024
…ero length offsets to IPC format (#41303)

### Rationale for this change

Fixes the integration test failures caused by #41230

### What changes are included in this PR?

Only try to access the offset values if the array length is non-zero when writing list and binary arrays to IPC format.

### Are these changes tested?

Yes, I've manually run the integration tests with C# and Java to verify they pass (when also including the changes from #41264), and also added new unit tests for this.

### Are there any user-facing changes?

This may also be a bug that affects users but it isn't in a released version.
* GitHub Issue: #41302

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…binary arrays in IPC format (apache#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: apache#41225

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…uffers with non-zero offsets (apache#41259)

### Rationale for this change

After apache#41230 , the integration tests are failing on main .

### What changes are included in this PR?

The bit-by-bit comparison branch in the validity bitmap comparison is missing an offset on one side of the comparison. This PR adds that offset back in.

### Are these changes tested?

Via the integration test CI job.

### Are there any user-facing changes?

No
* GitHub Issue: apache#41258

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…with zero length offsets to IPC format (apache#41303)

### Rationale for this change

Fixes the integration test failures caused by apache#41230

### What changes are included in this PR?

Only try to access the offset values if the array length is non-zero when writing list and binary arrays to IPC format.

### Are these changes tested?

Yes, I've manually run the integration tests with C# and Java to verify they pass (when also including the changes from apache#41264), and also added new unit tests for this.

### Are there any user-facing changes?

This may also be a bug that affects users but it isn't in a released version.
* GitHub Issue: apache#41302

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…binary arrays in IPC format (apache#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: apache#41225

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…uffers with non-zero offsets (apache#41259)

### Rationale for this change

After apache#41230 , the integration tests are failing on main .

### What changes are included in this PR?

The bit-by-bit comparison branch in the validity bitmap comparison is missing an offset on one side of the comparison. This PR adds that offset back in.

### Are these changes tested?

Via the integration test CI job.

### Are there any user-facing changes?

No
* GitHub Issue: apache#41258

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…with zero length offsets to IPC format (apache#41303)

### Rationale for this change

Fixes the integration test failures caused by apache#41230

### What changes are included in this PR?

Only try to access the offset values if the array length is non-zero when writing list and binary arrays to IPC format.

### Are these changes tested?

Yes, I've manually run the integration tests with C# and Java to verify they pass (when also including the changes from apache#41264), and also added new unit tests for this.

### Are there any user-facing changes?

This may also be a bug that affects users but it isn't in a released version.
* GitHub Issue: apache#41302

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…binary arrays in IPC format (apache#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: apache#41225

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…uffers with non-zero offsets (apache#41259)

### Rationale for this change

After apache#41230 , the integration tests are failing on main .

### What changes are included in this PR?

The bit-by-bit comparison branch in the validity bitmap comparison is missing an offset on one side of the comparison. This PR adds that offset back in.

### Are these changes tested?

Via the integration test CI job.

### Are there any user-facing changes?

No
* GitHub Issue: apache#41258

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…with zero length offsets to IPC format (apache#41303)

### Rationale for this change

Fixes the integration test failures caused by apache#41230

### What changes are included in this PR?

Only try to access the offset values if the array length is non-zero when writing list and binary arrays to IPC format.

### Are these changes tested?

Yes, I've manually run the integration tests with C# and Java to verify they pass (when also including the changes from apache#41264), and also added new unit tests for this.

### Are there any user-facing changes?

This may also be a bug that affects users but it isn't in a released version.
* GitHub Issue: apache#41302

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…binary arrays in IPC format (apache#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: apache#41225

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…uffers with non-zero offsets (apache#41259)

### Rationale for this change

After apache#41230 , the integration tests are failing on main .

### What changes are included in this PR?

The bit-by-bit comparison branch in the validity bitmap comparison is missing an offset on one side of the comparison. This PR adds that offset back in.

### Are these changes tested?

Via the integration test CI job.

### Are there any user-facing changes?

No
* GitHub Issue: apache#41258

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…with zero length offsets to IPC format (apache#41303)

### Rationale for this change

Fixes the integration test failures caused by apache#41230

### What changes are included in this PR?

Only try to access the offset values if the array length is non-zero when writing list and binary arrays to IPC format.

### Are these changes tested?

Yes, I've manually run the integration tests with C# and Java to verify they pass (when also including the changes from apache#41264), and also added new unit tests for this.

### Are there any user-facing changes?

This may also be a bug that affects users but it isn't in a released version.
* GitHub Issue: apache#41302

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…binary arrays in IPC format (apache#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: apache#41225

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…uffers with non-zero offsets (apache#41259)

### Rationale for this change

After apache#41230 , the integration tests are failing on main .

### What changes are included in this PR?

The bit-by-bit comparison branch in the validity bitmap comparison is missing an offset on one side of the comparison. This PR adds that offset back in.

### Are these changes tested?

Via the integration test CI job.

### Are there any user-facing changes?

No
* GitHub Issue: apache#41258

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…with zero length offsets to IPC format (apache#41303)

### Rationale for this change

Fixes the integration test failures caused by apache#41230

### What changes are included in this PR?

Only try to access the offset values if the array length is non-zero when writing list and binary arrays to IPC format.

### Are these changes tested?

Yes, I've manually run the integration tests with C# and Java to verify they pass (when also including the changes from apache#41264), and also added new unit tests for this.

### Are there any user-facing changes?

This may also be a bug that affects users but it isn't in a released version.
* GitHub Issue: apache#41302

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants