Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43949: [C++] io::BufferedInput: Fix invalid state after SetBufferSize #44387

Merged
merged 5 commits into from
Oct 24, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Oct 12, 2024

Rationale for this change

See #43949

The problem is Peek and Read both calls SetBufferSize, however:

  1. Read implicit says that, when SetBufferSize or read, the previous buffer is not being required. In this scenerio, bytes_buffered_ is always 0, since Read will consume all data. And new_buffer_size == std::min(new_buffer_size, (raw_read_bound_ - raw_read_total_))
  2. Peek still requires the buffer-size here. So, bytes_buffered_ might not 0. When Peek, the new_buffer_size would less than expected, which shrinks the buffer

What changes are included in this PR?

Update the Logic of SetBufferSize

  1. If bytes_buffered_ == 0, SetBufferSize can discard the current buffer
  2. Otherwise, SetBufferSize should resize minimal to buffer_size_ + (raw_read_bound_ - raw_read_total_), since it should considering the current buffer

Are these changes tested?

Yes

Are there any user-facing changes?

Bugfix

Copy link

⚠️ GitHub issue #43949 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Oct 12, 2024

@pitrou @felipecrv @wgtmac This might be a bit critical fixing

Also cc @biljazovic sorry for so slow replying

@github-actions github-actions bot added the awaiting review Awaiting review label Oct 12, 2024
cpp/src/arrow/io/buffered.cc Outdated Show resolved Hide resolved
cpp/src/arrow/io/buffered.cc Outdated Show resolved Hide resolved
cpp/src/arrow/io/buffered.cc Outdated Show resolved Hide resolved
Comment on lines 309 to 310
new_buffer_size =
std::min(new_buffer_size, buffer_size_ + (raw_read_bound_ - raw_read_total_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new_buffer_size =
std::min(new_buffer_size, buffer_size_ + (raw_read_bound_ - raw_read_total_));
new_buffer_size =
std::min(new_buffer_size, buffer_pos_ + buffer_size_ + (raw_read_bound_ - raw_read_total_));

Isn't buffer_pos_ required to be kept as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|     buffer_size_            |
|    buffer_pos_  |

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad! I have confused bytes_buffered_ with buffer_size_.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you're right, I've a bug here. buffer_size_ might greater than buffer_pos_ + bytes_buffered_.

cpp/src/arrow/io/buffered.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 12, 2024
@pitrou
Copy link
Member

pitrou commented Oct 15, 2024

Are you calling SetBufferSize from your own code? I think it was a mistake to make this API public, it should only be used for initializing the buffered stream.

@mapleFU
Copy link
Member Author

mapleFU commented Oct 15, 2024

Are you calling SetBufferSize from your own code? I think it was a mistake to make this API public, it should only be used for initializing the buffered stream.

No, Peak will call SetBufferSize to adjust the size if possible. But call SetBufferSize explicitly will also cause the bug

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I have left a comment with a confusing comment. Thanks!

@@ -111,6 +111,7 @@ class ARROW_EXPORT BufferedInputStream
int64_t raw_read_bound = -1);

/// \brief Resize internal read buffer; calls to Read(...) will read at least
/// this many bytes from the raw InputStream if possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what's the intent of at least in the previous comment since it is incomplete. I suppose it should be will read at most new_buffer_size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both ok to me

@mapleFU
Copy link
Member Author

mapleFU commented Oct 16, 2024

@pitrou would you mind take a look when you have spare time? Thanks!

@mapleFU mapleFU requested a review from pitrou October 21, 2024 11:04
@mapleFU
Copy link
Member Author

mapleFU commented Oct 24, 2024

I'll merge it this week if no negative comment since it's a bugfix. Some comment can be address after bug is fixed

@@ -311,7 +326,7 @@ class BufferedInputStream::Impl : public BufferedBase {
}

// Increase the buffer size if needed.
if (nbytes > buffer_->size() - buffer_pos_) {
if (nbytes > buffer_size_ - buffer_pos_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the DCHECK below be changed as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll resume buffer_->size() but I think they're same in BufferedBase

Status SetBufferSize(int64_t new_buffer_size) {
if (new_buffer_size <= 0) {
return Status::Invalid("Buffer size should be positive");
}
if ((buffer_pos_ + bytes_buffered_) >= new_buffer_size) {
return Status::Invalid("Cannot shrink read buffer if buffered data remains");
return Status::Invalid(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this actually occur using the public APIs? If yes, can we add a test?

Copy link
Member Author

@mapleFU mapleFU Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes actually Peek test tests this. Previously, PeekPastBufferedBytes will raise error here.

@pitrou
Copy link
Member

pitrou commented Oct 24, 2024

By the way, is this rebased on git main already?

@mapleFU
Copy link
Member Author

mapleFU commented Oct 24, 2024

By the way, is this rebased on git main already?

Previously not but rebased now

@pitrou pitrou changed the title GH-43949: [C++] io::BufferedInput: Fixing the invalid state with SetBufferSize GH-43949: [C++] io::BufferedInput: Fix invalid state after SetBufferSize Oct 24, 2024
@pitrou pitrou merged commit 7d5a818 into apache:main Oct 24, 2024
41 checks passed
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 7d5a818.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants