-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43949: [C++] io::BufferedInput: Fix invalid state after SetBufferSize #44387
Conversation
|
@pitrou @felipecrv @wgtmac This might be a bit critical fixing Also cc @biljazovic sorry for so slow replying |
cpp/src/arrow/io/buffered.cc
Outdated
new_buffer_size = | ||
std::min(new_buffer_size, buffer_size_ + (raw_read_bound_ - raw_read_total_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_buffer_size = | |
std::min(new_buffer_size, buffer_size_ + (raw_read_bound_ - raw_read_total_)); | |
new_buffer_size = | |
std::min(new_buffer_size, buffer_pos_ + buffer_size_ + (raw_read_bound_ - raw_read_total_)); |
Isn't buffer_pos_
required to be kept as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| buffer_size_ |
| buffer_pos_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad! I have confused bytes_buffered_
with buffer_size_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you're right, I've a bug here. buffer_size_
might greater than buffer_pos_ + bytes_buffered_
.
Are you calling |
No, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I have left a comment with a confusing comment. Thanks!
@@ -111,6 +111,7 @@ class ARROW_EXPORT BufferedInputStream | |||
int64_t raw_read_bound = -1); | |||
|
|||
/// \brief Resize internal read buffer; calls to Read(...) will read at least | |||
/// this many bytes from the raw InputStream if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what's the intent of at least
in the previous comment since it is incomplete. I suppose it should be will read at most new_buffer_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both ok to me
@pitrou would you mind take a look when you have spare time? Thanks! |
I'll merge it this week if no negative comment since it's a bugfix. Some comment can be address after bug is fixed |
cpp/src/arrow/io/buffered.cc
Outdated
@@ -311,7 +326,7 @@ class BufferedInputStream::Impl : public BufferedBase { | |||
} | |||
|
|||
// Increase the buffer size if needed. | |||
if (nbytes > buffer_->size() - buffer_pos_) { | |||
if (nbytes > buffer_size_ - buffer_pos_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the DCHECK below be changed as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll resume buffer_->size()
but I think they're same in BufferedBase
Status SetBufferSize(int64_t new_buffer_size) { | ||
if (new_buffer_size <= 0) { | ||
return Status::Invalid("Buffer size should be positive"); | ||
} | ||
if ((buffer_pos_ + bytes_buffered_) >= new_buffer_size) { | ||
return Status::Invalid("Cannot shrink read buffer if buffered data remains"); | ||
return Status::Invalid( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this actually occur using the public APIs? If yes, can we add a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes actually Peek test tests this. Previously, PeekPastBufferedBytes
will raise error here.
Co-authored-by: Antoine Pitrou <[email protected]>
By the way, is this rebased on git main already? |
Previously not but rebased now |
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 7d5a818. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
See #43949
The problem is
Peek
andRead
both callsSetBufferSize
, however:Read
implicit says that, whenSetBufferSize
or read, the previous buffer is not being required. In this scenerio,bytes_buffered_
is always 0, sinceRead
will consume all data. Andnew_buffer_size == std::min(new_buffer_size, (raw_read_bound_ - raw_read_total_))
Peek
still requires the buffer-size here. So,bytes_buffered_
might not 0. WhenPeek
, thenew_buffer_size
would less than expected, which shrinks the bufferWhat changes are included in this PR?
Update the Logic of
SetBufferSize
bytes_buffered_ == 0
,SetBufferSize
can discard the current bufferSetBufferSize
should resize minimal tobuffer_size_ + (raw_read_bound_ - raw_read_total_)
, since it should considering the current bufferAre these changes tested?
Yes
Are there any user-facing changes?
Bugfix