Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33484: [C++][Compute] Implement Grouper::Reset #41352

Merged
merged 6 commits into from
May 14, 2024

Conversation

zanmato1984
Copy link
Collaborator

@zanmato1984 zanmato1984 commented Apr 23, 2024

Rationale for this change

Recently I've been working on some improvement for Grouper and I found adding Reset function could be beneficial. Then I trace down to #33484 from a TODO in code. Here comes this PR.

What changes are included in this PR?

Add Reset function for all the concrete Grouper implementations, and eliminate the recreation of Grouper in AnyKeysSegmenter.

Also add more RowSegmenter cases covering AnyKeysSegmenter.

Are these changes tested?

Yes. Legacy UTs should cover it well. Also added some new UTs.

Are there any user-facing changes?

None.

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@zanmato1984 zanmato1984 changed the title Implement grouper reset GH-33484: [C++][Compute] Implement Grouper::Reset Apr 23, 2024
@zanmato1984 zanmato1984 marked this pull request as ready for review April 23, 2024 17:13
@zanmato1984
Copy link
Collaborator Author

cc @westonpace

Copy link

⚠️ GitHub issue #33484 has been automatically assigned in GitHub to PR creator.

rows_minibatch_.Clean();
map_.cleanup();
RETURN_NOT_OK(map_.init(encode_ctx_.hardware_flags, ctx_->memory_pool()));
return Status::OK();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should enlarge the size of the temp_stack_? Or the stack's internal states like top_ may no enough for the next bunch of operations after grouper reset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the temp vectors allocated by the temp_stack_ would be released at the end the of the call to each individual grouper member function (IMO that's why they are designed to be "temp vector"). Meaning by the time of Reset, the stack is empty already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i see, thanks.
TempVectorStack will hold a resizable buffer allocated in TempVectorStack::Init function until the end of the grouper object.
But the TempVectorHolder will release the used space (already allocated in TempVectorStack ) at the call to each of individual grouper member function as you said.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a DCHECK that the temp_stack_ is empty, then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's probably reasonable. But considering current temp stack doesn't have a public method for size/empty check, and I have other open PRs for temp stack restructure, I'd add the necessary methods after other PRs are done and get this one rebased.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 26, 2024
@zanmato1984
Copy link
Collaborator Author

Hi @pitrou , would you please help to take a look? Thanks.

cpp/src/arrow/compute/row/grouper.h Outdated Show resolved Hide resolved
@@ -223,12 +223,11 @@ struct AnyKeysSegmenter : public BaseRowSegmenter {

AnyKeysSegmenter(const std::vector<TypeHolder>& key_types, ExecContext* ctx)
: BaseRowSegmenter(key_types),
ctx_(ctx),
grouper_(nullptr),
grouper_(Grouper::Make(key_types, ctx).ValueOrDie()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reasonable possibility that this might fail, for example due to input types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without extra context, there is. But the factory method (cleverly) avoids this by a dummy instantiation of Grouper:

ARROW_RETURN_NOT_OK(Grouper::Make(key_types, ctx)); // check types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't it pointless to construct the Grouper twice? Can't the constructor simply take the Grouper that was constructed in the factory method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you are saying. Will update. Thanks for pointing this out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

map_.clear();
offsets_.clear();
key_bytes_.clear();
num_groups_ = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about encoders_?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I saw in https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/row_encoder_internal.h, none of the encoders holds states that would change during the lifespan of the grouper.

rows_minibatch_.Clean();
map_.cleanup();
RETURN_NOT_OK(map_.init(encode_ctx_.hardware_flags, ctx_->memory_pool()));
return Status::OK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a DCHECK that the temp_stack_ is empty, then?

cpp/src/arrow/compute/row/grouper.cc Show resolved Hide resolved
// Assert next is the last (empty) segment.
ASSERT_OK_AND_ASSIGN(auto segment, segmenter->GetNextSegment(batch, offset));
ASSERT_TRUE(segment.offset >= batch.length);
ASSERT_TRUE(segment.is_open);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should something be checked about segment.length here? Or is undetermined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last segment.length is supposed to be 0. Updated.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zanmato1984 . I'll declare this good to merge, though I'd love if @westonpace could double-check this.

@pitrou pitrou merged commit ada965f into apache:main May 14, 2024
36 of 39 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label May 14, 2024
zanmato1984 added a commit to zanmato1984/arrow that referenced this pull request May 14, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit ada965f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
### Rationale for this change

Recently I've been working on some improvement for `Grouper` and I found adding `Reset` function could be beneficial. Then I trace down to apache#33484 from a TODO in code. Here comes this PR.

### What changes are included in this PR?

Add `Reset` function for all the concrete `Grouper` implementations, and eliminate the recreation of `Grouper` in `AnyKeysSegmenter`.

Also add more `RowSegmenter` cases covering `AnyKeysSegmenter`.

### Are these changes tested?

Yes. Legacy UTs should cover it well. Also added some new UTs.

### Are there any user-facing changes?

None.

* GitHub Issue: apache#33484

Lead-authored-by: Ruoxi Sun <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request May 29, 2024
### Rationale for this change

Recently I've been working on some improvement for `Grouper` and I found adding `Reset` function could be beneficial. Then I trace down to apache#33484 from a TODO in code. Here comes this PR.

### What changes are included in this PR?

Add `Reset` function for all the concrete `Grouper` implementations, and eliminate the recreation of `Grouper` in `AnyKeysSegmenter`.

Also add more `RowSegmenter` cases covering `AnyKeysSegmenter`.

### Are these changes tested?

Yes. Legacy UTs should cover it well. Also added some new UTs.

### Are there any user-facing changes?

None.

* GitHub Issue: apache#33484

Lead-authored-by: Ruoxi Sun <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants