Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RowFilter utility #32366

Merged
merged 8 commits into from
Sep 24, 2024
Merged

Add RowFilter utility #32366

merged 8 commits into from
Sep 24, 2024

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Aug 29, 2024

Utility to easily filters columns in Beam Rows

Part of #32365
See #31807 for how this can work for portable dynamic destinations

@ahmedabu98
Copy link
Contributor Author

assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @m-trieu for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).


/**
* Configures this {@link RowFilter} to filter {@link Row} by removing the specified fields.
* Nested fields can be specified using dot-notation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a clear usecase for this? I'm not convinced it's worth the complexity (not just here, but for other transforms (possibly in other languages) that might also adopt these conventions, and further other tools that might want to do things like validation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about reaching nested fields?

No one asked for it AFAIK, just thought it would be a nice-to-have.

Beam already somewhat has this convention: our Select transform (that does a similar "drop" operation on Beam Rows) also reaches nested fields


/**
* A utility that filters fields from Beam {@link Row}s. This filter can be configured to indicate
* what fields you would like to either <strong>keep</strong> or <strong>drop</strong>. Afterward,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the unnesting only option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to the design doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I went ahead and added an unnest option where multiple fields can be provided. Let me know what you think -- I'm flexible on changing the name/implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a complete flattening will be what we want, e.g. in the example below, one might want to keep the sub-nesting structure intact (e.g. { bar: my_str, xyz: { baz: 456. qwe: 789 }}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unnesting implementation here will keep the structure intact (i.e. field names remain the same) and will fail at construction time if we end up with duplicate fields at the top-level (e.g. unnest: [baz.foo, bar.foo])

Copy link
Contributor

github-actions bot commented Sep 7, 2024

Reminder, please take a look at this pr: @m-trieu

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I have no problem with a dedicated Select transform that has this level of sophistication, I would like to avoid having this complexity built into all our IOs (and I see this pattern being useful for several of the ML operations as well). Basically, we want the minimum (or close to it, at least covering the common use cases) here to partition the input into two distinct, schema'd parts which can't be done as a preceding operation. (Anything more sophisticated can be done prior to this transform.)

Maybe this is worth discussing in a larger forum?


/**
* A utility that filters fields from Beam {@link Row}s. This filter can be configured to indicate
* what fields you would like to either <strong>keep</strong> or <strong>drop</strong>. Afterward,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a complete flattening will be what we want, e.g. in the example below, one might want to keep the sub-nesting structure intact (e.g. { bar: my_str, xyz: { baz: 456. qwe: 789 }}.

@ahmedabu98
Copy link
Contributor Author

ahmedabu98 commented Sep 15, 2024

@robertwb That makes sense, I can get behind that. We can always add more complexity later if it becomes a big ask.

So what I'm hearing is we want to support keeping/dropping top-level fields only? Am I understanding right

@robertwb
Copy link
Contributor

@robertwb That makes sense, I can get behind that. We can always add more complexity later if it becomes a big ask.

So what I'm hearing is we want to support keeping/dropping top-level fields only? Am I understanding right

Yep, exactly. Keep and drop look good now.

*
* <pre>{@code
* abc: 123
* foo:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think what will be useful here is to indicate that one only wants to write foo, but not nested as

foo:
  bar: my_str
  xyz:
    baz: 456
    que: 789

but to a sink expecting Foo type and the records are

bar: my_str
xyz:
  baz: 456
  que: 789

(e.g. I think it'd be quite common to have elements {path: ..., record: ...} and want to write out record.

To use unnest as written one would have to enumerate all the fields of foo (or record). This is why I was prosing naming this only and making there be exactly one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see, that does look like a better approach.

Does it make sense to make it a list option? It would server the purpose of only but also allow unnesting multiple rows. Just wondering because it would be inconvenient in the future if we ever wanted to turn it from a string to a list<string>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. should we also limit unnest to top-level fields?

ie. in your example one can do unnest: foo but not unnest: foo.xyz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's limit to top-level fields as well (for now at least).

For naming, I think it'd be useful to get lots of eyes/opinions. Created https://docs.google.com/document/d/1IIn4cjF9eYASnjSmVmmAt6ymFnpBxHgBKVPgpnQ12G4/edit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no support for a list unnest option by Monday, I'm down to settle on a string only option. After thinking a lil about it, it doesn't make too much sense to unnest multiple records, effectively merging their contents into one bigger record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed a commit to switch to the only option. PTAL

@liferoad liferoad added this to the 2.60.0 Release milestone Sep 23, 2024
Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and thanks for bearing with me. This is going to be a great enhancement for more than just managed IO sinks.

@ahmedabu98
Copy link
Contributor Author

Thanks! The feedback was much appreciated 🙏🏽

@ahmedabu98 ahmedabu98 merged commit c3be9f0 into apache:master Sep 24, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants