-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RowFilter utility #32366
Add RowFilter utility #32366
Conversation
assign set of reviewers |
Assigning reviewers. If you would like to opt out of this review, comment R: @m-trieu for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
||
/** | ||
* Configures this {@link RowFilter} to filter {@link Row} by removing the specified fields. | ||
* Nested fields can be specified using dot-notation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a clear usecase for this? I'm not convinced it's worth the complexity (not just here, but for other transforms (possibly in other languages) that might also adopt these conventions, and further other tools that might want to do things like validation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about reaching nested fields?
No one asked for it AFAIK, just thought it would be a nice-to-have.
Beam already somewhat has this convention: our Select
transform (that does a similar "drop" operation on Beam Rows) also reaches nested fields
|
||
/** | ||
* A utility that filters fields from Beam {@link Row}s. This filter can be configured to indicate | ||
* what fields you would like to either <strong>keep</strong> or <strong>drop</strong>. Afterward, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the unnesting only
option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to the design doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I went ahead and added an unnest
option where multiple fields can be provided. Let me know what you think -- I'm flexible on changing the name/implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a complete flattening will be what we want, e.g. in the example below, one might want to keep the sub-nesting structure intact (e.g. { bar: my_str, xyz: { baz: 456. qwe: 789 }}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unnest
ing implementation here will keep the structure intact (i.e. field names remain the same) and will fail at construction time if we end up with duplicate fields at the top-level (e.g. unnest: [baz.foo, bar.foo]
)
sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowFilter.java
Outdated
Show resolved
Hide resolved
Reminder, please take a look at this pr: @m-trieu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I have no problem with a dedicated Select transform that has this level of sophistication, I would like to avoid having this complexity built into all our IOs (and I see this pattern being useful for several of the ML operations as well). Basically, we want the minimum (or close to it, at least covering the common use cases) here to partition the input into two distinct, schema'd parts which can't be done as a preceding operation. (Anything more sophisticated can be done prior to this transform.)
Maybe this is worth discussing in a larger forum?
|
||
/** | ||
* A utility that filters fields from Beam {@link Row}s. This filter can be configured to indicate | ||
* what fields you would like to either <strong>keep</strong> or <strong>drop</strong>. Afterward, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a complete flattening will be what we want, e.g. in the example below, one might want to keep the sub-nesting structure intact (e.g. { bar: my_str, xyz: { baz: 456. qwe: 789 }}
.
@robertwb That makes sense, I can get behind that. We can always add more complexity later if it becomes a big ask. So what I'm hearing is we want to support keeping/dropping top-level fields only? Am I understanding right |
Yep, exactly. Keep and drop look good now. |
* | ||
* <pre>{@code | ||
* abc: 123 | ||
* foo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think what will be useful here is to indicate that one only wants to write foo, but not nested as
foo:
bar: my_str
xyz:
baz: 456
que: 789
but to a sink expecting Foo type and the records are
bar: my_str
xyz:
baz: 456
que: 789
(e.g. I think it'd be quite common to have elements {path: ..., record: ...} and want to write out record.
To use unnest
as written one would have to enumerate all the fields of foo
(or record
). This is why I was prosing naming this only
and making there be exactly one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see, that does look like a better approach.
Does it make sense to make it a list option? It would server the purpose of only
but also allow unnesting multiple rows. Just wondering because it would be inconvenient in the future if we ever wanted to turn it from a string
to a list<string>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. should we also limit unnest
to top-level fields?
ie. in your example one can do unnest: foo
but not unnest: foo.xyz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's limit to top-level fields as well (for now at least).
For naming, I think it'd be useful to get lots of eyes/opinions. Created https://docs.google.com/document/d/1IIn4cjF9eYASnjSmVmmAt6ymFnpBxHgBKVPgpnQ12G4/edit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no support for a list unnest
option by Monday, I'm down to settle on a string only
option. After thinking a lil about it, it doesn't make too much sense to unnest multiple records, effectively merging their contents into one bigger record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed a commit to switch to the only
option. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and thanks for bearing with me. This is going to be a great enhancement for more than just managed IO sinks.
Thanks! The feedback was much appreciated 🙏🏽 |
Utility to easily filters columns in Beam Rows
Part of #32365
See #31807 for how this can work for portable dynamic destinations