Add RowFilter utility #32366

ahmedabu98 · 2024-08-29T19:18:52Z

Utility to easily filters columns in Beam Rows

Part of #32365
See #31807 for how this can work for portable dynamic destinations

ahmedabu98 · 2024-08-29T20:17:47Z

assign set of reviewers

github-actions · 2024-08-29T20:18:57Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @m-trieu for label java.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

robertwb · 2024-08-30T20:30:24Z

sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowFilter.java

+
+  /**
+   * Configures this {@link RowFilter} to filter {@link Row} by removing the specified fields.
+   * Nested fields can be specified using dot-notation.


Do we have a clear usecase for this? I'm not convinced it's worth the complexity (not just here, but for other transforms (possibly in other languages) that might also adopt these conventions, and further other tools that might want to do things like validation).

This is about reaching nested fields?

No one asked for it AFAIK, just thought it would be a nice-to-have.

Beam already somewhat has this convention: our Select transform (that does a similar "drop" operation on Beam Rows) also reaches nested fields

robertwb · 2024-08-30T20:32:49Z

sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowFilter.java

+
+/**
+ * A utility that filters fields from Beam {@link Row}s. This filter can be configured to indicate
+ * what fields you would like to either <strong>keep</strong> or <strong>drop</strong>. Afterward,


What about the unnesting only option?

Added a comment to the design doc

For now I went ahead and added an unnest option where multiple fields can be provided. Let me know what you think -- I'm flexible on changing the name/implementation.

I don't think a complete flattening will be what we want, e.g. in the example below, one might want to keep the sub-nesting structure intact (e.g. { bar: my_str, xyz: { baz: 456. qwe: 789 }}.

The unnesting implementation here will keep the structure intact (i.e. field names remain the same) and will fail at construction time if we end up with duplicate fields at the top-level (e.g. unnest: [baz.foo, bar.foo])

sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowFilter.java

github-actions · 2024-09-07T12:13:52Z

Reminder, please take a look at this pr: @m-trieu

robertwb

While I have no problem with a dedicated Select transform that has this level of sophistication, I would like to avoid having this complexity built into all our IOs (and I see this pattern being useful for several of the ML operations as well). Basically, we want the minimum (or close to it, at least covering the common use cases) here to partition the input into two distinct, schema'd parts which can't be done as a preceding operation. (Anything more sophisticated can be done prior to this transform.)

Maybe this is worth discussing in a larger forum?

robertwb · 2024-09-11T23:02:39Z

sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowFilter.java

+
+/**
+ * A utility that filters fields from Beam {@link Row}s. This filter can be configured to indicate
+ * what fields you would like to either <strong>keep</strong> or <strong>drop</strong>. Afterward,


I don't think a complete flattening will be what we want, e.g. in the example below, one might want to keep the sub-nesting structure intact (e.g. { bar: my_str, xyz: { baz: 456. qwe: 789 }}.

ahmedabu98 · 2024-09-15T19:50:32Z

@robertwb That makes sense, I can get behind that. We can always add more complexity later if it becomes a big ask.

So what I'm hearing is we want to support keeping/dropping top-level fields only? Am I understanding right

robertwb · 2024-09-17T23:30:11Z

@robertwb That makes sense, I can get behind that. We can always add more complexity later if it becomes a big ask.

So what I'm hearing is we want to support keeping/dropping top-level fields only? Am I understanding right

Yep, exactly. Keep and drop look good now.

robertwb · 2024-09-17T23:35:25Z

sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowFilter.java

+   *
+   * <pre>{@code
+   * abc: 123
+   * foo:


So I think what will be useful here is to indicate that one only wants to write foo, but not nested as

foo: bar: my_str xyz: baz: 456 que: 789

but to a sink expecting Foo type and the records are

bar: my_str xyz: baz: 456 que: 789

(e.g. I think it'd be quite common to have elements {path: ..., record: ...} and want to write out record.

To use unnest as written one would have to enumerate all the fields of foo (or record). This is why I was prosing naming this only and making there be exactly one.

Ahh I see, that does look like a better approach.

Does it make sense to make it a list option? It would server the purpose of only but also allow unnesting multiple rows. Just wondering because it would be inconvenient in the future if we ever wanted to turn it from a string to a list<string>

P.S. should we also limit unnest to top-level fields?

ie. in your example one can do unnest: foo but not unnest: foo.xyz

Yes, let's limit to top-level fields as well (for now at least).

For naming, I think it'd be useful to get lots of eyes/opinions. Created https://docs.google.com/document/d/1IIn4cjF9eYASnjSmVmmAt6ymFnpBxHgBKVPgpnQ12G4/edit

If there's no support for a list unnest option by Monday, I'm down to settle on a string only option. After thinking a lil about it, it doesn't make too much sense to unnest multiple records, effectively merging their contents into one bigger record.

Just pushed a commit to switch to the only option. PTAL

…one field

robertwb

LGTM, and thanks for bearing with me. This is going to be a great enhancement for more than just managed IO sinks.

ahmedabu98 · 2024-09-24T11:43:32Z

Thanks! The feedback was much appreciated 🙏🏽

RowFilter

4cca791

github-actions bot added the java label Aug 29, 2024

ahmedabu98 mentioned this pull request Aug 29, 2024

Add RowStringInterpolator utility #32367

Merged

spotless

1027bdb

github-actions bot added the Next Action: Reviewers label Aug 29, 2024

robertwb reviewed Aug 30, 2024

View reviewed changes

github-actions bot added the slow-review label Sep 7, 2024

ahmedabu98 added 2 commits September 9, 2024 13:14

re-order to make public API more visible

eabb805

add unnest filter operation and tests

2558725

robertwb reviewed Sep 11, 2024

View reviewed changes

ahmedabu98 added 2 commits September 16, 2024 07:10

fail when nested fields are specified for 'keep' or 'drop'

1fedf83

doc nit

50540d3

github-actions bot removed the slow-review label Sep 17, 2024

robertwb reviewed Sep 17, 2024

View reviewed changes

ahmedabu98 added 2 commits September 18, 2024 17:25

update implementation to unnest all fields under a row

df74050

switch implementation from 'unnest' multiple fields to 'only' select …

50d10b1

…one field

liferoad added this to the 2.60.0 Release milestone Sep 23, 2024

robertwb approved these changes Sep 23, 2024

View reviewed changes

ahmedabu98 merged commit c3be9f0 into apache:master Sep 24, 2024
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RowFilter utility #32366

Add RowFilter utility #32366

ahmedabu98 commented Aug 29, 2024 •

edited

Loading

ahmedabu98 commented Aug 29, 2024

github-actions bot commented Aug 29, 2024

robertwb Aug 30, 2024

ahmedabu98 Sep 9, 2024

robertwb Aug 30, 2024

ahmedabu98 Sep 9, 2024

ahmedabu98 Sep 9, 2024

robertwb Sep 11, 2024

ahmedabu98 Sep 15, 2024

github-actions bot commented Sep 7, 2024

robertwb left a comment

robertwb Sep 11, 2024

ahmedabu98 commented Sep 15, 2024 •

edited

Loading

robertwb commented Sep 17, 2024

robertwb Sep 17, 2024

ahmedabu98 Sep 18, 2024

ahmedabu98 Sep 18, 2024

robertwb Sep 19, 2024

ahmedabu98 Sep 20, 2024

ahmedabu98 Sep 23, 2024

robertwb left a comment

ahmedabu98 commented Sep 24, 2024

Add RowFilter utility #32366

Add RowFilter utility #32366

Conversation

ahmedabu98 commented Aug 29, 2024 • edited Loading

ahmedabu98 commented Aug 29, 2024

github-actions bot commented Aug 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 7, 2024

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmedabu98 commented Sep 15, 2024 • edited Loading

robertwb commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertwb left a comment

Choose a reason for hiding this comment

ahmedabu98 commented Sep 24, 2024

ahmedabu98 commented Aug 29, 2024 •

edited

Loading

ahmedabu98 commented Sep 15, 2024 •

edited

Loading