Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Java SDK blocked due to FnApi data stream multiplexing and ProcessBundleHandler exception handling #32714

Closed
1 of 17 tasks
scwhittle opened this issue Oct 9, 2024 · 0 comments · Fixed by #32857
Closed
1 of 17 tasks
Assignees

Comments

@scwhittle
Copy link
Contributor

What happened?

Java SDK pipelines using the FnApi can become wedged if the single data stream is blocked attempting to queue records for a process bundle handler which was never registered.
image

It seems like this same stuckness could also occur with the error handling if we did create the bundle handler successfully but encountered an exception during processing and unregistered the instruction id and then had more data show up for the instruction id. The healthy case ensures that the data for the instruction is drained before returning and removing the registration.
image

There could be a control stream issue where the instruction id doesn't arrive but I think that there is a race with error handling that could explain why the future is never notified. It seems possible that a handler for the instruction id was recieved and registered but then was removed due to processing completing with an error. If an element on the data stream arrives for the instruction id after this point it will create a new entry in the map for the instruction id which was removed, and will block forever because the instruction id will not be registered and have it's future notified.
image

To fix I think we could:

  1. catch and retry the exception creating the bundle processor
  2. add a timeout to the data multiplexing waiting for the instruction to arrive. If the timeout is exceeded we'd have to error out the data stream causing SDK restart but that is preferrable to being stuck forever. We could reduce this triggering for the identified races above by having a cache of recently errored instructions.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant