Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: PreparePubsubWriteDoFn does not guarantee that a message will not exceed hard request limits #31800

Open
2 of 16 tasks
sjvanrossum opened this issue Jul 8, 2024 · 0 comments

Comments

@sjvanrossum
Copy link
Contributor

sjvanrossum commented Jul 8, 2024

What happened?

While working on cleaning up PubsubIO.Write to support writes with ordering keys without explicit bucketing I noticed that PreparePubsubWriteDoFn does not guarantee that a single message fits within the limits of a PublishRequest for either gRPC or REST clients.

The validation routine in PreparePubsubWriteDoFn only validates explicit message size and counts 6 bytes of overhead for every attribute. This will work often enough if maxPublishBatchSize is set to <9MB for the gRPC client, but this is more likely to fail for REST clients since there's additional overhead to consider because the data field is sent as a base64 encoded string and string fields like attribute keys/values and ordering keys may require escape sequences (minimally required: \", \\, \b, \t, \n, \f, \r, \uXXXX for control characters 0-31). For JSON clients the data field must be <7.5 MB, attribute size depends on the overhead on escape sequences, and requests should not exceed 10 MiB.

For gRPC clients, the static overhead of 6 bytes per attribute is valid for any map entry <128 B in protobuf wire format. The TLV overhead may grow to 3 bytes for the entry, key and value (attribute field tag (1 B) + map entry length (1-2 B) + key field tag (1 B) + key length (1-2 B) + value field tag (1 B) + value length (1-2 B)). The total serialized size of any single message publish request must not exceed 10 MB currently. The API accepts requests up to 10 MiB for both JSON and protobuf and may omit the protobuf encoding overhead from the request validation at some point in the future in which case a single message will not exceed 10MiB since its encoding overhead is less than 1 KiB.

Batched requests should be validated again by PubsubClient to ensure a single PublishRequest does not exceed these limits. That's... Not as simple either, since a JSON PublishRequest not exceeding 10 MB or 10 MiB may still exceed those limits when the request is transcoded into a protobuf PublishRequest before it hits Pub/Sub.

Filing this for visibility since I was working on fixes for this while collaborating on #31608, one to correct the validation in PreparePubsubWriteDoFn and one to refactor batching in PubsubClient to account for this.

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants