You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on cleaning up PubsubIO.Write to support writes with ordering keys without explicit bucketing I noticed that PreparePubsubWriteDoFn does not guarantee that a single message fits within the limits of a PublishRequest for either gRPC or REST clients.
The validation routine in PreparePubsubWriteDoFn only validates explicit message size and counts 6 bytes of overhead for every attribute. This will work often enough if maxPublishBatchSize is set to <9MB for the gRPC client, but this is more likely to fail for REST clients since there's additional overhead to consider because the data field is sent as a base64 encoded string and string fields like attribute keys/values and ordering keys may require escape sequences (minimally required: \", \\, \b, \t, \n, \f, \r, \uXXXX for control characters 0-31). For JSON clients the data field must be <7.5 MB, attribute size depends on the overhead on escape sequences, and requests should not exceed 10 MiB.
For gRPC clients, the static overhead of 6 bytes per attribute is valid for any map entry <128 B in protobuf wire format. The TLV overhead may grow to 3 bytes for the entry, key and value (attribute field tag (1 B) + map entry length (1-2 B) + key field tag (1 B) + key length (1-2 B) + value field tag (1 B) + value length (1-2 B)). The total serialized size of any single message publish request must not exceed 10 MB currently. The API accepts requests up to 10 MiB for both JSON and protobuf and may omit the protobuf encoding overhead from the request validation at some point in the future in which case a single message will not exceed 10MiB since its encoding overhead is less than 1 KiB.
Batched requests should be validated again by PubsubClient to ensure a single PublishRequest does not exceed these limits. That's... Not as simple either, since a JSON PublishRequest not exceeding 10 MB or 10 MiB may still exceed those limits when the request is transcoded into a protobuf PublishRequest before it hits Pub/Sub.
Filing this for visibility since I was working on fixes for this while collaborating on #31608, one to correct the validation in PreparePubsubWriteDoFn and one to refactor batching in PubsubClient to account for this.
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam YAML
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
What happened?
While working on cleaning up
PubsubIO.Write
to support writes with ordering keys without explicit bucketing I noticed thatPreparePubsubWriteDoFn
does not guarantee that a single message fits within the limits of a PublishRequest for either gRPC or REST clients.The validation routine in
PreparePubsubWriteDoFn
only validates explicit message size and counts 6 bytes of overhead for every attribute. This will work often enough ifmaxPublishBatchSize
is set to <9MB for the gRPC client, but this is more likely to fail for REST clients since there's additional overhead to consider because the data field is sent as a base64 encoded string and string fields like attribute keys/values and ordering keys may require escape sequences (minimally required:\"
,\\
,\b
,\t
,\n
,\f
,\r
,\uXXXX
for control characters 0-31). For JSON clients the data field must be <7.5 MB, attribute size depends on the overhead on escape sequences, and requests should not exceed 10 MiB.For gRPC clients, the static overhead of 6 bytes per attribute is valid for any map entry <128 B in protobuf wire format. The TLV overhead may grow to 3 bytes for the entry, key and value (attribute field tag (1 B) + map entry length (1-2 B) + key field tag (1 B) + key length (1-2 B) + value field tag (1 B) + value length (1-2 B)). The total serialized size of any single message publish request must not exceed 10 MB currently. The API accepts requests up to 10 MiB for both JSON and protobuf and may omit the protobuf encoding overhead from the request validation at some point in the future in which case a single message will not exceed 10MiB since its encoding overhead is less than 1 KiB.
Batched requests should be validated again by
PubsubClient
to ensure a singlePublishRequest
does not exceed these limits. That's... Not as simple either, since a JSONPublishRequest
not exceeding 10 MB or 10 MiB may still exceed those limits when the request is transcoded into a protobufPublishRequest
before it hits Pub/Sub.Filing this for visibility since I was working on fixes for this while collaborating on #31608, one to correct the validation in
PreparePubsubWriteDoFn
and one to refactor batching inPubsubClient
to account for this.Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
The text was updated successfully, but these errors were encountered: