FIP-13: Canonical serialization for hashing messages #87

sanjayprabhu · 2023-04-26T00:35:04Z

sanjayprabhu
Apr 26, 2023
Maintainer

Title: Canonical serialization for hashing messages
Type: Standards
Author: Sanjay Raveendran (@sanjayprabhu), Aditya (@adityapk00)

Abstract

We need a canonical way to serialize messages, or otherwise accept valid messages submitted by non-js libraries that serialize the data slightly differently. This FIP proposes that hubs support a new field with the raw serialized bytes and use that to verify the signature to avoid this issue.

Problem

Verifying a message signature currently involves deserializing and serializing it. Because different protobuf libraries (particularly from non-js languages) serialize data slightly differently, the raw bytes may not match when the hubs re-serialize the data to verify signature. So we end up rejecting messages that are actually valid.

Specification

Hubs will accept a new optional field called data_bytes in the Message object which will be set if the serialized form form data is not consistent with ts-proto.

message Message {
  MessageData data = 1; // Contents of the message
  bytes hash = 2; // Hash digest of data
  HashScheme hash_scheme = 3; // Hash scheme that produced the hash digest
  bytes signature = 4; // Signature of the hash digest
  SignatureScheme signature_scheme = 5; // Signature scheme that produced the signature
  bytes signer = 6; // Public key or address of the key pair that produced the signature
  optional bytes data_bytes = 7;         // MessageData serialized to bytes if using protobuf serialization other than ts-proto
}

This field is mutually exclusive with data. When data_bytes is set, the hub will deserialize this field and overwrite data. It will also use the raw data_bytes bytestream to validate the hash and the signature. This allows the hub to support other protobuf implementation with minimal changes and little to no storage overhead.

Rationale

There is a lot of context on this issue, so it's recommended to read it first. This excerpt summarizes the issue:

If a protobuf has an empty integer array, ts-proto handles serialization differently from other libraries like proto-es. This isn’t normally an issue since protobufs and grpc don’t require deterministic byte outputs to work correctly.

Farcaster has a problem here because it uses the serialized bytes to calculate a hash and expects them to be deterministic. Unfortunately, CastAdd messages have integer arrays that can be empty (mentions, mentionPositions ) and messages generated by ts-proto will have a different hash than messages generated by other libraries. Hubble uses ts-proto to check the hashes and so any message generated by other libs are rejected.

Alternatives considered

Patch ts-proto to serialize empty ints "correctly"

Relatively simple
Will have to maintain a fork of ts-proto
There might be future incompatibilities that we'll need to patch again

Switch to different library like protobuf-ts or protobuf-es

Complicated change, since the types are different and the code will have to be modified to use the new interfaces. Significant amount of work required on the hub and the consumers of hub-nodejs
The new libraries could have other issues we don't know about yet (they do have higher conformance scores, but there might be other issues)

Specify a custom serialization format

Don't have to be beholden to implementation details of a specific library
However, we'd be re-inventing the wheel, and will eventually have to deal with the similar problems.

Release

This is a backwards compatible change since it only adds a new, optional field.

Target release for this proposal is protocol version 2023.11.15

varunsrin · 2023-04-26T22:34:48Z

varunsrin
Apr 26, 2023
Maintainer

I really like this proposal, it greatly simplifies the work needed to be done here.

The only downside which I'm slightly concerned about the is data bloat attack vector. It might be worth creating a quick model to estimate what data store would be at 10M users if we extrapolated our current message patterns on the Hub, and what it would look like if all 10M users were attackers and maxing out byte size on the network.

0 replies

varunsrin · 2023-05-10T20:49:25Z

varunsrin
May 10, 2023
Maintainer

@sanjayprabhu thanks for turning this into a draft. few comments:

We could prevent this by imposing a length limit on the data field.

We should make a decision here before moving this to review.

This lets hubs be agnostic about the implementation details of the serialization process while still allowing them to validate the hash and the contents of the message. This is a backwards compatible change because the wire format has not changed, just how we read the data. This approach is simple, easy to implement, requires no client changes, and also doesn't preclude us from switching to another library future if we need to.

This paragraph feels like it belongs in the rationale, not the specificaiton
As part of the specification for this FIP, we should show how the protocol specification will change for hashing logic, similar to your before/after for the message format changes.

This is a backwards compatible change. Clients that use the old Message schema will be able to continue to do so seamlessly. However, when they upgrade they will need to manually de-serialize the data field if they are working with the Message object directly. When this FIP is implemented, there will be a helper method to deep deserialize the Message object to provide the same behavior as calling protobufs.Message.decode(msg) currently.

It's not super clear on reading this what an impact to al application developer will be. Will hub-nodejs handle all this and make this a non issue? Or do I need to change parts of my app code? If so, some pseudocode would be helpful.

1 reply

CassOnMars May 11, 2023
Maintainer

We should make a decision here before moving this to review.

Heavily agree. My POV is that an equally pragmatic approach to the limit can be taken: given that other protobuf libraries take less space than ts-proto is (by having unnecessary fields included), we could just leverage ts-proto's serialization of a message at maximum field capacity, and use that for the cut-off

sanjayprabhu · 2023-05-18T00:16:20Z

sanjayprabhu
May 18, 2023
Maintainer Author

After attempting to implement this, there's one major issue that came up with this approach: We always need to keep the original bytestream and ship that around when requested (e.g. for sync). If we ever re-encode the message, the hash is not guaranteed to be the same and the message will not validate. This introduces a lot of additional complexity in the hub and invalidates the original rationale for this approach (simplicity).

I went back and evaluated the different options, and unfortunately, there's just no easy answer. Everything has a tradeoff. I'll summarize them below:

1. Store a copy of the original bytestream along with the message

We could modify the Message schema to be:

Message {
  MessageData data = 1;
  ...
  bytes dataAsBytes = 7;
}

This maintains the simplicity of the approach, and is relatively simple on the client side, at the cost of doubling the storage cost of the hubs (and on the wire). Does not seem like good tradeoff, especially as we want to be mindful of storage requirements for hubs.

2. Transparently serialize/deserialize the original bytestream within the hubs

In this approach, there would be no change to the Message protobuf, but the typescript Message type would change to the following:

type Message {
 data: MessageData | undefined;
 dataAsBytes: Uint8Array | undefined;
}

When deserializing a message, we'd decode it as normal, and also attach the original bytestream to dataAsBytes. We pass this object around internally within the hub, and when it's time to serialize it again, we'd take the bytestream from dataAsBytes instead of re-encoding the data. This is fine because we never modify the message, however there is a lot of additional complexity introduced to the hubs. In particular, because we rely on ts-proto generated types and implementations for the serializer/deserializer and the grpc server implementation, there's no easy way to patch the autogenerated code (unless there's some ts-proto magic we're not aware of). It's a leaky abstraction and is not so transparent to the hub.

3. Patch ts-proto/switch to a different protobuf library

Neither of these actually solve the problem because protobuf is not designed to offer deterministic serialization. So, the issue would just show up again in a different form. protobuf v3 actually has an option for deterministic serialization, however it's severely limited:

Deterministic serialization guarantees that for a given binary, equal messages will always be serialized to the same bytes. This implies:

Repeated serialization of a message will return the same bytes.

Different processes of the same binary (which may be executing on different machines) will serialize equal messages to the same bytes.

Note that the deterministic serialization is NOT canonical across languages. It is not guaranteed to remain stable over time. It is unstable across different builds with schema changes due to unknown fields. Users who need canonical serialization (e.g., persistent storage in a canonical form, fingerprinting, etc.) should define their own canonicalization specification and implement their own serializer rather than relying on this API.

Note the last sentence.

4. Define our own serialization format (just for hashing)

In this approach, we'd still use protobufs, but define a custom approach to serializing the data for the purposes of computing the hash. There are significant downsides:

All clients need to be aware of how to serialize the data
We'll need to provide implementations in multiple languages or a robust specification at a minimum
Will require migrating the data or grandfathering existing messages until they expire.

Within this, there are two ways we could go:

Use a non-protobuf format
Ideally, we'd pick a well known format with implementations in multiple languages. Some options are: RLP (ethereum), SSZ (ETH 2), Canonical JSON. However, we'd still need to specify how to convert from protobufs to this format (mainly because to ensure forwards compatibility we'll need to ignore default values). This would be the same as developing our own serialization convention and we would lose the benefits of going with a known implementation.
Define convention for how to deterministically serialize protobufs
There is some prior art in the form of https://github.com/regen-network/canonical-proto3. But it's unmaintained and the reference implementation is deprecated. Interestingly, Cosmos has the same problem and is proposing a similar approach https://docs.cosmos.network/main/architecture/adr-027-deterministic-protobuf-serialization. There's a clear need for this, but getting this right may be complicated and time consuming.

As a side note, I discovered Cosmos is essentially doing option 1. above: https://docs.cosmos.network/main/architecture/adr-020-protobuf-transaction-encoding

0 replies

varunsrin · 2023-05-19T23:52:31Z

varunsrin
May 19, 2023
Maintainer

@sanjayprabhu removing the FIP number, since we're only supposed to assign these when in the review stage

0 replies

sanjayprabhu · 2023-05-23T16:33:51Z

sanjayprabhu
May 23, 2023
Maintainer Author

After evaluating the tradeoffs, 4. b) Define convention for how to deterministically serialize protobufs is likely the best way forward. We would only need to define a convention for serialization, we don't need to worry about deserialization.

The following rules from the Cosmos ADR-27 offer a good starting point.

The serialization is based on the protobuf3 encoding with the following additions:

Fields must be serialized only once in ascending order

Extra fields or any extra data must not be added

Default values must be omitted
repeated fields of scalar numeric types must use packed encoding

Varint encoding must not be longer than needed:

No trailing zero bytes (in little endian, i.e. no leading zeroes in big endian). Per rule 3 above, the default value of 0 must be omitted, so this rule does not apply in such cases.

The maximum value for a varint must be FF FF FF FF FF FF FF FF FF 01. In other words, when decoded, the highest 6 bits of the 70-bit unsigned integer must be 0. (10-byte varints are 10 groups of 7 bits, i.e. 70 bits, of which only the lowest 70-6=64 are useful.)

The maximum value for 32-bit values in varint encoding must be FF FF FF FF 0F with one exception (below). In other words, when decoded, the highest 38 bits of the 70-bit unsigned integer must be 0.

The one exception to the above is negative int32, which must be encoded using the full 10 bytes for sign extension2.

The maximum value for Boolean values in varint encoding must be 01 (i.e. it must be 0 or 1). Per rule 3 above, the default value of 0 must be omitted, so if a Boolean is included it must have a value of 1.

However, there are not publicly available implementations. We'll need our own implementation in typescript and likely a reference implementation in one other language for validating everything works end-to-end across languages.

0 replies

michaelhly · 2023-06-23T18:28:00Z

michaelhly
Jun 23, 2023

This issue only affects signature verification when submitting messages to hubs, is that correct?

0 replies

michaelhly · 2023-06-24T11:43:38Z

michaelhly
Jun 24, 2023

What is the issue with @farcasterxyz maintaining a ts-proto fork? I've seen teams like @solana-labs maintain their own fork of projects such as

rust + the LLVM compiler to support BPF modules written in Rust
buffer-layout to support Typescript

2 replies

varunsrin Jun 24, 2023
Maintainer

The problem isn't implementing a serialization library - that's easy, and there are much better paths than patching ts-proto.

We need to first define the serialization standard in such a way that it can be implemented in Hubs and migrated to, which is where most of the issues are coming up.

michaelhly Jun 24, 2023

Spewed some stuff on patching ts-proto for conformance

But then caught up to the rest of the thread ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIP-13: Canonical serialization for hashing messages #87

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

FIP-13: Canonical serialization for hashing messages #87

sanjayprabhu Apr 26, 2023 Maintainer

Abstract

Problem

Specification

Rationale

Alternatives considered

Release

Replies: 7 comments · 3 replies

varunsrin Apr 26, 2023 Maintainer

varunsrin May 10, 2023 Maintainer

CassOnMars May 11, 2023 Maintainer

sanjayprabhu May 18, 2023 Maintainer Author

1. Store a copy of the original bytestream along with the message

2. Transparently serialize/deserialize the original bytestream within the hubs

3. Patch ts-proto/switch to a different protobuf library

4. Define our own serialization format (just for hashing)

varunsrin May 19, 2023 Maintainer

sanjayprabhu May 23, 2023 Maintainer Author

michaelhly Jun 23, 2023

michaelhly Jun 24, 2023

varunsrin Jun 24, 2023 Maintainer

michaelhly Jun 24, 2023

sanjayprabhu
Apr 26, 2023
Maintainer

Replies: 7 comments 3 replies

varunsrin
Apr 26, 2023
Maintainer

varunsrin
May 10, 2023
Maintainer

CassOnMars May 11, 2023
Maintainer

sanjayprabhu
May 18, 2023
Maintainer Author

varunsrin
May 19, 2023
Maintainer

sanjayprabhu
May 23, 2023
Maintainer Author

michaelhly
Jun 23, 2023

michaelhly
Jun 24, 2023

varunsrin Jun 24, 2023
Maintainer