Write binary data in yaml canonical format instead of hex encoded strings #5

prestonvanloon · 2019-06-06T15:36:26Z

We're finding that unmarshalling yaml is very cumbersome with hex encoded strings as binary data.

Example: ssz: '0x00000000' cannot be inherently unmarshaled to a byte slice in go. go playground link.

Looking at the spec for yaml, this data should be represented as a base64 binary string with the prefix !!binary.

Edit: I no longer support adding !!binary as it is entirely unnecessary with modern libraries understanding that Base64 encoded strings are binary data. However, the tag resolution in the yaml spec for this scenario is unclear. It's probably safer to add !!binary explicitly.

The text was updated successfully, but these errors were encountered:

paulhauner · 2019-06-06T16:08:44Z

We configured our parser to handle this quite easily, so it's not a problem for us.

I'm not opposed to switching if it's a common problem, but my preference is that we keep the YAML and JSON representations of binary data the same, as to reduce total system complexity.

EDIT: When I refer to JSON, I'm referring to the minimal validator API spec. Here signatures, roots and other binary data are represented as 0x00...

prestonvanloon · 2019-06-06T16:44:26Z

Still investigating this proposal. We have been using hex encoded strings in eth1 for some time now and have existing functionality for that with JSON encoding in go-ethereum.

Example block returned from json-rpc in eth1.

{
    "number": 3,
    "hash": "0xef95f2f1ed3ca60b048b4bf67cde2195961e0bba6f70bcbea9a2c4e133e34b46",
    "parentHash": "0x2302e1c0b972d00932deb5dab9eb2982f570597d9d42504c05d9c2147eaf9c88",
    "nonce": "0xfb6e1a62d119228b",
    "sha3Uncles": "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
    "logsBloom": "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
    "transactionsRoot": "0x3a1b03875115b79539e5bd33fb00d8f7b7cd61929d5a3c574f507b8acf415bee",
    "stateRoot": "0xf1133199d44695dfa8fd1bcfe424d82854b5cebef75bddd7e40ea94cda515bcb",
    "receiptsRoot: '0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421',
    "miner": "0x8888f1f195afa192cfee860698584c030f4c9db1",
    "difficulty": '21345678965432',
    "totalDifficulty": '324567845321',
    "size": 616,
    "extraData": "0x",
    "gasLimit": 3141592,
    "gasUsed": 21662,
    "timestamp": 1429287689,
    "transactions": [
        "0x9fc76417374aa880d4449a1f7f31ec597f00b1f6f3dd2d66f4c9c6c445836d8b"
    ],
    "uncles": []
}

Suggesting a shift from hex encoded strings to base64 encoding would be annoying to say the least.

Will follow up with a bit more before closing this out, but it's looking like it should be hex encoded.

prestonvanloon · 2019-06-06T16:59:15Z

It looks like go-ethereum uses type overrides (ex) which uses gencodec to create custom json/yaml parsers to map to go structs (ex).
So, it looks like we can generate some extra code to support hex encoded strings to represent binary data and close out this issue.

The follow up, if its worth following up, is why are we using hex encoded strings to represent binary data in the first place instead of the seemingly more intuitive Base64 encoding?

arnetheduck · 2019-06-06T17:15:27Z

one reason to use one encoding over another is that it should ideally match whatever is used in canonical value printers so it's easy to compare logs with yaml files..

prestonvanloon · 2019-06-06T20:59:11Z

Good thinking. Clients are more likely to print hex encoded strings for binary data.

prestonvanloon · 2019-06-18T00:23:43Z

Reopening this. We've found it too infeasible to writing special decoding/encoding support for these byte arrays as hex encoded strings. Instead, we overwrite the hex strings to base64 in the yaml, then ingest that data to the tests.

With regards to "do what the JSON API does", I recant my earlier statement and I don't support hex encoded strings in JSON going forward. I see little benefit to hex encoded over base64 and it means we have to write a special work arounds to support it.

The latter comment is probably for a different discussion topic.

prestonvanloon · 2019-06-19T20:37:09Z

Another note in support of base64, we've found that these yaml files are saving about 30% space on disk as base64 rather than hex encoded.

protolambda · 2019-06-19T20:51:56Z

For size, we could also compress, encode suites in SSZ (or other), or split mainnet suites (huge) from minimal suites (reasonable) in another repo. If there's an issue with size, clients should put that forward, and describe their resources / caching. But I don't think there's much of a problem (although yes, yaml encoded mainnet tests are very large).

Generally I think it's perfectly fine to load tests in a convenient way, i.e. load into a map, translate python names and hex values, and load into the appropriate types. Better than describing all new types just for testing. In Go specifically, I just load them through https://github.com/mitchellh/mapstructure and do some translation, then put them directly in the types I use for state-transitioning. No extra code, or per-type conversion. And while I agree it's far from perfect, it's affordable (tests loading doesn't take that long) and easy.

You could generate structs with decoder definitions or use a custom yaml decoder, but I really just think running (and then passing) the tests at this stage is worth much more than (over-)engineering the test runner. Eventually, yes, it would be very nice. But first things first.

prestonvanloon · 2019-06-19T20:55:32Z

It's not necessarily an issue with size, but difficulty of using existing tools with hex encoded strings for seemingly no benefit.

protolambda · 2019-06-19T20:59:02Z

If other clients are in for a change too, we can change it. But don't want to break the working setup of anyone right before spec freeze.

arnetheduck · 2019-06-19T21:24:46Z

The point of yaml seems to be to optimize utility for humans, where hex is more commonly used - ie echo hello | sha256sum prints in hex.

Else we might as well go with a binary format that can be memory-mapped instead of parsed - that would at least net a simplification and speedup of test-running.

From nimbus, we're fine with keeping things interoperable and simple with hex too, and not too much yaml magic.

prestonvanloon · 2019-06-19T21:58:15Z

Thanks for the context. I now understand some of the reasoning that went into choosing hex for yaml, but I disagree with this decision. When we need to read the data as humans, we would likely do that in some form of test runner. I.e. Method failed. Wanted 0xf2d, got 0xabb. This formatting could be done with whatever stdout printer is being used without subjecting others to the cost having yaml data as hex.

We (and others) have had to write custom yaml processing logic or workarounds to support hex encoded strings in over 1 million of lines of yaml in this repository.

Maybe it is too late to undo this decision before spec freeze, but we think it is a technical debt that we'll be paying for a long time.

protolambda · 2019-06-19T22:12:02Z

@prestonvanloon Can you form a concrete counter proposal, to be executed after spec freeze, if it works for everyone?

prestonvanloon · 2019-06-19T22:36:06Z

This issue is my proposal. Base64 is the canonical format for binary data representation in yaml.
Nearly every library supports ingesting base64 as binary.

Sources:
https://yaml.org/spec/1.2/spec.html
https://yaml.org/type/binary.html

Canonical: Base64 without white space and line breaks.

protolambda · 2019-06-19T22:43:35Z

Was thinking of something more radical. If we give up on readability, there should be much better options.

prestonvanloon · 2019-06-19T22:57:30Z

We have no preference on readability or format as long as we don't have to waste time implementing things that already exist. YAML is fine if we use YAML as intended. JSON is fine because the JSON spec suggests Base64 as well.

This isn't giving up on readability. YAML is readable.

djrtwo · 2019-06-25T03:19:51Z

We don't intend to use any of the fancy !! YAML notation due to the more simple parsers not supporting this usage. The 0x is super common in ethereum and human readable and speakable.

I think we give something up if fields in tests such as pubkeys become difficult to discuss due to the base64 notation, especially considering pubkeys will almost certainly be represented in hex to users.

prestonvanloon · 2019-06-25T14:31:03Z

@djrtwo For some additional context, we have wasted at least one eng week to write tooling to support non-canonical formatting with our standard libraries. I think it is a bad idea to use hex for everything.

It makes sense when you need representation to be human readable for end users, like a wallet address in UI, but binary data for test ingestion is not one of those use cases. This data is not end user facing and is causing a lot of pain for little/no benefit.

protolambda · 2019-07-31T00:31:44Z

So pretty much all data, unless it's not a SSZ object, will also be available as ssz-encoded raw bytes in the next testing release. See ethereum/consensus-specs#1320
An example can be found here: https://github.com/ethereum/eth2.0-spec-tests/tree/tests-v082-reorg/tests/minimal/phase0/operations/deposit/pyspec_tests/success_top_up

So effectively no yaml / hex encoded data, unless for a small test metadata detail (which shouldn't touch ssz/production code anyway). Does this work for Prysm @prestonvanloon ?

prestonvanloon · 2019-08-04T17:41:33Z

unless for a small test metadata detail

What does this mean? There will still be some required yaml?

Switching to ssz encoded data will most likely work for us. Thanks for the recent improvements.

protolambda · 2019-08-04T18:28:40Z

@prestonvanloon there are a few small files that just say things like "there are 10 deposits to read from ssz files", "there are 6 blocks to read", "BLS required here". Etc. etc. Nothing hard to parse really.

prestonvanloon closed this as completed Jun 6, 2019

prestonvanloon reopened this Jun 18, 2019

prestonvanloon mentioned this issue Aug 5, 2019

[Mega tracking] - Align Prysm to 0.8.2 prysmaticlabs/prysm#3144

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write binary data in yaml canonical format instead of hex encoded strings #5

Write binary data in yaml canonical format instead of hex encoded strings #5

prestonvanloon commented Jun 6, 2019 •

edited

Loading

paulhauner commented Jun 6, 2019 •

edited

Loading

prestonvanloon commented Jun 6, 2019 •

edited

Loading

prestonvanloon commented Jun 6, 2019

arnetheduck commented Jun 6, 2019

prestonvanloon commented Jun 6, 2019

prestonvanloon commented Jun 18, 2019

prestonvanloon commented Jun 19, 2019

protolambda commented Jun 19, 2019

prestonvanloon commented Jun 19, 2019

protolambda commented Jun 19, 2019

arnetheduck commented Jun 19, 2019

prestonvanloon commented Jun 19, 2019 •

edited

Loading

protolambda commented Jun 19, 2019

prestonvanloon commented Jun 19, 2019

protolambda commented Jun 19, 2019

prestonvanloon commented Jun 19, 2019

djrtwo commented Jun 25, 2019 •

edited

Loading

prestonvanloon commented Jun 25, 2019 •

edited

Loading

protolambda commented Jul 31, 2019

prestonvanloon commented Aug 4, 2019

protolambda commented Aug 4, 2019

Write binary data in yaml canonical format instead of hex encoded strings #5

Write binary data in yaml canonical format instead of hex encoded strings #5

Comments

prestonvanloon commented Jun 6, 2019 • edited Loading

paulhauner commented Jun 6, 2019 • edited Loading

prestonvanloon commented Jun 6, 2019 • edited Loading

prestonvanloon commented Jun 6, 2019

arnetheduck commented Jun 6, 2019

prestonvanloon commented Jun 6, 2019

prestonvanloon commented Jun 18, 2019

prestonvanloon commented Jun 19, 2019

protolambda commented Jun 19, 2019

prestonvanloon commented Jun 19, 2019

protolambda commented Jun 19, 2019

arnetheduck commented Jun 19, 2019

prestonvanloon commented Jun 19, 2019 • edited Loading

protolambda commented Jun 19, 2019

prestonvanloon commented Jun 19, 2019

protolambda commented Jun 19, 2019

prestonvanloon commented Jun 19, 2019

djrtwo commented Jun 25, 2019 • edited Loading

prestonvanloon commented Jun 25, 2019 • edited Loading

protolambda commented Jul 31, 2019

prestonvanloon commented Aug 4, 2019

protolambda commented Aug 4, 2019

prestonvanloon commented Jun 6, 2019 •

edited

Loading

paulhauner commented Jun 6, 2019 •

edited

Loading

prestonvanloon commented Jun 6, 2019 •

edited

Loading

prestonvanloon commented Jun 19, 2019 •

edited

Loading

djrtwo commented Jun 25, 2019 •

edited

Loading

prestonvanloon commented Jun 25, 2019 •

edited

Loading