This document describes the B3 binary format, including data type formats, the structure of composite items, and the format of the item header. It also documents semantics and policies for the two main types of packer/encoder - schema and schemaless.
Note: Here and in the Python code, the term encode means pack to bytes and vice versa for decode.
These are in rough order of priority.
- Simplicity & Correctness -> Security
- Interoperability & Compatibility (forward & backward)
- Easy to install and use (no 3rd party libraries)
- Flexibility
- Compactness
- Performance
B3 is a TLV ("type-length-value") based format. All data structures are encoded on the wire as a series of items. Each item has a value, size, data type, and optional key.
B3 is a 'bottom up' format, where all items have a known size. This means it:
- does not support "unknown size" items, 1
- does support nested items - the 'inner' items must first be encoded, then included in outer items.
B3 uses LEB128-format variable length encoded integers in a number of places, unsigned ("UVARINT") and signed ("SVARINT"). This format ensures small numbers use few bytes. See https://en.wikipedia.org/wiki/LEB128 for more information.
Units of data are encoded as "Items".
An item consists of a mandatory control byte, followed by some optional header components, followed by the item value's encoded bytes, as follows:
- The control byte begins the item, and is always present.
- The data type number (if any) immediately follows (1) and is encoded as a UVARINT.
- The key (if any) follows (2) and is encoded as described below.
- The data length (if any) follows (2) and is encoded as a UVARINT.
- The item value's data bytes follow (4).
On the wire, this looks like:
<BYTE control> [UVARINT type] [BYTES key] [UVARINT data length] [BYTES data]
------------------------ item_header -------------------------- -- codecs --
The control byte dictates the presence/absence of item components, and also holds the data type number for types 0-15. It's bits are as follows:
+------------+------------+------------+------------+------------+------------+------------+------------+
| data type | data type | data type | data type | has data | null/zero | key type | key type |
+------------+------------+------------+------------+------------+------------+------------+------------+
The upper 4 bits of the control byte form an integer from 0-15, which is the item's data type number.
Values 0-14 correspond to the core data types documented in the Data Types section below.
Value 15 means the actual data type number is encoded as a UVARINT immediately following the control byte.
Data type numbers 96 through 8191 inclusive, are open for use as User-Defined Types. All other type numbers are reserved for the use of the B3 standard.
There are 4 possible types of key:
- No key is present at all (no key bytes follow the control byte)
- Integer key - UVARINT encoded
- String key - UTF8 encoded, with the UTF8 size in bytes first, UVARINT encoded.
- Bytes key - the raw bytes, with the byte size first, UVARINT encoded.
Control byte 'key type' bits select which is present, as follows:
0 0 no key
0 1 integer key
1 0 string key
1 1 bytes key
If this bit is 0,
- The item's data value is either Null, or the zero-value for it's type.
- The null/zero bit controls whether the value is Null or the zero-value.
- There are no data length bytes
- There are no data bytes
If this bit is 1,
- the data length must be present,
- the data bytes must be present.
- The null/zero bit is not used by the header and is free for use by the data codecs.
- (Special case: the BOOL datatype uses the null/zero bit to hold it's value, so for BOOL there are no data bytes or length even through this bit is 1.)
if has_data is 0
- if this bit is 0,
- item value is the zero-value for it's data type (0, empty string, etc)
- if this bit is 1,
- item value is NULL (None, nil, etc)
if has_data is 1
- This bit's value is not defined by the header and is free for use by the data type codecs
- data type BOOL uses this bit to carry it's True/False value
The data length is encoded as a UVARINT.
Data length will be present when has_data = 1 and not present otherwise.
This means no bytes are wasted if encoding NULLS or zero values for items.
BOOL uses the is_null bit to carry it's true/false value.
BOOL is the only data type for which there is no data length and no data payload bytes, even though has_data is 1.
All other data types must adhere to the "has_data controls length & payload presence" rule.
Lists (arrays) and Dicts (maps) are easily supported simply by using a series of items. More exotic datastructures can be created by using a combination of keys and nesting, as follows.
A list is encoded as a series of items on the wire, ordered by their wire order. Keys are typically not used when working with lists.
A dict is also encoded as a series of items on the wire. Item keys are required.
An item's data bytes can themselves be a series of encoded items, which is how nesting is achieved on a linear wire, as shown here:
--item-- --item-- ----------item---------- --item-- --item--
[hdr|data][hdr|data][hdr|--------data--------][hdr|data][hdr|data] etc
[hdr|data][hdr|data]
--item-- --item--
The core data types are as follows. Unknown data types can be interpreted as BYTES and passed through successfully, as the length is always known and explicit.
Name | Number | Info |
---|---|---|
BYTES | 0 | array of bytes. Note: str in py2. for bytes. |
UTF8 | 1 | UTF8 strings. for str in py3 and unicode in py2. |
BOOL | 2 | True or False. |
UVARINT | 3 | unsigned varint (small or v.large ints) |
SVARINT | 4 | signed varint, zigzag encoded. |
U64 | 5 | unsigned 64bit integer |
S64 | 6 | signed 64bit integer |
FLOAT64 | 7 | IEEE754 64bit signed float. |
DECIMAL | 8 | Arbitrary Precision decimals. |
SCHED | 9 | Datetime with tz/offset/subsec etc. for future times. |
10 | reserved for future use | |
11 | reserved for future use | |
12 | reserved for future use | |
LIST | 13 | (d) list-like composite object |
DICT | 14 | (e) dict-like composite object |
15 | reserved | |
COMPLEX | 16 | encoded as 3 float64s |
Footnotes
-
The only use-cases we could think of for unknown-size items were a) huge data structures like DB tables which will have their own sizing anyway, and b) streaming TCP scenarios which always end up being chunked anyway for a better UX. ↩