Column-oriented binary format for immutable datasets
zebra
is a typed data format for storing arbitrary combinations of
sums, products and arrays in compressed form. It achieves high
compression by decomposing this data in to a "struct of arrays"
representation on disk. zebra
stores data in blocks so that files can
be streamed without loading the entire file in memory.
Image from the Dremel paper
The "struct of arrays" representation used on disk is very similar to that used by Dremel and Parquet, and is known as column oriented, as opposed to row or record oriented. This allows for higher compression as similar data is grouped together. One notable difference that Zebra has from Dremel and Parquet is that Zebra supports full sum types (i.e. Rust-like enums) rather than just optional fields.
What this looks like on disk can be hard to visualise, here are a number of examples which show how logical types and data are flattened into the physical column oriented representation which is used on disk. You can see in the examples that arrays which exist in adjacent rows are simply concatenated and a segment descriptor with the array lengths is stored to keep track of which values belong to which row. Sum types use a tag to indicate which variant is relevant for a given row.
Zebra uses a number of techniques to achieve a relatively high compression ratio, given the speed at which it can be encoded and decoded.
When compressing integers, Zebra uses three techniques stacked on top of each other. First, it uses a frame of reference encoding which offsets each value in the column by the midpoint of the minimum and the maximum so that big numbers become small numbers.
Zebra then uses zig-zag encoding to make all numbers positive, so that small negative numbers can be encoded using a small number of bits.
Finally Zebra finds the maximum number of bits required to store each of the numbers in the column. If that happens to be 6-bits as in the example below, then 16 x 6-bit numbers can be packed in to 6 x 16-bit numbers. All of these techniques are detailed and benchmarked in Lemire 2012.
For strings, Zebra simply uses Snappy compression, however because data is organised in columns, it achieves a higher than usual compression ratio.
Given the current state of this library, its performance sensitivity, and is criticality to ongoing production jobs that are still in a very early/tentative state. There should be some mandatory review, for production implications. Due to time and availability constraints this is currently @jystic. So any reviews should hold off on merge until he can sweep. Whilst this might be a bit more restrictive then we normally are on most libraries, it is the safest option for the short term while stability is low / churn is high, and it remains on critical path for many production tasks.