Skip to content

Column-oriented binary format for immutable datasets

License

Notifications You must be signed in to change notification settings

HuwCampbell/zebra

 
 

Repository files navigation

zebra

Column-oriented binary format for immutable datasets

Overview

zebra is a typed data format for storing arbitrary combinations of sums, products and arrays in compressed form. It achieves high compression by decomposing this data in to a "struct of arrays" representation on disk. zebra stores data in blocks so that files can be streamed without loading the entire file in memory.

Column-oriented

Image from the Dremel paper

The "struct of arrays" representation used on disk is very similar to that used by Dremel and Parquet, and is known as column oriented, as opposed to row or record oriented. This allows for higher compression as similar data is grouped together. One notable difference that Zebra has from Dremel and Parquet is that Zebra supports full sum types (i.e. Rust-like enums) rather than just optional fields.

What this looks like on disk can be hard to visualise, here are a number of examples which show how logical types and data are flattened into the physical column oriented representation which is used on disk. You can see in the examples that arrays which exist in adjacent rows are simply concatenated and a segment descriptor with the array lengths is stored to keep track of which values belong to which row. Sum types use a tag to indicate which variant is relevant for a given row.

Example 1

Type

Data

Example 2

Type

Data

Example 3

Type

Data

Example 4

Type

Data

Compression

Zebra uses a number of techniques to achieve a relatively high compression ratio, given the speed at which it can be encoded and decoded.

When compressing integers, Zebra uses three techniques stacked on top of each other. First, it uses a frame of reference encoding which offsets each value in the column by the midpoint of the minimum and the maximum so that big numbers become small numbers.

Zebra then uses zig-zag encoding to make all numbers positive, so that small negative numbers can be encoded using a small number of bits.

Finally Zebra finds the maximum number of bits required to store each of the numbers in the column. If that happens to be 6-bits as in the example below, then 16 x 6-bit numbers can be packed in to 6 x 16-bit numbers. All of these techniques are detailed and benchmarked in Lemire 2012.

For strings, Zebra simply uses Snappy compression, however because data is organised in columns, it achieves a higher than usual compression ratio.

Mandatory Reviewers

Given the current state of this library, its performance sensitivity, and is criticality to ongoing production jobs that are still in a very early/tentative state. There should be some mandatory review, for production implications. Due to time and availability constraints this is currently @jystic. So any reviews should hold off on merge until he can sweep. Whilst this might be a bit more restrictive then we normally are on most libraries, it is the safest option for the short term while stability is low / churn is high, and it remains on critical path for many production tasks.

About

Column-oriented binary format for immutable datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Haskell 83.7%
  • C 12.5%
  • Shell 1.2%
  • Roff 1.2%
  • Makefile 0.8%
  • Objective-C 0.4%
  • C++ 0.2%