Skip to content

Commit

Permalink
feat: add tdigest data structure for statistics (#71)
Browse files Browse the repository at this point in the history
This pull request implements a simplified version of [Ted Dunning's
TDigest algorithm](https://arxiv.org/pdf/1902.04023.pdf) for efficient
quantile/cdf computations.

It supports a fully parallelizable, memory-bounded computation scheme,
along with an easy API.

Simplified for two reasons:
- The linear interpolation is only done for the quantile, not the cdf.
- Linear interpolation in general could be done more efficiently if we
leverage unit-weighted centroids at the edges, as described in the
paper.
 
Both are marked as TODOs in the code for future work.

Most open-source implementations found online had bugs, were incomplete,
or were too complex (i.e., **poorly written**).

Includes unit tests on uniform and weighted distributions.

Next steps: Implementing the HyperLogLog algorithm for NDistinct.
  • Loading branch information
AlSchlo authored Feb 24, 2024
1 parent 442600e commit f390059
Show file tree
Hide file tree
Showing 7 changed files with 410 additions and 0 deletions.
65 changes: 65 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@ members = [
"optd-datafusion-repr",
"optd-sqlplannertest",
"optd-adaptive-demo",
"gungnir",
]
resolver = "2"
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ The documentation is available in the mdbook format in the [docs](docs) director
* `optd-datafusion-repr`: Representation of Apache Arrow Datafusion plan nodes in optd.
* `optd-adaptive-demo`: Demo of adaptive optimization capabilities of optd. More information available in the [docs](docs/).
* `optd-sqlplannertest`: Planner test of optd based on [risinglightdb/sqlplannertest-rs](https://github.com/risinglightdb/sqlplannertest-rs).
* `gungnir`: Scalable, memory-efficient, and parallelizable statistical methods for cardinality estimation (e.g. TDigest, HyperLogLog).


# Related Works

Expand Down
11 changes: 11 additions & 0 deletions gungnir/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[package]
name = "gungnir"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
itertools = "0.11"
rand = "0.8"
crossbeam = "0.8"
3 changes: 3 additions & 0 deletions gungnir/src/lib.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#![allow(clippy::new_without_default)]

pub mod stats;
1 change: 1 addition & 0 deletions gungnir/src/stats.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
mod tdigest;
Loading

0 comments on commit f390059

Please sign in to comment.