optimize: speed up stat gen by factor x15 (#167)

Made the stat generation faster using Rayon's thread pool. Various improvements, such as; - slightly less copying - optimized MG - process in parallel using Rayon, taking advantage of fold/reduce - moved to Parquet (WIP to convert into Parquet automatically) - read Parquet in parallel (one reader for each row-group), this granularity is sufficient for big enough datasets This takes 30s for JOB 1D stats on my computer, vs 7:30min before. Postgres takes 1:30min loading, and 22s for the stat gen. So we beat it, depending on how we view it. On "real" datacenter hardware (i.e. 512 cores), we would **crush** it, we'll test that soon. Finally coming together :-)
cmu-db · Apr 30, 2024 · 74dc3ff · 74dc3ff
1 parent 5528eec
commit 74dc3ff
Show file tree

Hide file tree

Showing 11 changed files with 411 additions and 244 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/optd-datafusion-repr/Cargo.toml b/optd-datafusion-repr/Cargo.toml
@@ -25,6 +25,7 @@ assert_approx_eq = "1.1.0"
 serde = { version = "1.0", features = ["derive"] }
 serde_with = {version = "3.7.0", features = ["json"]}
 bincode = "1.3.3"
+rayon = "1.10"
 union-find = { git = "https://github.com/Gun9niR/union-find-rs.git", rev = "794821514f7daefcbb8d5f38ef04e62fc18b5665" }
 test-case = "3.3"
 chrono = "0.4"