[ENH]: add crate to manage datasets for benchmarking #2797

codetheweb · 2024-09-14T00:07:53Z

Description of changes

Adds a new crate, benchmark-datasets. Contains utils to download, cache, and stream from three separate datasets: an English Wikipedia article snapshot, the SciDocs corpus, and the Microsoft MARCO search query dataset.

Additionally, there are helpers to construct a subset of search queries where every query in the subset has at least N results in a specified corpus.

Test plan

How are these changes tested?

Partly tested in rust/benchmark-datasets/src/types.rs, but will be more comprehensively consumed in the next PR in this stack.

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

n/a

github-actions · 2024-09-14T00:08:04Z

codetheweb · 2024-09-17T22:11:56Z

[ENH]: add full text index querying benchmark #2816
[ENH]: add crate to manage datasets for benchmarking #2797 👈
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @codetheweb and the rest of your teammates on Graphite

rust/benchmark-datasets/src/types.rs

HammadB · 2024-09-20T00:19:07Z

rust/benchmark-datasets/src/types.rs

+    /// Returns a subset of queries from the dataset that have at least `min_results_per_query` results in the `corpus_dataset`.
+    /// The subset will contain at most `max_num_of_queries` queries.
+    ///
+    /// Because constructing this subset can be expensive (and different subsets may lead to different downstream test results), by default the constructed subset is stored in the `dataset_files/` directory in the root of this crate.


we should use git lfs for that folder maybe

These files are pretty small, assuming queries with say an average of 32 bytes 10,000 queries (more than we would ever need) is only 320 kB.

rust/benchmark-datasets/src/types.rs

rust/benchmark-datasets/src/datasets/ms_marco_queries.rs

rust/benchmark-datasets/src/datasets/scidocs.rs

HammadB · 2024-09-20T00:38:45Z

rust/benchmark-datasets/src/datasets/wikipedia.rs

+                    let client = reqwest::Client::new();
+                    let response = client
+                        .get(
+                            "https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=1", // todo: less sketchy source


rust/benchmark-datasets/src/datasets/wikipedia.rs

HammadB

nits

codetheweb · 2024-09-24T16:38:50Z

Merge activity

Sep 24, 12:38 PM EDT: @codetheweb started a stack merge that includes this pull request via Graphite.
Sep 24, 12:40 PM EDT: Graphite rebased this pull request as part of a merge.
Sep 24, 12:41 PM EDT: @codetheweb merged this pull request with Graphite.

## Description of changes Adds a new crate, `benchmark-datasets`. Contains utils to download, cache, and stream from three separate datasets: an English Wikipedia article snapshot, the SciDocs corpus, and the Microsoft MARCO search query dataset. Additionally, there are helpers to construct a subset of search queries where every query in the subset has at least N results in a specified corpus. ## Test plan *How are these changes tested?* Partly tested in `rust/benchmark-datasets/src/types.rs`, but will be more comprehensively consumed in the next PR in this stack. ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?* n/a

codetheweb force-pushed the feat-benchmark-datasets branch 3 times, most recently from 3cc654c to c221e99 Compare September 17, 2024 22:11

codetheweb mentioned this pull request Sep 17, 2024

[ENH]: add full text index querying benchmark #2816

Merged

codetheweb force-pushed the feat-benchmark-datasets branch 2 times, most recently from c5ab47a to f958d1a Compare September 19, 2024 23:26

codetheweb marked this pull request as ready for review September 19, 2024 23:26

codetheweb requested a review from HammadB September 19, 2024 23:27