-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH]: add crate to manage datasets for benchmarking #2797
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
3cc654c
to
c221e99
Compare
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @codetheweb and the rest of your teammates on Graphite |
c5ab47a
to
f958d1a
Compare
/// Returns a subset of queries from the dataset that have at least `min_results_per_query` results in the `corpus_dataset`. | ||
/// The subset will contain at most `max_num_of_queries` queries. | ||
/// | ||
/// Because constructing this subset can be expensive (and different subsets may lead to different downstream test results), by default the constructed subset is stored in the `dataset_files/` directory in the root of this crate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use git lfs for that folder maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These files are pretty small, assuming queries with say an average of 32 bytes 10,000 queries (more than we would ever need) is only 320 kB.
let client = reqwest::Client::new(); | ||
let response = client | ||
.get( | ||
"https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=1", // todo: less sketchy source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits
d7332a1
to
0605be1
Compare
Merge activity
|
0605be1
to
45608a3
Compare
## Description of changes Adds a new crate, `benchmark-datasets`. Contains utils to download, cache, and stream from three separate datasets: an English Wikipedia article snapshot, the SciDocs corpus, and the Microsoft MARCO search query dataset. Additionally, there are helpers to construct a subset of search queries where every query in the subset has at least N results in a specified corpus. ## Test plan *How are these changes tested?* Partly tested in `rust/benchmark-datasets/src/types.rs`, but will be more comprehensively consumed in the next PR in this stack. ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?* n/a
Description of changes
Adds a new crate,
benchmark-datasets
. Contains utils to download, cache, and stream from three separate datasets: an English Wikipedia article snapshot, the SciDocs corpus, and the Microsoft MARCO search query dataset.Additionally, there are helpers to construct a subset of search queries where every query in the subset has at least N results in a specified corpus.
Test plan
How are these changes tested?
Partly tested in
rust/benchmark-datasets/src/types.rs
, but will be more comprehensively consumed in the next PR in this stack.Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?
n/a