This is a benchmarking tool for Qdrant's sparse vector implementation using the NeurIPS 2023 datasets.
This task is based on the common MSMARCO passage retrieval dataset, which has 8,841,823 text passages, encoded into sparse vectors using the SPLADE model. The vectors have a large dimension (about 30,000), but each vector in the base dataset has an average of approximately 120 nonzero elements. The query set contains 6,980 text queries, embedded by the same SPLADE model. The average number of nonzero elements in the query set is approximately 49 (since text queries are generally shorter). Given a sparse query vector, the index should return the top-k results according to the maximal inner product between the vectors.
This section quotes the big-ann-benchmarks repo.
Name | Description | download link | #rows | ground truth |
---|---|---|---|---|
full |
Full base dataset | 5.5 GB | 8,841,823 | 545K |
1M |
1M slice of base dataset | 636.3 MB | 1,000,000 | 545K |
small |
100k slice of base dataset | 64.3 MB | 100,000 | 545K |
queries.dev |
queries file | 1.8 MB | 6,980 | N/A |
The datasets will be automatically downloaded and extracted into the data
folder when running the benchmark.
pyenv install 3.10.10
- install python 3.10.10pyenv local 3.10.10
- set python versionpip install virtualenv
- install venv managervirtualenv venv
- create virtual envsource venv/bin/activate
- enter venvpip install -r requirements.txt
- install dependencies
Usage: main.py [OPTIONS]
Sparse vector benchmark tool for Qdrant.
Options:
--host TEXT The host of the Qdrant server
--skip-creation BOOLEAN Whether to skip collection creation
--dataset TEXT Dataset to use: small, 1M, full
--slow-ms INTEGER Slow query threshold in milliseconds
--search-limit INTEGER Search limit
--data-path TEXT Path to the data files
--results-path TEXT Path to the results files
--segment-number INTEGER Number of segments
--analyze-data BOOLEAN Whether to analyze data
--check-ground-truth BOOLEAN Whether to check results against ground
truth
--graph-y-range TEXT Y axis range for the graph to help compare
plots
--upsert-batch-size INTEGER Number of vectors per batch upserts
--parallel-batch-upsert INTEGER
Number of parallel batch upserts
--on-disk-index BOOLEAN Whether to use on-disk index
--help Show this message and exit.
e.g. to create a collection from the small
dataset:
python main.py --skip-creation false --dataset small
The results are printed in the console and additional plots are stored in the results
folder.
e.g. for the full
dataset with Azure instances with separated client from server
- server
Standard D8ds v5 (8 vcpus, 32 GiB memory)
- client
Standard D4s v3 (4 vcpus, 16 GiB memory)
Shows the distribution of the number of non-zero elements in the sparse vectors for the small
dataset.