This repository is the virtual appendix for the paper Static Pruning for Multi-Representation Dense Retrieval, published in ACM DocEng 2023.
It makes use of PyTerrier and PyTerrier_ColBERT to demonstrate the power of static-pruning methods for reducing ColBERT indices without loss of effectiveness.
This repository contains four notebooks:
- idf-pruning.ipynb - PyTerrier notebook to run the Original, IDF doc-centric and IDF-uniform approaches from Section 5.1, as well as create the indices for Section 5.2.
- baseline-pruning.ipynb - PyTerrier notebook to run the random doc-centric and stopword approaches from Section 5.1.
- pruned-indices.ipynb - PyTerrier notebook to create the runs for the indices in Section 5.2.
- results-analysis.ipynb - PyTerrier notebook to create the figures for Section 5.1.
In order to run the above notebooks, you will require a PyTerrier ColBERT index of the MSMARCO v1 passage ranking corpus. This can be created using the following PyTerrier code:
import pyterrier as pt
from pyterrier_colbert.indexing import ColBERTIndexer
checkpoint="http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/colbert_model_checkpoint.zip"
indexer = ColBERTIndexer(checkpoint, "/path/to/index", "index_name", chunksize=3)
indexer.index(pt.get_dataset("msmarco_passage").get_corpus_iter())
!ln -s /path/to/index/index_name/ivfpq.262144.faiss /path/to/index/index_name/ivfpq.faiss
Space consumption of the final index is 185GB, as mentioned in the paper.
We provide a separate notebook providing all indexing and results on the TREC Covid corpus.
These are necessary:
pip install git+https://github.com/terrierteam/pyterrier_colbert.git
pip install setuptools==59.5.0
@inproceedings{doceng2023,
title = {Static Pruning for Multi-Representation Dense Retrieval},
author = {Antonio Acquavia and Craig Macdonald and Nicola Tonellotto},
booktitle = {Proceedings of ACM DocEng},
year = {2023},
}
- Antonio Acquavia, University of Pisa
- Craig Macdonald, University of Glasgow
- Nicola Tonelloto, University of Pisa