This page describes how to reproduce the learning-to-rank (LTR) experiments in the following paper:
Yue Zhang, Chengcheng Hu, Yuqi Liu, Hui Fang, and Jimmy Lin. Learning to Rank in the Age of Muppets: Effectiveness–Efficiency Tradeoffs in Multi-Stage Ranking Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 64–73, 2021.
This guide contains instructions for running the LTR baseline on the MS MARCO passage reranking task. LTR serves as a second-stage reranker after BM25 retrieval.
We're going to use root as the working directory.
mkdir collections/msmarco-ltr-passage
Download our already trained IBM model:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/model-ltr-ibm.tar.gz -P collections/msmarco-ltr-passage/
tar -xzvf collections/msmarco-ltr-passage/model-ltr-ibm.tar.gz -C collections/msmarco-ltr-passage/
Download our already trained LTR model:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/model-ltr-msmarco-passage-mrr-v1.tar.gz -P collections/msmarco-ltr-passage
tar -xzvf collections/msmarco-ltr-passage/model-ltr-msmarco-passage-mrr-v1.tar.gz -C collections/msmarco-ltr-passage/
The following command generates our reranking result with our prebuilt index:
python -m pyserini.search.lucene.ltr \
--index msmarco-passage-ltr \
--model collections/msmarco-ltr-passage/msmarco-passage-ltr-mrr-v1 \
--ibm-model collections/msmarco-ltr-passage/ibm_model/ \
--topic tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
--qrel tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
--output runs/run.ltr.msmarco-passage.tsv
Inference speed will vary; on our orca
machine, the run takes about half an hour to complete.
Note that internally, retrieval depends on tokenization with spaCy; our implementation currently depends on v3.2.1 (this is potentially important as tokenization might change from version to version).
Here, our model was trained to maximize MRR@10.
We can also train other models from scratch, following the training guide, and replace the --model
argument with the newly trained model directory.
After the run finishes, we can evaluate the results using the official MS MARCO evaluation script:
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \
runs/run.ltr.msmarco-passage.tsv
#####################
MRR @10: 0.24723580979669724
QueriesRanked: 6980
#####################
We can also use the official TREC evaluation tool, trec_eval
, to compute metrics other than MRR@10.
For that we first need to convert the run file into TREC format:
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \
--input runs/run.ltr.msmarco-passage.tsv --output runs/run.ltr.msmarco-passage.trec
And then run the trec_eval
tool:
$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap \
msmarco-passage-dev-subset runs/run.ltr.msmarco-passage.trec
map all 0.2552
recall_1000 all 0.8573
Average precision or AP (also called mean average precision, MAP) and recall@1000 (recall at rank 1000) are the two metrics we care about the most. AP captures aspects of both precision and recall in a single metric, and is the most common metric used by information retrieval researchers. On the other hand, recall@1000 provides the upper bound effectiveness of downstream reranking modules (i.e., rerankers are useless if there isn't a relevant document in the results).
To build an index from scratch, we need to preprocess the collection:
First, download the MS MACRO passage dataset collectionandqueries.tar.gz
, per instructions here.
python scripts/ltr_msmarco/convert_passage.py \
--input collections/msmarco-passage/collection.tsv \
--output collections/msmarco-ltr-passage/ltr_collection.json
The above script will convert the collection to JSON files with text_unlemm
, analyzed
, text_bert_tok
and raw
fields.
Note that the tokenization script depends on spaCy; our implementation currently depends on v3.2.1 (this is potentially important as tokenization might change from version to version).
Next, we need to convert the MS MARCO JSON collection into Anserini's JSONL format:
python scripts/ltr_msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-ltr-passage/ltr_collection.json \
--output-folder collections/msmarco-ltr-passage/ltr_collection_jsonl
The above script should generate nine JSONL files in collections/msmarco-ltr-passage/ltr_collection_jsonl
, each with 1M lines (except for the last one, which should have 841,823 lines).
We can now index these docs as a JsonCollection
:
python -m pyserini.index.lucene \
--collection JsonCollection \
--input collections/msmarco-ltr-passage/ltr_collection_jsonl \
--index indexes/lucene-index-msmarco-passage-ltr \
--generator DefaultLuceneDocumentGenerator \
--threads 9 \
--storePositions --storeDocvectors --storeRaw --pretokenized
Note that the --pretokenized
option pretokenized tells Pyserini to use the whitespace analyzer so preserve the existing tokenization.
Reproduction Log*
- Results reproduced by @Dahlia-Chehata on 2021-07-17 (commit
a6b6545
) - Results reproduced by @lintool on 2022-04-02 (commit
88e9a74
)