Skip to content

Latest commit

 

History

History
81 lines (59 loc) · 3.64 KB

experiments-ltr-msmarco-passage-training.md

File metadata and controls

81 lines (59 loc) · 3.64 KB

Pyserini: Train Learning-To-Rank Reranking Models for MS MARCO Passage

Data Preprocessing

Please first follow the Pyserini BM25 retrieval guide to obtain our reranking candidate.

wget https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.2.tsv.gz -P collections/msmarco-passage/	
gzip -d collections/msmarco-passage/qidpidtriples.train.full.2.tsv.gz

Then, download the file which has training triples and uncompress it.

Next, we're going to use collections/msmarco-ltr-passage/ as the working directory to download pre processed data.

mkdir collections/msmarco-ltr-passage/

python scripts/ltr_msmarco/convert_queries.py \
  --input collections/msmarco-passage/queries.eval.small.tsv \
  --output collections/msmarco-ltr-passage/queries.eval.small.json 

python scripts/ltr_msmarco/convert_queries.py \
  --input collections/msmarco-passage/queries.dev.small.tsv \
  --output collections/msmarco-ltr-passage/queries.dev.small.json

python scripts/ltr_msmarco/convert_queries.py \
  --input collections/msmarco-passage/queries.train.tsv \
  --output collections/msmarco-ltr-passage/queries.train.json

The above scripts convert queries to json objects with text, text_unlemm, raw, and text_bert_tok fields. The first two scripts take ~1 min and the third one is a bit longer (~1.5h).

python -c "from pyserini.search import SimpleSearcher; SimpleSearcher.from_prebuilt_index('msmarco-passage-ltr')"

We run the above commands to obtain pre-built index in cache.

Note you can also build index from scratch follow this guide.

Download pretrained IBM models

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/model-ltr-ibm.tar.gz -P collections/msmarco-ltr-passage/
tar -xzvf collections/msmarco-ltr-passage/model-ltr-ibm.tar.gz -C collections/msmarco-ltr-passage/

Training the Model From Scratch

python scripts/ltr_msmarco/train_ltr_model.py  \
 --index ~/.cache/pyserini/indexes/index-msmarco-passage-ltr-20210519-e25e33f.a5de642c268ac1ed5892c069bdc29ae3 

Compare texts at the bottom of the output with texts below for a quick sanity check.

recall@10:0.48367956064947465
recall@20:0.5796442215854822
recall@50:0.683966093600764
recall@100:0.7545964660936009
recall@200:0.8033428844317098
recall@500:0.8454512893982808
recall@1000:0.8573424068767909
Total training time: XXXX s
Done!

Note that the number may vary due to the randomness of LambdaRank. As long as your outputs are around those values, your training is done correctly.

The training script will train a model at runs/ with your running date in the file name. You can use this as the --model parameter for reranking.

Number of negative samples used in training can be changed by --neg-sample, by default is 10.

Change the Optmization Goal of Your Trained Model

The script trains a model which optimizes MRR@10 by default.

You can change the mrr_at_10 of this function and here to recall_at_20 to train a model which optimizes recall@20.

You can also self defined a function format like this and change corresponding places mentioned above to have different optimization goal.

Reproduction Log*