Skip to content

ChuanMeng/QPP-GenRE

Repository files navigation

Query Performance Prediction using Relevance Judgments Generated by Large Language Models

This repository complements the following papers:

  1. Query Performance Prediction using Relevance Judgments Generated by Large Language Models
    • In this paper, we propose a new query performance prediction (QPP) framework, QPP-GenRE, which first automatically generates relevance judgments for a ranked list for a given query, and then regard the generated relevance judgments as pseudo labels to compute different IR evaluation measures. QPP-GenRE can be integrated with various methods for judging relevance. We show the success of QPP-GenRE equipped with LLaMA-7B, Llama-3-8B, and Llama-3-8B-Instruct. We fine-tune LLaMA-7B, Llama-3-8B, and Llama-3-8B-Instruct to generate relevance judgments automatically.
  2. Can We Use Large Language Models to Fill Relevance Judgment Holes?
    • In this paper, we fine-tune Llama-3-8B, and Llama-3-8B-Instruct for generating relevance judgments in the context of conversational search.

This repository is structured into the following parts:

  1. Installation
  2. Query Performance Prediction using Relevance Judgments Generated by Large Language Models
    • 2.1 Prerequisite
    • 2.2 Inference using fine-tuned LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct
    • 2.3 Fine-tuning LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct
    • 2.4 In-context learning using LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct
    • 2.5 Evaluation
    • 2.6 The results of scaled Mean Absolute Ranking Error (sMARE)
  3. Can We Use Large Language Models to Fill Relevance Judgment Holes?
    • 3.1 Prerequisite
    • 3.2 Inference using fine-tuned LLaMA-7B
    • 3.3 Zero-shot prompting using Llama-3-8B and Llama-3-8B-Instruct
    • 3.4 Inference using fine-tuned Llama-3-8B and Llama-3-8B-Instruct
    • 3.5 Fine-tuning Llama-3-8B and Llama-3-8B-Instruct
    • 3.6 Evaluation

βš™οΈ 1. Installation

Install dependencies

pip install -r requirements.txt

2. Query Performance Prediction using Relevance Judgments Generated by Large Language Models

2.1 Prerequisite

Download datasets

Please first download dataset.zip (containing queries, run files, qrels files and files containing the actual retrieval quality of queries) from here, and then unzip it in the current directory.

Then, please download MS MARCO V1 and V2 passage ranking collections from Pyserini:

wget -P ./datasets/msmarco-v1-passage/ https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.msmarco-v1-passage-full.20221004.252b5e.tar.gz --no-check-certificate
tar -zxvf  ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e.tar.gz -C ./datasets/msmarco-v1-passage/

wget -P ./datasets/msmarco-v2-passage/ https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a.tar.gz --no-check-certificate
tar -zxvf  ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a.tar.gz -C ./datasets/msmarco-v2-passage/

Fetch the original weights of LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct

For LLaMA-7B, please refer to the LLaMA repository to fetch the original weights of LLaMA-7B. And then, please follow the instructions from here to convert the original weights for the LLaMA-7B model to the Hugging Face Transformers format. Next, set your local path to the weights of LLaMA-7B (Hugging Face Transformers format) as an environment variable, which will be used in the following process.

export LLAMA_7B_PATH={your path to the weights of LLaMA-7B (Hugging Face Transformers format)}

For Llama-3-8B and Llama-3-8B-Instruct, we can directly fetch weights from Hugging Face. Please set your own token and your cache directory:

export TOKEN={your token to use as HTTP bearer authorization for remote files}
export CACHE_DIR={your cache path that stores the weights of Llama 3}

Download the checkpoints of fine-tuned LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct

For the reproducibility of the results reported in the paper, please download the checkpoints of our fine-tuned

After downloading, please unzip them in a new directory ./checkpoint/.

Note

We leverage 4-bit quantized LLaMA-7B for either inference or fine-tuning in this paper; we use an NVIDIA A100 Tensor Core GPU (40GB) to conduct all experiments in our paper.

πŸš€ 2.2 Inference using fine-tuned LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct

The part shows how to directly use our released checkpoints of fine-tuned LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct to predict the performance of BM25 and ANCE on TREC-DL 19, 20, 21 and 22 datasets. Please run judge_relevance.py and predict_measures.py sequentially to finish one prediction for one ranker on one dataset. Specifically, judge_relevance.py aims to automatically generate relevance judgments for a ranked list returned by BM25 or ANCE; the generated relevance judgments are saved to ./output/. predict_measures.py is used to compute different IR evaluation measures, such as RR@10 and nDCG@10, based on the generated relevance judgments (pseudo labels); the computed values of an IR evaluation metric are regarded as predicted QPP scores that are expected to approximate the actual values of the IR evaluation metric; predicted QPP scores for a dataset will be saved to a folder that corresponds to the dataset, e.g., QPP scores for BM25 or ANCE on TREC-DL 19 will be saved to ./output/dl-19-passage.

Predicting the performance of BM25 on TREC-DL 19

# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-2790 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-19-passage.original-bm25-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-2790.k1000 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000

# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000/checkpoint-5350 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt  \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-19-passage.original-bm25-1000.original-Meta-Llama-3-8B-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000-checkpoint-5350.k1000 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000

# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000/checkpoint-2675 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt  \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-19-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000-checkpoint-2675.k1000 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000

Predicting the performance of BM25 on TREC-DL 20

# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-2790 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-20-passage.original-bm25-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-2790.k1000 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000

# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000/checkpoint-5350 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt  \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-20-passage.original-bm25-1000.original-Meta-Llama-3-8B-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000-checkpoint-5350.k1000 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000

# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000/checkpoint-2675 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt  \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-20-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000-checkpoint-2675.k1000 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000

Predicting the performance of BM25 on TREC-DL 21

# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-1860 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-21-passage.original-bm25-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-1860.k1000 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000

# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000/checkpoint-5350 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-21-passage.original-bm25-1000.original-Meta-Llama-3-8B-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000-checkpoint-5350.k1000 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000

# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000/checkpoint-2675 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-21-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000-checkpoint-2675.k1000 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000 

Predicting the performance of BM25 on TREC-DL 22

# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-1860 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-22-passage.original-bm25-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-1860.k1000 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000

# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000/checkpoint-5350 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-22-passage.original-bm25-1000.original-Meta-Llama-3-8B-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000-checkpoint-5350.k1000 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000

# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000/checkpoint-2675 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-22-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000-checkpoint-2675.k1000 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000

Predicting the performance of ANCE on TREC-DL 19

python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-2790 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-ance-msmarco-v1-passage-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt  \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-ance-msmarco-v1-passage-1000.txt \
--qrels_path  ./output/dl-19-passage.original-ance-msmarco-v1-passage-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-checkpoint-2790.k1000 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000

Predicting the performance of ANCE on TREC-DL 20

python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-2790 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-ance-msmarco-v1-passage-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt  \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-ance-msmarco-v1-passage-1000.txt \
--qrels_path  ./output/dl-20-passage.original-ance-msmarco-v1-passage-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-checkpoint-2790.k1000 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000

πŸ› οΈ 2.3 Fine-tuning LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct

Run the following command to fine-tune quantized 4-bit LaMA-7B using QLoRA on the task of judging the relevance of a passage to a given query, on the development set of MS MARCO V1. For each query in the development set of MS MARCO V1, we use the relevant passages shown in the qrels file, while we randomly sample a negative passage from the ranked list (1000 items) returned by BM25. The checkpoints will be saved to ./checkpoint/ for each epoch.

# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--logging_steps 10 \
--per_device_train_batch_size 64 \
--num_epochs 5 \
--num_negs 1 \
--neg_top 1000 \
--prompt binary

# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--logging_steps 10 \
--per_device_train_batch_size 64 \
--num_epochs 5 \
--num_negs 2 \
--neg_top 1000 \
--prompt binary

# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--logging_steps 10 \
--per_device_train_batch_size 64 \
--num_epochs 5 \
--num_negs 2 \
--neg_top 1000 \
--prompt binary

Note

Fine-tuning LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct using QLoRA for 5 epochs on the development set of MS MARCO V1 takes about an hour and a half on an NVIDIA A100 GPU.

πŸ›ž 2.4 In-context learning using LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct

In the setting of in-context learning, we freeze the parameters of LLaMA. We randomly sample several human-labeled demonstration examples (each demonstration example is in the format of "<query, passage, relevant/irrelevant>") from the development set of MS MARCO V1 (the same set used for fine-tuning LLaMA in the previous part), and insert these sampled demonstration examples into the input of LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct with original weights. We randomly sample two demonstration examples, where one example has a passage that is labeled as relevant (<query, passage, relevant>) while the other example has an irrelevant passage (<query, passage, irrelevant>); our preliminary experiments show that two demonstration examples work best and so we stick with this setting.

Sampled demonstration examples

Note that we sampled and use the following demonstration examples for all few-shot prompting experiments:

Question: avatar the last airbender game
Passage: Avatar: The Last Airbender: The Video Game (known as Avatar: The Legend of Aang in Europe) is a video game based on the animated television series of the same name for Game Boy Advance, Microsoft Windows, Nintendo GameCube, Nintendo DS, PlayStation 2, PlayStation Portable, Wii, and Xbox.
Output: Relevant
Question: avatar the last airbender game
Passage: Fans of Avatar: The Last Airbender have been feverishly looking forward to this weekend: Michael Dante DiMartino and Bryan Konietzko continue the mythology of their Airbender series with NickelodeonΓ’οΏ½οΏ½s The Legend of Korra, which premieres Saturday at 11 a.m.
Output: Irrelevant

Predicting the performance of BM25 on TREC-DL 19

# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-19-passage.original-bm25-1000.original-llama-1-7b-hf-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000

# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-19-passage.original-bm25-1000.original-Meta-Llama-3-8B-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000

# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-19-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000

Predicting the performance of BM25 on TREC-DL 20

# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-20-passage.original-bm25-1000.original-llama-1-7b-hf-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000

# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-20-passage.original-bm25-1000.original-Meta-Llama-3-8B-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000


# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-20-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000

Predicting the performance of BM25 on TREC-DL 21

# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-21-passage.original-bm25-1000.original-llama-1-7b-hf-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000

# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-21-passage.original-bm25-1000.original-Meta-Llama-3-8B-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000

# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-21-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000

Predicting the performance of BM25 on TREC-DL 22

# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-22-passage.original-bm25-1000.original-llama-1-7b-hf-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000

# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-22-passage.original-bm25-1000.original-Meta-Llama-3-8B-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000

# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct"  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt  \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv  \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary

python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path  ./output/dl-22-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000

πŸ“ 2.5 Evaluation

We provide detailed commands to evaluate QPP effectiveness of QPP-GenRE for predicting the performance of BM25 or ANCE in terms of RR@10 or nDCG@10. Specifically, QPP effectiveness is measured by Pearson and Kendall correlation coefficients between the actual performance of a ranker for a set of queries and the predicted performance of the ranker for the set of queries.

Note

TREC-DL 19, 20, 21 and 22 provide relevance judgments in multi-graded relevance scales per query, while an LLM in QPP-GenRE can only generate binary relevance judgments for each query, because the training set of QPP-GenRE only contains binary relevance judgments. For RR@10, we use relevance scale β‰₯ 2 as positive to compute the actual values of RR@10. For nDCG@10, the actual values of nDCG@10 are calculated by human-labeled relevance judgments in multi-graded relevance scales, while the values of nDCG@10 predicted by QPP-GenRE are calculated by binary relevance judgments automatically generated by an LLM. Although QPP-GenRE uses the nDCG@10 values computed by binary relevance judgments to "approximate" the nDCG@10 values computed by relevance judgments in multi-graded relevance scales, QPP-GenRE still achieves promising QPP effectiveness in terms of Pearson and Kendall correlation coefficients.

Evaluate QPP effectiveness of QPP-GenRE for predicting the performance of BM25 in terms of RR@10 and nDCG@10

The following commands will produce files recording QPP results in the directories of ./output/dl-19-passage/, ./output/dl-20-passage/, ./output/dl-21-passage/, and ./output/dl-22-passage/, respectively:

python -u evaluate_qpp.py \
--pattern './output/dl-19-passage/dl-19-passage.original-bm25-1000*' \
--ap_path ./datasets/msmarco-v1-passage/ap/dl-19-passage.ap-original-bm25-1000.json \
--target_metrics mrr@10 ndcg@10

python -u evaluate_qpp.py \
--pattern './output/dl-20-passage/dl-20-passage.original-bm25-1000*' \
--ap_path ./datasets/msmarco-v1-passage/ap/dl-20-passage.ap-original-bm25-1000.json \
--target_metrics mrr@10 ndcg@10

python -u evaluate_qpp.py \
--pattern './output/dl-21-passage/dl-21-passage.original-bm25-1000*' \
--ap_path ./datasets/msmarco-v2-passage/ap/dl-21-passage.ap-original-bm25-1000.json \
--target_metrics mrr@10 ndcg@10 

python -u evaluate_qpp.py \
--pattern './output/dl-22-passage/dl-22-passage.original-bm25-1000*' \
--ap_path ./datasets/msmarco-v2-passage/ap/dl-22-passage.ap-original-bm25-1000.json \
--target_metrics mrr@10 ndcg@10

Evaluate QPP effectiveness of QPP-GenRE for predicting the performance of ANCE in terms of RR@10 and nDCG@10

The following commands will produce files recording QPP results in the directories of ./output/dl-19-passage/ and ./output/dl-20-passage/, respectively:

python -u evaluate_qpp.py \
--pattern './output/dl-19-passage/dl-19-passage.original-ance-msmarco-v1-passage-1000*' \
--ap_path ./datasets/msmarco-v1-passage/ap/dl-19-passage.ap-original-ance-msmarco-v1-passage-1000.json \
--target_metrics mrr@10 ndcg@10

python -u evaluate_qpp.py \
--pattern './output/dl-20-passage/dl-20-passage.original-ance-msmarco-v1-passage-1000*' \
--ap_path ./datasets/msmarco-v1-passage/ap/dl-20-passage.ap-original-ance-msmarco-v1-passage-1000.json \
--target_metrics mrr@10 ndcg@10

πŸ‘ 2.6 The results of scaled Mean Absolute Ranking Error (sMARE)

We calculate sMARE values for our method and all baselines; we use the code released by the authors of sMARE.

The following tables show that our method obtains the lowest sMARE values (the lower the value is, the better the QPP effectiveness is) on each dataset for predicting the performance of either BM25 or ANCE in terms of RR@10 and nDCG@10.

Table: Predicting the performance of BM25 in terms of RR@10 on TREC-DL 19.

Method sMARE
Clarity 0.352
WIG 0.291
NQC 0.313
πœŽπ‘šπ‘Žπ‘₯ 0.296
n(𝜎π‘₯%) 0.286
SMV 0.313
UEF(NQC) 0.290
RLS(NQC) 0.318
QPP-PRP 0.297
NQAQPP 0.315
BERTQPP 0.318
qppBERT-PL 0.275
M-QPPF 0.283
QPP-GenRE (ours) 0.196

Table: Predicting the performance of BM25 in terms of RR@10 on TREC-DL 20.

Method sMARE
Clarity 0.320
WIG 0.245
NQC 0.249
πœŽπ‘šπ‘Žπ‘₯ 0.255
n(𝜎π‘₯%) 0.279
SMV 0.251
UEF(NQC) 0.261
RLS(NQC) 0.294
QPP-PRP 0.287
NQAQPP 0.315
BERTQPP 0.287
qppBERT-PL 0.302
M-QPPF 0.250
QPP-GenRE (ours) 0.157

Table: Predicting the performance of BM25 in terms of RR@10 on TREC-DL 21.

Method sMARE
Clarity 0.285
WIG 0.276
NQC 0.276
πœŽπ‘šπ‘Žπ‘₯ 0.286
n(𝜎π‘₯%) 0.288
SMV 0.273
UEF(NQC) 0.315
RLS(NQC) 0.272
QPP-PRP 0.311
NQAQPP 0.285
BERTQPP 0.305
qppBERT-PL 0.269
M-QPPF 0.267
QPP-GenRE (ours) 0.237

Table: Predicting the performance of BM25 in terms of RR@10 on TREC-DL 22.

Method sMARE
Clarity 0.317
WIG 0.315
NQC 0.330
πœŽπ‘šπ‘Žπ‘₯ 0.322
n(𝜎π‘₯%) 0.309
SMV 0.322
UEF(NQC) 0.325
RLS(NQC) 0.316
QPP-PRP 0.316
NQAQPP 0.280
BERTQPP 0.306
qppBERT-PL 0.295
M-QPPF 0.289
QPP-GenRE (ours) 0.249

Table: Predicting the performance of ANCE in terms of RR@10 on TREC-DL 19.

Method sMARE
Clarity 0.335
WIG 0.307
NQC 0.307
πœŽπ‘šπ‘Žπ‘₯ 0.281
n(𝜎π‘₯%) 0.287
SMV 0.278
UEF(NQC) 0.266
RLS(NQC) 0.269
QPP-PRP 0.296
Dense-QPP 0.317
NQAQPP 0.316
BERTQPP 0.286
qppBERT-PL 0.274
M-QPPF 0.291
QPP-GenRE (ours) 0.119

Table: Predicting the performance of ANCE in terms of RR@10 on TREC-DL 20.

Method sMARE
Clarity 0.325
WIG 0.333
NQC 0.302
πœŽπ‘šπ‘Žπ‘₯ 0.306
n(𝜎π‘₯%) 0.339
SMV 0.294
UEF(NQC) 0.335
RLS(NQC) 0.302
QPP-PRP 0.307
Dense-QPP 0.292
NQAQPP 0.368
BERTQPP 0.365
qppBERT-PL 0.359
M-QPPF 0.321
QPP-GenRE (ours) 0.228

Table: Predicting the performance of BM25 in terms of nDCG@10 on TREC-DL 19.

Method sMARE
Clarity 0.309
WIG 0.239
NQC 0.239
πœŽπ‘šπ‘Žπ‘₯ 0.236
n(𝜎π‘₯%) 0.238
SMV 0.241
UEF(NQC) 0.236
RLS(NQC) 0.233
QPP-PRP 0.287
NQAQPP 0.295
BERTQPP 0.273
qppBERT-PL 0.296
M-QPPF 0.264
QPP-GenRE (ours) 0.198

Table: Predicting the performance of BM25 in terms of nDCG@10 on TREC-DL 20.

Method sMARE
Clarity 0.251
WIG 0.213
NQC 0.215
πœŽπ‘šπ‘Žπ‘₯ 0.211
n(𝜎π‘₯%) 0.206
SMV 0.218
UEF(NQC) 0.227
RLS(NQC) 0.223
QPP-PRP 0.305
NQAQPP 0.272
BERTQPP 0.248
qppBERT-PL 0.274
M-QPPF 0.243
QPP-GenRE (ours) 0.177

Table: Predicting the performance of BM25 in terms of nDCG@10 on TREC-DL 21.

Method sMARE
Clarity 0.307
WIG 0.252
NQC 0.266
πœŽπ‘šπ‘Žπ‘₯ 0.258
n(𝜎π‘₯%) 0.264
SMV 0.271
UEF(NQC) 0.262
RLS(NQC) 0.286
QPP-PRP 0.341
NQAQPP 0.266
BERTQPP 0.261
qppBERT-PL 0.279
M-QPPF 0.259
QPP-GenRE (ours) 0.201

Table: Predicting the performance of BM25 in terms of nDCG@10 on TREC-DL 22.

Method sMARE
Clarity 0.307
WIG 0.265
NQC 0.282
πœŽπ‘šπ‘Žπ‘₯ 0.283
n(𝜎π‘₯%) 0.264
SMV 0.276
UEF(NQC) 0.282
RLS(NQC) 0.284
QPP-PRP 0.339
NQAQPP 0.283
BERTQPP 0.273
qppBERT-PL 0.289
M-QPPF 0.283
QPP-GenRE (ours) 0.249

Table: Predicting the performance of ANCE in terms of nDCG@10 on TREC-DL 19.

Method sMARE
Clarity 0.366
WIG 0.213
NQC 0.221
πœŽπ‘šπ‘Žπ‘₯ 0.223
n(𝜎π‘₯%) 0.239
SMV 0.228
UEF(NQC) 0.221
RLS(NQC) 0.224
QPP-PRP 0.309
Dense-QPP 0.212
NQAQPP 0.329
BERTQPP 0.309
qppBERT-PL 0.343
M-QPPF 0.292
QPP-GenRE (ours) 0.186

Table: Predicting the performance of ANCE in terms of nDCG@10 on TREC-DL 20.

Method sMARE
Clarity 0.345
WIG 0.297
NQC 0.254
πœŽπ‘šπ‘Žπ‘₯ 0.250
n(𝜎π‘₯%) 0.305
SMV 0.250
UEF(NQC) 0.250
RLS(NQC) 0.254
QPP-PRP 0.294
Dense-QPP 0.242
NQAQPP 0.304
BERTQPP 0.304
qppBERT-PL 0.324
M-QPPF 0.274
QPP-GenRE (ours) 0.228

Can We Use Large Language Models to Fill Relevance Judgment Holes?

3.1 Prerequisite

Download datasets

Please first download dataset.zip (containing queries, qrels files and corpus) from here, and then unzip it in the current directory.

Then, please run the following commands to preprocess the dataset:

python -u prepcrocessing.py \
--raw_data_path ./datasets/ikat/raw/splitted_data.txt

Fetch the original weights of Llama-3-8B and Llama-3-8B-Instruct

One can directly fetch the original weights of Llama-3-8B and Llama-3-8B-Instruct. Please set the following variables:

export TOKEN={your token to use as HTTP bearer authorization for remote files}
export CACHE_DIR={your cache path that stores the weights of Llama 3}

Download the checkpoints of fine-tuned Llama-3-8B and Llama-3-8B-Instruct

For the reproducibility of the results reported in the paper, please download the checkpoints of our fine-tuned

After downloading, please unzip them in a new directory ./checkpoint/.

3.2 Inference using fine-tuned LLaMA-7B

# inference on the test split
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH}  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-1860 \
--query_path  ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-test.qrels \
--output_dir ./output \
--batch_size 16 \
--infer --rj --prompt binary

# inference on the whole set
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH}  \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-1860 \
--query_path  ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat.qrels \
--output_dir ./output \
--batch_size 16 \
--infer --rj --prompt binary

3.3 Zero-shot prompting using Llama-3-8B and Llama-3-8B-Instruct

# inference on the test split (Llama-3-8B)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-test.qrels \
--output_dir ./output/ \
--prompt ikat \
--infer --rj 

# inference on the test split (Llama-3-8B-Instruct)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-test.qrels \
--output_dir ./output/ \
--prompt ikat \
--infer --rj 

# inference on the whole set (Llama-3-8B)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat.qrels \
--output_dir ./output/ \
--prompt ikat \
--infer --rj 

# inference on the whole set (Llama-3-8B-Instruct)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat.qrels \
--output_dir ./output/ \
--prompt ikat \
--infer --rj 

3.4 Inference using fine-tuned Llama-3-8B and Llama-3-8B-Instruct

# inference on the test split (Llama-3-8B)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name ikat-train.Meta-Llama-3-8B/checkpoint-3374 \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-test.qrels \
--output_dir ./output/ \
--batch_size 32 \
--prompt ikat \
--infer --rj 


# inference on the test split (Llama-3-8B-Instruct)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name ikat-train.Meta-Llama-3-8B-Instruct/checkpoint-3374 \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-test.qrels \
--output_dir ./output/ \
--batch_size 32 \
--prompt ikat \
--infer --rj 

3.5 Fine-tuning Llama-3-8B and Llama-3-8B-Instruct

# fine-tune Llama-3-8B on the training split
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-train.qrels \
--logging_steps 10 \
--per_device_train_batch_size 64 \
--num_epochs 10 \
--prompt ikat \
--rj 

# fine-tune Llama-3-8B-Instruct on the training split
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-train.qrels \
--logging_steps 10 \
--per_device_train_batch_size 64 \
--num_epochs 10 \
--prompt ikat \
--rj 

3.6 Evaluation

# evaluate fine-tuned LLaMA-7B on the test split
python -u evaluate_rj.py \
--qrels_true_dir /datasets/ikat/qrels/ikat-test.qrels \
--qrels_pred_dir ./output/ikat-test.manual-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-1860 \
--binary --pre_is_binary

# evaluate fine-tuned LLaMA-7B on the whole set
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat.qrels \
--qrels_pred_dir ./output/ikat.manual-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-1860 \
--binary --pre_is_binary

# evaluate Llama-3-8B on the test split
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat-test.qrels \
--qrels_pred_dir ./output/ikat-test.Meta-Llama-3-8B

# evaluate Llama-3-8B-Instruct on the test split
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat-test.qrels \
--qrels_pred_dir ./output/ikat-test.Meta-Llama-3-8B-Instruct

# evaluate Llama-3-8B on the whole set
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat.qrels \
--qrels_pred_dir ./output/ikat.Meta-Llama-3-8B

# evaluate Llama-3-8B-Instruct on the whole set
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat.qrels \
--qrels_pred_dir ./output/ikat.Meta-Llama-3-8B-Instruct

# evaluate fine-tuned Llama-3-8B on the test split
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat-test.qrels \
--qrels_pred_dir ./output/ikat-test.manual-Meta-Llama-3-8B-ckpt-ikat-train.Meta-Llama-3-8B-checkpoint-3374

# evaluate fine-tuned Llama-3-8B-Instruct on the test split
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat-test.qrels \
--qrels_pred_dir ./output/ikat-test.manual-Meta-Llama-3-8B-Instruct-ckpt-ikat-train.Meta-Llama-3-8B-Instruct-checkpoint-3374

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages