Why are there so many indexes for the MIRACL-Korean index? #1903

steven-channel · 2024-05-30T00:40:09Z

steven-channel
May 30, 2024

I'm running the command for evaluating the mDPR model on the MIRACL-Korean benchmark provided here: https://castorini.github.io/pyserini/2cr/miracl.html

The command is:

python -m pyserini.search.faiss \
  --threads 16 --batch-size 512 \
  --encoder-class auto \
  --encoder castorini/mdpr-tied-pft-msmarco-ft-miracl-ko \
  --topics miracl-v1.0-ko-dev \
  --index miracl-v1.0-ko-mdpr-tied-pft-msmarco-ft-miracl-ko \
  --output run.miracl.mdpr-tied-pft-msmarco-ft-miracl.ko.dev.txt --hits 1000

As far as I know, the number of documents (i.e., positive and negative passages) in the MIRACL-Korean dev set is 3,057 with some duplicates. However, I'm noticing that the number of documents contained in the index is 1,486,752. Why are there so many?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are there so many indexes for the MIRACL-Korean index? #1903

{{title}}

Replies: 0 comments

Select a reply

Why are there so many indexes for the MIRACL-Korean index? #1903

steven-channel May 30, 2024

Replies: 0 comments

steven-channel
May 30, 2024