Format missmatch when trying to load trec-car index. #962

ZanezZephyrs · 2022-01-19T12:21:43Z

ZanezZephyrs
Jan 19, 2022

Hi, I was trying to build the trec-car index and access it with pyserini, here is what i did.

First i cloned and installed anserini and anserini-tools

git clone https://github.com/castorini/Anserini.git
cd Anserini
mvn clean package appassembler:assemble
tar xvfz tools/eval/trec_eval.9.0.4.tar.gz -C tools/eval/ && cd tools/eval/trec_eval.9.0.4 && make
cd ../ndeval && make

then, i indexed my collection ( 1 .cbor file from http://trec-car.cs.unh.edu/datareleases/v2.0/paragraphCorpus.v2.0.tar.xz) with the following command

-generator DefaultLuceneDocumentGenerator -threads 1 -input ./paragraphCorpus -index \
./lucene-index.car17 -storeRaw

With this process, i was able to generate the index correctly, but when i tried to access the index with pyserini simpleSearcher i got the following java exception

jnius.JavaException: JVM exception occurred: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path="/datadrive/trec_car/Anserini/lucene-index.car17/segments_1"))): 10 (needs to be between 7 and 9) org.apache.lucene.index.IndexFormatTooNewException

Some other information that might be useful

Pyserini Version - 0.14
Java Version - 11
Anserini and Anserini-tools - cloned from master

I am not sure what was the problem here, but would appreciate any help.

Answered by lintool

Jan 19, 2022

You're getting Lucene index incompatibility issues.

Anserini master is based on Lucene 8.11 now: https://github.com/castorini/anserini/blob/master/docs/release-notes/release-notes-v0.14.0.md

Previously it was based on Lucene 8.3 - Pyserini PyPI artifact 0.14.0 is still based on Lucene 8.3.

Are you indexing from Pyserini (Python) or Anserini (Java)?

If you're in Java-land exclusively (both indexing and search), you should be fine. You're probably mixing Python and Java and a way that's exposing this incompatability.

View full answer

lintool · 2022-01-19T12:42:43Z

lintool
Jan 19, 2022
Maintainer

You're getting Lucene index incompatibility issues.

Anserini master is based on Lucene 8.11 now: https://github.com/castorini/anserini/blob/master/docs/release-notes/release-notes-v0.14.0.md

Previously it was based on Lucene 8.3 - Pyserini PyPI artifact 0.14.0 is still based on Lucene 8.3.

Are you indexing from Pyserini (Python) or Anserini (Java)?

If you're in Java-land exclusively (both indexing and search), you should be fine. You're probably mixing Python and Java and a way that's exposing this incompatability.

2 replies

ZanezZephyrs Jan 19, 2022
Author

I see.

So Anserini versions previous to the 0.14 still use lucene 8.3, correct?

I was indexing with Anserini, for no particular reason. I am indexing again with pyserini, so this should solve my index format problems.

Thanks for the fast response :D.

lintool Jan 19, 2022
Maintainer

So Anserini versions previous to the 0.14 still use lucene 8.3, correct?

Yes. When we push out 0.15.0 onto PyPI, It'll be based on Lucene 8.11.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format missmatch when trying to load trec-car index. #962

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Format missmatch when trying to load trec-car index. #962

ZanezZephyrs Jan 19, 2022

Replies: 1 comment · 2 replies

lintool Jan 19, 2022 Maintainer

ZanezZephyrs Jan 19, 2022 Author

lintool Jan 19, 2022 Maintainer

ZanezZephyrs
Jan 19, 2022

Replies: 1 comment 2 replies

lintool
Jan 19, 2022
Maintainer

ZanezZephyrs Jan 19, 2022
Author

lintool Jan 19, 2022
Maintainer