Skip to content

Latest commit

 

History

History
36 lines (28 loc) · 2.22 KB

lucene-inverted.msmarco-v2-doc-segmented.20220808.4d6d2a.README.md

File metadata and controls

36 lines (28 loc) · 2.22 KB

msmarco-v2-doc-segmented

Lucene index of the MS MARCO V2 segmented document corpus.

Note that there are three variants of this index:

  • msmarco-v2-doc-segmented (132G uncompressed): the "default" version, which stores term frequencies and the raw text. This supports bag-of-words queries, but no phrase queries and no relevance feedback.
  • msmarco-v2-doc-segmented-slim (26G uncompressed): the "slim" version, which stores term frequencies only. This supports bag-of-words queries, but no phrase queries and no relevance feedback. There is no way to fetch the raw text from this index.
  • msmarco-v2-doc-segmented-full (233G uncompressed): the "full" version, which stores term frequencies, term positions, document vectors, and the raw text. This supports bag-of-words queries, phrase queries, and relevance feedback.

These indexes were generated on 2022/08/08 at Anserini commit fbe35e on damiano with the following commands:

nohup target/appassembler/bin/IndexCollection -collection MsMarcoV2DocCollection \
  -generator DefaultLuceneDocumentGenerator -threads 18 \
  -input /scratch2/collections/msmarco/msmarco_v2_doc_segmented/ \
  -index indexes/lucene-index.msmarco-v2-doc-segmented.20220808.4d6d2a/ \
  -storeRaw -optimize \
  >& logs/log.msmarco-v2-doc-segmented.20220808.4d6d2a.txt &

nohup target/appassembler/bin/IndexCollection -collection MsMarcoV2DocCollection \
  -generator DefaultLuceneDocumentGenerator -threads 18 \
  -input /scratch2/collections/msmarco/msmarco_v2_doc_segmented/ \
  -index indexes/lucene-index.msmarco-v2-doc-segmented-slim.20220808.4d6d2a/ \
  -optimize \
  >& logs/log.msmarco-v2-doc-segmented-slim.20220808.4d6d2a.txt &

nohup target/appassembler/bin/IndexCollection -collection MsMarcoV2DocCollection \
  -generator DefaultLuceneDocumentGenerator -threads 18 \
  -input /scratch2/collections/msmarco/msmarco_v2_doc_segmented/ \
  -index indexes/lucene-index.msmarco-v2-doc-segmented-full.20220808.4d6d2a/ \
  -storePositions -storeDocvectors -storeRaw -optimize \
  >& logs/log.msmarco-v2-doc-segmented-full.20220808.4d6d2a.txt &

In May 2024, indexes were repackaged to adopt a more consistent naming scheme.