Lucene index of the MS MARCO V2 segmented document corpus.
Note that there are three variants of this index:
msmarco-v2-doc-segmented
(132G uncompressed): the "default" version, which stores term frequencies and the raw text. This supports bag-of-words queries, but no phrase queries and no relevance feedback.msmarco-v2-doc-segmented-slim
(26G uncompressed): the "slim" version, which stores term frequencies only. This supports bag-of-words queries, but no phrase queries and no relevance feedback. There is no way to fetch the raw text from this index.msmarco-v2-doc-segmented-full
(233G uncompressed): the "full" version, which stores term frequencies, term positions, document vectors, and the raw text. This supports bag-of-words queries, phrase queries, and relevance feedback.
These indexes were generated on 2022/08/08 at Anserini commit fbe35e
on damiano
with the following commands:
nohup target/appassembler/bin/IndexCollection -collection MsMarcoV2DocCollection \
-generator DefaultLuceneDocumentGenerator -threads 18 \
-input /scratch2/collections/msmarco/msmarco_v2_doc_segmented/ \
-index indexes/lucene-index.msmarco-v2-doc-segmented.20220808.4d6d2a/ \
-storeRaw -optimize \
>& logs/log.msmarco-v2-doc-segmented.20220808.4d6d2a.txt &
nohup target/appassembler/bin/IndexCollection -collection MsMarcoV2DocCollection \
-generator DefaultLuceneDocumentGenerator -threads 18 \
-input /scratch2/collections/msmarco/msmarco_v2_doc_segmented/ \
-index indexes/lucene-index.msmarco-v2-doc-segmented-slim.20220808.4d6d2a/ \
-optimize \
>& logs/log.msmarco-v2-doc-segmented-slim.20220808.4d6d2a.txt &
nohup target/appassembler/bin/IndexCollection -collection MsMarcoV2DocCollection \
-generator DefaultLuceneDocumentGenerator -threads 18 \
-input /scratch2/collections/msmarco/msmarco_v2_doc_segmented/ \
-index indexes/lucene-index.msmarco-v2-doc-segmented-full.20220808.4d6d2a/ \
-storePositions -storeDocvectors -storeRaw -optimize \
>& logs/log.msmarco-v2-doc-segmented-full.20220808.4d6d2a.txt &
In May 2024, indexes were repackaged to adopt a more consistent naming scheme.