I'm trying to add a new document for an existing index(Clueweb09) #1931
-
I'm trying to add a new document for an existing index(Clueweb09) from pyserini import collection, index
collection = collection.Collection('ClueWeb09Collection', 'data/clueweb')
generator = index.Generator('DefaultLuceneDocumentGenerator')
import json
with open('data/jsonl/clueweb.jsonl', 'w') as f:
for (i, fs) in enumerate(collection):
doc = next(fs)
for (j, doc) in enumerate(fs):
parsed = generator.create_document(doc)
docid = parsed.get('id') # FIELD_ID
contents = parsed.get('contents') # FIELD_BODY
f.write(json.dumps({'id': docid, 'contents': contents}) + '\n')
command = (
f'python -m pyserini.index.lucene '
f'--collection JsonCollection '
f'--input data/jsonl '
f'--index data/index '
f'--generator DefaultLuceneDocumentGenerator '
f'--threads 4 '
f'--storeDocvectors --storeContents'
)
import subprocess
result = subprocess.run(command, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
from pyserini.index.lucene import LuceneIndexer
indexer = LuceneIndexer('data/index', append=True)
doc = "testing"
id = "clueweb09-en0000-00-327300000"
indexer.add_doc_dict({"id": id, "contents": doc})
indexer.close() But I get the error: When I do: indexer.add_doc_dict({"id": id, "contents": doc})
@lintool Do you know how to fix it or add a new document with the following command? Thanks |
Beta Was this translation helpful? Give feedback.
Answered by
tomer92808888
Jul 17, 2024
Replies: 1 comment
-
I managed to find the solution: from pyserini.index.lucene import LuceneIndexer
args = ["-index", "data/index", "-storeDocvectors", "-storeContents"]
indexer = LuceneIndexer(append=True, args=args)
doc = "testing"
id = "clueweb09-en0000-00-327300000"
indexer.add_doc_dict({"id": id, "contents": doc})
indexer.close() |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
tomer92808888
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I managed to find the solution: