Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Index in Pyserini Without Full Reindexing when Document Contents Change #1964

Open
LaplaceXD opened this issue Aug 20, 2024 · 4 comments

Comments

@LaplaceXD
Copy link

LaplaceXD commented Aug 20, 2024

Hi all,

I'm working on a project with Pyserini and would appreciate some guidance on efficiently updating document indexes. My goal is to avoid reindexing the entire document list whenever a document changes. Initially, I planned to delete the specific document from the index and then append the updated version. However, I couldn't find an IndexWriter module that allows for document deletion.

I also tried using the -uniqueDocid flag with the LuceneIndexer set to append mode, but it didn't seem to remove the old document entry from the index.

At this point, I'm uncertain whether this approach is possible in Pyserini or if there's a more suitable method for incremental indexing. Any guidance or references to relevant code or examples would be greatly appreciated.

Thanks in advance for your insights!

@lintool
Copy link
Member

lintool commented Aug 20, 2024

See #1451 - does this help?

@LaplaceXD
Copy link
Author

LaplaceXD commented Aug 20, 2024

See #1451 - does this help?

Unfortunately, it doesn't. It works for our other use case, which is when adding new documents; but it doesn't work for our other use case which is when we update the contents of the document, we also want the index to update accordingly.

>>> from pyserini.index.lucene import LuceneIndexer, IndexReader
>>> indexer = LuceneIndexer("index")
2024-08-21 01:51:56,150 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:138) - Using DefaultEnglishAnalyzer
2024-08-21 01:51:56,153 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:139) - Stemmer: porter
2024-08-21 01:51:56,153 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:140) - Keep stopwords? false
2024-08-21 01:51:56,153 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:141) - Stopwords file: null
Aug 21, 2024 1:51:56 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
>>> indexer.add_doc_dict({'id': '0', 'contents': 'Hello there!'})
>>> indexer.add_doc_dict({'id': '1', 'contents': 'A completely unique document.'})
>>> indexer.close()
>>> reader = IndexReader("index")
>>> reader.stats()
{'total_terms': 4, 'documents': 2, 'non_empty_documents': 2, 'unique_terms': 4}
>>> indexer = LuceneIndexer("index", append=True)
2024-08-21 01:52:54,745 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:138) - Using DefaultEnglishAnalyzer
2024-08-21 01:52:54,745 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:139) - Stemmer: porter
2024-08-21 01:52:54,746 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:140) - Keep stopwords? false
2024-08-21 01:52:54,746 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:141) - Stopwords file: null
>>> indexer.add_doc_dict({'id': '1', 'contents': 'A new document!'})
>>> indexer.close()
>>> reader = IndexReader("index")
>>> reader.stats()
{'total_terms': 6, 'documents': 3, 'non_empty_documents': 3, 'unique_terms': -1}

Here in the second invocation of reader.stats(), I was expecting the re-addition of document id 1 to overwrite the existing document in the index, instead of treating it as different document.

@lintool
Copy link
Member

lintool commented Aug 20, 2024

Unfortunately, the document deletion bindings have not been exposed on the Java end (from Lucene), so this is not currently doable. You're certainly welcome to send a PR to implement this functionality... otherwise, this feature request is noted and we might circle back to implement when our team has extra cycles.

@LaplaceXD
Copy link
Author

LaplaceXD commented Aug 20, 2024

I see, thanks for the clarification! I'll see what I can do in the meantime.

@LaplaceXD LaplaceXD changed the title Updating Indexes in Pyserini Without Full Reindexing Updating Index in Pyserini Without Full Reindexing when Document Contents Change Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants